linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
@ 2024-07-07  9:49 Yafang Shao
  2024-07-07  9:49 ` [PATCH 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
                   ` (3 more replies)
  0 siblings, 4 replies; 41+ messages in thread
From: Yafang Shao @ 2024-07-07  9:49 UTC (permalink / raw)
  To: akpm; +Cc: ying.huang, mgorman, linux-mm, Yafang Shao

Background
==========

In our containerized environment, we have a specific type of container
that runs 18 processes, each consuming approximately 6GB of RSS. These
processes are organized as separate processes rather than threads due
to the Python Global Interpreter Lock (GIL) being a bottleneck in a
multi-threaded setup. Upon the exit of these containers, other
containers hosted on the same machine experience significant latency
spikes.

Investigation
=============

My investigation using perf tracing revealed that the root cause of
these spikes is the simultaneous execution of exit_mmap() by each of
the exiting processes. This concurrent access to the zone->lock
results in contention, which becomes a hotspot and negatively impacts
performance. The perf results clearly indicate this contention as a
primary contributor to the observed latency issues.

+   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
-   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
   - 76.97% exit_mmap
      - 58.58% unmap_vmas
         - 58.55% unmap_single_vma
            - unmap_page_range
               - 58.32% zap_pte_range
                  - 42.88% tlb_flush_mmu
                     - 42.76% free_pages_and_swap_cache
                        - 41.22% release_pages
                           - 33.29% free_unref_page_list
                              - 32.37% free_unref_page_commit
                                 - 31.64% free_pcppages_bulk
                                    + 28.65% _raw_spin_lock
                                      1.28% __list_del_entry_valid
                           + 3.25% folio_lruvec_lock_irqsave
                           + 0.75% __mem_cgroup_uncharge_list
                             0.60% __mod_lruvec_state
                          1.07% free_swap_cache
                  + 11.69% page_remove_rmap
                    0.64% __mod_lruvec_page_state
      - 17.34% remove_vma
         - 17.25% vm_area_free
            - 17.23% kmem_cache_free
               - 17.15% __slab_free
                  - 14.56% discard_slab
                       free_slab
                       __free_slab
                       __free_pages
                     - free_unref_page
                        - 13.50% free_unref_page_commit
                           - free_pcppages_bulk
                              + 13.44% _raw_spin_lock

By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
with the majority of them being regular order-0 user pages.

          <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
e=1
           <...>-1540432 [224] d..3. 618048.023887: <stack trace>
 => free_pcppages_bulk
 => free_unref_page_commit
 => free_unref_page_list
 => release_pages
 => free_pages_and_swap_cache
 => tlb_flush_mmu
 => zap_pte_range
 => unmap_page_range
 => unmap_single_vma
 => unmap_vmas
 => exit_mmap
 => mmput
 => do_exit
 => do_group_exit
 => get_signal
 => arch_do_signal_or_restart
 => exit_to_user_mode_prepare
 => syscall_exit_to_user_mode
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

The servers experiencing these issues are equipped with impressive
hardware specifications, including 256 CPUs and 1TB of memory, all
within a single NUMA node. The zoneinfo is as follows,

Node 0, zone   Normal
  pages free     144465775
        boost    0
        min      1309270
        low      1636587
        high     1963904
        spanned  564133888
        present  296747008
        managed  291974346
        cma      0
        protection: (0, 0, 0, 0)
...
  pagesets
    cpu: 0
              count: 2217
              high:  6392
              batch: 63
  vm stats threshold: 125
    cpu: 1
              count: 4510
              high:  6392
              batch: 63
  vm stats threshold: 125
    cpu: 2
              count: 3059
              high:  6392
              batch: 63

...

The pcp high is around 100 times the batch size.

I also traced the latency associated with the free_pcppages_bulk()
function during the container exit process:

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 148      |*****************                       |
       512 -> 1023       : 334      |****************************************|
      1024 -> 2047       : 33       |***                                     |
      2048 -> 4095       : 5        |                                        |
      4096 -> 8191       : 7        |                                        |
      8192 -> 16383      : 12       |*                                       |
     16384 -> 32767      : 30       |***                                     |
     32768 -> 65535      : 21       |**                                      |
     65536 -> 131071     : 15       |*                                       |
    131072 -> 262143     : 27       |***                                     |
    262144 -> 524287     : 84       |**********                              |
    524288 -> 1048575    : 203      |************************                |
   1048576 -> 2097151    : 284      |**********************************      |
   2097152 -> 4194303    : 327      |*************************************** |
   4194304 -> 8388607    : 215      |*************************               |
   8388608 -> 16777215   : 116      |*************                           |
  16777216 -> 33554431   : 47       |*****                                   |
  33554432 -> 67108863   : 8        |                                        |
  67108864 -> 134217727  : 3        |                                        |

The latency can reach tens of milliseconds.

Experimenting
=============

vm.percpu_pagelist_high_fraction
--------------------------------

The kernel version currently deployed in our production environment is the
stable 6.1.y, and my initial strategy involves optimizing the
vm.percpu_pagelist_high_fraction parameter. By increasing the value of
vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
page draining, which subsequently leads to a substantial reduction in
latency. After setting the sysctl value to 0x7fffffff, I observed a notable
improvement in latency.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 120      |                                        |
       256 -> 511        : 365      |*                                       |
       512 -> 1023       : 201      |                                        |
      1024 -> 2047       : 103      |                                        |
      2048 -> 4095       : 84       |                                        |
      4096 -> 8191       : 87       |                                        |
      8192 -> 16383      : 4777     |**************                          |
     16384 -> 32767      : 10572    |*******************************         |
     32768 -> 65535      : 13544    |****************************************|
     65536 -> 131071     : 12723    |*************************************   |
    131072 -> 262143     : 8604     |*************************               |
    262144 -> 524287     : 3659     |**********                              |
    524288 -> 1048575    : 921      |**                                      |
   1048576 -> 2097151    : 122      |                                        |
   2097152 -> 4194303    : 5        |                                        |

However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
pcp high watermark size to a minimum of four times the batch size. While
this could theoretically affect throughput, as highlighted by Ying[0], we
have yet to observe any significant difference in throughput within our
production environment after implementing this change.

Backporting the series "mm: PCP high auto-tuning"
-------------------------------------------------

My second endeavor was to backport the series titled
"mm: PCP high auto-tuning"[1], which comprises nine individual patches,
into our 6.1.y stable kernel version. Subsequent to its deployment in our
production environment, I noted a pronounced reduction in latency. The
observed outcomes are as enumerated below:

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 2        |                                        |
      2048 -> 4095       : 11       |                                        |
      4096 -> 8191       : 3        |                                        |
      8192 -> 16383      : 1        |                                        |
     16384 -> 32767      : 2        |                                        |
     32768 -> 65535      : 7        |                                        |
     65536 -> 131071     : 198      |*********                               |
    131072 -> 262143     : 530      |************************                |
    262144 -> 524287     : 824      |**************************************  |
    524288 -> 1048575    : 852      |****************************************|
   1048576 -> 2097151    : 714      |*********************************       |
   2097152 -> 4194303    : 389      |******************                      |
   4194304 -> 8388607    : 143      |******                                  |
   8388608 -> 16777215   : 29       |*                                       |
  16777216 -> 33554431   : 1        |                                        |

Compared to the previous data, the maximum latency has been reduced to
less than 30ms.

Adjusting the CONFIG_PCP_BATCH_SCALE_MAX
----------------------------------------

Upon Ying's suggestion, adjusting the CONFIG_PCP_BATCH_SCALE_MAX can
potentially reduce the PCP batch size without compromising the PCP high
watermark size. This approach could mitigate latency spikes without
adversely affecting throughput. Consequently, my third attempt focused on
modifying this configuration.

To facilitate easier adjustments, I replaced CONFIG_PCP_BATCH_SCALE_MAX
with a new sysctl knob named vm.pcp_batch_scale_max. By fine-tuning
vm.pcp_batch_scale_max from its default value of 5 down to 0, I achieved a
further reduction in the maximum latency, which was lowered to less than
2ms:

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 36       |                                        |
      2048 -> 4095       : 5063     |*****                                   |
      4096 -> 8191       : 31226    |********************************        |
      8192 -> 16383      : 37606    |*************************************** |
     16384 -> 32767      : 38359    |****************************************|
     32768 -> 65535      : 30652    |*******************************         |
     65536 -> 131071     : 18714    |*******************                     |
    131072 -> 262143     : 7968     |********                                |
    262144 -> 524287     : 1996     |**                                      |
    524288 -> 1048575    : 302      |                                        |
   1048576 -> 2097151    : 19       |                                        |

After multiple trials, I observed no significant differences between
each attempt.

The Proposal
============

This series encompasses two minor refinements to the PCP high watermark
auto-tuning mechanism, along with the introduction of a new sysctl knob
that serves as a more practical alternative to the previous configuration
method.

Future improvement to zone->lock
================================

To ultimately mitigate the zone->lock contention issue, several suggestions
have been proposed. One approach involves dividing large zones into multi
smaller zones, as suggested by Matthew[2], while another entails splitting
the zone->lock using a mechanism similar to memory arenas and shifting away
from relying solely on zone_id to identify the range of free lists a
particular page belongs to[3]. However, implementing these solutions is
likely to necessitate a more extended development effort.

Link: https://lore.kernel.org/linux-mm/874j98noth.fsf@yhuang6-desk2.ccr.corp.intel.com/ [0]
Link: https://lore.kernel.org/all/20231016053002.756205-1-ying.huang@intel.com/ [1]
Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [2]
Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [3]

Changes:
- mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist
  https://lore.kernel.org/linux-mm/20240701142046.6050-1-laoar.shao@gmail.com/

Yafang Shao (3):
  mm/page_alloc: A minor fix to the calculation of pcp->free_count
  mm/page_alloc: Avoid changing pcp->high decaying when adjusting
    CONFIG_PCP_BATCH_SCALE_MAX
  mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max

 Documentation/admin-guide/sysctl/vm.rst | 15 ++++++++++
 include/linux/sysctl.h                  |  1 +
 kernel/sysctl.c                         |  2 +-
 mm/Kconfig                              | 11 -------
 mm/page_alloc.c                         | 38 ++++++++++++++++++-------
 5 files changed, 45 insertions(+), 22 deletions(-)

-- 
2.43.5



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count
  2024-07-07  9:49 [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
@ 2024-07-07  9:49 ` Yafang Shao
  2024-07-10  1:52   ` Huang, Ying
  2024-07-07  9:49 ` [PATCH 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX Yafang Shao
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2024-07-07  9:49 UTC (permalink / raw)
  To: akpm; +Cc: ying.huang, mgorman, linux-mm, Yafang Shao

Currently, At worst, the pcp->free_count can be
(batch - 1 + (1 << MAX_ORDER)), which may exceed the expected max value of
(batch << CONFIG_PCP_BATCH_SCALE_MAX).

This issue was identified through code review, and no real problems have
been observed.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e22ce5675ca..8e2f4e1ab4f2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2534,7 +2534,8 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
 	}
 	if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
-		pcp->free_count += (1 << order);
+		pcp->free_count = min(pcp->free_count + (1 << order),
+				      batch << CONFIG_PCP_BATCH_SCALE_MAX);
 	high = nr_pcp_high(pcp, zone, batch, free_high);
 	if (pcp->count >= high) {
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX
  2024-07-07  9:49 [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
  2024-07-07  9:49 ` [PATCH 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
@ 2024-07-07  9:49 ` Yafang Shao
  2024-07-10  1:51   ` Huang, Ying
  2024-07-07  9:49 ` [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
  2024-07-10  3:00 ` [PATCH 0/3] " Huang, Ying
  3 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2024-07-07  9:49 UTC (permalink / raw)
  To: akpm; +Cc: ying.huang, mgorman, linux-mm, Yafang Shao

When adjusting the CONFIG_PCP_BATCH_SCALE_MAX configuration from its
default value of 5 to a lower value, such as 0, it's important to ensure
that the pcp->high decaying is not inadvertently slowed down. Similarly,
when increasing CONFIG_PCP_BATCH_SCALE_MAX to a larger value, like 6, we
must avoid inadvertently increasing the number of pages freed in
free_pcppages_bulk() as a result of this change.

So below improvements are made:
- hardcode the default value of 5 to avoiding modifying the pcp->high
- refactore free_pcppages_bulk() into multiple steps, with each step
  processing a fixed batch size of pages

Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/page_alloc.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8e2f4e1ab4f2..2b76754a48e0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2247,7 +2247,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 {
 	int high_min, to_drain, batch;
-	int todo = 0;
+	int todo = 0, count = 0;
 
 	high_min = READ_ONCE(pcp->high_min);
 	batch = READ_ONCE(pcp->batch);
@@ -2257,18 +2257,25 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 	 * control latency.  This caps pcp->high decrement too.
 	 */
 	if (pcp->high > high_min) {
-		pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
+		/* When tuning the pcp batch scale value, we want to ensure that
+		 * the pcp->high decay rate is not slowed down. Therefore, we
+		 * hard-code the historical default scale value of 5 here to
+		 * prevent any unintended effects.
+		 */
+		pcp->high = max3(pcp->count - (batch << 5),
 				 pcp->high - (pcp->high >> 3), high_min);
 		if (pcp->high > high_min)
 			todo++;
 	}
 
 	to_drain = pcp->count - pcp->high;
-	if (to_drain > 0) {
+	while (count < to_drain) {
 		spin_lock(&pcp->lock);
-		free_pcppages_bulk(zone, to_drain, pcp, 0);
+		free_pcppages_bulk(zone, batch, pcp, 0);
 		spin_unlock(&pcp->lock);
+		count += batch;
 		todo++;
+		cond_resched();
 	}
 
 	return todo;
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-07  9:49 [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
  2024-07-07  9:49 ` [PATCH 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
  2024-07-07  9:49 ` [PATCH 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX Yafang Shao
@ 2024-07-07  9:49 ` Yafang Shao
  2024-07-10  2:49   ` Huang, Ying
  2024-07-10  3:00 ` [PATCH 0/3] " Huang, Ying
  3 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2024-07-07  9:49 UTC (permalink / raw)
  To: akpm
  Cc: ying.huang, mgorman, linux-mm, Yafang Shao, Matthew Wilcox,
	David Rientjes

The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
quickly experimenting with specific workloads in a production environment,
particularly when monitoring latency spikes caused by contention on the
zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
is introduced as a more practical alternative.

To ultimately mitigate the zone->lock contention issue, several suggestions
have been proposed. One approach involves dividing large zones into multi
smaller zones, as suggested by Matthew[0], while another entails splitting
the zone->lock using a mechanism similar to memory arenas and shifting away
from relying solely on zone_id to identify the range of free lists a
particular page belongs to[1]. However, implementing these solutions is
likely to necessitate a more extended development effort.

Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [0]
Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [1]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 15 +++++++++++++++
 include/linux/sysctl.h                  |  1 +
 kernel/sysctl.c                         |  2 +-
 mm/Kconfig                              | 11 -----------
 mm/page_alloc.c                         | 22 ++++++++++++++++------
 5 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index e86c968a7a0e..eb9e5216eefe 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -66,6 +66,7 @@ Currently, these files are in /proc/sys/vm:
 - page_lock_unfairness
 - panic_on_oom
 - percpu_pagelist_high_fraction
+- pcp_batch_scale_max
 - stat_interval
 - stat_refresh
 - numa_stat
@@ -864,6 +865,20 @@ mark based on the low watermark for the zone and the number of local
 online CPUs.  If the user writes '0' to this sysctl, it will revert to
 this default behavior.
 
+pcp_batch_scale_max
+===================
+
+In page allocator, PCP (Per-CPU pageset) is refilled and drained in
+batches.  The batch number is scaled automatically to improve page
+allocation/free throughput.  But too large scale factor may hurt
+latency.  This option sets the upper limit of scale factor to limit
+the maximum latency.
+
+The range for this parameter spans from 0 to 6, with a default value of 5.
+The value assigned to 'N' signifies that during each refilling or draining
+process, a maximum of (batch << N) pages will be involved, where "batch"
+represents the default batch size automatically computed by the kernel for
+each zone.
 
 stat_interval
 =============
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 09db2f2e6488..fb797f1c0ef7 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -52,6 +52,7 @@ struct ctl_dir;
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 #define SYSCTL_MAXOLDUID		((void *)&sysctl_vals[10])
 #define SYSCTL_NEG_ONE			((void *)&sysctl_vals[11])
+#define SYSCTL_SIX			((void *)&sysctl_vals[12])
 
 extern const int sysctl_vals[];
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e0b917328cf9..430ac4f58eb7 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -82,7 +82,7 @@
 #endif
 
 /* shared constants to be used in various sysctls */
-const int sysctl_vals[] = { 0, 1, 2, 3, 4, 100, 200, 1000, 3000, INT_MAX, 65535, -1 };
+const int sysctl_vals[] = { 0, 1, 2, 3, 4, 100, 200, 1000, 3000, INT_MAX, 65535, -1, 6 };
 EXPORT_SYMBOL(sysctl_vals);
 
 const unsigned long sysctl_long_vals[] = { 0, 1, LONG_MAX };
diff --git a/mm/Kconfig b/mm/Kconfig
index b4cb45255a54..41fe4c13b7ac 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -663,17 +663,6 @@ config HUGETLB_PAGE_SIZE_VARIABLE
 config CONTIG_ALLOC
 	def_bool (MEMORY_ISOLATION && COMPACTION) || CMA
 
-config PCP_BATCH_SCALE_MAX
-	int "Maximum scale factor of PCP (Per-CPU pageset) batch allocate/free"
-	default 5
-	range 0 6
-	help
-	  In page allocator, PCP (Per-CPU pageset) is refilled and drained in
-	  batches.  The batch number is scaled automatically to improve page
-	  allocation/free throughput.  But too large scale factor may hurt
-	  latency.  This option sets the upper limit of scale factor to limit
-	  the maximum latency.
-
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2b76754a48e0..703eec22a997 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -273,6 +273,7 @@ int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
 static int watermark_boost_factor __read_mostly = 15000;
 static int watermark_scale_factor = 10;
+static int pcp_batch_scale_max = 5;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -2310,7 +2311,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 	int count = READ_ONCE(pcp->count);
 
 	while (count) {
-		int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX);
+		int to_drain = min(count, pcp->batch << pcp_batch_scale_max);
 		count -= to_drain;
 
 		spin_lock(&pcp->lock);
@@ -2438,7 +2439,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free
 
 	/* Free as much as possible if batch freeing high-order pages. */
 	if (unlikely(free_high))
-		return min(pcp->count, batch << CONFIG_PCP_BATCH_SCALE_MAX);
+		return min(pcp->count, batch << pcp_batch_scale_max);
 
 	/* Check for PCP disabled or boot pageset */
 	if (unlikely(high < batch))
@@ -2470,7 +2471,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 		return 0;
 
 	if (unlikely(free_high)) {
-		pcp->high = max(high - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
+		pcp->high = max(high - (batch << pcp_batch_scale_max),
 				high_min);
 		return 0;
 	}
@@ -2540,9 +2541,9 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
 		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
 	}
-	if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
+	if (pcp->free_count < (batch << pcp_batch_scale_max))
 		pcp->free_count = min(pcp->free_count + (1 << order),
-				      batch << CONFIG_PCP_BATCH_SCALE_MAX);
+				      batch << pcp_batch_scale_max);
 	high = nr_pcp_high(pcp, zone, batch, free_high);
 	if (pcp->count >= high) {
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
@@ -2884,7 +2885,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
 		 * subsequent allocation of order-0 pages without any freeing.
 		 */
 		if (batch <= max_nr_alloc &&
-		    pcp->alloc_factor < CONFIG_PCP_BATCH_SCALE_MAX)
+		    pcp->alloc_factor < pcp_batch_scale_max)
 			pcp->alloc_factor++;
 		batch = min(batch, max_nr_alloc);
 	}
@@ -6251,6 +6252,15 @@ static struct ctl_table page_alloc_sysctl_table[] = {
 		.proc_handler	= percpu_pagelist_high_fraction_sysctl_handler,
 		.extra1		= SYSCTL_ZERO,
 	},
+	{
+		.procname	= "pcp_batch_scale_max",
+		.data		= &pcp_batch_scale_max,
+		.maxlen		= sizeof(pcp_batch_scale_max),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_SIX,
+	},
 	{
 		.procname	= "lowmem_reserve_ratio",
 		.data		= &sysctl_lowmem_reserve_ratio,
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX
  2024-07-07  9:49 ` [PATCH 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX Yafang Shao
@ 2024-07-10  1:51   ` Huang, Ying
  2024-07-10  2:07     ` Yafang Shao
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2024-07-10  1:51 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm

Yafang Shao <laoar.shao@gmail.com> writes:

> When adjusting the CONFIG_PCP_BATCH_SCALE_MAX configuration from its
> default value of 5 to a lower value, such as 0, it's important to ensure
> that the pcp->high decaying is not inadvertently slowed down. Similarly,
> when increasing CONFIG_PCP_BATCH_SCALE_MAX to a larger value, like 6, we
> must avoid inadvertently increasing the number of pages freed in
> free_pcppages_bulk() as a result of this change.
>
> So below improvements are made:
> - hardcode the default value of 5 to avoiding modifying the pcp->high
> - refactore free_pcppages_bulk() into multiple steps, with each step
>   processing a fixed batch size of pages

This is confusing.  You don't change free_pcppages_bulk() itself.  I
guess what you mean is "change free_pcppages_bulk() calling into
multiple steps".

>
> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  mm/page_alloc.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8e2f4e1ab4f2..2b76754a48e0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2247,7 +2247,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
>  {
>  	int high_min, to_drain, batch;
> -	int todo = 0;
> +	int todo = 0, count = 0;
>  
>  	high_min = READ_ONCE(pcp->high_min);
>  	batch = READ_ONCE(pcp->batch);
> @@ -2257,18 +2257,25 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
>  	 * control latency.  This caps pcp->high decrement too.
>  	 */
>  	if (pcp->high > high_min) {
> -		pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
> +		/* When tuning the pcp batch scale value, we want to ensure that
> +		 * the pcp->high decay rate is not slowed down. Therefore, we
> +		 * hard-code the historical default scale value of 5 here to
> +		 * prevent any unintended effects.
> +		 */

This is good description for history.  But, in the result code, it's
not easy for people to connect the code with pcp batch scale directly.
How about something as follows,

We will decay 1/8 pcp->high each time in general, so that the idle PCP
pages can be returned to buddy system timely.  To control the max
latency of decay, we also constrain the number pages freed each time.

> +		pcp->high = max3(pcp->count - (batch << 5),
>  				 pcp->high - (pcp->high >> 3), high_min);
>  		if (pcp->high > high_min)
>  			todo++;
>  	}
>  
>  	to_drain = pcp->count - pcp->high;
> -	if (to_drain > 0) {
> +	while (count < to_drain) {
>  		spin_lock(&pcp->lock);
> -		free_pcppages_bulk(zone, to_drain, pcp, 0);
> +		free_pcppages_bulk(zone, batch, pcp, 0);

"to_drain - count" may < batch.

>  		spin_unlock(&pcp->lock);
> +		count += batch;
>  		todo++;
> +		cond_resched();
>  	}
>  
>  	return todo;

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count
  2024-07-07  9:49 ` [PATCH 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
@ 2024-07-10  1:52   ` Huang, Ying
  0 siblings, 0 replies; 41+ messages in thread
From: Huang, Ying @ 2024-07-10  1:52 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm

Yafang Shao <laoar.shao@gmail.com> writes:

> Currently, At worst, the pcp->free_count can be
> (batch - 1 + (1 << MAX_ORDER)), which may exceed the expected max value of
> (batch << CONFIG_PCP_BATCH_SCALE_MAX).
>
> This issue was identified through code review, and no real problems have
> been observed.
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: "Huang, Ying" <ying.huang@intel.com>

LGTM, Thanks!

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

> ---
>  mm/page_alloc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2e22ce5675ca..8e2f4e1ab4f2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2534,7 +2534,8 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
>  		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
>  	}
>  	if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
> -		pcp->free_count += (1 << order);
> +		pcp->free_count = min(pcp->free_count + (1 << order),
> +				      batch << CONFIG_PCP_BATCH_SCALE_MAX);
>  	high = nr_pcp_high(pcp, zone, batch, free_high);
>  	if (pcp->count >= high) {
>  		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX
  2024-07-10  1:51   ` Huang, Ying
@ 2024-07-10  2:07     ` Yafang Shao
  0 siblings, 0 replies; 41+ messages in thread
From: Yafang Shao @ 2024-07-10  2:07 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, mgorman, linux-mm

On Wed, Jul 10, 2024 at 9:53 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > When adjusting the CONFIG_PCP_BATCH_SCALE_MAX configuration from its
> > default value of 5 to a lower value, such as 0, it's important to ensure
> > that the pcp->high decaying is not inadvertently slowed down. Similarly,
> > when increasing CONFIG_PCP_BATCH_SCALE_MAX to a larger value, like 6, we
> > must avoid inadvertently increasing the number of pages freed in
> > free_pcppages_bulk() as a result of this change.
> >
> > So below improvements are made:
> > - hardcode the default value of 5 to avoiding modifying the pcp->high
> > - refactore free_pcppages_bulk() into multiple steps, with each step
> >   processing a fixed batch size of pages
>
> This is confusing.  You don't change free_pcppages_bulk() itself.  I
> guess what you mean is "change free_pcppages_bulk() calling into
> multiple steps".

will change it.

>
> >
> > Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  mm/page_alloc.c | 15 +++++++++++----
> >  1 file changed, 11 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8e2f4e1ab4f2..2b76754a48e0 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2247,7 +2247,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> >  int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
> >  {
> >       int high_min, to_drain, batch;
> > -     int todo = 0;
> > +     int todo = 0, count = 0;
> >
> >       high_min = READ_ONCE(pcp->high_min);
> >       batch = READ_ONCE(pcp->batch);
> > @@ -2257,18 +2257,25 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
> >        * control latency.  This caps pcp->high decrement too.
> >        */
> >       if (pcp->high > high_min) {
> > -             pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
> > +             /* When tuning the pcp batch scale value, we want to ensure that
> > +              * the pcp->high decay rate is not slowed down. Therefore, we
> > +              * hard-code the historical default scale value of 5 here to
> > +              * prevent any unintended effects.
> > +              */
>
> This is good description for history.  But, in the result code, it's
> not easy for people to connect the code with pcp batch scale directly.
> How about something as follows,
>
> We will decay 1/8 pcp->high each time in general, so that the idle PCP
> pages can be returned to buddy system timely.  To control the max
> latency of decay, we also constrain the number pages freed each time.

Thanks for your suggestion.
will change it.

>
> > +             pcp->high = max3(pcp->count - (batch << 5),
> >                                pcp->high - (pcp->high >> 3), high_min);
> >               if (pcp->high > high_min)
> >                       todo++;
> >       }
> >
> >       to_drain = pcp->count - pcp->high;
> > -     if (to_drain > 0) {
> > +     while (count < to_drain) {
> >               spin_lock(&pcp->lock);
> > -             free_pcppages_bulk(zone, to_drain, pcp, 0);
> > +             free_pcppages_bulk(zone, batch, pcp, 0);
>
> "to_drain - count" may < batch.

Nice catch. will fix it.

>
> >               spin_unlock(&pcp->lock);
> > +             count += batch;
> >               todo++;
> > +             cond_resched();
> >       }
> >
> >       return todo;
>
> --
> Best Regards,
> Huang, Ying



-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-07  9:49 ` [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
@ 2024-07-10  2:49   ` Huang, Ying
  2024-07-11  2:21     ` Yafang Shao
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2024-07-10  2:49 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

Yafang Shao <laoar.shao@gmail.com> writes:

> The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
> quickly experimenting with specific workloads in a production environment,
> particularly when monitoring latency spikes caused by contention on the
> zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
> is introduced as a more practical alternative.

In general, I'm neutral to the change.  I can understand that kernel
configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
too.

> To ultimately mitigate the zone->lock contention issue, several suggestions
> have been proposed. One approach involves dividing large zones into multi
> smaller zones, as suggested by Matthew[0], while another entails splitting
> the zone->lock using a mechanism similar to memory arenas and shifting away
> from relying solely on zone_id to identify the range of free lists a
> particular page belongs to[1]. However, implementing these solutions is
> likely to necessitate a more extended development effort.

Per my understanding, the change will hurt instead of improve zone->lock
contention.  Instead, it will reduce page allocation/freeing latency.

> Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [0]
> Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [1]
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: "Huang, Ying" <ying.huang@intel.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: David Rientjes <rientjes@google.com>
> ---
>  Documentation/admin-guide/sysctl/vm.rst | 15 +++++++++++++++
>  include/linux/sysctl.h                  |  1 +
>  kernel/sysctl.c                         |  2 +-
>  mm/Kconfig                              | 11 -----------
>  mm/page_alloc.c                         | 22 ++++++++++++++++------
>  5 files changed, 33 insertions(+), 18 deletions(-)
>
> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> index e86c968a7a0e..eb9e5216eefe 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -66,6 +66,7 @@ Currently, these files are in /proc/sys/vm:
>  - page_lock_unfairness
>  - panic_on_oom
>  - percpu_pagelist_high_fraction
> +- pcp_batch_scale_max
>  - stat_interval
>  - stat_refresh
>  - numa_stat
> @@ -864,6 +865,20 @@ mark based on the low watermark for the zone and the number of local
>  online CPUs.  If the user writes '0' to this sysctl, it will revert to
>  this default behavior.
>  
> +pcp_batch_scale_max
> +===================
> +
> +In page allocator, PCP (Per-CPU pageset) is refilled and drained in
> +batches.  The batch number is scaled automatically to improve page
> +allocation/free throughput.  But too large scale factor may hurt
> +latency.  This option sets the upper limit of scale factor to limit
> +the maximum latency.
> +
> +The range for this parameter spans from 0 to 6, with a default value of 5.
> +The value assigned to 'N' signifies that during each refilling or draining
> +process, a maximum of (batch << N) pages will be involved, where "batch"
> +represents the default batch size automatically computed by the kernel for
> +each zone.
>  
>  stat_interval
>  =============
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> index 09db2f2e6488..fb797f1c0ef7 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -52,6 +52,7 @@ struct ctl_dir;
>  /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
>  #define SYSCTL_MAXOLDUID		((void *)&sysctl_vals[10])
>  #define SYSCTL_NEG_ONE			((void *)&sysctl_vals[11])
> +#define SYSCTL_SIX			((void *)&sysctl_vals[12])
>  
>  extern const int sysctl_vals[];
>  
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index e0b917328cf9..430ac4f58eb7 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -82,7 +82,7 @@
>  #endif
>  
>  /* shared constants to be used in various sysctls */
> -const int sysctl_vals[] = { 0, 1, 2, 3, 4, 100, 200, 1000, 3000, INT_MAX, 65535, -1 };
> +const int sysctl_vals[] = { 0, 1, 2, 3, 4, 100, 200, 1000, 3000, INT_MAX, 65535, -1, 6 };
>  EXPORT_SYMBOL(sysctl_vals);
>  
>  const unsigned long sysctl_long_vals[] = { 0, 1, LONG_MAX };
> diff --git a/mm/Kconfig b/mm/Kconfig
> index b4cb45255a54..41fe4c13b7ac 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -663,17 +663,6 @@ config HUGETLB_PAGE_SIZE_VARIABLE
>  config CONTIG_ALLOC
>  	def_bool (MEMORY_ISOLATION && COMPACTION) || CMA
>  
> -config PCP_BATCH_SCALE_MAX
> -	int "Maximum scale factor of PCP (Per-CPU pageset) batch allocate/free"
> -	default 5
> -	range 0 6
> -	help
> -	  In page allocator, PCP (Per-CPU pageset) is refilled and drained in
> -	  batches.  The batch number is scaled automatically to improve page
> -	  allocation/free throughput.  But too large scale factor may hurt
> -	  latency.  This option sets the upper limit of scale factor to limit
> -	  the maximum latency.
> -
>  config PHYS_ADDR_T_64BIT
>  	def_bool 64BIT
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2b76754a48e0..703eec22a997 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -273,6 +273,7 @@ int min_free_kbytes = 1024;
>  int user_min_free_kbytes = -1;
>  static int watermark_boost_factor __read_mostly = 15000;
>  static int watermark_scale_factor = 10;
> +static int pcp_batch_scale_max = 5;
>  
>  /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
>  int movable_zone;
> @@ -2310,7 +2311,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
>  	int count = READ_ONCE(pcp->count);
>  
>  	while (count) {
> -		int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX);
> +		int to_drain = min(count, pcp->batch << pcp_batch_scale_max);
>  		count -= to_drain;
>  
>  		spin_lock(&pcp->lock);
> @@ -2438,7 +2439,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free
>  
>  	/* Free as much as possible if batch freeing high-order pages. */
>  	if (unlikely(free_high))
> -		return min(pcp->count, batch << CONFIG_PCP_BATCH_SCALE_MAX);
> +		return min(pcp->count, batch << pcp_batch_scale_max);
>  
>  	/* Check for PCP disabled or boot pageset */
>  	if (unlikely(high < batch))
> @@ -2470,7 +2471,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
>  		return 0;
>  
>  	if (unlikely(free_high)) {
> -		pcp->high = max(high - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
> +		pcp->high = max(high - (batch << pcp_batch_scale_max),
>  				high_min);
>  		return 0;
>  	}
> @@ -2540,9 +2541,9 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
>  	} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
>  		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
>  	}
> -	if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
> +	if (pcp->free_count < (batch << pcp_batch_scale_max))
>  		pcp->free_count = min(pcp->free_count + (1 << order),
> -				      batch << CONFIG_PCP_BATCH_SCALE_MAX);
> +				      batch << pcp_batch_scale_max);
>  	high = nr_pcp_high(pcp, zone, batch, free_high);
>  	if (pcp->count >= high) {
>  		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
> @@ -2884,7 +2885,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
>  		 * subsequent allocation of order-0 pages without any freeing.
>  		 */
>  		if (batch <= max_nr_alloc &&
> -		    pcp->alloc_factor < CONFIG_PCP_BATCH_SCALE_MAX)
> +		    pcp->alloc_factor < pcp_batch_scale_max)
>  			pcp->alloc_factor++;
>  		batch = min(batch, max_nr_alloc);
>  	}
> @@ -6251,6 +6252,15 @@ static struct ctl_table page_alloc_sysctl_table[] = {
>  		.proc_handler	= percpu_pagelist_high_fraction_sysctl_handler,
>  		.extra1		= SYSCTL_ZERO,
>  	},
> +	{
> +		.procname	= "pcp_batch_scale_max",
> +		.data		= &pcp_batch_scale_max,
> +		.maxlen		= sizeof(pcp_batch_scale_max),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +		.extra2		= SYSCTL_SIX,
> +	},
>  	{
>  		.procname	= "lowmem_reserve_ratio",
>  		.data		= &sysctl_lowmem_reserve_ratio,

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-07  9:49 [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
                   ` (2 preceding siblings ...)
  2024-07-07  9:49 ` [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
@ 2024-07-10  3:00 ` Huang, Ying
  2024-07-11  2:25   ` Yafang Shao
  3 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2024-07-10  3:00 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm

Yafang Shao <laoar.shao@gmail.com> writes:

> Background
> ==========
>
> In our containerized environment, we have a specific type of container
> that runs 18 processes, each consuming approximately 6GB of RSS. These
> processes are organized as separate processes rather than threads due
> to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> multi-threaded setup. Upon the exit of these containers, other
> containers hosted on the same machine experience significant latency
> spikes.
>
> Investigation
> =============
>
> My investigation using perf tracing revealed that the root cause of
> these spikes is the simultaneous execution of exit_mmap() by each of
> the exiting processes. This concurrent access to the zone->lock
> results in contention, which becomes a hotspot and negatively impacts
> performance. The perf results clearly indicate this contention as a
> primary contributor to the observed latency issues.
>
> +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
> -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
>    - 76.97% exit_mmap
>       - 58.58% unmap_vmas
>          - 58.55% unmap_single_vma
>             - unmap_page_range
>                - 58.32% zap_pte_range
>                   - 42.88% tlb_flush_mmu
>                      - 42.76% free_pages_and_swap_cache
>                         - 41.22% release_pages
>                            - 33.29% free_unref_page_list
>                               - 32.37% free_unref_page_commit
>                                  - 31.64% free_pcppages_bulk
>                                     + 28.65% _raw_spin_lock
>                                       1.28% __list_del_entry_valid
>                            + 3.25% folio_lruvec_lock_irqsave
>                            + 0.75% __mem_cgroup_uncharge_list
>                              0.60% __mod_lruvec_state
>                           1.07% free_swap_cache
>                   + 11.69% page_remove_rmap
>                     0.64% __mod_lruvec_page_state
>       - 17.34% remove_vma
>          - 17.25% vm_area_free
>             - 17.23% kmem_cache_free
>                - 17.15% __slab_free
>                   - 14.56% discard_slab
>                        free_slab
>                        __free_slab
>                        __free_pages
>                      - free_unref_page
>                         - 13.50% free_unref_page_commit
>                            - free_pcppages_bulk
>                               + 13.44% _raw_spin_lock

I don't think your change will reduce zone->lock contention cycles.  So,
I don't find the value of the above data.

> By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
> with the majority of them being regular order-0 user pages.
>
>           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> e=1
>            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>  => free_pcppages_bulk
>  => free_unref_page_commit
>  => free_unref_page_list
>  => release_pages
>  => free_pages_and_swap_cache
>  => tlb_flush_mmu
>  => zap_pte_range
>  => unmap_page_range
>  => unmap_single_vma
>  => unmap_vmas
>  => exit_mmap
>  => mmput
>  => do_exit
>  => do_group_exit
>  => get_signal
>  => arch_do_signal_or_restart
>  => exit_to_user_mode_prepare
>  => syscall_exit_to_user_mode
>  => do_syscall_64
>  => entry_SYSCALL_64_after_hwframe
>
> The servers experiencing these issues are equipped with impressive
> hardware specifications, including 256 CPUs and 1TB of memory, all
> within a single NUMA node. The zoneinfo is as follows,
>
> Node 0, zone   Normal
>   pages free     144465775
>         boost    0
>         min      1309270
>         low      1636587
>         high     1963904
>         spanned  564133888
>         present  296747008
>         managed  291974346
>         cma      0
>         protection: (0, 0, 0, 0)
> ...
>   pagesets
>     cpu: 0
>               count: 2217
>               high:  6392
>               batch: 63
>   vm stats threshold: 125
>     cpu: 1
>               count: 4510
>               high:  6392
>               batch: 63
>   vm stats threshold: 125
>     cpu: 2
>               count: 3059
>               high:  6392
>               batch: 63
>
> ...
>
> The pcp high is around 100 times the batch size.
>
> I also traced the latency associated with the free_pcppages_bulk()
> function during the container exit process:
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 148      |*****************                       |
>        512 -> 1023       : 334      |****************************************|
>       1024 -> 2047       : 33       |***                                     |
>       2048 -> 4095       : 5        |                                        |
>       4096 -> 8191       : 7        |                                        |
>       8192 -> 16383      : 12       |*                                       |
>      16384 -> 32767      : 30       |***                                     |
>      32768 -> 65535      : 21       |**                                      |
>      65536 -> 131071     : 15       |*                                       |
>     131072 -> 262143     : 27       |***                                     |
>     262144 -> 524287     : 84       |**********                              |
>     524288 -> 1048575    : 203      |************************                |
>    1048576 -> 2097151    : 284      |**********************************      |
>    2097152 -> 4194303    : 327      |*************************************** |
>    4194304 -> 8388607    : 215      |*************************               |
>    8388608 -> 16777215   : 116      |*************                           |
>   16777216 -> 33554431   : 47       |*****                                   |
>   33554432 -> 67108863   : 8        |                                        |
>   67108864 -> 134217727  : 3        |                                        |
>
> The latency can reach tens of milliseconds.
>
> Experimenting
> =============
>
> vm.percpu_pagelist_high_fraction
> --------------------------------
>
> The kernel version currently deployed in our production environment is the
> stable 6.1.y, and my initial strategy involves optimizing the

IMHO, we should focus on upstream activity in the cover letter and patch
description.  And I don't think that it's necessary to describe the
alternative solution with too much details.

> vm.percpu_pagelist_high_fraction parameter. By increasing the value of
> vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
> page draining, which subsequently leads to a substantial reduction in
> latency. After setting the sysctl value to 0x7fffffff, I observed a notable
> improvement in latency.
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 120      |                                        |
>        256 -> 511        : 365      |*                                       |
>        512 -> 1023       : 201      |                                        |
>       1024 -> 2047       : 103      |                                        |
>       2048 -> 4095       : 84       |                                        |
>       4096 -> 8191       : 87       |                                        |
>       8192 -> 16383      : 4777     |**************                          |
>      16384 -> 32767      : 10572    |*******************************         |
>      32768 -> 65535      : 13544    |****************************************|
>      65536 -> 131071     : 12723    |*************************************   |
>     131072 -> 262143     : 8604     |*************************               |
>     262144 -> 524287     : 3659     |**********                              |
>     524288 -> 1048575    : 921      |**                                      |
>    1048576 -> 2097151    : 122      |                                        |
>    2097152 -> 4194303    : 5        |                                        |
>
> However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
> pcp high watermark size to a minimum of four times the batch size. While
> this could theoretically affect throughput, as highlighted by Ying[0], we
> have yet to observe any significant difference in throughput within our
> production environment after implementing this change.
>
> Backporting the series "mm: PCP high auto-tuning"
> -------------------------------------------------

Again, not upstream activity.  We can describe the upstream behavior
directly.

> My second endeavor was to backport the series titled
> "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
> into our 6.1.y stable kernel version. Subsequent to its deployment in our
> production environment, I noted a pronounced reduction in latency. The
> observed outcomes are as enumerated below:
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 0        |                                        |
>        512 -> 1023       : 0        |                                        |
>       1024 -> 2047       : 2        |                                        |
>       2048 -> 4095       : 11       |                                        |
>       4096 -> 8191       : 3        |                                        |
>       8192 -> 16383      : 1        |                                        |
>      16384 -> 32767      : 2        |                                        |
>      32768 -> 65535      : 7        |                                        |
>      65536 -> 131071     : 198      |*********                               |
>     131072 -> 262143     : 530      |************************                |
>     262144 -> 524287     : 824      |**************************************  |
>     524288 -> 1048575    : 852      |****************************************|
>    1048576 -> 2097151    : 714      |*********************************       |
>    2097152 -> 4194303    : 389      |******************                      |
>    4194304 -> 8388607    : 143      |******                                  |
>    8388608 -> 16777215   : 29       |*                                       |
>   16777216 -> 33554431   : 1        |                                        |
>
> Compared to the previous data, the maximum latency has been reduced to
> less than 30ms.

People don't care too much about page freeing latency during processes
exiting.  Instead, they care more about the process exiting time, that
is, throughput.  So, it's better to show the page allocation latency
which is affected by the simultaneous processes exiting.

> Adjusting the CONFIG_PCP_BATCH_SCALE_MAX
> ----------------------------------------
>
> Upon Ying's suggestion, adjusting the CONFIG_PCP_BATCH_SCALE_MAX can
> potentially reduce the PCP batch size without compromising the PCP high
> watermark size. This approach could mitigate latency spikes without
> adversely affecting throughput. Consequently, my third attempt focused on
> modifying this configuration.
>
> To facilitate easier adjustments, I replaced CONFIG_PCP_BATCH_SCALE_MAX
> with a new sysctl knob named vm.pcp_batch_scale_max. By fine-tuning
> vm.pcp_batch_scale_max from its default value of 5 down to 0, I achieved a
> further reduction in the maximum latency, which was lowered to less than
> 2ms:
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 0        |                                        |
>        512 -> 1023       : 0        |                                        |
>       1024 -> 2047       : 36       |                                        |
>       2048 -> 4095       : 5063     |*****                                   |
>       4096 -> 8191       : 31226    |********************************        |
>       8192 -> 16383      : 37606    |*************************************** |
>      16384 -> 32767      : 38359    |****************************************|
>      32768 -> 65535      : 30652    |*******************************         |
>      65536 -> 131071     : 18714    |*******************                     |
>     131072 -> 262143     : 7968     |********                                |
>     262144 -> 524287     : 1996     |**                                      |
>     524288 -> 1048575    : 302      |                                        |
>    1048576 -> 2097151    : 19       |                                        |
>
> After multiple trials, I observed no significant differences between
> each attempt.
>
> The Proposal
> ============
>
> This series encompasses two minor refinements to the PCP high watermark
> auto-tuning mechanism, along with the introduction of a new sysctl knob
> that serves as a more practical alternative to the previous configuration
> method.
>
> Future improvement to zone->lock
> ================================
>
> To ultimately mitigate the zone->lock contention issue, several suggestions
> have been proposed. One approach involves dividing large zones into multi
> smaller zones, as suggested by Matthew[2], while another entails splitting
> the zone->lock using a mechanism similar to memory arenas and shifting away
> from relying solely on zone_id to identify the range of free lists a
> particular page belongs to[3]. However, implementing these solutions is
> likely to necessitate a more extended development effort.
>
> Link: https://lore.kernel.org/linux-mm/874j98noth.fsf@yhuang6-desk2.ccr.corp.intel.com/ [0]
> Link: https://lore.kernel.org/all/20231016053002.756205-1-ying.huang@intel.com/ [1]
> Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [2]
> Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [3]
>
> Changes:
> - mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist
>   https://lore.kernel.org/linux-mm/20240701142046.6050-1-laoar.shao@gmail.com/
>
> Yafang Shao (3):
>   mm/page_alloc: A minor fix to the calculation of pcp->free_count
>   mm/page_alloc: Avoid changing pcp->high decaying when adjusting
>     CONFIG_PCP_BATCH_SCALE_MAX
>   mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
>
>  Documentation/admin-guide/sysctl/vm.rst | 15 ++++++++++
>  include/linux/sysctl.h                  |  1 +
>  kernel/sysctl.c                         |  2 +-
>  mm/Kconfig                              | 11 -------
>  mm/page_alloc.c                         | 38 ++++++++++++++++++-------
>  5 files changed, 45 insertions(+), 22 deletions(-)

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-10  2:49   ` Huang, Ying
@ 2024-07-11  2:21     ` Yafang Shao
  2024-07-11  6:42       ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2024-07-11  2:21 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
> > quickly experimenting with specific workloads in a production environment,
> > particularly when monitoring latency spikes caused by contention on the
> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
> > is introduced as a more practical alternative.
>
> In general, I'm neutral to the change.  I can understand that kernel
> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
> too.
>
> > To ultimately mitigate the zone->lock contention issue, several suggestions
> > have been proposed. One approach involves dividing large zones into multi
> > smaller zones, as suggested by Matthew[0], while another entails splitting
> > the zone->lock using a mechanism similar to memory arenas and shifting away
> > from relying solely on zone_id to identify the range of free lists a
> > particular page belongs to[1]. However, implementing these solutions is
> > likely to necessitate a more extended development effort.
>
> Per my understanding, the change will hurt instead of improve zone->lock
> contention.  Instead, it will reduce page allocation/freeing latency.

I'm quite perplexed by your recent comment. You introduced a
configuration that has proven to be difficult to use, and you have
been resistant to suggestions for modifying it to a more user-friendly
and practical tuning approach. May I inquire about the rationale
behind introducing this configuration in the beginning?

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-10  3:00 ` [PATCH 0/3] " Huang, Ying
@ 2024-07-11  2:25   ` Yafang Shao
  2024-07-11  6:38     ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2024-07-11  2:25 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, mgorman, linux-mm

On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > Background
> > ==========
> >
> > In our containerized environment, we have a specific type of container
> > that runs 18 processes, each consuming approximately 6GB of RSS. These
> > processes are organized as separate processes rather than threads due
> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> > multi-threaded setup. Upon the exit of these containers, other
> > containers hosted on the same machine experience significant latency
> > spikes.
> >
> > Investigation
> > =============
> >
> > My investigation using perf tracing revealed that the root cause of
> > these spikes is the simultaneous execution of exit_mmap() by each of
> > the exiting processes. This concurrent access to the zone->lock
> > results in contention, which becomes a hotspot and negatively impacts
> > performance. The perf results clearly indicate this contention as a
> > primary contributor to the observed latency issues.
> >
> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
> >    - 76.97% exit_mmap
> >       - 58.58% unmap_vmas
> >          - 58.55% unmap_single_vma
> >             - unmap_page_range
> >                - 58.32% zap_pte_range
> >                   - 42.88% tlb_flush_mmu
> >                      - 42.76% free_pages_and_swap_cache
> >                         - 41.22% release_pages
> >                            - 33.29% free_unref_page_list
> >                               - 32.37% free_unref_page_commit
> >                                  - 31.64% free_pcppages_bulk
> >                                     + 28.65% _raw_spin_lock
> >                                       1.28% __list_del_entry_valid
> >                            + 3.25% folio_lruvec_lock_irqsave
> >                            + 0.75% __mem_cgroup_uncharge_list
> >                              0.60% __mod_lruvec_state
> >                           1.07% free_swap_cache
> >                   + 11.69% page_remove_rmap
> >                     0.64% __mod_lruvec_page_state
> >       - 17.34% remove_vma
> >          - 17.25% vm_area_free
> >             - 17.23% kmem_cache_free
> >                - 17.15% __slab_free
> >                   - 14.56% discard_slab
> >                        free_slab
> >                        __free_slab
> >                        __free_pages
> >                      - free_unref_page
> >                         - 13.50% free_unref_page_commit
> >                            - free_pcppages_bulk
> >                               + 13.44% _raw_spin_lock
>
> I don't think your change will reduce zone->lock contention cycles.  So,
> I don't find the value of the above data.
>
> > By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
> > with the majority of them being regular order-0 user pages.
> >
> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> > e=1
> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >  => free_pcppages_bulk
> >  => free_unref_page_commit
> >  => free_unref_page_list
> >  => release_pages
> >  => free_pages_and_swap_cache
> >  => tlb_flush_mmu
> >  => zap_pte_range
> >  => unmap_page_range
> >  => unmap_single_vma
> >  => unmap_vmas
> >  => exit_mmap
> >  => mmput
> >  => do_exit
> >  => do_group_exit
> >  => get_signal
> >  => arch_do_signal_or_restart
> >  => exit_to_user_mode_prepare
> >  => syscall_exit_to_user_mode
> >  => do_syscall_64
> >  => entry_SYSCALL_64_after_hwframe
> >
> > The servers experiencing these issues are equipped with impressive
> > hardware specifications, including 256 CPUs and 1TB of memory, all
> > within a single NUMA node. The zoneinfo is as follows,
> >
> > Node 0, zone   Normal
> >   pages free     144465775
> >         boost    0
> >         min      1309270
> >         low      1636587
> >         high     1963904
> >         spanned  564133888
> >         present  296747008
> >         managed  291974346
> >         cma      0
> >         protection: (0, 0, 0, 0)
> > ...
> >   pagesets
> >     cpu: 0
> >               count: 2217
> >               high:  6392
> >               batch: 63
> >   vm stats threshold: 125
> >     cpu: 1
> >               count: 4510
> >               high:  6392
> >               batch: 63
> >   vm stats threshold: 125
> >     cpu: 2
> >               count: 3059
> >               high:  6392
> >               batch: 63
> >
> > ...
> >
> > The pcp high is around 100 times the batch size.
> >
> > I also traced the latency associated with the free_pcppages_bulk()
> > function during the container exit process:
> >
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                        |
> >          2 -> 3          : 0        |                                        |
> >          4 -> 7          : 0        |                                        |
> >          8 -> 15         : 0        |                                        |
> >         16 -> 31         : 0        |                                        |
> >         32 -> 63         : 0        |                                        |
> >         64 -> 127        : 0        |                                        |
> >        128 -> 255        : 0        |                                        |
> >        256 -> 511        : 148      |*****************                       |
> >        512 -> 1023       : 334      |****************************************|
> >       1024 -> 2047       : 33       |***                                     |
> >       2048 -> 4095       : 5        |                                        |
> >       4096 -> 8191       : 7        |                                        |
> >       8192 -> 16383      : 12       |*                                       |
> >      16384 -> 32767      : 30       |***                                     |
> >      32768 -> 65535      : 21       |**                                      |
> >      65536 -> 131071     : 15       |*                                       |
> >     131072 -> 262143     : 27       |***                                     |
> >     262144 -> 524287     : 84       |**********                              |
> >     524288 -> 1048575    : 203      |************************                |
> >    1048576 -> 2097151    : 284      |**********************************      |
> >    2097152 -> 4194303    : 327      |*************************************** |
> >    4194304 -> 8388607    : 215      |*************************               |
> >    8388608 -> 16777215   : 116      |*************                           |
> >   16777216 -> 33554431   : 47       |*****                                   |
> >   33554432 -> 67108863   : 8        |                                        |
> >   67108864 -> 134217727  : 3        |                                        |
> >
> > The latency can reach tens of milliseconds.
> >
> > Experimenting
> > =============
> >
> > vm.percpu_pagelist_high_fraction
> > --------------------------------
> >
> > The kernel version currently deployed in our production environment is the
> > stable 6.1.y, and my initial strategy involves optimizing the
>
> IMHO, we should focus on upstream activity in the cover letter and patch
> description.  And I don't think that it's necessary to describe the
> alternative solution with too much details.
>
> > vm.percpu_pagelist_high_fraction parameter. By increasing the value of
> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
> > page draining, which subsequently leads to a substantial reduction in
> > latency. After setting the sysctl value to 0x7fffffff, I observed a notable
> > improvement in latency.
> >
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                        |
> >          2 -> 3          : 0        |                                        |
> >          4 -> 7          : 0        |                                        |
> >          8 -> 15         : 0        |                                        |
> >         16 -> 31         : 0        |                                        |
> >         32 -> 63         : 0        |                                        |
> >         64 -> 127        : 0        |                                        |
> >        128 -> 255        : 120      |                                        |
> >        256 -> 511        : 365      |*                                       |
> >        512 -> 1023       : 201      |                                        |
> >       1024 -> 2047       : 103      |                                        |
> >       2048 -> 4095       : 84       |                                        |
> >       4096 -> 8191       : 87       |                                        |
> >       8192 -> 16383      : 4777     |**************                          |
> >      16384 -> 32767      : 10572    |*******************************         |
> >      32768 -> 65535      : 13544    |****************************************|
> >      65536 -> 131071     : 12723    |*************************************   |
> >     131072 -> 262143     : 8604     |*************************               |
> >     262144 -> 524287     : 3659     |**********                              |
> >     524288 -> 1048575    : 921      |**                                      |
> >    1048576 -> 2097151    : 122      |                                        |
> >    2097152 -> 4194303    : 5        |                                        |
> >
> > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
> > pcp high watermark size to a minimum of four times the batch size. While
> > this could theoretically affect throughput, as highlighted by Ying[0], we
> > have yet to observe any significant difference in throughput within our
> > production environment after implementing this change.
> >
> > Backporting the series "mm: PCP high auto-tuning"
> > -------------------------------------------------
>
> Again, not upstream activity.  We can describe the upstream behavior
> directly.

Andrew has requested that I provide a more comprehensive analysis of
this issue, and in response, I have endeavored to outline all the
pertinent details in a thorough and detailed manner.

>
> > My second endeavor was to backport the series titled
> > "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
> > into our 6.1.y stable kernel version. Subsequent to its deployment in our
> > production environment, I noted a pronounced reduction in latency. The
> > observed outcomes are as enumerated below:
> >
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                        |
> >          2 -> 3          : 0        |                                        |
> >          4 -> 7          : 0        |                                        |
> >          8 -> 15         : 0        |                                        |
> >         16 -> 31         : 0        |                                        |
> >         32 -> 63         : 0        |                                        |
> >         64 -> 127        : 0        |                                        |
> >        128 -> 255        : 0        |                                        |
> >        256 -> 511        : 0        |                                        |
> >        512 -> 1023       : 0        |                                        |
> >       1024 -> 2047       : 2        |                                        |
> >       2048 -> 4095       : 11       |                                        |
> >       4096 -> 8191       : 3        |                                        |
> >       8192 -> 16383      : 1        |                                        |
> >      16384 -> 32767      : 2        |                                        |
> >      32768 -> 65535      : 7        |                                        |
> >      65536 -> 131071     : 198      |*********                               |
> >     131072 -> 262143     : 530      |************************                |
> >     262144 -> 524287     : 824      |**************************************  |
> >     524288 -> 1048575    : 852      |****************************************|
> >    1048576 -> 2097151    : 714      |*********************************       |
> >    2097152 -> 4194303    : 389      |******************                      |
> >    4194304 -> 8388607    : 143      |******                                  |
> >    8388608 -> 16777215   : 29       |*                                       |
> >   16777216 -> 33554431   : 1        |                                        |
> >
> > Compared to the previous data, the maximum latency has been reduced to
> > less than 30ms.
>
> People don't care too much about page freeing latency during processes
> exiting.  Instead, they care more about the process exiting time, that
> is, throughput.  So, it's better to show the page allocation latency
> which is affected by the simultaneous processes exiting.

I'm confused also. Is this issue really hard to understand ?


-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-11  2:25   ` Yafang Shao
@ 2024-07-11  6:38     ` Huang, Ying
  2024-07-11  7:21       ` Yafang Shao
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2024-07-11  6:38 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm

Yafang Shao <laoar.shao@gmail.com> writes:

> On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > Background
>> > ==========
>> >
>> > In our containerized environment, we have a specific type of container
>> > that runs 18 processes, each consuming approximately 6GB of RSS. These
>> > processes are organized as separate processes rather than threads due
>> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
>> > multi-threaded setup. Upon the exit of these containers, other
>> > containers hosted on the same machine experience significant latency
>> > spikes.
>> >
>> > Investigation
>> > =============
>> >
>> > My investigation using perf tracing revealed that the root cause of
>> > these spikes is the simultaneous execution of exit_mmap() by each of
>> > the exiting processes. This concurrent access to the zone->lock
>> > results in contention, which becomes a hotspot and negatively impacts
>> > performance. The perf results clearly indicate this contention as a
>> > primary contributor to the observed latency issues.
>> >
>> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
>> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
>> >    - 76.97% exit_mmap
>> >       - 58.58% unmap_vmas
>> >          - 58.55% unmap_single_vma
>> >             - unmap_page_range
>> >                - 58.32% zap_pte_range
>> >                   - 42.88% tlb_flush_mmu
>> >                      - 42.76% free_pages_and_swap_cache
>> >                         - 41.22% release_pages
>> >                            - 33.29% free_unref_page_list
>> >                               - 32.37% free_unref_page_commit
>> >                                  - 31.64% free_pcppages_bulk
>> >                                     + 28.65% _raw_spin_lock
>> >                                       1.28% __list_del_entry_valid
>> >                            + 3.25% folio_lruvec_lock_irqsave
>> >                            + 0.75% __mem_cgroup_uncharge_list
>> >                              0.60% __mod_lruvec_state
>> >                           1.07% free_swap_cache
>> >                   + 11.69% page_remove_rmap
>> >                     0.64% __mod_lruvec_page_state
>> >       - 17.34% remove_vma
>> >          - 17.25% vm_area_free
>> >             - 17.23% kmem_cache_free
>> >                - 17.15% __slab_free
>> >                   - 14.56% discard_slab
>> >                        free_slab
>> >                        __free_slab
>> >                        __free_pages
>> >                      - free_unref_page
>> >                         - 13.50% free_unref_page_commit
>> >                            - free_pcppages_bulk
>> >                               + 13.44% _raw_spin_lock
>>
>> I don't think your change will reduce zone->lock contention cycles.  So,
>> I don't find the value of the above data.
>>
>> > By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
>> > with the majority of them being regular order-0 user pages.
>> >
>> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
>> > e=1
>> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>> >  => free_pcppages_bulk
>> >  => free_unref_page_commit
>> >  => free_unref_page_list
>> >  => release_pages
>> >  => free_pages_and_swap_cache
>> >  => tlb_flush_mmu
>> >  => zap_pte_range
>> >  => unmap_page_range
>> >  => unmap_single_vma
>> >  => unmap_vmas
>> >  => exit_mmap
>> >  => mmput
>> >  => do_exit
>> >  => do_group_exit
>> >  => get_signal
>> >  => arch_do_signal_or_restart
>> >  => exit_to_user_mode_prepare
>> >  => syscall_exit_to_user_mode
>> >  => do_syscall_64
>> >  => entry_SYSCALL_64_after_hwframe
>> >
>> > The servers experiencing these issues are equipped with impressive
>> > hardware specifications, including 256 CPUs and 1TB of memory, all
>> > within a single NUMA node. The zoneinfo is as follows,
>> >
>> > Node 0, zone   Normal
>> >   pages free     144465775
>> >         boost    0
>> >         min      1309270
>> >         low      1636587
>> >         high     1963904
>> >         spanned  564133888
>> >         present  296747008
>> >         managed  291974346
>> >         cma      0
>> >         protection: (0, 0, 0, 0)
>> > ...
>> >   pagesets
>> >     cpu: 0
>> >               count: 2217
>> >               high:  6392
>> >               batch: 63
>> >   vm stats threshold: 125
>> >     cpu: 1
>> >               count: 4510
>> >               high:  6392
>> >               batch: 63
>> >   vm stats threshold: 125
>> >     cpu: 2
>> >               count: 3059
>> >               high:  6392
>> >               batch: 63
>> >
>> > ...
>> >
>> > The pcp high is around 100 times the batch size.
>> >
>> > I also traced the latency associated with the free_pcppages_bulk()
>> > function during the container exit process:
>> >
>> >      nsecs               : count     distribution
>> >          0 -> 1          : 0        |                                        |
>> >          2 -> 3          : 0        |                                        |
>> >          4 -> 7          : 0        |                                        |
>> >          8 -> 15         : 0        |                                        |
>> >         16 -> 31         : 0        |                                        |
>> >         32 -> 63         : 0        |                                        |
>> >         64 -> 127        : 0        |                                        |
>> >        128 -> 255        : 0        |                                        |
>> >        256 -> 511        : 148      |*****************                       |
>> >        512 -> 1023       : 334      |****************************************|
>> >       1024 -> 2047       : 33       |***                                     |
>> >       2048 -> 4095       : 5        |                                        |
>> >       4096 -> 8191       : 7        |                                        |
>> >       8192 -> 16383      : 12       |*                                       |
>> >      16384 -> 32767      : 30       |***                                     |
>> >      32768 -> 65535      : 21       |**                                      |
>> >      65536 -> 131071     : 15       |*                                       |
>> >     131072 -> 262143     : 27       |***                                     |
>> >     262144 -> 524287     : 84       |**********                              |
>> >     524288 -> 1048575    : 203      |************************                |
>> >    1048576 -> 2097151    : 284      |**********************************      |
>> >    2097152 -> 4194303    : 327      |*************************************** |
>> >    4194304 -> 8388607    : 215      |*************************               |
>> >    8388608 -> 16777215   : 116      |*************                           |
>> >   16777216 -> 33554431   : 47       |*****                                   |
>> >   33554432 -> 67108863   : 8        |                                        |
>> >   67108864 -> 134217727  : 3        |                                        |
>> >
>> > The latency can reach tens of milliseconds.
>> >
>> > Experimenting
>> > =============
>> >
>> > vm.percpu_pagelist_high_fraction
>> > --------------------------------
>> >
>> > The kernel version currently deployed in our production environment is the
>> > stable 6.1.y, and my initial strategy involves optimizing the
>>
>> IMHO, we should focus on upstream activity in the cover letter and patch
>> description.  And I don't think that it's necessary to describe the
>> alternative solution with too much details.
>>
>> > vm.percpu_pagelist_high_fraction parameter. By increasing the value of
>> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
>> > page draining, which subsequently leads to a substantial reduction in
>> > latency. After setting the sysctl value to 0x7fffffff, I observed a notable
>> > improvement in latency.
>> >
>> >      nsecs               : count     distribution
>> >          0 -> 1          : 0        |                                        |
>> >          2 -> 3          : 0        |                                        |
>> >          4 -> 7          : 0        |                                        |
>> >          8 -> 15         : 0        |                                        |
>> >         16 -> 31         : 0        |                                        |
>> >         32 -> 63         : 0        |                                        |
>> >         64 -> 127        : 0        |                                        |
>> >        128 -> 255        : 120      |                                        |
>> >        256 -> 511        : 365      |*                                       |
>> >        512 -> 1023       : 201      |                                        |
>> >       1024 -> 2047       : 103      |                                        |
>> >       2048 -> 4095       : 84       |                                        |
>> >       4096 -> 8191       : 87       |                                        |
>> >       8192 -> 16383      : 4777     |**************                          |
>> >      16384 -> 32767      : 10572    |*******************************         |
>> >      32768 -> 65535      : 13544    |****************************************|
>> >      65536 -> 131071     : 12723    |*************************************   |
>> >     131072 -> 262143     : 8604     |*************************               |
>> >     262144 -> 524287     : 3659     |**********                              |
>> >     524288 -> 1048575    : 921      |**                                      |
>> >    1048576 -> 2097151    : 122      |                                        |
>> >    2097152 -> 4194303    : 5        |                                        |
>> >
>> > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
>> > pcp high watermark size to a minimum of four times the batch size. While
>> > this could theoretically affect throughput, as highlighted by Ying[0], we
>> > have yet to observe any significant difference in throughput within our
>> > production environment after implementing this change.
>> >
>> > Backporting the series "mm: PCP high auto-tuning"
>> > -------------------------------------------------
>>
>> Again, not upstream activity.  We can describe the upstream behavior
>> directly.
>
> Andrew has requested that I provide a more comprehensive analysis of
> this issue, and in response, I have endeavored to outline all the
> pertinent details in a thorough and detailed manner.

IMHO, upstream activity can provide comprehensive analysis of the issue
too.  And, your patch has changed much from the first version.  It's
better to describe your current version.

>>
>> > My second endeavor was to backport the series titled
>> > "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
>> > into our 6.1.y stable kernel version. Subsequent to its deployment in our
>> > production environment, I noted a pronounced reduction in latency. The
>> > observed outcomes are as enumerated below:
>> >
>> >      nsecs               : count     distribution
>> >          0 -> 1          : 0        |                                        |
>> >          2 -> 3          : 0        |                                        |
>> >          4 -> 7          : 0        |                                        |
>> >          8 -> 15         : 0        |                                        |
>> >         16 -> 31         : 0        |                                        |
>> >         32 -> 63         : 0        |                                        |
>> >         64 -> 127        : 0        |                                        |
>> >        128 -> 255        : 0        |                                        |
>> >        256 -> 511        : 0        |                                        |
>> >        512 -> 1023       : 0        |                                        |
>> >       1024 -> 2047       : 2        |                                        |
>> >       2048 -> 4095       : 11       |                                        |
>> >       4096 -> 8191       : 3        |                                        |
>> >       8192 -> 16383      : 1        |                                        |
>> >      16384 -> 32767      : 2        |                                        |
>> >      32768 -> 65535      : 7        |                                        |
>> >      65536 -> 131071     : 198      |*********                               |
>> >     131072 -> 262143     : 530      |************************                |
>> >     262144 -> 524287     : 824      |**************************************  |
>> >     524288 -> 1048575    : 852      |****************************************|
>> >    1048576 -> 2097151    : 714      |*********************************       |
>> >    2097152 -> 4194303    : 389      |******************                      |
>> >    4194304 -> 8388607    : 143      |******                                  |
>> >    8388608 -> 16777215   : 29       |*                                       |
>> >   16777216 -> 33554431   : 1        |                                        |
>> >
>> > Compared to the previous data, the maximum latency has been reduced to
>> > less than 30ms.
>>
>> People don't care too much about page freeing latency during processes
>> exiting.  Instead, they care more about the process exiting time, that
>> is, throughput.  So, it's better to show the page allocation latency
>> which is affected by the simultaneous processes exiting.
>
> I'm confused also. Is this issue really hard to understand ?

IMHO, it's better to prove the issue directly.  If you cannot prove it
directly, you can try alternative one and describe why.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-11  2:21     ` Yafang Shao
@ 2024-07-11  6:42       ` Huang, Ying
  2024-07-11  7:25         ` Yafang Shao
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2024-07-11  6:42 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

Yafang Shao <laoar.shao@gmail.com> writes:

> On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
>> > quickly experimenting with specific workloads in a production environment,
>> > particularly when monitoring latency spikes caused by contention on the
>> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
>> > is introduced as a more practical alternative.
>>
>> In general, I'm neutral to the change.  I can understand that kernel
>> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
>> too.
>>
>> > To ultimately mitigate the zone->lock contention issue, several suggestions
>> > have been proposed. One approach involves dividing large zones into multi
>> > smaller zones, as suggested by Matthew[0], while another entails splitting
>> > the zone->lock using a mechanism similar to memory arenas and shifting away
>> > from relying solely on zone_id to identify the range of free lists a
>> > particular page belongs to[1]. However, implementing these solutions is
>> > likely to necessitate a more extended development effort.
>>
>> Per my understanding, the change will hurt instead of improve zone->lock
>> contention.  Instead, it will reduce page allocation/freeing latency.
>
> I'm quite perplexed by your recent comment. You introduced a
> configuration that has proven to be difficult to use, and you have
> been resistant to suggestions for modifying it to a more user-friendly
> and practical tuning approach. May I inquire about the rationale
> behind introducing this configuration in the beginning?

Sorry, I don't understand your words.  Do you need me to explain what is
"neutral"?

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-11  6:38     ` Huang, Ying
@ 2024-07-11  7:21       ` Yafang Shao
  2024-07-11  8:36         ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2024-07-11  7:21 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, mgorman, linux-mm

On Thu, Jul 11, 2024 at 2:40 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > Background
> >> > ==========
> >> >
> >> > In our containerized environment, we have a specific type of container
> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
> >> > processes are organized as separate processes rather than threads due
> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> >> > multi-threaded setup. Upon the exit of these containers, other
> >> > containers hosted on the same machine experience significant latency
> >> > spikes.
> >> >
> >> > Investigation
> >> > =============
> >> >
> >> > My investigation using perf tracing revealed that the root cause of
> >> > these spikes is the simultaneous execution of exit_mmap() by each of
> >> > the exiting processes. This concurrent access to the zone->lock
> >> > results in contention, which becomes a hotspot and negatively impacts
> >> > performance. The perf results clearly indicate this contention as a
> >> > primary contributor to the observed latency issues.
> >> >
> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
> >> >    - 76.97% exit_mmap
> >> >       - 58.58% unmap_vmas
> >> >          - 58.55% unmap_single_vma
> >> >             - unmap_page_range
> >> >                - 58.32% zap_pte_range
> >> >                   - 42.88% tlb_flush_mmu
> >> >                      - 42.76% free_pages_and_swap_cache
> >> >                         - 41.22% release_pages
> >> >                            - 33.29% free_unref_page_list
> >> >                               - 32.37% free_unref_page_commit
> >> >                                  - 31.64% free_pcppages_bulk
> >> >                                     + 28.65% _raw_spin_lock
> >> >                                       1.28% __list_del_entry_valid
> >> >                            + 3.25% folio_lruvec_lock_irqsave
> >> >                            + 0.75% __mem_cgroup_uncharge_list
> >> >                              0.60% __mod_lruvec_state
> >> >                           1.07% free_swap_cache
> >> >                   + 11.69% page_remove_rmap
> >> >                     0.64% __mod_lruvec_page_state
> >> >       - 17.34% remove_vma
> >> >          - 17.25% vm_area_free
> >> >             - 17.23% kmem_cache_free
> >> >                - 17.15% __slab_free
> >> >                   - 14.56% discard_slab
> >> >                        free_slab
> >> >                        __free_slab
> >> >                        __free_pages
> >> >                      - free_unref_page
> >> >                         - 13.50% free_unref_page_commit
> >> >                            - free_pcppages_bulk
> >> >                               + 13.44% _raw_spin_lock
> >>
> >> I don't think your change will reduce zone->lock contention cycles.  So,
> >> I don't find the value of the above data.
> >>
> >> > By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
> >> > with the majority of them being regular order-0 user pages.
> >> >
> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> >> > e=1
> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >> >  => free_pcppages_bulk
> >> >  => free_unref_page_commit
> >> >  => free_unref_page_list
> >> >  => release_pages
> >> >  => free_pages_and_swap_cache
> >> >  => tlb_flush_mmu
> >> >  => zap_pte_range
> >> >  => unmap_page_range
> >> >  => unmap_single_vma
> >> >  => unmap_vmas
> >> >  => exit_mmap
> >> >  => mmput
> >> >  => do_exit
> >> >  => do_group_exit
> >> >  => get_signal
> >> >  => arch_do_signal_or_restart
> >> >  => exit_to_user_mode_prepare
> >> >  => syscall_exit_to_user_mode
> >> >  => do_syscall_64
> >> >  => entry_SYSCALL_64_after_hwframe
> >> >
> >> > The servers experiencing these issues are equipped with impressive
> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
> >> > within a single NUMA node. The zoneinfo is as follows,
> >> >
> >> > Node 0, zone   Normal
> >> >   pages free     144465775
> >> >         boost    0
> >> >         min      1309270
> >> >         low      1636587
> >> >         high     1963904
> >> >         spanned  564133888
> >> >         present  296747008
> >> >         managed  291974346
> >> >         cma      0
> >> >         protection: (0, 0, 0, 0)
> >> > ...
> >> >   pagesets
> >> >     cpu: 0
> >> >               count: 2217
> >> >               high:  6392
> >> >               batch: 63
> >> >   vm stats threshold: 125
> >> >     cpu: 1
> >> >               count: 4510
> >> >               high:  6392
> >> >               batch: 63
> >> >   vm stats threshold: 125
> >> >     cpu: 2
> >> >               count: 3059
> >> >               high:  6392
> >> >               batch: 63
> >> >
> >> > ...
> >> >
> >> > The pcp high is around 100 times the batch size.
> >> >
> >> > I also traced the latency associated with the free_pcppages_bulk()
> >> > function during the container exit process:
> >> >
> >> >      nsecs               : count     distribution
> >> >          0 -> 1          : 0        |                                        |
> >> >          2 -> 3          : 0        |                                        |
> >> >          4 -> 7          : 0        |                                        |
> >> >          8 -> 15         : 0        |                                        |
> >> >         16 -> 31         : 0        |                                        |
> >> >         32 -> 63         : 0        |                                        |
> >> >         64 -> 127        : 0        |                                        |
> >> >        128 -> 255        : 0        |                                        |
> >> >        256 -> 511        : 148      |*****************                       |
> >> >        512 -> 1023       : 334      |****************************************|
> >> >       1024 -> 2047       : 33       |***                                     |
> >> >       2048 -> 4095       : 5        |                                        |
> >> >       4096 -> 8191       : 7        |                                        |
> >> >       8192 -> 16383      : 12       |*                                       |
> >> >      16384 -> 32767      : 30       |***                                     |
> >> >      32768 -> 65535      : 21       |**                                      |
> >> >      65536 -> 131071     : 15       |*                                       |
> >> >     131072 -> 262143     : 27       |***                                     |
> >> >     262144 -> 524287     : 84       |**********                              |
> >> >     524288 -> 1048575    : 203      |************************                |
> >> >    1048576 -> 2097151    : 284      |**********************************      |
> >> >    2097152 -> 4194303    : 327      |*************************************** |
> >> >    4194304 -> 8388607    : 215      |*************************               |
> >> >    8388608 -> 16777215   : 116      |*************                           |
> >> >   16777216 -> 33554431   : 47       |*****                                   |
> >> >   33554432 -> 67108863   : 8        |                                        |
> >> >   67108864 -> 134217727  : 3        |                                        |
> >> >
> >> > The latency can reach tens of milliseconds.
> >> >
> >> > Experimenting
> >> > =============
> >> >
> >> > vm.percpu_pagelist_high_fraction
> >> > --------------------------------
> >> >
> >> > The kernel version currently deployed in our production environment is the
> >> > stable 6.1.y, and my initial strategy involves optimizing the
> >>
> >> IMHO, we should focus on upstream activity in the cover letter and patch
> >> description.  And I don't think that it's necessary to describe the
> >> alternative solution with too much details.
> >>
> >> > vm.percpu_pagelist_high_fraction parameter. By increasing the value of
> >> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
> >> > page draining, which subsequently leads to a substantial reduction in
> >> > latency. After setting the sysctl value to 0x7fffffff, I observed a notable
> >> > improvement in latency.
> >> >
> >> >      nsecs               : count     distribution
> >> >          0 -> 1          : 0        |                                        |
> >> >          2 -> 3          : 0        |                                        |
> >> >          4 -> 7          : 0        |                                        |
> >> >          8 -> 15         : 0        |                                        |
> >> >         16 -> 31         : 0        |                                        |
> >> >         32 -> 63         : 0        |                                        |
> >> >         64 -> 127        : 0        |                                        |
> >> >        128 -> 255        : 120      |                                        |
> >> >        256 -> 511        : 365      |*                                       |
> >> >        512 -> 1023       : 201      |                                        |
> >> >       1024 -> 2047       : 103      |                                        |
> >> >       2048 -> 4095       : 84       |                                        |
> >> >       4096 -> 8191       : 87       |                                        |
> >> >       8192 -> 16383      : 4777     |**************                          |
> >> >      16384 -> 32767      : 10572    |*******************************         |
> >> >      32768 -> 65535      : 13544    |****************************************|
> >> >      65536 -> 131071     : 12723    |*************************************   |
> >> >     131072 -> 262143     : 8604     |*************************               |
> >> >     262144 -> 524287     : 3659     |**********                              |
> >> >     524288 -> 1048575    : 921      |**                                      |
> >> >    1048576 -> 2097151    : 122      |                                        |
> >> >    2097152 -> 4194303    : 5        |                                        |
> >> >
> >> > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
> >> > pcp high watermark size to a minimum of four times the batch size. While
> >> > this could theoretically affect throughput, as highlighted by Ying[0], we
> >> > have yet to observe any significant difference in throughput within our
> >> > production environment after implementing this change.
> >> >
> >> > Backporting the series "mm: PCP high auto-tuning"
> >> > -------------------------------------------------
> >>
> >> Again, not upstream activity.  We can describe the upstream behavior
> >> directly.
> >
> > Andrew has requested that I provide a more comprehensive analysis of
> > this issue, and in response, I have endeavored to outline all the
> > pertinent details in a thorough and detailed manner.
>
> IMHO, upstream activity can provide comprehensive analysis of the issue
> too.  And, your patch has changed much from the first version.  It's
> better to describe your current version.

After backporting the pcp auto-tuning feature to the 6.1.y branch, the
code is almost the same with the upstream kernel wrt the pcp. I have
thoroughly documented the detailed data showcasing the changes in the
backported version, providing a clear picture of the results. However,
it's crucial to note that I am unable to directly run the upstream
kernel on our production environment due to practical constraints.

>
> >>
> >> > My second endeavor was to backport the series titled
> >> > "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
> >> > into our 6.1.y stable kernel version. Subsequent to its deployment in our
> >> > production environment, I noted a pronounced reduction in latency. The
> >> > observed outcomes are as enumerated below:
> >> >
> >> >      nsecs               : count     distribution
> >> >          0 -> 1          : 0        |                                        |
> >> >          2 -> 3          : 0        |                                        |
> >> >          4 -> 7          : 0        |                                        |
> >> >          8 -> 15         : 0        |                                        |
> >> >         16 -> 31         : 0        |                                        |
> >> >         32 -> 63         : 0        |                                        |
> >> >         64 -> 127        : 0        |                                        |
> >> >        128 -> 255        : 0        |                                        |
> >> >        256 -> 511        : 0        |                                        |
> >> >        512 -> 1023       : 0        |                                        |
> >> >       1024 -> 2047       : 2        |                                        |
> >> >       2048 -> 4095       : 11       |                                        |
> >> >       4096 -> 8191       : 3        |                                        |
> >> >       8192 -> 16383      : 1        |                                        |
> >> >      16384 -> 32767      : 2        |                                        |
> >> >      32768 -> 65535      : 7        |                                        |
> >> >      65536 -> 131071     : 198      |*********                               |
> >> >     131072 -> 262143     : 530      |************************                |
> >> >     262144 -> 524287     : 824      |**************************************  |
> >> >     524288 -> 1048575    : 852      |****************************************|
> >> >    1048576 -> 2097151    : 714      |*********************************       |
> >> >    2097152 -> 4194303    : 389      |******************                      |
> >> >    4194304 -> 8388607    : 143      |******                                  |
> >> >    8388608 -> 16777215   : 29       |*                                       |
> >> >   16777216 -> 33554431   : 1        |                                        |
> >> >
> >> > Compared to the previous data, the maximum latency has been reduced to
> >> > less than 30ms.
> >>
> >> People don't care too much about page freeing latency during processes
> >> exiting.  Instead, they care more about the process exiting time, that
> >> is, throughput.  So, it's better to show the page allocation latency
> >> which is affected by the simultaneous processes exiting.
> >
> > I'm confused also. Is this issue really hard to understand ?
>
> IMHO, it's better to prove the issue directly.  If you cannot prove it
> directly, you can try alternative one and describe why.

Not all data can be verified straightforwardly or effortlessly. The
primary focus lies in the zone->lock contention, which necessitates
measuring the latency it incurs. To accomplish this, the
free_pcppages_bulk() function serves as an effective tool for
evaluation. Therefore, I have opted to specifically measure the
latency associated with free_pcppages_bulk().

The rationale behind not measuring allocation latency is due to the
necessity of finding a willing participant to endure potential delays,
a task that proved unsuccessful as no one expressed interest. In
contrast, assessing free_pcppages_bulk()'s latency solely requires
identifying and experimenting with the source causing the delays,
making it a more feasible approach.


-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-11  6:42       ` Huang, Ying
@ 2024-07-11  7:25         ` Yafang Shao
  2024-07-11  8:18           ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2024-07-11  7:25 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
> >> > quickly experimenting with specific workloads in a production environment,
> >> > particularly when monitoring latency spikes caused by contention on the
> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
> >> > is introduced as a more practical alternative.
> >>
> >> In general, I'm neutral to the change.  I can understand that kernel
> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
> >> too.
> >>
> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
> >> > have been proposed. One approach involves dividing large zones into multi
> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
> >> > from relying solely on zone_id to identify the range of free lists a
> >> > particular page belongs to[1]. However, implementing these solutions is
> >> > likely to necessitate a more extended development effort.
> >>
> >> Per my understanding, the change will hurt instead of improve zone->lock
> >> contention.  Instead, it will reduce page allocation/freeing latency.
> >
> > I'm quite perplexed by your recent comment. You introduced a
> > configuration that has proven to be difficult to use, and you have
> > been resistant to suggestions for modifying it to a more user-friendly
> > and practical tuning approach. May I inquire about the rationale
> > behind introducing this configuration in the beginning?
>
> Sorry, I don't understand your words.  Do you need me to explain what is
> "neutral"?

No, thanks.
After consulting with ChatGPT, I received a clear and comprehensive
explanation of what "neutral" means, providing me with a better
understanding of the concept.

So, can you explain why you introduced it as a config in the beginning ?

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-11  7:25         ` Yafang Shao
@ 2024-07-11  8:18           ` Huang, Ying
  2024-07-11  9:51             ` Yafang Shao
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2024-07-11  8:18 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

Yafang Shao <laoar.shao@gmail.com> writes:

> On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
>> >> > quickly experimenting with specific workloads in a production environment,
>> >> > particularly when monitoring latency spikes caused by contention on the
>> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
>> >> > is introduced as a more practical alternative.
>> >>
>> >> In general, I'm neutral to the change.  I can understand that kernel
>> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
>> >> too.
>> >>
>> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
>> >> > have been proposed. One approach involves dividing large zones into multi
>> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
>> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
>> >> > from relying solely on zone_id to identify the range of free lists a
>> >> > particular page belongs to[1]. However, implementing these solutions is
>> >> > likely to necessitate a more extended development effort.
>> >>
>> >> Per my understanding, the change will hurt instead of improve zone->lock
>> >> contention.  Instead, it will reduce page allocation/freeing latency.
>> >
>> > I'm quite perplexed by your recent comment. You introduced a
>> > configuration that has proven to be difficult to use, and you have
>> > been resistant to suggestions for modifying it to a more user-friendly
>> > and practical tuning approach. May I inquire about the rationale
>> > behind introducing this configuration in the beginning?
>>
>> Sorry, I don't understand your words.  Do you need me to explain what is
>> "neutral"?
>
> No, thanks.
> After consulting with ChatGPT, I received a clear and comprehensive
> explanation of what "neutral" means, providing me with a better
> understanding of the concept.
>
> So, can you explain why you introduced it as a config in the beginning ?

I think that I have explained it in the commit log of commit
52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
latency").  Which introduces the config.

Sysctl knob is ABI, which needs to be maintained forever.  Can you
explain why you need it?  Why cannot you use a fixed value after initial
experiments.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-11  7:21       ` Yafang Shao
@ 2024-07-11  8:36         ` Huang, Ying
  2024-07-11  9:40           ` Yafang Shao
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2024-07-11  8:36 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm

Yafang Shao <laoar.shao@gmail.com> writes:

> On Thu, Jul 11, 2024 at 2:40 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > Background
>> >> > ==========
>> >> >
>> >> > In our containerized environment, we have a specific type of container
>> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
>> >> > processes are organized as separate processes rather than threads due
>> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
>> >> > multi-threaded setup. Upon the exit of these containers, other
>> >> > containers hosted on the same machine experience significant latency
>> >> > spikes.
>> >> >
>> >> > Investigation
>> >> > =============
>> >> >
>> >> > My investigation using perf tracing revealed that the root cause of
>> >> > these spikes is the simultaneous execution of exit_mmap() by each of
>> >> > the exiting processes. This concurrent access to the zone->lock
>> >> > results in contention, which becomes a hotspot and negatively impacts
>> >> > performance. The perf results clearly indicate this contention as a
>> >> > primary contributor to the observed latency issues.
>> >> >
>> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
>> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
>> >> >    - 76.97% exit_mmap
>> >> >       - 58.58% unmap_vmas
>> >> >          - 58.55% unmap_single_vma
>> >> >             - unmap_page_range
>> >> >                - 58.32% zap_pte_range
>> >> >                   - 42.88% tlb_flush_mmu
>> >> >                      - 42.76% free_pages_and_swap_cache
>> >> >                         - 41.22% release_pages
>> >> >                            - 33.29% free_unref_page_list
>> >> >                               - 32.37% free_unref_page_commit
>> >> >                                  - 31.64% free_pcppages_bulk
>> >> >                                     + 28.65% _raw_spin_lock
>> >> >                                       1.28% __list_del_entry_valid
>> >> >                            + 3.25% folio_lruvec_lock_irqsave
>> >> >                            + 0.75% __mem_cgroup_uncharge_list
>> >> >                              0.60% __mod_lruvec_state
>> >> >                           1.07% free_swap_cache
>> >> >                   + 11.69% page_remove_rmap
>> >> >                     0.64% __mod_lruvec_page_state
>> >> >       - 17.34% remove_vma
>> >> >          - 17.25% vm_area_free
>> >> >             - 17.23% kmem_cache_free
>> >> >                - 17.15% __slab_free
>> >> >                   - 14.56% discard_slab
>> >> >                        free_slab
>> >> >                        __free_slab
>> >> >                        __free_pages
>> >> >                      - free_unref_page
>> >> >                         - 13.50% free_unref_page_commit
>> >> >                            - free_pcppages_bulk
>> >> >                               + 13.44% _raw_spin_lock
>> >>
>> >> I don't think your change will reduce zone->lock contention cycles.  So,
>> >> I don't find the value of the above data.
>> >>
>> >> > By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
>> >> > with the majority of them being regular order-0 user pages.
>> >> >
>> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
>> >> > e=1
>> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>> >> >  => free_pcppages_bulk
>> >> >  => free_unref_page_commit
>> >> >  => free_unref_page_list
>> >> >  => release_pages
>> >> >  => free_pages_and_swap_cache
>> >> >  => tlb_flush_mmu
>> >> >  => zap_pte_range
>> >> >  => unmap_page_range
>> >> >  => unmap_single_vma
>> >> >  => unmap_vmas
>> >> >  => exit_mmap
>> >> >  => mmput
>> >> >  => do_exit
>> >> >  => do_group_exit
>> >> >  => get_signal
>> >> >  => arch_do_signal_or_restart
>> >> >  => exit_to_user_mode_prepare
>> >> >  => syscall_exit_to_user_mode
>> >> >  => do_syscall_64
>> >> >  => entry_SYSCALL_64_after_hwframe
>> >> >
>> >> > The servers experiencing these issues are equipped with impressive
>> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
>> >> > within a single NUMA node. The zoneinfo is as follows,
>> >> >
>> >> > Node 0, zone   Normal
>> >> >   pages free     144465775
>> >> >         boost    0
>> >> >         min      1309270
>> >> >         low      1636587
>> >> >         high     1963904
>> >> >         spanned  564133888
>> >> >         present  296747008
>> >> >         managed  291974346
>> >> >         cma      0
>> >> >         protection: (0, 0, 0, 0)
>> >> > ...
>> >> >   pagesets
>> >> >     cpu: 0
>> >> >               count: 2217
>> >> >               high:  6392
>> >> >               batch: 63
>> >> >   vm stats threshold: 125
>> >> >     cpu: 1
>> >> >               count: 4510
>> >> >               high:  6392
>> >> >               batch: 63
>> >> >   vm stats threshold: 125
>> >> >     cpu: 2
>> >> >               count: 3059
>> >> >               high:  6392
>> >> >               batch: 63
>> >> >
>> >> > ...
>> >> >
>> >> > The pcp high is around 100 times the batch size.
>> >> >
>> >> > I also traced the latency associated with the free_pcppages_bulk()
>> >> > function during the container exit process:
>> >> >
>> >> >      nsecs               : count     distribution
>> >> >          0 -> 1          : 0        |                                        |
>> >> >          2 -> 3          : 0        |                                        |
>> >> >          4 -> 7          : 0        |                                        |
>> >> >          8 -> 15         : 0        |                                        |
>> >> >         16 -> 31         : 0        |                                        |
>> >> >         32 -> 63         : 0        |                                        |
>> >> >         64 -> 127        : 0        |                                        |
>> >> >        128 -> 255        : 0        |                                        |
>> >> >        256 -> 511        : 148      |*****************                       |
>> >> >        512 -> 1023       : 334      |****************************************|
>> >> >       1024 -> 2047       : 33       |***                                     |
>> >> >       2048 -> 4095       : 5        |                                        |
>> >> >       4096 -> 8191       : 7        |                                        |
>> >> >       8192 -> 16383      : 12       |*                                       |
>> >> >      16384 -> 32767      : 30       |***                                     |
>> >> >      32768 -> 65535      : 21       |**                                      |
>> >> >      65536 -> 131071     : 15       |*                                       |
>> >> >     131072 -> 262143     : 27       |***                                     |
>> >> >     262144 -> 524287     : 84       |**********                              |
>> >> >     524288 -> 1048575    : 203      |************************                |
>> >> >    1048576 -> 2097151    : 284      |**********************************      |
>> >> >    2097152 -> 4194303    : 327      |*************************************** |
>> >> >    4194304 -> 8388607    : 215      |*************************               |
>> >> >    8388608 -> 16777215   : 116      |*************                           |
>> >> >   16777216 -> 33554431   : 47       |*****                                   |
>> >> >   33554432 -> 67108863   : 8        |                                        |
>> >> >   67108864 -> 134217727  : 3        |                                        |
>> >> >
>> >> > The latency can reach tens of milliseconds.
>> >> >
>> >> > Experimenting
>> >> > =============
>> >> >
>> >> > vm.percpu_pagelist_high_fraction
>> >> > --------------------------------
>> >> >
>> >> > The kernel version currently deployed in our production environment is the
>> >> > stable 6.1.y, and my initial strategy involves optimizing the
>> >>
>> >> IMHO, we should focus on upstream activity in the cover letter and patch
>> >> description.  And I don't think that it's necessary to describe the
>> >> alternative solution with too much details.
>> >>
>> >> > vm.percpu_pagelist_high_fraction parameter. By increasing the value of
>> >> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
>> >> > page draining, which subsequently leads to a substantial reduction in
>> >> > latency. After setting the sysctl value to 0x7fffffff, I observed a notable
>> >> > improvement in latency.
>> >> >
>> >> >      nsecs               : count     distribution
>> >> >          0 -> 1          : 0        |                                        |
>> >> >          2 -> 3          : 0        |                                        |
>> >> >          4 -> 7          : 0        |                                        |
>> >> >          8 -> 15         : 0        |                                        |
>> >> >         16 -> 31         : 0        |                                        |
>> >> >         32 -> 63         : 0        |                                        |
>> >> >         64 -> 127        : 0        |                                        |
>> >> >        128 -> 255        : 120      |                                        |
>> >> >        256 -> 511        : 365      |*                                       |
>> >> >        512 -> 1023       : 201      |                                        |
>> >> >       1024 -> 2047       : 103      |                                        |
>> >> >       2048 -> 4095       : 84       |                                        |
>> >> >       4096 -> 8191       : 87       |                                        |
>> >> >       8192 -> 16383      : 4777     |**************                          |
>> >> >      16384 -> 32767      : 10572    |*******************************         |
>> >> >      32768 -> 65535      : 13544    |****************************************|
>> >> >      65536 -> 131071     : 12723    |*************************************   |
>> >> >     131072 -> 262143     : 8604     |*************************               |
>> >> >     262144 -> 524287     : 3659     |**********                              |
>> >> >     524288 -> 1048575    : 921      |**                                      |
>> >> >    1048576 -> 2097151    : 122      |                                        |
>> >> >    2097152 -> 4194303    : 5        |                                        |
>> >> >
>> >> > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
>> >> > pcp high watermark size to a minimum of four times the batch size. While
>> >> > this could theoretically affect throughput, as highlighted by Ying[0], we
>> >> > have yet to observe any significant difference in throughput within our
>> >> > production environment after implementing this change.
>> >> >
>> >> > Backporting the series "mm: PCP high auto-tuning"
>> >> > -------------------------------------------------
>> >>
>> >> Again, not upstream activity.  We can describe the upstream behavior
>> >> directly.
>> >
>> > Andrew has requested that I provide a more comprehensive analysis of
>> > this issue, and in response, I have endeavored to outline all the
>> > pertinent details in a thorough and detailed manner.
>>
>> IMHO, upstream activity can provide comprehensive analysis of the issue
>> too.  And, your patch has changed much from the first version.  It's
>> better to describe your current version.
>
> After backporting the pcp auto-tuning feature to the 6.1.y branch, the
> code is almost the same with the upstream kernel wrt the pcp. I have
> thoroughly documented the detailed data showcasing the changes in the
> backported version, providing a clear picture of the results. However,
> it's crucial to note that I am unable to directly run the upstream
> kernel on our production environment due to practical constraints.

IMHO, the patch is for upstream kernel, not some downstream kernel, so
focus should be the upstream activity.  The issue of the upstream
kernel, and how to resolve it.  The production environment test results
can be used to support the upstream change.

>> >> > My second endeavor was to backport the series titled
>> >> > "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
>> >> > into our 6.1.y stable kernel version. Subsequent to its deployment in our
>> >> > production environment, I noted a pronounced reduction in latency. The
>> >> > observed outcomes are as enumerated below:
>> >> >
>> >> >      nsecs               : count     distribution
>> >> >          0 -> 1          : 0        |                                        |
>> >> >          2 -> 3          : 0        |                                        |
>> >> >          4 -> 7          : 0        |                                        |
>> >> >          8 -> 15         : 0        |                                        |
>> >> >         16 -> 31         : 0        |                                        |
>> >> >         32 -> 63         : 0        |                                        |
>> >> >         64 -> 127        : 0        |                                        |
>> >> >        128 -> 255        : 0        |                                        |
>> >> >        256 -> 511        : 0        |                                        |
>> >> >        512 -> 1023       : 0        |                                        |
>> >> >       1024 -> 2047       : 2        |                                        |
>> >> >       2048 -> 4095       : 11       |                                        |
>> >> >       4096 -> 8191       : 3        |                                        |
>> >> >       8192 -> 16383      : 1        |                                        |
>> >> >      16384 -> 32767      : 2        |                                        |
>> >> >      32768 -> 65535      : 7        |                                        |
>> >> >      65536 -> 131071     : 198      |*********                               |
>> >> >     131072 -> 262143     : 530      |************************                |
>> >> >     262144 -> 524287     : 824      |**************************************  |
>> >> >     524288 -> 1048575    : 852      |****************************************|
>> >> >    1048576 -> 2097151    : 714      |*********************************       |
>> >> >    2097152 -> 4194303    : 389      |******************                      |
>> >> >    4194304 -> 8388607    : 143      |******                                  |
>> >> >    8388608 -> 16777215   : 29       |*                                       |
>> >> >   16777216 -> 33554431   : 1        |                                        |
>> >> >
>> >> > Compared to the previous data, the maximum latency has been reduced to
>> >> > less than 30ms.
>> >>
>> >> People don't care too much about page freeing latency during processes
>> >> exiting.  Instead, they care more about the process exiting time, that
>> >> is, throughput.  So, it's better to show the page allocation latency
>> >> which is affected by the simultaneous processes exiting.
>> >
>> > I'm confused also. Is this issue really hard to understand ?
>>
>> IMHO, it's better to prove the issue directly.  If you cannot prove it
>> directly, you can try alternative one and describe why.
>
> Not all data can be verified straightforwardly or effortlessly. The
> primary focus lies in the zone->lock contention, which necessitates
> measuring the latency it incurs. To accomplish this, the
> free_pcppages_bulk() function serves as an effective tool for
> evaluation. Therefore, I have opted to specifically measure the
> latency associated with free_pcppages_bulk().
>
> The rationale behind not measuring allocation latency is due to the
> necessity of finding a willing participant to endure potential delays,
> a task that proved unsuccessful as no one expressed interest. In
> contrast, assessing free_pcppages_bulk()'s latency solely requires
> identifying and experimenting with the source causing the delays,
> making it a more feasible approach.

Can you run a benchmark program that do quite some memory allocation by
yourself to test it?

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-11  8:36         ` Huang, Ying
@ 2024-07-11  9:40           ` Yafang Shao
  2024-07-11 11:03             ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2024-07-11  9:40 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, mgorman, linux-mm

On Thu, Jul 11, 2024 at 4:38 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Thu, Jul 11, 2024 at 2:40 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > Background
> >> >> > ==========
> >> >> >
> >> >> > In our containerized environment, we have a specific type of container
> >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
> >> >> > processes are organized as separate processes rather than threads due
> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> >> >> > multi-threaded setup. Upon the exit of these containers, other
> >> >> > containers hosted on the same machine experience significant latency
> >> >> > spikes.
> >> >> >
> >> >> > Investigation
> >> >> > =============
> >> >> >
> >> >> > My investigation using perf tracing revealed that the root cause of
> >> >> > these spikes is the simultaneous execution of exit_mmap() by each of
> >> >> > the exiting processes. This concurrent access to the zone->lock
> >> >> > results in contention, which becomes a hotspot and negatively impacts
> >> >> > performance. The perf results clearly indicate this contention as a
> >> >> > primary contributor to the observed latency issues.
> >> >> >
> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
> >> >> >    - 76.97% exit_mmap
> >> >> >       - 58.58% unmap_vmas
> >> >> >          - 58.55% unmap_single_vma
> >> >> >             - unmap_page_range
> >> >> >                - 58.32% zap_pte_range
> >> >> >                   - 42.88% tlb_flush_mmu
> >> >> >                      - 42.76% free_pages_and_swap_cache
> >> >> >                         - 41.22% release_pages
> >> >> >                            - 33.29% free_unref_page_list
> >> >> >                               - 32.37% free_unref_page_commit
> >> >> >                                  - 31.64% free_pcppages_bulk
> >> >> >                                     + 28.65% _raw_spin_lock
> >> >> >                                       1.28% __list_del_entry_valid
> >> >> >                            + 3.25% folio_lruvec_lock_irqsave
> >> >> >                            + 0.75% __mem_cgroup_uncharge_list
> >> >> >                              0.60% __mod_lruvec_state
> >> >> >                           1.07% free_swap_cache
> >> >> >                   + 11.69% page_remove_rmap
> >> >> >                     0.64% __mod_lruvec_page_state
> >> >> >       - 17.34% remove_vma
> >> >> >          - 17.25% vm_area_free
> >> >> >             - 17.23% kmem_cache_free
> >> >> >                - 17.15% __slab_free
> >> >> >                   - 14.56% discard_slab
> >> >> >                        free_slab
> >> >> >                        __free_slab
> >> >> >                        __free_pages
> >> >> >                      - free_unref_page
> >> >> >                         - 13.50% free_unref_page_commit
> >> >> >                            - free_pcppages_bulk
> >> >> >                               + 13.44% _raw_spin_lock
> >> >>
> >> >> I don't think your change will reduce zone->lock contention cycles.  So,
> >> >> I don't find the value of the above data.
> >> >>
> >> >> > By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
> >> >> > with the majority of them being regular order-0 user pages.
> >> >> >
> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> >> >> > e=1
> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >> >> >  => free_pcppages_bulk
> >> >> >  => free_unref_page_commit
> >> >> >  => free_unref_page_list
> >> >> >  => release_pages
> >> >> >  => free_pages_and_swap_cache
> >> >> >  => tlb_flush_mmu
> >> >> >  => zap_pte_range
> >> >> >  => unmap_page_range
> >> >> >  => unmap_single_vma
> >> >> >  => unmap_vmas
> >> >> >  => exit_mmap
> >> >> >  => mmput
> >> >> >  => do_exit
> >> >> >  => do_group_exit
> >> >> >  => get_signal
> >> >> >  => arch_do_signal_or_restart
> >> >> >  => exit_to_user_mode_prepare
> >> >> >  => syscall_exit_to_user_mode
> >> >> >  => do_syscall_64
> >> >> >  => entry_SYSCALL_64_after_hwframe
> >> >> >
> >> >> > The servers experiencing these issues are equipped with impressive
> >> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
> >> >> > within a single NUMA node. The zoneinfo is as follows,
> >> >> >
> >> >> > Node 0, zone   Normal
> >> >> >   pages free     144465775
> >> >> >         boost    0
> >> >> >         min      1309270
> >> >> >         low      1636587
> >> >> >         high     1963904
> >> >> >         spanned  564133888
> >> >> >         present  296747008
> >> >> >         managed  291974346
> >> >> >         cma      0
> >> >> >         protection: (0, 0, 0, 0)
> >> >> > ...
> >> >> >   pagesets
> >> >> >     cpu: 0
> >> >> >               count: 2217
> >> >> >               high:  6392
> >> >> >               batch: 63
> >> >> >   vm stats threshold: 125
> >> >> >     cpu: 1
> >> >> >               count: 4510
> >> >> >               high:  6392
> >> >> >               batch: 63
> >> >> >   vm stats threshold: 125
> >> >> >     cpu: 2
> >> >> >               count: 3059
> >> >> >               high:  6392
> >> >> >               batch: 63
> >> >> >
> >> >> > ...
> >> >> >
> >> >> > The pcp high is around 100 times the batch size.
> >> >> >
> >> >> > I also traced the latency associated with the free_pcppages_bulk()
> >> >> > function during the container exit process:
> >> >> >
> >> >> >      nsecs               : count     distribution
> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >        128 -> 255        : 0        |                                        |
> >> >> >        256 -> 511        : 148      |*****************                       |
> >> >> >        512 -> 1023       : 334      |****************************************|
> >> >> >       1024 -> 2047       : 33       |***                                     |
> >> >> >       2048 -> 4095       : 5        |                                        |
> >> >> >       4096 -> 8191       : 7        |                                        |
> >> >> >       8192 -> 16383      : 12       |*                                       |
> >> >> >      16384 -> 32767      : 30       |***                                     |
> >> >> >      32768 -> 65535      : 21       |**                                      |
> >> >> >      65536 -> 131071     : 15       |*                                       |
> >> >> >     131072 -> 262143     : 27       |***                                     |
> >> >> >     262144 -> 524287     : 84       |**********                              |
> >> >> >     524288 -> 1048575    : 203      |************************                |
> >> >> >    1048576 -> 2097151    : 284      |**********************************      |
> >> >> >    2097152 -> 4194303    : 327      |*************************************** |
> >> >> >    4194304 -> 8388607    : 215      |*************************               |
> >> >> >    8388608 -> 16777215   : 116      |*************                           |
> >> >> >   16777216 -> 33554431   : 47       |*****                                   |
> >> >> >   33554432 -> 67108863   : 8        |                                        |
> >> >> >   67108864 -> 134217727  : 3        |                                        |
> >> >> >
> >> >> > The latency can reach tens of milliseconds.
> >> >> >
> >> >> > Experimenting
> >> >> > =============
> >> >> >
> >> >> > vm.percpu_pagelist_high_fraction
> >> >> > --------------------------------
> >> >> >
> >> >> > The kernel version currently deployed in our production environment is the
> >> >> > stable 6.1.y, and my initial strategy involves optimizing the
> >> >>
> >> >> IMHO, we should focus on upstream activity in the cover letter and patch
> >> >> description.  And I don't think that it's necessary to describe the
> >> >> alternative solution with too much details.
> >> >>
> >> >> > vm.percpu_pagelist_high_fraction parameter. By increasing the value of
> >> >> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
> >> >> > page draining, which subsequently leads to a substantial reduction in
> >> >> > latency. After setting the sysctl value to 0x7fffffff, I observed a notable
> >> >> > improvement in latency.
> >> >> >
> >> >> >      nsecs               : count     distribution
> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >        128 -> 255        : 120      |                                        |
> >> >> >        256 -> 511        : 365      |*                                       |
> >> >> >        512 -> 1023       : 201      |                                        |
> >> >> >       1024 -> 2047       : 103      |                                        |
> >> >> >       2048 -> 4095       : 84       |                                        |
> >> >> >       4096 -> 8191       : 87       |                                        |
> >> >> >       8192 -> 16383      : 4777     |**************                          |
> >> >> >      16384 -> 32767      : 10572    |*******************************         |
> >> >> >      32768 -> 65535      : 13544    |****************************************|
> >> >> >      65536 -> 131071     : 12723    |*************************************   |
> >> >> >     131072 -> 262143     : 8604     |*************************               |
> >> >> >     262144 -> 524287     : 3659     |**********                              |
> >> >> >     524288 -> 1048575    : 921      |**                                      |
> >> >> >    1048576 -> 2097151    : 122      |                                        |
> >> >> >    2097152 -> 4194303    : 5        |                                        |
> >> >> >
> >> >> > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
> >> >> > pcp high watermark size to a minimum of four times the batch size. While
> >> >> > this could theoretically affect throughput, as highlighted by Ying[0], we
> >> >> > have yet to observe any significant difference in throughput within our
> >> >> > production environment after implementing this change.
> >> >> >
> >> >> > Backporting the series "mm: PCP high auto-tuning"
> >> >> > -------------------------------------------------
> >> >>
> >> >> Again, not upstream activity.  We can describe the upstream behavior
> >> >> directly.
> >> >
> >> > Andrew has requested that I provide a more comprehensive analysis of
> >> > this issue, and in response, I have endeavored to outline all the
> >> > pertinent details in a thorough and detailed manner.
> >>
> >> IMHO, upstream activity can provide comprehensive analysis of the issue
> >> too.  And, your patch has changed much from the first version.  It's
> >> better to describe your current version.
> >
> > After backporting the pcp auto-tuning feature to the 6.1.y branch, the
> > code is almost the same with the upstream kernel wrt the pcp. I have
> > thoroughly documented the detailed data showcasing the changes in the
> > backported version, providing a clear picture of the results. However,
> > it's crucial to note that I am unable to directly run the upstream
> > kernel on our production environment due to practical constraints.
>
> IMHO, the patch is for upstream kernel, not some downstream kernel, so
> focus should be the upstream activity.  The issue of the upstream
> kernel, and how to resolve it.  The production environment test results
> can be used to support the upstream change.

 The sole distinction in the pcp between version 6.1.y and the
upstream kernel lies solely in the modifications made to the code by
you. Furthermore, given that your code changes have now been
successfully backported, what else do you expect me to do ?

>
> >> >> > My second endeavor was to backport the series titled
> >> >> > "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
> >> >> > into our 6.1.y stable kernel version. Subsequent to its deployment in our
> >> >> > production environment, I noted a pronounced reduction in latency. The
> >> >> > observed outcomes are as enumerated below:
> >> >> >
> >> >> >      nsecs               : count     distribution
> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >        128 -> 255        : 0        |                                        |
> >> >> >        256 -> 511        : 0        |                                        |
> >> >> >        512 -> 1023       : 0        |                                        |
> >> >> >       1024 -> 2047       : 2        |                                        |
> >> >> >       2048 -> 4095       : 11       |                                        |
> >> >> >       4096 -> 8191       : 3        |                                        |
> >> >> >       8192 -> 16383      : 1        |                                        |
> >> >> >      16384 -> 32767      : 2        |                                        |
> >> >> >      32768 -> 65535      : 7        |                                        |
> >> >> >      65536 -> 131071     : 198      |*********                               |
> >> >> >     131072 -> 262143     : 530      |************************                |
> >> >> >     262144 -> 524287     : 824      |**************************************  |
> >> >> >     524288 -> 1048575    : 852      |****************************************|
> >> >> >    1048576 -> 2097151    : 714      |*********************************       |
> >> >> >    2097152 -> 4194303    : 389      |******************                      |
> >> >> >    4194304 -> 8388607    : 143      |******                                  |
> >> >> >    8388608 -> 16777215   : 29       |*                                       |
> >> >> >   16777216 -> 33554431   : 1        |                                        |
> >> >> >
> >> >> > Compared to the previous data, the maximum latency has been reduced to
> >> >> > less than 30ms.
> >> >>
> >> >> People don't care too much about page freeing latency during processes
> >> >> exiting.  Instead, they care more about the process exiting time, that
> >> >> is, throughput.  So, it's better to show the page allocation latency
> >> >> which is affected by the simultaneous processes exiting.
> >> >
> >> > I'm confused also. Is this issue really hard to understand ?
> >>
> >> IMHO, it's better to prove the issue directly.  If you cannot prove it
> >> directly, you can try alternative one and describe why.
> >
> > Not all data can be verified straightforwardly or effortlessly. The
> > primary focus lies in the zone->lock contention, which necessitates
> > measuring the latency it incurs. To accomplish this, the
> > free_pcppages_bulk() function serves as an effective tool for
> > evaluation. Therefore, I have opted to specifically measure the
> > latency associated with free_pcppages_bulk().
> >
> > The rationale behind not measuring allocation latency is due to the
> > necessity of finding a willing participant to endure potential delays,
> > a task that proved unsuccessful as no one expressed interest. In
> > contrast, assessing free_pcppages_bulk()'s latency solely requires
> > identifying and experimenting with the source causing the delays,
> > making it a more feasible approach.
>
> Can you run a benchmark program that do quite some memory allocation by
> yourself to test it?

I can have a try.
However, is it the key point here?  Why can't the lock contention be
measured by the freeing?


-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-11  8:18           ` Huang, Ying
@ 2024-07-11  9:51             ` Yafang Shao
  2024-07-11 10:49               ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2024-07-11  9:51 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
> >> >> > quickly experimenting with specific workloads in a production environment,
> >> >> > particularly when monitoring latency spikes caused by contention on the
> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
> >> >> > is introduced as a more practical alternative.
> >> >>
> >> >> In general, I'm neutral to the change.  I can understand that kernel
> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
> >> >> too.
> >> >>
> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
> >> >> > have been proposed. One approach involves dividing large zones into multi
> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
> >> >> > from relying solely on zone_id to identify the range of free lists a
> >> >> > particular page belongs to[1]. However, implementing these solutions is
> >> >> > likely to necessitate a more extended development effort.
> >> >>
> >> >> Per my understanding, the change will hurt instead of improve zone->lock
> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
> >> >
> >> > I'm quite perplexed by your recent comment. You introduced a
> >> > configuration that has proven to be difficult to use, and you have
> >> > been resistant to suggestions for modifying it to a more user-friendly
> >> > and practical tuning approach. May I inquire about the rationale
> >> > behind introducing this configuration in the beginning?
> >>
> >> Sorry, I don't understand your words.  Do you need me to explain what is
> >> "neutral"?
> >
> > No, thanks.
> > After consulting with ChatGPT, I received a clear and comprehensive
> > explanation of what "neutral" means, providing me with a better
> > understanding of the concept.
> >
> > So, can you explain why you introduced it as a config in the beginning ?
>
> I think that I have explained it in the commit log of commit
> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
> latency").  Which introduces the config.

What specifically are your expectations for how users should utilize
this config in real production workload?

>
> Sysctl knob is ABI, which needs to be maintained forever.  Can you
> explain why you need it?  Why cannot you use a fixed value after initial
> experiments.

Given the extensive scale of our production environment, with hundreds
of thousands of servers, it begs the question: how do you propose we
efficiently manage the various workloads that remain unaffected by the
sysctl change implemented on just a few thousand servers? Is it
feasible to expect us to recompile and release a new kernel for every
instance where the default value falls short? Surely, there must be
more practical and efficient approaches we can explore together to
ensure optimal performance across all workloads.

When making improvements or modifications, kindly ensure that they are
not solely confined to a test or lab environment. It's vital to also
consider the needs and requirements of our actual users, along with
the diverse workloads they encounter in their daily operations.


--
Regards
Yafang


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-11  9:51             ` Yafang Shao
@ 2024-07-11 10:49               ` Huang, Ying
  2024-07-11 12:45                 ` Yafang Shao
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2024-07-11 10:49 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

Yafang Shao <laoar.shao@gmail.com> writes:

> On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
>> >> >> > quickly experimenting with specific workloads in a production environment,
>> >> >> > particularly when monitoring latency spikes caused by contention on the
>> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
>> >> >> > is introduced as a more practical alternative.
>> >> >>
>> >> >> In general, I'm neutral to the change.  I can understand that kernel
>> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
>> >> >> too.
>> >> >>
>> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
>> >> >> > have been proposed. One approach involves dividing large zones into multi
>> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
>> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
>> >> >> > from relying solely on zone_id to identify the range of free lists a
>> >> >> > particular page belongs to[1]. However, implementing these solutions is
>> >> >> > likely to necessitate a more extended development effort.
>> >> >>
>> >> >> Per my understanding, the change will hurt instead of improve zone->lock
>> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
>> >> >
>> >> > I'm quite perplexed by your recent comment. You introduced a
>> >> > configuration that has proven to be difficult to use, and you have
>> >> > been resistant to suggestions for modifying it to a more user-friendly
>> >> > and practical tuning approach. May I inquire about the rationale
>> >> > behind introducing this configuration in the beginning?
>> >>
>> >> Sorry, I don't understand your words.  Do you need me to explain what is
>> >> "neutral"?
>> >
>> > No, thanks.
>> > After consulting with ChatGPT, I received a clear and comprehensive
>> > explanation of what "neutral" means, providing me with a better
>> > understanding of the concept.
>> >
>> > So, can you explain why you introduced it as a config in the beginning ?
>>
>> I think that I have explained it in the commit log of commit
>> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
>> latency").  Which introduces the config.
>
> What specifically are your expectations for how users should utilize
> this config in real production workload?
>
>>
>> Sysctl knob is ABI, which needs to be maintained forever.  Can you
>> explain why you need it?  Why cannot you use a fixed value after initial
>> experiments.
>
> Given the extensive scale of our production environment, with hundreds
> of thousands of servers, it begs the question: how do you propose we
> efficiently manage the various workloads that remain unaffected by the
> sysctl change implemented on just a few thousand servers? Is it
> feasible to expect us to recompile and release a new kernel for every
> instance where the default value falls short? Surely, there must be
> more practical and efficient approaches we can explore together to
> ensure optimal performance across all workloads.
>
> When making improvements or modifications, kindly ensure that they are
> not solely confined to a test or lab environment. It's vital to also
> consider the needs and requirements of our actual users, along with
> the diverse workloads they encounter in their daily operations.

Have you found that your different systems requires different
CONFIG_PCP_BATCH_SCALE_MAX value already?  If no, I think that it's
better for you to keep this patch in your downstream kernel for now.
When you find that it is a common requirement, we can evaluate whether
to make it a sysctl knob.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-11  9:40           ` Yafang Shao
@ 2024-07-11 11:03             ` Huang, Ying
  2024-07-11 12:40               ` Yafang Shao
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2024-07-11 11:03 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm

Yafang Shao <laoar.shao@gmail.com> writes:

> On Thu, Jul 11, 2024 at 4:38 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Thu, Jul 11, 2024 at 2:40 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > Background
>> >> >> > ==========
>> >> >> >
>> >> >> > In our containerized environment, we have a specific type of container
>> >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
>> >> >> > processes are organized as separate processes rather than threads due
>> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
>> >> >> > multi-threaded setup. Upon the exit of these containers, other
>> >> >> > containers hosted on the same machine experience significant latency
>> >> >> > spikes.
>> >> >> >
>> >> >> > Investigation
>> >> >> > =============
>> >> >> >
>> >> >> > My investigation using perf tracing revealed that the root cause of
>> >> >> > these spikes is the simultaneous execution of exit_mmap() by each of
>> >> >> > the exiting processes. This concurrent access to the zone->lock
>> >> >> > results in contention, which becomes a hotspot and negatively impacts
>> >> >> > performance. The perf results clearly indicate this contention as a
>> >> >> > primary contributor to the observed latency issues.
>> >> >> >
>> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
>> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
>> >> >> >    - 76.97% exit_mmap
>> >> >> >       - 58.58% unmap_vmas
>> >> >> >          - 58.55% unmap_single_vma
>> >> >> >             - unmap_page_range
>> >> >> >                - 58.32% zap_pte_range
>> >> >> >                   - 42.88% tlb_flush_mmu
>> >> >> >                      - 42.76% free_pages_and_swap_cache
>> >> >> >                         - 41.22% release_pages
>> >> >> >                            - 33.29% free_unref_page_list
>> >> >> >                               - 32.37% free_unref_page_commit
>> >> >> >                                  - 31.64% free_pcppages_bulk
>> >> >> >                                     + 28.65% _raw_spin_lock
>> >> >> >                                       1.28% __list_del_entry_valid
>> >> >> >                            + 3.25% folio_lruvec_lock_irqsave
>> >> >> >                            + 0.75% __mem_cgroup_uncharge_list
>> >> >> >                              0.60% __mod_lruvec_state
>> >> >> >                           1.07% free_swap_cache
>> >> >> >                   + 11.69% page_remove_rmap
>> >> >> >                     0.64% __mod_lruvec_page_state
>> >> >> >       - 17.34% remove_vma
>> >> >> >          - 17.25% vm_area_free
>> >> >> >             - 17.23% kmem_cache_free
>> >> >> >                - 17.15% __slab_free
>> >> >> >                   - 14.56% discard_slab
>> >> >> >                        free_slab
>> >> >> >                        __free_slab
>> >> >> >                        __free_pages
>> >> >> >                      - free_unref_page
>> >> >> >                         - 13.50% free_unref_page_commit
>> >> >> >                            - free_pcppages_bulk
>> >> >> >                               + 13.44% _raw_spin_lock
>> >> >>
>> >> >> I don't think your change will reduce zone->lock contention cycles.  So,
>> >> >> I don't find the value of the above data.
>> >> >>
>> >> >> > By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
>> >> >> > with the majority of them being regular order-0 user pages.
>> >> >> >
>> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
>> >> >> > e=1
>> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>> >> >> >  => free_pcppages_bulk
>> >> >> >  => free_unref_page_commit
>> >> >> >  => free_unref_page_list
>> >> >> >  => release_pages
>> >> >> >  => free_pages_and_swap_cache
>> >> >> >  => tlb_flush_mmu
>> >> >> >  => zap_pte_range
>> >> >> >  => unmap_page_range
>> >> >> >  => unmap_single_vma
>> >> >> >  => unmap_vmas
>> >> >> >  => exit_mmap
>> >> >> >  => mmput
>> >> >> >  => do_exit
>> >> >> >  => do_group_exit
>> >> >> >  => get_signal
>> >> >> >  => arch_do_signal_or_restart
>> >> >> >  => exit_to_user_mode_prepare
>> >> >> >  => syscall_exit_to_user_mode
>> >> >> >  => do_syscall_64
>> >> >> >  => entry_SYSCALL_64_after_hwframe
>> >> >> >
>> >> >> > The servers experiencing these issues are equipped with impressive
>> >> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
>> >> >> > within a single NUMA node. The zoneinfo is as follows,
>> >> >> >
>> >> >> > Node 0, zone   Normal
>> >> >> >   pages free     144465775
>> >> >> >         boost    0
>> >> >> >         min      1309270
>> >> >> >         low      1636587
>> >> >> >         high     1963904
>> >> >> >         spanned  564133888
>> >> >> >         present  296747008
>> >> >> >         managed  291974346
>> >> >> >         cma      0
>> >> >> >         protection: (0, 0, 0, 0)
>> >> >> > ...
>> >> >> >   pagesets
>> >> >> >     cpu: 0
>> >> >> >               count: 2217
>> >> >> >               high:  6392
>> >> >> >               batch: 63
>> >> >> >   vm stats threshold: 125
>> >> >> >     cpu: 1
>> >> >> >               count: 4510
>> >> >> >               high:  6392
>> >> >> >               batch: 63
>> >> >> >   vm stats threshold: 125
>> >> >> >     cpu: 2
>> >> >> >               count: 3059
>> >> >> >               high:  6392
>> >> >> >               batch: 63
>> >> >> >
>> >> >> > ...
>> >> >> >
>> >> >> > The pcp high is around 100 times the batch size.
>> >> >> >
>> >> >> > I also traced the latency associated with the free_pcppages_bulk()
>> >> >> > function during the container exit process:
>> >> >> >
>> >> >> >      nsecs               : count     distribution
>> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >        128 -> 255        : 0        |                                        |
>> >> >> >        256 -> 511        : 148      |*****************                       |
>> >> >> >        512 -> 1023       : 334      |****************************************|
>> >> >> >       1024 -> 2047       : 33       |***                                     |
>> >> >> >       2048 -> 4095       : 5        |                                        |
>> >> >> >       4096 -> 8191       : 7        |                                        |
>> >> >> >       8192 -> 16383      : 12       |*                                       |
>> >> >> >      16384 -> 32767      : 30       |***                                     |
>> >> >> >      32768 -> 65535      : 21       |**                                      |
>> >> >> >      65536 -> 131071     : 15       |*                                       |
>> >> >> >     131072 -> 262143     : 27       |***                                     |
>> >> >> >     262144 -> 524287     : 84       |**********                              |
>> >> >> >     524288 -> 1048575    : 203      |************************                |
>> >> >> >    1048576 -> 2097151    : 284      |**********************************      |
>> >> >> >    2097152 -> 4194303    : 327      |*************************************** |
>> >> >> >    4194304 -> 8388607    : 215      |*************************               |
>> >> >> >    8388608 -> 16777215   : 116      |*************                           |
>> >> >> >   16777216 -> 33554431   : 47       |*****                                   |
>> >> >> >   33554432 -> 67108863   : 8        |                                        |
>> >> >> >   67108864 -> 134217727  : 3        |                                        |
>> >> >> >
>> >> >> > The latency can reach tens of milliseconds.
>> >> >> >
>> >> >> > Experimenting
>> >> >> > =============
>> >> >> >
>> >> >> > vm.percpu_pagelist_high_fraction
>> >> >> > --------------------------------
>> >> >> >
>> >> >> > The kernel version currently deployed in our production environment is the
>> >> >> > stable 6.1.y, and my initial strategy involves optimizing the
>> >> >>
>> >> >> IMHO, we should focus on upstream activity in the cover letter and patch
>> >> >> description.  And I don't think that it's necessary to describe the
>> >> >> alternative solution with too much details.
>> >> >>
>> >> >> > vm.percpu_pagelist_high_fraction parameter. By increasing the value of
>> >> >> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
>> >> >> > page draining, which subsequently leads to a substantial reduction in
>> >> >> > latency. After setting the sysctl value to 0x7fffffff, I observed a notable
>> >> >> > improvement in latency.
>> >> >> >
>> >> >> >      nsecs               : count     distribution
>> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >        128 -> 255        : 120      |                                        |
>> >> >> >        256 -> 511        : 365      |*                                       |
>> >> >> >        512 -> 1023       : 201      |                                        |
>> >> >> >       1024 -> 2047       : 103      |                                        |
>> >> >> >       2048 -> 4095       : 84       |                                        |
>> >> >> >       4096 -> 8191       : 87       |                                        |
>> >> >> >       8192 -> 16383      : 4777     |**************                          |
>> >> >> >      16384 -> 32767      : 10572    |*******************************         |
>> >> >> >      32768 -> 65535      : 13544    |****************************************|
>> >> >> >      65536 -> 131071     : 12723    |*************************************   |
>> >> >> >     131072 -> 262143     : 8604     |*************************               |
>> >> >> >     262144 -> 524287     : 3659     |**********                              |
>> >> >> >     524288 -> 1048575    : 921      |**                                      |
>> >> >> >    1048576 -> 2097151    : 122      |                                        |
>> >> >> >    2097152 -> 4194303    : 5        |                                        |
>> >> >> >
>> >> >> > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
>> >> >> > pcp high watermark size to a minimum of four times the batch size. While
>> >> >> > this could theoretically affect throughput, as highlighted by Ying[0], we
>> >> >> > have yet to observe any significant difference in throughput within our
>> >> >> > production environment after implementing this change.
>> >> >> >
>> >> >> > Backporting the series "mm: PCP high auto-tuning"
>> >> >> > -------------------------------------------------
>> >> >>
>> >> >> Again, not upstream activity.  We can describe the upstream behavior
>> >> >> directly.
>> >> >
>> >> > Andrew has requested that I provide a more comprehensive analysis of
>> >> > this issue, and in response, I have endeavored to outline all the
>> >> > pertinent details in a thorough and detailed manner.
>> >>
>> >> IMHO, upstream activity can provide comprehensive analysis of the issue
>> >> too.  And, your patch has changed much from the first version.  It's
>> >> better to describe your current version.
>> >
>> > After backporting the pcp auto-tuning feature to the 6.1.y branch, the
>> > code is almost the same with the upstream kernel wrt the pcp. I have
>> > thoroughly documented the detailed data showcasing the changes in the
>> > backported version, providing a clear picture of the results. However,
>> > it's crucial to note that I am unable to directly run the upstream
>> > kernel on our production environment due to practical constraints.
>>
>> IMHO, the patch is for upstream kernel, not some downstream kernel, so
>> focus should be the upstream activity.  The issue of the upstream
>> kernel, and how to resolve it.  The production environment test results
>> can be used to support the upstream change.
>
>  The sole distinction in the pcp between version 6.1.y and the
> upstream kernel lies solely in the modifications made to the code by
> you. Furthermore, given that your code changes have now been
> successfully backported, what else do you expect me to do ?

If you can run the upstream kernel directly with some proxy workloads,
it will be better.  But, I understand that this may be not easy for you.

So, what I really expect you to do is to organize the patch description
in an upstream centric way.  Describe the issue of the upstream kernel,
and how do you resolve it.  Although your test data comes from a
downstream kernel with the same page allocator behavior.

>>
>> >> >> > My second endeavor was to backport the series titled
>> >> >> > "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
>> >> >> > into our 6.1.y stable kernel version. Subsequent to its deployment in our
>> >> >> > production environment, I noted a pronounced reduction in latency. The
>> >> >> > observed outcomes are as enumerated below:
>> >> >> >
>> >> >> >      nsecs               : count     distribution
>> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >        128 -> 255        : 0        |                                        |
>> >> >> >        256 -> 511        : 0        |                                        |
>> >> >> >        512 -> 1023       : 0        |                                        |
>> >> >> >       1024 -> 2047       : 2        |                                        |
>> >> >> >       2048 -> 4095       : 11       |                                        |
>> >> >> >       4096 -> 8191       : 3        |                                        |
>> >> >> >       8192 -> 16383      : 1        |                                        |
>> >> >> >      16384 -> 32767      : 2        |                                        |
>> >> >> >      32768 -> 65535      : 7        |                                        |
>> >> >> >      65536 -> 131071     : 198      |*********                               |
>> >> >> >     131072 -> 262143     : 530      |************************                |
>> >> >> >     262144 -> 524287     : 824      |**************************************  |
>> >> >> >     524288 -> 1048575    : 852      |****************************************|
>> >> >> >    1048576 -> 2097151    : 714      |*********************************       |
>> >> >> >    2097152 -> 4194303    : 389      |******************                      |
>> >> >> >    4194304 -> 8388607    : 143      |******                                  |
>> >> >> >    8388608 -> 16777215   : 29       |*                                       |
>> >> >> >   16777216 -> 33554431   : 1        |                                        |
>> >> >> >
>> >> >> > Compared to the previous data, the maximum latency has been reduced to
>> >> >> > less than 30ms.
>> >> >>
>> >> >> People don't care too much about page freeing latency during processes
>> >> >> exiting.  Instead, they care more about the process exiting time, that
>> >> >> is, throughput.  So, it's better to show the page allocation latency
>> >> >> which is affected by the simultaneous processes exiting.
>> >> >
>> >> > I'm confused also. Is this issue really hard to understand ?
>> >>
>> >> IMHO, it's better to prove the issue directly.  If you cannot prove it
>> >> directly, you can try alternative one and describe why.
>> >
>> > Not all data can be verified straightforwardly or effortlessly. The
>> > primary focus lies in the zone->lock contention, which necessitates
>> > measuring the latency it incurs. To accomplish this, the
>> > free_pcppages_bulk() function serves as an effective tool for
>> > evaluation. Therefore, I have opted to specifically measure the
>> > latency associated with free_pcppages_bulk().
>> >
>> > The rationale behind not measuring allocation latency is due to the
>> > necessity of finding a willing participant to endure potential delays,
>> > a task that proved unsuccessful as no one expressed interest. In
>> > contrast, assessing free_pcppages_bulk()'s latency solely requires
>> > identifying and experimenting with the source causing the delays,
>> > making it a more feasible approach.
>>
>> Can you run a benchmark program that do quite some memory allocation by
>> yourself to test it?
>
> I can have a try.

Thanks!

> However, is it the key point here?

It's better to prove the issue directly instead of indirectly.

> Why can't the lock contention be measured by the freeing?

Have you measured the lock contention after adjusting
CONFIG_PCP_BATCH_SCALE_MAX?  IIUC, the lock contention will become even
worse.  Smaller CONFIG_PCP_BATCH_SCALE_MAX helps latency, but it will
hurt lock contention.  I have said it several times, but it seems that
you don't agree with me.  Can you prove I'm wrong with data?

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-11 11:03             ` Huang, Ying
@ 2024-07-11 12:40               ` Yafang Shao
  2024-07-12  2:32                 ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2024-07-11 12:40 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, mgorman, linux-mm

On Thu, Jul 11, 2024 at 7:05 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Thu, Jul 11, 2024 at 4:38 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Thu, Jul 11, 2024 at 2:40 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > Background
> >> >> >> > ==========
> >> >> >> >
> >> >> >> > In our containerized environment, we have a specific type of container
> >> >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
> >> >> >> > processes are organized as separate processes rather than threads due
> >> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> >> >> >> > multi-threaded setup. Upon the exit of these containers, other
> >> >> >> > containers hosted on the same machine experience significant latency
> >> >> >> > spikes.
> >> >> >> >
> >> >> >> > Investigation
> >> >> >> > =============
> >> >> >> >
> >> >> >> > My investigation using perf tracing revealed that the root cause of
> >> >> >> > these spikes is the simultaneous execution of exit_mmap() by each of
> >> >> >> > the exiting processes. This concurrent access to the zone->lock
> >> >> >> > results in contention, which becomes a hotspot and negatively impacts
> >> >> >> > performance. The perf results clearly indicate this contention as a
> >> >> >> > primary contributor to the observed latency issues.
> >> >> >> >
> >> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
> >> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
> >> >> >> >    - 76.97% exit_mmap
> >> >> >> >       - 58.58% unmap_vmas
> >> >> >> >          - 58.55% unmap_single_vma
> >> >> >> >             - unmap_page_range
> >> >> >> >                - 58.32% zap_pte_range
> >> >> >> >                   - 42.88% tlb_flush_mmu
> >> >> >> >                      - 42.76% free_pages_and_swap_cache
> >> >> >> >                         - 41.22% release_pages
> >> >> >> >                            - 33.29% free_unref_page_list
> >> >> >> >                               - 32.37% free_unref_page_commit
> >> >> >> >                                  - 31.64% free_pcppages_bulk
> >> >> >> >                                     + 28.65% _raw_spin_lock
> >> >> >> >                                       1.28% __list_del_entry_valid
> >> >> >> >                            + 3.25% folio_lruvec_lock_irqsave
> >> >> >> >                            + 0.75% __mem_cgroup_uncharge_list
> >> >> >> >                              0.60% __mod_lruvec_state
> >> >> >> >                           1.07% free_swap_cache
> >> >> >> >                   + 11.69% page_remove_rmap
> >> >> >> >                     0.64% __mod_lruvec_page_state
> >> >> >> >       - 17.34% remove_vma
> >> >> >> >          - 17.25% vm_area_free
> >> >> >> >             - 17.23% kmem_cache_free
> >> >> >> >                - 17.15% __slab_free
> >> >> >> >                   - 14.56% discard_slab
> >> >> >> >                        free_slab
> >> >> >> >                        __free_slab
> >> >> >> >                        __free_pages
> >> >> >> >                      - free_unref_page
> >> >> >> >                         - 13.50% free_unref_page_commit
> >> >> >> >                            - free_pcppages_bulk
> >> >> >> >                               + 13.44% _raw_spin_lock
> >> >> >>
> >> >> >> I don't think your change will reduce zone->lock contention cycles.  So,
> >> >> >> I don't find the value of the above data.
> >> >> >>
> >> >> >> > By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
> >> >> >> > with the majority of them being regular order-0 user pages.
> >> >> >> >
> >> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> >> >> >> > e=1
> >> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >> >> >> >  => free_pcppages_bulk
> >> >> >> >  => free_unref_page_commit
> >> >> >> >  => free_unref_page_list
> >> >> >> >  => release_pages
> >> >> >> >  => free_pages_and_swap_cache
> >> >> >> >  => tlb_flush_mmu
> >> >> >> >  => zap_pte_range
> >> >> >> >  => unmap_page_range
> >> >> >> >  => unmap_single_vma
> >> >> >> >  => unmap_vmas
> >> >> >> >  => exit_mmap
> >> >> >> >  => mmput
> >> >> >> >  => do_exit
> >> >> >> >  => do_group_exit
> >> >> >> >  => get_signal
> >> >> >> >  => arch_do_signal_or_restart
> >> >> >> >  => exit_to_user_mode_prepare
> >> >> >> >  => syscall_exit_to_user_mode
> >> >> >> >  => do_syscall_64
> >> >> >> >  => entry_SYSCALL_64_after_hwframe
> >> >> >> >
> >> >> >> > The servers experiencing these issues are equipped with impressive
> >> >> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
> >> >> >> > within a single NUMA node. The zoneinfo is as follows,
> >> >> >> >
> >> >> >> > Node 0, zone   Normal
> >> >> >> >   pages free     144465775
> >> >> >> >         boost    0
> >> >> >> >         min      1309270
> >> >> >> >         low      1636587
> >> >> >> >         high     1963904
> >> >> >> >         spanned  564133888
> >> >> >> >         present  296747008
> >> >> >> >         managed  291974346
> >> >> >> >         cma      0
> >> >> >> >         protection: (0, 0, 0, 0)
> >> >> >> > ...
> >> >> >> >   pagesets
> >> >> >> >     cpu: 0
> >> >> >> >               count: 2217
> >> >> >> >               high:  6392
> >> >> >> >               batch: 63
> >> >> >> >   vm stats threshold: 125
> >> >> >> >     cpu: 1
> >> >> >> >               count: 4510
> >> >> >> >               high:  6392
> >> >> >> >               batch: 63
> >> >> >> >   vm stats threshold: 125
> >> >> >> >     cpu: 2
> >> >> >> >               count: 3059
> >> >> >> >               high:  6392
> >> >> >> >               batch: 63
> >> >> >> >
> >> >> >> > ...
> >> >> >> >
> >> >> >> > The pcp high is around 100 times the batch size.
> >> >> >> >
> >> >> >> > I also traced the latency associated with the free_pcppages_bulk()
> >> >> >> > function during the container exit process:
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >> >        128 -> 255        : 0        |                                        |
> >> >> >> >        256 -> 511        : 148      |*****************                       |
> >> >> >> >        512 -> 1023       : 334      |****************************************|
> >> >> >> >       1024 -> 2047       : 33       |***                                     |
> >> >> >> >       2048 -> 4095       : 5        |                                        |
> >> >> >> >       4096 -> 8191       : 7        |                                        |
> >> >> >> >       8192 -> 16383      : 12       |*                                       |
> >> >> >> >      16384 -> 32767      : 30       |***                                     |
> >> >> >> >      32768 -> 65535      : 21       |**                                      |
> >> >> >> >      65536 -> 131071     : 15       |*                                       |
> >> >> >> >     131072 -> 262143     : 27       |***                                     |
> >> >> >> >     262144 -> 524287     : 84       |**********                              |
> >> >> >> >     524288 -> 1048575    : 203      |************************                |
> >> >> >> >    1048576 -> 2097151    : 284      |**********************************      |
> >> >> >> >    2097152 -> 4194303    : 327      |*************************************** |
> >> >> >> >    4194304 -> 8388607    : 215      |*************************               |
> >> >> >> >    8388608 -> 16777215   : 116      |*************                           |
> >> >> >> >   16777216 -> 33554431   : 47       |*****                                   |
> >> >> >> >   33554432 -> 67108863   : 8        |                                        |
> >> >> >> >   67108864 -> 134217727  : 3        |                                        |
> >> >> >> >
> >> >> >> > The latency can reach tens of milliseconds.
> >> >> >> >
> >> >> >> > Experimenting
> >> >> >> > =============
> >> >> >> >
> >> >> >> > vm.percpu_pagelist_high_fraction
> >> >> >> > --------------------------------
> >> >> >> >
> >> >> >> > The kernel version currently deployed in our production environment is the
> >> >> >> > stable 6.1.y, and my initial strategy involves optimizing the
> >> >> >>
> >> >> >> IMHO, we should focus on upstream activity in the cover letter and patch
> >> >> >> description.  And I don't think that it's necessary to describe the
> >> >> >> alternative solution with too much details.
> >> >> >>
> >> >> >> > vm.percpu_pagelist_high_fraction parameter. By increasing the value of
> >> >> >> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
> >> >> >> > page draining, which subsequently leads to a substantial reduction in
> >> >> >> > latency. After setting the sysctl value to 0x7fffffff, I observed a notable
> >> >> >> > improvement in latency.
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >> >        128 -> 255        : 120      |                                        |
> >> >> >> >        256 -> 511        : 365      |*                                       |
> >> >> >> >        512 -> 1023       : 201      |                                        |
> >> >> >> >       1024 -> 2047       : 103      |                                        |
> >> >> >> >       2048 -> 4095       : 84       |                                        |
> >> >> >> >       4096 -> 8191       : 87       |                                        |
> >> >> >> >       8192 -> 16383      : 4777     |**************                          |
> >> >> >> >      16384 -> 32767      : 10572    |*******************************         |
> >> >> >> >      32768 -> 65535      : 13544    |****************************************|
> >> >> >> >      65536 -> 131071     : 12723    |*************************************   |
> >> >> >> >     131072 -> 262143     : 8604     |*************************               |
> >> >> >> >     262144 -> 524287     : 3659     |**********                              |
> >> >> >> >     524288 -> 1048575    : 921      |**                                      |
> >> >> >> >    1048576 -> 2097151    : 122      |                                        |
> >> >> >> >    2097152 -> 4194303    : 5        |                                        |
> >> >> >> >
> >> >> >> > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
> >> >> >> > pcp high watermark size to a minimum of four times the batch size. While
> >> >> >> > this could theoretically affect throughput, as highlighted by Ying[0], we
> >> >> >> > have yet to observe any significant difference in throughput within our
> >> >> >> > production environment after implementing this change.
> >> >> >> >
> >> >> >> > Backporting the series "mm: PCP high auto-tuning"
> >> >> >> > -------------------------------------------------
> >> >> >>
> >> >> >> Again, not upstream activity.  We can describe the upstream behavior
> >> >> >> directly.
> >> >> >
> >> >> > Andrew has requested that I provide a more comprehensive analysis of
> >> >> > this issue, and in response, I have endeavored to outline all the
> >> >> > pertinent details in a thorough and detailed manner.
> >> >>
> >> >> IMHO, upstream activity can provide comprehensive analysis of the issue
> >> >> too.  And, your patch has changed much from the first version.  It's
> >> >> better to describe your current version.
> >> >
> >> > After backporting the pcp auto-tuning feature to the 6.1.y branch, the
> >> > code is almost the same with the upstream kernel wrt the pcp. I have
> >> > thoroughly documented the detailed data showcasing the changes in the
> >> > backported version, providing a clear picture of the results. However,
> >> > it's crucial to note that I am unable to directly run the upstream
> >> > kernel on our production environment due to practical constraints.
> >>
> >> IMHO, the patch is for upstream kernel, not some downstream kernel, so
> >> focus should be the upstream activity.  The issue of the upstream
> >> kernel, and how to resolve it.  The production environment test results
> >> can be used to support the upstream change.
> >
> >  The sole distinction in the pcp between version 6.1.y and the
> > upstream kernel lies solely in the modifications made to the code by
> > you. Furthermore, given that your code changes have now been
> > successfully backported, what else do you expect me to do ?
>
> If you can run the upstream kernel directly with some proxy workloads,
> it will be better.  But, I understand that this may be not easy for you.
>
> So, what I really expect you to do is to organize the patch description
> in an upstream centric way.  Describe the issue of the upstream kernel,
> and how do you resolve it.  Although your test data comes from a
> downstream kernel with the same page allocator behavior.
>
> >>
> >> >> >> > My second endeavor was to backport the series titled
> >> >> >> > "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
> >> >> >> > into our 6.1.y stable kernel version. Subsequent to its deployment in our
> >> >> >> > production environment, I noted a pronounced reduction in latency. The
> >> >> >> > observed outcomes are as enumerated below:
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >> >        128 -> 255        : 0        |                                        |
> >> >> >> >        256 -> 511        : 0        |                                        |
> >> >> >> >        512 -> 1023       : 0        |                                        |
> >> >> >> >       1024 -> 2047       : 2        |                                        |
> >> >> >> >       2048 -> 4095       : 11       |                                        |
> >> >> >> >       4096 -> 8191       : 3        |                                        |
> >> >> >> >       8192 -> 16383      : 1        |                                        |
> >> >> >> >      16384 -> 32767      : 2        |                                        |
> >> >> >> >      32768 -> 65535      : 7        |                                        |
> >> >> >> >      65536 -> 131071     : 198      |*********                               |
> >> >> >> >     131072 -> 262143     : 530      |************************                |
> >> >> >> >     262144 -> 524287     : 824      |**************************************  |
> >> >> >> >     524288 -> 1048575    : 852      |****************************************|
> >> >> >> >    1048576 -> 2097151    : 714      |*********************************       |
> >> >> >> >    2097152 -> 4194303    : 389      |******************                      |
> >> >> >> >    4194304 -> 8388607    : 143      |******                                  |
> >> >> >> >    8388608 -> 16777215   : 29       |*                                       |
> >> >> >> >   16777216 -> 33554431   : 1        |                                        |
> >> >> >> >
> >> >> >> > Compared to the previous data, the maximum latency has been reduced to
> >> >> >> > less than 30ms.
> >> >> >>
> >> >> >> People don't care too much about page freeing latency during processes
> >> >> >> exiting.  Instead, they care more about the process exiting time, that
> >> >> >> is, throughput.  So, it's better to show the page allocation latency
> >> >> >> which is affected by the simultaneous processes exiting.
> >> >> >
> >> >> > I'm confused also. Is this issue really hard to understand ?
> >> >>
> >> >> IMHO, it's better to prove the issue directly.  If you cannot prove it
> >> >> directly, you can try alternative one and describe why.
> >> >
> >> > Not all data can be verified straightforwardly or effortlessly. The
> >> > primary focus lies in the zone->lock contention, which necessitates
> >> > measuring the latency it incurs. To accomplish this, the
> >> > free_pcppages_bulk() function serves as an effective tool for
> >> > evaluation. Therefore, I have opted to specifically measure the
> >> > latency associated with free_pcppages_bulk().
> >> >
> >> > The rationale behind not measuring allocation latency is due to the
> >> > necessity of finding a willing participant to endure potential delays,
> >> > a task that proved unsuccessful as no one expressed interest. In
> >> > contrast, assessing free_pcppages_bulk()'s latency solely requires
> >> > identifying and experimenting with the source causing the delays,
> >> > making it a more feasible approach.
> >>
> >> Can you run a benchmark program that do quite some memory allocation by
> >> yourself to test it?
> >
> > I can have a try.
>
> Thanks!
>
> > However, is it the key point here?
>
> It's better to prove the issue directly instead of indirectly.
>
> > Why can't the lock contention be measured by the freeing?
>
> Have you measured the lock contention after adjusting
> CONFIG_PCP_BATCH_SCALE_MAX?  IIUC, the lock contention will become even
> worse.  Smaller CONFIG_PCP_BATCH_SCALE_MAX helps latency, but it will
> hurt lock contention.  I have said it several times, but it seems that
> you don't agree with me.  Can you prove I'm wrong with data?

Now I understand the point. It seems we have different understandings
regarding the zone lock contention.

    CPU A  (Freer)                    CPU B (Allocator)
lock zone->lock
free pages                              lock zone->lock
unlock zone->lock                  alloc pages
                                               unlock zone->lock

If the Freer holds the zone lock for an extended period, the Allocator
has to wait, right? Isn't that a lock contention issue? Lock
contention affects not only CPU system usage but also latency.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-11 10:49               ` Huang, Ying
@ 2024-07-11 12:45                 ` Yafang Shao
  2024-07-12  1:19                   ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2024-07-11 12:45 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
> >> >> >> > quickly experimenting with specific workloads in a production environment,
> >> >> >> > particularly when monitoring latency spikes caused by contention on the
> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
> >> >> >> > is introduced as a more practical alternative.
> >> >> >>
> >> >> >> In general, I'm neutral to the change.  I can understand that kernel
> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
> >> >> >> too.
> >> >> >>
> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
> >> >> >> > have been proposed. One approach involves dividing large zones into multi
> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
> >> >> >> > from relying solely on zone_id to identify the range of free lists a
> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
> >> >> >> > likely to necessitate a more extended development effort.
> >> >> >>
> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
> >> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
> >> >> >
> >> >> > I'm quite perplexed by your recent comment. You introduced a
> >> >> > configuration that has proven to be difficult to use, and you have
> >> >> > been resistant to suggestions for modifying it to a more user-friendly
> >> >> > and practical tuning approach. May I inquire about the rationale
> >> >> > behind introducing this configuration in the beginning?
> >> >>
> >> >> Sorry, I don't understand your words.  Do you need me to explain what is
> >> >> "neutral"?
> >> >
> >> > No, thanks.
> >> > After consulting with ChatGPT, I received a clear and comprehensive
> >> > explanation of what "neutral" means, providing me with a better
> >> > understanding of the concept.
> >> >
> >> > So, can you explain why you introduced it as a config in the beginning ?
> >>
> >> I think that I have explained it in the commit log of commit
> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
> >> latency").  Which introduces the config.
> >
> > What specifically are your expectations for how users should utilize
> > this config in real production workload?
> >
> >>
> >> Sysctl knob is ABI, which needs to be maintained forever.  Can you
> >> explain why you need it?  Why cannot you use a fixed value after initial
> >> experiments.
> >
> > Given the extensive scale of our production environment, with hundreds
> > of thousands of servers, it begs the question: how do you propose we
> > efficiently manage the various workloads that remain unaffected by the
> > sysctl change implemented on just a few thousand servers? Is it
> > feasible to expect us to recompile and release a new kernel for every
> > instance where the default value falls short? Surely, there must be
> > more practical and efficient approaches we can explore together to
> > ensure optimal performance across all workloads.
> >
> > When making improvements or modifications, kindly ensure that they are
> > not solely confined to a test or lab environment. It's vital to also
> > consider the needs and requirements of our actual users, along with
> > the diverse workloads they encounter in their daily operations.
>
> Have you found that your different systems requires different
> CONFIG_PCP_BATCH_SCALE_MAX value already?

For specific workloads that introduce latency, we set the value to 0.
For other workloads, we keep it unchanged until we determine that the
default value is also suboptimal. What is the issue with this
approach?

>  If no, I think that it's
> better for you to keep this patch in your downstream kernel for now.
> When you find that it is a common requirement, we can evaluate whether
> to make it a sysctl knob.




-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-11 12:45                 ` Yafang Shao
@ 2024-07-12  1:19                   ` Huang, Ying
  2024-07-12  2:25                     ` Yafang Shao
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2024-07-12  1:19 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

Yafang Shao <laoar.shao@gmail.com> writes:

> On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
>> >> >> >> > quickly experimenting with specific workloads in a production environment,
>> >> >> >> > particularly when monitoring latency spikes caused by contention on the
>> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
>> >> >> >> > is introduced as a more practical alternative.
>> >> >> >>
>> >> >> >> In general, I'm neutral to the change.  I can understand that kernel
>> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
>> >> >> >> too.
>> >> >> >>
>> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
>> >> >> >> > have been proposed. One approach involves dividing large zones into multi
>> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
>> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
>> >> >> >> > from relying solely on zone_id to identify the range of free lists a
>> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
>> >> >> >> > likely to necessitate a more extended development effort.
>> >> >> >>
>> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
>> >> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
>> >> >> >
>> >> >> > I'm quite perplexed by your recent comment. You introduced a
>> >> >> > configuration that has proven to be difficult to use, and you have
>> >> >> > been resistant to suggestions for modifying it to a more user-friendly
>> >> >> > and practical tuning approach. May I inquire about the rationale
>> >> >> > behind introducing this configuration in the beginning?
>> >> >>
>> >> >> Sorry, I don't understand your words.  Do you need me to explain what is
>> >> >> "neutral"?
>> >> >
>> >> > No, thanks.
>> >> > After consulting with ChatGPT, I received a clear and comprehensive
>> >> > explanation of what "neutral" means, providing me with a better
>> >> > understanding of the concept.
>> >> >
>> >> > So, can you explain why you introduced it as a config in the beginning ?
>> >>
>> >> I think that I have explained it in the commit log of commit
>> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
>> >> latency").  Which introduces the config.
>> >
>> > What specifically are your expectations for how users should utilize
>> > this config in real production workload?
>> >
>> >>
>> >> Sysctl knob is ABI, which needs to be maintained forever.  Can you
>> >> explain why you need it?  Why cannot you use a fixed value after initial
>> >> experiments.
>> >
>> > Given the extensive scale of our production environment, with hundreds
>> > of thousands of servers, it begs the question: how do you propose we
>> > efficiently manage the various workloads that remain unaffected by the
>> > sysctl change implemented on just a few thousand servers? Is it
>> > feasible to expect us to recompile and release a new kernel for every
>> > instance where the default value falls short? Surely, there must be
>> > more practical and efficient approaches we can explore together to
>> > ensure optimal performance across all workloads.
>> >
>> > When making improvements or modifications, kindly ensure that they are
>> > not solely confined to a test or lab environment. It's vital to also
>> > consider the needs and requirements of our actual users, along with
>> > the diverse workloads they encounter in their daily operations.
>>
>> Have you found that your different systems requires different
>> CONFIG_PCP_BATCH_SCALE_MAX value already?
>
> For specific workloads that introduce latency, we set the value to 0.
> For other workloads, we keep it unchanged until we determine that the
> default value is also suboptimal. What is the issue with this
> approach?

Firstly, this is a system wide configuration, not workload specific.
So, other workloads run on the same system will be impacted too.  Will
you run one workload only on one system?

Secondly, we need some evidences to introduce a new system ABI.  For
example, we need to use different configuration on different systems
otherwise some workloads will be hurt.  Can you provide some evidences
to support your change?  IMHO, it's not good enough to say I don't know
why I just don't want to change existing systems.  If so, it may be
better to wait until you have more evidences.

>>  If no, I think that it's
>> better for you to keep this patch in your downstream kernel for now.
>> When you find that it is a common requirement, we can evaluate whether
>> to make it a sysctl knob.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-12  1:19                   ` Huang, Ying
@ 2024-07-12  2:25                     ` Yafang Shao
  2024-07-12  3:05                       ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2024-07-12  2:25 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >>
> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
> >> >> >> >> > quickly experimenting with specific workloads in a production environment,
> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the
> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
> >> >> >> >> > is introduced as a more practical alternative.
> >> >> >> >>
> >> >> >> >> In general, I'm neutral to the change.  I can understand that kernel
> >> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
> >> >> >> >> too.
> >> >> >> >>
> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi
> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a
> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
> >> >> >> >> > likely to necessitate a more extended development effort.
> >> >> >> >>
> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
> >> >> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
> >> >> >> >
> >> >> >> > I'm quite perplexed by your recent comment. You introduced a
> >> >> >> > configuration that has proven to be difficult to use, and you have
> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly
> >> >> >> > and practical tuning approach. May I inquire about the rationale
> >> >> >> > behind introducing this configuration in the beginning?
> >> >> >>
> >> >> >> Sorry, I don't understand your words.  Do you need me to explain what is
> >> >> >> "neutral"?
> >> >> >
> >> >> > No, thanks.
> >> >> > After consulting with ChatGPT, I received a clear and comprehensive
> >> >> > explanation of what "neutral" means, providing me with a better
> >> >> > understanding of the concept.
> >> >> >
> >> >> > So, can you explain why you introduced it as a config in the beginning ?
> >> >>
> >> >> I think that I have explained it in the commit log of commit
> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
> >> >> latency").  Which introduces the config.
> >> >
> >> > What specifically are your expectations for how users should utilize
> >> > this config in real production workload?
> >> >
> >> >>
> >> >> Sysctl knob is ABI, which needs to be maintained forever.  Can you
> >> >> explain why you need it?  Why cannot you use a fixed value after initial
> >> >> experiments.
> >> >
> >> > Given the extensive scale of our production environment, with hundreds
> >> > of thousands of servers, it begs the question: how do you propose we
> >> > efficiently manage the various workloads that remain unaffected by the
> >> > sysctl change implemented on just a few thousand servers? Is it
> >> > feasible to expect us to recompile and release a new kernel for every
> >> > instance where the default value falls short? Surely, there must be
> >> > more practical and efficient approaches we can explore together to
> >> > ensure optimal performance across all workloads.
> >> >
> >> > When making improvements or modifications, kindly ensure that they are
> >> > not solely confined to a test or lab environment. It's vital to also
> >> > consider the needs and requirements of our actual users, along with
> >> > the diverse workloads they encounter in their daily operations.
> >>
> >> Have you found that your different systems requires different
> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
> >
> > For specific workloads that introduce latency, we set the value to 0.
> > For other workloads, we keep it unchanged until we determine that the
> > default value is also suboptimal. What is the issue with this
> > approach?
>
> Firstly, this is a system wide configuration, not workload specific.
> So, other workloads run on the same system will be impacted too.  Will
> you run one workload only on one system?

It seems we're living on different planets. You're happily working in
your lab environment, while I'm struggling with real-world production
issues.

For servers:

Server 1 to 10,000: vm.pcp_batch_scale_max = 0
Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
Server 1,000,001 and beyond: Happy with all values

Is this hard to understand?

In other words:

For applications:

Application 1 to 10,000: vm.pcp_batch_scale_max = 0
Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
Application 1,000,001 and beyond: Happy with all values


>
> Secondly, we need some evidences to introduce a new system ABI.  For
> example, we need to use different configuration on different systems
> otherwise some workloads will be hurt.  Can you provide some evidences
> to support your change?  IMHO, it's not good enough to say I don't know
> why I just don't want to change existing systems.  If so, it may be
> better to wait until you have more evidences.

It seems the community encourages developers to experiment with their
improvements in lab environments using meticulously designed test
cases A, B, C, and as many others as they can imagine, ultimately
obtaining perfect data. However, it discourages developers from
directly addressing real-world workloads. Sigh.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-11 12:40               ` Yafang Shao
@ 2024-07-12  2:32                 ` Huang, Ying
  0 siblings, 0 replies; 41+ messages in thread
From: Huang, Ying @ 2024-07-12  2:32 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm

Yafang Shao <laoar.shao@gmail.com> writes:

> On Thu, Jul 11, 2024 at 7:05 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Thu, Jul 11, 2024 at 4:38 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Thu, Jul 11, 2024 at 2:40 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > Background
>> >> >> >> > ==========
>> >> >> >> >
>> >> >> >> > In our containerized environment, we have a specific type of container
>> >> >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
>> >> >> >> > processes are organized as separate processes rather than threads due
>> >> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
>> >> >> >> > multi-threaded setup. Upon the exit of these containers, other
>> >> >> >> > containers hosted on the same machine experience significant latency
>> >> >> >> > spikes.
>> >> >> >> >
>> >> >> >> > Investigation
>> >> >> >> > =============
>> >> >> >> >
>> >> >> >> > My investigation using perf tracing revealed that the root cause of
>> >> >> >> > these spikes is the simultaneous execution of exit_mmap() by each of
>> >> >> >> > the exiting processes. This concurrent access to the zone->lock
>> >> >> >> > results in contention, which becomes a hotspot and negatively impacts
>> >> >> >> > performance. The perf results clearly indicate this contention as a
>> >> >> >> > primary contributor to the observed latency issues.
>> >> >> >> >
>> >> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
>> >> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
>> >> >> >> >    - 76.97% exit_mmap
>> >> >> >> >       - 58.58% unmap_vmas
>> >> >> >> >          - 58.55% unmap_single_vma
>> >> >> >> >             - unmap_page_range
>> >> >> >> >                - 58.32% zap_pte_range
>> >> >> >> >                   - 42.88% tlb_flush_mmu
>> >> >> >> >                      - 42.76% free_pages_and_swap_cache
>> >> >> >> >                         - 41.22% release_pages
>> >> >> >> >                            - 33.29% free_unref_page_list
>> >> >> >> >                               - 32.37% free_unref_page_commit
>> >> >> >> >                                  - 31.64% free_pcppages_bulk
>> >> >> >> >                                     + 28.65% _raw_spin_lock
>> >> >> >> >                                       1.28% __list_del_entry_valid
>> >> >> >> >                            + 3.25% folio_lruvec_lock_irqsave
>> >> >> >> >                            + 0.75% __mem_cgroup_uncharge_list
>> >> >> >> >                              0.60% __mod_lruvec_state
>> >> >> >> >                           1.07% free_swap_cache
>> >> >> >> >                   + 11.69% page_remove_rmap
>> >> >> >> >                     0.64% __mod_lruvec_page_state
>> >> >> >> >       - 17.34% remove_vma
>> >> >> >> >          - 17.25% vm_area_free
>> >> >> >> >             - 17.23% kmem_cache_free
>> >> >> >> >                - 17.15% __slab_free
>> >> >> >> >                   - 14.56% discard_slab
>> >> >> >> >                        free_slab
>> >> >> >> >                        __free_slab
>> >> >> >> >                        __free_pages
>> >> >> >> >                      - free_unref_page
>> >> >> >> >                         - 13.50% free_unref_page_commit
>> >> >> >> >                            - free_pcppages_bulk
>> >> >> >> >                               + 13.44% _raw_spin_lock
>> >> >> >>
>> >> >> >> I don't think your change will reduce zone->lock contention cycles.  So,
>> >> >> >> I don't find the value of the above data.
>> >> >> >>
>> >> >> >> > By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
>> >> >> >> > with the majority of them being regular order-0 user pages.
>> >> >> >> >
>> >> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
>> >> >> >> > e=1
>> >> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>> >> >> >> >  => free_pcppages_bulk
>> >> >> >> >  => free_unref_page_commit
>> >> >> >> >  => free_unref_page_list
>> >> >> >> >  => release_pages
>> >> >> >> >  => free_pages_and_swap_cache
>> >> >> >> >  => tlb_flush_mmu
>> >> >> >> >  => zap_pte_range
>> >> >> >> >  => unmap_page_range
>> >> >> >> >  => unmap_single_vma
>> >> >> >> >  => unmap_vmas
>> >> >> >> >  => exit_mmap
>> >> >> >> >  => mmput
>> >> >> >> >  => do_exit
>> >> >> >> >  => do_group_exit
>> >> >> >> >  => get_signal
>> >> >> >> >  => arch_do_signal_or_restart
>> >> >> >> >  => exit_to_user_mode_prepare
>> >> >> >> >  => syscall_exit_to_user_mode
>> >> >> >> >  => do_syscall_64
>> >> >> >> >  => entry_SYSCALL_64_after_hwframe
>> >> >> >> >
>> >> >> >> > The servers experiencing these issues are equipped with impressive
>> >> >> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
>> >> >> >> > within a single NUMA node. The zoneinfo is as follows,
>> >> >> >> >
>> >> >> >> > Node 0, zone   Normal
>> >> >> >> >   pages free     144465775
>> >> >> >> >         boost    0
>> >> >> >> >         min      1309270
>> >> >> >> >         low      1636587
>> >> >> >> >         high     1963904
>> >> >> >> >         spanned  564133888
>> >> >> >> >         present  296747008
>> >> >> >> >         managed  291974346
>> >> >> >> >         cma      0
>> >> >> >> >         protection: (0, 0, 0, 0)
>> >> >> >> > ...
>> >> >> >> >   pagesets
>> >> >> >> >     cpu: 0
>> >> >> >> >               count: 2217
>> >> >> >> >               high:  6392
>> >> >> >> >               batch: 63
>> >> >> >> >   vm stats threshold: 125
>> >> >> >> >     cpu: 1
>> >> >> >> >               count: 4510
>> >> >> >> >               high:  6392
>> >> >> >> >               batch: 63
>> >> >> >> >   vm stats threshold: 125
>> >> >> >> >     cpu: 2
>> >> >> >> >               count: 3059
>> >> >> >> >               high:  6392
>> >> >> >> >               batch: 63
>> >> >> >> >
>> >> >> >> > ...
>> >> >> >> >
>> >> >> >> > The pcp high is around 100 times the batch size.
>> >> >> >> >
>> >> >> >> > I also traced the latency associated with the free_pcppages_bulk()
>> >> >> >> > function during the container exit process:
>> >> >> >> >
>> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >> >        128 -> 255        : 0        |                                        |
>> >> >> >> >        256 -> 511        : 148      |*****************                       |
>> >> >> >> >        512 -> 1023       : 334      |****************************************|
>> >> >> >> >       1024 -> 2047       : 33       |***                                     |
>> >> >> >> >       2048 -> 4095       : 5        |                                        |
>> >> >> >> >       4096 -> 8191       : 7        |                                        |
>> >> >> >> >       8192 -> 16383      : 12       |*                                       |
>> >> >> >> >      16384 -> 32767      : 30       |***                                     |
>> >> >> >> >      32768 -> 65535      : 21       |**                                      |
>> >> >> >> >      65536 -> 131071     : 15       |*                                       |
>> >> >> >> >     131072 -> 262143     : 27       |***                                     |
>> >> >> >> >     262144 -> 524287     : 84       |**********                              |
>> >> >> >> >     524288 -> 1048575    : 203      |************************                |
>> >> >> >> >    1048576 -> 2097151    : 284      |**********************************      |
>> >> >> >> >    2097152 -> 4194303    : 327      |*************************************** |
>> >> >> >> >    4194304 -> 8388607    : 215      |*************************               |
>> >> >> >> >    8388608 -> 16777215   : 116      |*************                           |
>> >> >> >> >   16777216 -> 33554431   : 47       |*****                                   |
>> >> >> >> >   33554432 -> 67108863   : 8        |                                        |
>> >> >> >> >   67108864 -> 134217727  : 3        |                                        |
>> >> >> >> >
>> >> >> >> > The latency can reach tens of milliseconds.
>> >> >> >> >
>> >> >> >> > Experimenting
>> >> >> >> > =============
>> >> >> >> >
>> >> >> >> > vm.percpu_pagelist_high_fraction
>> >> >> >> > --------------------------------
>> >> >> >> >
>> >> >> >> > The kernel version currently deployed in our production environment is the
>> >> >> >> > stable 6.1.y, and my initial strategy involves optimizing the
>> >> >> >>
>> >> >> >> IMHO, we should focus on upstream activity in the cover letter and patch
>> >> >> >> description.  And I don't think that it's necessary to describe the
>> >> >> >> alternative solution with too much details.
>> >> >> >>
>> >> >> >> > vm.percpu_pagelist_high_fraction parameter. By increasing the value of
>> >> >> >> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
>> >> >> >> > page draining, which subsequently leads to a substantial reduction in
>> >> >> >> > latency. After setting the sysctl value to 0x7fffffff, I observed a notable
>> >> >> >> > improvement in latency.
>> >> >> >> >
>> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >> >        128 -> 255        : 120      |                                        |
>> >> >> >> >        256 -> 511        : 365      |*                                       |
>> >> >> >> >        512 -> 1023       : 201      |                                        |
>> >> >> >> >       1024 -> 2047       : 103      |                                        |
>> >> >> >> >       2048 -> 4095       : 84       |                                        |
>> >> >> >> >       4096 -> 8191       : 87       |                                        |
>> >> >> >> >       8192 -> 16383      : 4777     |**************                          |
>> >> >> >> >      16384 -> 32767      : 10572    |*******************************         |
>> >> >> >> >      32768 -> 65535      : 13544    |****************************************|
>> >> >> >> >      65536 -> 131071     : 12723    |*************************************   |
>> >> >> >> >     131072 -> 262143     : 8604     |*************************               |
>> >> >> >> >     262144 -> 524287     : 3659     |**********                              |
>> >> >> >> >     524288 -> 1048575    : 921      |**                                      |
>> >> >> >> >    1048576 -> 2097151    : 122      |                                        |
>> >> >> >> >    2097152 -> 4194303    : 5        |                                        |
>> >> >> >> >
>> >> >> >> > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
>> >> >> >> > pcp high watermark size to a minimum of four times the batch size. While
>> >> >> >> > this could theoretically affect throughput, as highlighted by Ying[0], we
>> >> >> >> > have yet to observe any significant difference in throughput within our
>> >> >> >> > production environment after implementing this change.
>> >> >> >> >
>> >> >> >> > Backporting the series "mm: PCP high auto-tuning"
>> >> >> >> > -------------------------------------------------
>> >> >> >>
>> >> >> >> Again, not upstream activity.  We can describe the upstream behavior
>> >> >> >> directly.
>> >> >> >
>> >> >> > Andrew has requested that I provide a more comprehensive analysis of
>> >> >> > this issue, and in response, I have endeavored to outline all the
>> >> >> > pertinent details in a thorough and detailed manner.
>> >> >>
>> >> >> IMHO, upstream activity can provide comprehensive analysis of the issue
>> >> >> too.  And, your patch has changed much from the first version.  It's
>> >> >> better to describe your current version.
>> >> >
>> >> > After backporting the pcp auto-tuning feature to the 6.1.y branch, the
>> >> > code is almost the same with the upstream kernel wrt the pcp. I have
>> >> > thoroughly documented the detailed data showcasing the changes in the
>> >> > backported version, providing a clear picture of the results. However,
>> >> > it's crucial to note that I am unable to directly run the upstream
>> >> > kernel on our production environment due to practical constraints.
>> >>
>> >> IMHO, the patch is for upstream kernel, not some downstream kernel, so
>> >> focus should be the upstream activity.  The issue of the upstream
>> >> kernel, and how to resolve it.  The production environment test results
>> >> can be used to support the upstream change.
>> >
>> >  The sole distinction in the pcp between version 6.1.y and the
>> > upstream kernel lies solely in the modifications made to the code by
>> > you. Furthermore, given that your code changes have now been
>> > successfully backported, what else do you expect me to do ?
>>
>> If you can run the upstream kernel directly with some proxy workloads,
>> it will be better.  But, I understand that this may be not easy for you.
>>
>> So, what I really expect you to do is to organize the patch description
>> in an upstream centric way.  Describe the issue of the upstream kernel,
>> and how do you resolve it.  Although your test data comes from a
>> downstream kernel with the same page allocator behavior.
>>
>> >>
>> >> >> >> > My second endeavor was to backport the series titled
>> >> >> >> > "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
>> >> >> >> > into our 6.1.y stable kernel version. Subsequent to its deployment in our
>> >> >> >> > production environment, I noted a pronounced reduction in latency. The
>> >> >> >> > observed outcomes are as enumerated below:
>> >> >> >> >
>> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >> >        128 -> 255        : 0        |                                        |
>> >> >> >> >        256 -> 511        : 0        |                                        |
>> >> >> >> >        512 -> 1023       : 0        |                                        |
>> >> >> >> >       1024 -> 2047       : 2        |                                        |
>> >> >> >> >       2048 -> 4095       : 11       |                                        |
>> >> >> >> >       4096 -> 8191       : 3        |                                        |
>> >> >> >> >       8192 -> 16383      : 1        |                                        |
>> >> >> >> >      16384 -> 32767      : 2        |                                        |
>> >> >> >> >      32768 -> 65535      : 7        |                                        |
>> >> >> >> >      65536 -> 131071     : 198      |*********                               |
>> >> >> >> >     131072 -> 262143     : 530      |************************                |
>> >> >> >> >     262144 -> 524287     : 824      |**************************************  |
>> >> >> >> >     524288 -> 1048575    : 852      |****************************************|
>> >> >> >> >    1048576 -> 2097151    : 714      |*********************************       |
>> >> >> >> >    2097152 -> 4194303    : 389      |******************                      |
>> >> >> >> >    4194304 -> 8388607    : 143      |******                                  |
>> >> >> >> >    8388608 -> 16777215   : 29       |*                                       |
>> >> >> >> >   16777216 -> 33554431   : 1        |                                        |
>> >> >> >> >
>> >> >> >> > Compared to the previous data, the maximum latency has been reduced to
>> >> >> >> > less than 30ms.
>> >> >> >>
>> >> >> >> People don't care too much about page freeing latency during processes
>> >> >> >> exiting.  Instead, they care more about the process exiting time, that
>> >> >> >> is, throughput.  So, it's better to show the page allocation latency
>> >> >> >> which is affected by the simultaneous processes exiting.
>> >> >> >
>> >> >> > I'm confused also. Is this issue really hard to understand ?
>> >> >>
>> >> >> IMHO, it's better to prove the issue directly.  If you cannot prove it
>> >> >> directly, you can try alternative one and describe why.
>> >> >
>> >> > Not all data can be verified straightforwardly or effortlessly. The
>> >> > primary focus lies in the zone->lock contention, which necessitates
>> >> > measuring the latency it incurs. To accomplish this, the
>> >> > free_pcppages_bulk() function serves as an effective tool for
>> >> > evaluation. Therefore, I have opted to specifically measure the
>> >> > latency associated with free_pcppages_bulk().
>> >> >
>> >> > The rationale behind not measuring allocation latency is due to the
>> >> > necessity of finding a willing participant to endure potential delays,
>> >> > a task that proved unsuccessful as no one expressed interest. In
>> >> > contrast, assessing free_pcppages_bulk()'s latency solely requires
>> >> > identifying and experimenting with the source causing the delays,
>> >> > making it a more feasible approach.
>> >>
>> >> Can you run a benchmark program that do quite some memory allocation by
>> >> yourself to test it?
>> >
>> > I can have a try.
>>
>> Thanks!
>>
>> > However, is it the key point here?
>>
>> It's better to prove the issue directly instead of indirectly.
>>
>> > Why can't the lock contention be measured by the freeing?
>>
>> Have you measured the lock contention after adjusting
>> CONFIG_PCP_BATCH_SCALE_MAX?  IIUC, the lock contention will become even
>> worse.  Smaller CONFIG_PCP_BATCH_SCALE_MAX helps latency, but it will
>> hurt lock contention.  I have said it several times, but it seems that
>> you don't agree with me.  Can you prove I'm wrong with data?
>
> Now I understand the point. It seems we have different understandings
> regarding the zone lock contention.
>
>     CPU A  (Freer)                    CPU B (Allocator)
> lock zone->lock
> free pages                              lock zone->lock
> unlock zone->lock                  alloc pages
>                                                unlock zone->lock
>
> If the Freer holds the zone lock for an extended period, the Allocator
> has to wait, right? Isn't that a lock contention issue? Lock
> contention affects not only CPU system usage but also latency.

Thanks for explanation!  Now, I understand our difference.

When I measure the spin lock contention, I usually measure lock spinning
CPU cycles.  The more spinning cycles, the more severe the lock
contention.  IIUC, we cannot measure the lock contention with the
latency only.  For example, if there are only 1 "Freer" and 0
"Allocator" in the system, there will be no lock contention at all.
But, if we increase page free batch number, the latency of "Freer" and
"Allocator" will increase without the lock contention.  Another example,
the total spin cycles decrease, but the spin cycles of each waiting
increase because the total number of waiting decreases with larger batch
number.  So, the lock contention becomes better at the cost of latency.

So, I suggest to use the latency spike to describe the issue and the
solution.

I suggest you to measure the processes exit time too.  To evaluate the
possible negative impact on throughput of smaller batch number.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-12  2:25                     ` Yafang Shao
@ 2024-07-12  3:05                       ` Huang, Ying
  2024-07-12  3:44                         ` Yafang Shao
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2024-07-12  3:05 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

Yafang Shao <laoar.shao@gmail.com> writes:

> On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >>
>> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >>
>> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
>> >> >> >> >> > quickly experimenting with specific workloads in a production environment,
>> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the
>> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
>> >> >> >> >> > is introduced as a more practical alternative.
>> >> >> >> >>
>> >> >> >> >> In general, I'm neutral to the change.  I can understand that kernel
>> >> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
>> >> >> >> >> too.
>> >> >> >> >>
>> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
>> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi
>> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
>> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
>> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a
>> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
>> >> >> >> >> > likely to necessitate a more extended development effort.
>> >> >> >> >>
>> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
>> >> >> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
>> >> >> >> >
>> >> >> >> > I'm quite perplexed by your recent comment. You introduced a
>> >> >> >> > configuration that has proven to be difficult to use, and you have
>> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly
>> >> >> >> > and practical tuning approach. May I inquire about the rationale
>> >> >> >> > behind introducing this configuration in the beginning?
>> >> >> >>
>> >> >> >> Sorry, I don't understand your words.  Do you need me to explain what is
>> >> >> >> "neutral"?
>> >> >> >
>> >> >> > No, thanks.
>> >> >> > After consulting with ChatGPT, I received a clear and comprehensive
>> >> >> > explanation of what "neutral" means, providing me with a better
>> >> >> > understanding of the concept.
>> >> >> >
>> >> >> > So, can you explain why you introduced it as a config in the beginning ?
>> >> >>
>> >> >> I think that I have explained it in the commit log of commit
>> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
>> >> >> latency").  Which introduces the config.
>> >> >
>> >> > What specifically are your expectations for how users should utilize
>> >> > this config in real production workload?
>> >> >
>> >> >>
>> >> >> Sysctl knob is ABI, which needs to be maintained forever.  Can you
>> >> >> explain why you need it?  Why cannot you use a fixed value after initial
>> >> >> experiments.
>> >> >
>> >> > Given the extensive scale of our production environment, with hundreds
>> >> > of thousands of servers, it begs the question: how do you propose we
>> >> > efficiently manage the various workloads that remain unaffected by the
>> >> > sysctl change implemented on just a few thousand servers? Is it
>> >> > feasible to expect us to recompile and release a new kernel for every
>> >> > instance where the default value falls short? Surely, there must be
>> >> > more practical and efficient approaches we can explore together to
>> >> > ensure optimal performance across all workloads.
>> >> >
>> >> > When making improvements or modifications, kindly ensure that they are
>> >> > not solely confined to a test or lab environment. It's vital to also
>> >> > consider the needs and requirements of our actual users, along with
>> >> > the diverse workloads they encounter in their daily operations.
>> >>
>> >> Have you found that your different systems requires different
>> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
>> >
>> > For specific workloads that introduce latency, we set the value to 0.
>> > For other workloads, we keep it unchanged until we determine that the
>> > default value is also suboptimal. What is the issue with this
>> > approach?
>>
>> Firstly, this is a system wide configuration, not workload specific.
>> So, other workloads run on the same system will be impacted too.  Will
>> you run one workload only on one system?
>
> It seems we're living on different planets. You're happily working in
> your lab environment, while I'm struggling with real-world production
> issues.
>
> For servers:
>
> Server 1 to 10,000: vm.pcp_batch_scale_max = 0
> Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> Server 1,000,001 and beyond: Happy with all values
>
> Is this hard to understand?
>
> In other words:
>
> For applications:
>
> Application 1 to 10,000: vm.pcp_batch_scale_max = 0
> Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> Application 1,000,001 and beyond: Happy with all values

Good to know this.  Thanks!

>>
>> Secondly, we need some evidences to introduce a new system ABI.  For
>> example, we need to use different configuration on different systems
>> otherwise some workloads will be hurt.  Can you provide some evidences
>> to support your change?  IMHO, it's not good enough to say I don't know
>> why I just don't want to change existing systems.  If so, it may be
>> better to wait until you have more evidences.
>
> It seems the community encourages developers to experiment with their
> improvements in lab environments using meticulously designed test
> cases A, B, C, and as many others as they can imagine, ultimately
> obtaining perfect data. However, it discourages developers from
> directly addressing real-world workloads. Sigh.

You cannot know whether your workloads benefit or hurt for the different
batch number and how in your production environment?  If you cannot, how
do you decide which workload deploys on which system (with different
batch number configuration).  If you can, can you provide such
information to support your patch?

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-12  3:05                       ` Huang, Ying
@ 2024-07-12  3:44                         ` Yafang Shao
  2024-07-12  5:25                           ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2024-07-12  3:44 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >>
> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >>
> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment,
> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the
> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
> >> >> >> >> >> > is introduced as a more practical alternative.
> >> >> >> >> >>
> >> >> >> >> >> In general, I'm neutral to the change.  I can understand that kernel
> >> >> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
> >> >> >> >> >> too.
> >> >> >> >> >>
> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi
> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a
> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
> >> >> >> >> >> > likely to necessitate a more extended development effort.
> >> >> >> >> >>
> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
> >> >> >> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
> >> >> >> >> >
> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a
> >> >> >> >> > configuration that has proven to be difficult to use, and you have
> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly
> >> >> >> >> > and practical tuning approach. May I inquire about the rationale
> >> >> >> >> > behind introducing this configuration in the beginning?
> >> >> >> >>
> >> >> >> >> Sorry, I don't understand your words.  Do you need me to explain what is
> >> >> >> >> "neutral"?
> >> >> >> >
> >> >> >> > No, thanks.
> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive
> >> >> >> > explanation of what "neutral" means, providing me with a better
> >> >> >> > understanding of the concept.
> >> >> >> >
> >> >> >> > So, can you explain why you introduced it as a config in the beginning ?
> >> >> >>
> >> >> >> I think that I have explained it in the commit log of commit
> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
> >> >> >> latency").  Which introduces the config.
> >> >> >
> >> >> > What specifically are your expectations for how users should utilize
> >> >> > this config in real production workload?
> >> >> >
> >> >> >>
> >> >> >> Sysctl knob is ABI, which needs to be maintained forever.  Can you
> >> >> >> explain why you need it?  Why cannot you use a fixed value after initial
> >> >> >> experiments.
> >> >> >
> >> >> > Given the extensive scale of our production environment, with hundreds
> >> >> > of thousands of servers, it begs the question: how do you propose we
> >> >> > efficiently manage the various workloads that remain unaffected by the
> >> >> > sysctl change implemented on just a few thousand servers? Is it
> >> >> > feasible to expect us to recompile and release a new kernel for every
> >> >> > instance where the default value falls short? Surely, there must be
> >> >> > more practical and efficient approaches we can explore together to
> >> >> > ensure optimal performance across all workloads.
> >> >> >
> >> >> > When making improvements or modifications, kindly ensure that they are
> >> >> > not solely confined to a test or lab environment. It's vital to also
> >> >> > consider the needs and requirements of our actual users, along with
> >> >> > the diverse workloads they encounter in their daily operations.
> >> >>
> >> >> Have you found that your different systems requires different
> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
> >> >
> >> > For specific workloads that introduce latency, we set the value to 0.
> >> > For other workloads, we keep it unchanged until we determine that the
> >> > default value is also suboptimal. What is the issue with this
> >> > approach?
> >>
> >> Firstly, this is a system wide configuration, not workload specific.
> >> So, other workloads run on the same system will be impacted too.  Will
> >> you run one workload only on one system?
> >
> > It seems we're living on different planets. You're happily working in
> > your lab environment, while I'm struggling with real-world production
> > issues.
> >
> > For servers:
> >
> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0
> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> > Server 1,000,001 and beyond: Happy with all values
> >
> > Is this hard to understand?
> >
> > In other words:
> >
> > For applications:
> >
> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0
> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> > Application 1,000,001 and beyond: Happy with all values
>
> Good to know this.  Thanks!
>
> >>
> >> Secondly, we need some evidences to introduce a new system ABI.  For
> >> example, we need to use different configuration on different systems
> >> otherwise some workloads will be hurt.  Can you provide some evidences
> >> to support your change?  IMHO, it's not good enough to say I don't know
> >> why I just don't want to change existing systems.  If so, it may be
> >> better to wait until you have more evidences.
> >
> > It seems the community encourages developers to experiment with their
> > improvements in lab environments using meticulously designed test
> > cases A, B, C, and as many others as they can imagine, ultimately
> > obtaining perfect data. However, it discourages developers from
> > directly addressing real-world workloads. Sigh.
>
> You cannot know whether your workloads benefit or hurt for the different
> batch number and how in your production environment?  If you cannot, how
> do you decide which workload deploys on which system (with different
> batch number configuration).  If you can, can you provide such
> information to support your patch?

We leverage a meticulous selection of network metrics, particularly
focusing on TcpExt indicators, to keep a close eye on application
latency. This includes metrics such as TcpExt.TCPTimeouts,
TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans,
TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more.

In instances where a problematic container terminates, we've noticed a
sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per
second, which serves as a clear indication that other applications are
experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max
parameter to 0, we've been able to drastically reduce the maximum
frequency of these timeouts to less than one per second.

At present, we're selectively applying this adjustment to clusters
that exclusively host the identified problematic applications, and
we're closely monitoring their performance to ensure stability. To
date, we've observed no network latency issues as a result of this
change. However, we remain cautious about extending this optimization
to other clusters, as the decision ultimately depends on a variety of
factors.

It's important to note that we're not eager to implement this change
across our entire fleet, as we recognize the potential for unforeseen
consequences. Instead, we're taking a cautious approach by initially
applying it to a limited number of servers. This allows us to assess
its impact and make informed decisions about whether or not to expand
its use in the future.


[0]  'Cluster' refers to a Kubernetes concept, where a single cluster
comprises a specific group of servers designed to work in unison.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-12  3:44                         ` Yafang Shao
@ 2024-07-12  5:25                           ` Huang, Ying
  2024-07-12  5:41                             ` Yafang Shao
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2024-07-12  5:25 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

Yafang Shao <laoar.shao@gmail.com> writes:

> On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >>
>> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >>
>> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >> >>
>> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
>> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment,
>> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the
>> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
>> >> >> >> >> >> > is introduced as a more practical alternative.
>> >> >> >> >> >>
>> >> >> >> >> >> In general, I'm neutral to the change.  I can understand that kernel
>> >> >> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
>> >> >> >> >> >> too.
>> >> >> >> >> >>
>> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
>> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi
>> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
>> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
>> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a
>> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
>> >> >> >> >> >> > likely to necessitate a more extended development effort.
>> >> >> >> >> >>
>> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
>> >> >> >> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
>> >> >> >> >> >
>> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a
>> >> >> >> >> > configuration that has proven to be difficult to use, and you have
>> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly
>> >> >> >> >> > and practical tuning approach. May I inquire about the rationale
>> >> >> >> >> > behind introducing this configuration in the beginning?
>> >> >> >> >>
>> >> >> >> >> Sorry, I don't understand your words.  Do you need me to explain what is
>> >> >> >> >> "neutral"?
>> >> >> >> >
>> >> >> >> > No, thanks.
>> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive
>> >> >> >> > explanation of what "neutral" means, providing me with a better
>> >> >> >> > understanding of the concept.
>> >> >> >> >
>> >> >> >> > So, can you explain why you introduced it as a config in the beginning ?
>> >> >> >>
>> >> >> >> I think that I have explained it in the commit log of commit
>> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
>> >> >> >> latency").  Which introduces the config.
>> >> >> >
>> >> >> > What specifically are your expectations for how users should utilize
>> >> >> > this config in real production workload?
>> >> >> >
>> >> >> >>
>> >> >> >> Sysctl knob is ABI, which needs to be maintained forever.  Can you
>> >> >> >> explain why you need it?  Why cannot you use a fixed value after initial
>> >> >> >> experiments.
>> >> >> >
>> >> >> > Given the extensive scale of our production environment, with hundreds
>> >> >> > of thousands of servers, it begs the question: how do you propose we
>> >> >> > efficiently manage the various workloads that remain unaffected by the
>> >> >> > sysctl change implemented on just a few thousand servers? Is it
>> >> >> > feasible to expect us to recompile and release a new kernel for every
>> >> >> > instance where the default value falls short? Surely, there must be
>> >> >> > more practical and efficient approaches we can explore together to
>> >> >> > ensure optimal performance across all workloads.
>> >> >> >
>> >> >> > When making improvements or modifications, kindly ensure that they are
>> >> >> > not solely confined to a test or lab environment. It's vital to also
>> >> >> > consider the needs and requirements of our actual users, along with
>> >> >> > the diverse workloads they encounter in their daily operations.
>> >> >>
>> >> >> Have you found that your different systems requires different
>> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
>> >> >
>> >> > For specific workloads that introduce latency, we set the value to 0.
>> >> > For other workloads, we keep it unchanged until we determine that the
>> >> > default value is also suboptimal. What is the issue with this
>> >> > approach?
>> >>
>> >> Firstly, this is a system wide configuration, not workload specific.
>> >> So, other workloads run on the same system will be impacted too.  Will
>> >> you run one workload only on one system?
>> >
>> > It seems we're living on different planets. You're happily working in
>> > your lab environment, while I'm struggling with real-world production
>> > issues.
>> >
>> > For servers:
>> >
>> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0
>> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
>> > Server 1,000,001 and beyond: Happy with all values
>> >
>> > Is this hard to understand?
>> >
>> > In other words:
>> >
>> > For applications:
>> >
>> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0
>> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
>> > Application 1,000,001 and beyond: Happy with all values
>>
>> Good to know this.  Thanks!
>>
>> >>
>> >> Secondly, we need some evidences to introduce a new system ABI.  For
>> >> example, we need to use different configuration on different systems
>> >> otherwise some workloads will be hurt.  Can you provide some evidences
>> >> to support your change?  IMHO, it's not good enough to say I don't know
>> >> why I just don't want to change existing systems.  If so, it may be
>> >> better to wait until you have more evidences.
>> >
>> > It seems the community encourages developers to experiment with their
>> > improvements in lab environments using meticulously designed test
>> > cases A, B, C, and as many others as they can imagine, ultimately
>> > obtaining perfect data. However, it discourages developers from
>> > directly addressing real-world workloads. Sigh.
>>
>> You cannot know whether your workloads benefit or hurt for the different
>> batch number and how in your production environment?  If you cannot, how
>> do you decide which workload deploys on which system (with different
>> batch number configuration).  If you can, can you provide such
>> information to support your patch?
>
> We leverage a meticulous selection of network metrics, particularly
> focusing on TcpExt indicators, to keep a close eye on application
> latency. This includes metrics such as TcpExt.TCPTimeouts,
> TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans,
> TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more.
>
> In instances where a problematic container terminates, we've noticed a
> sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per
> second, which serves as a clear indication that other applications are
> experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max
> parameter to 0, we've been able to drastically reduce the maximum
> frequency of these timeouts to less than one per second.

Thanks a lot for sharing this.  I learned much from it!

> At present, we're selectively applying this adjustment to clusters
> that exclusively host the identified problematic applications, and
> we're closely monitoring their performance to ensure stability. To
> date, we've observed no network latency issues as a result of this
> change. However, we remain cautious about extending this optimization
> to other clusters, as the decision ultimately depends on a variety of
> factors.
>
> It's important to note that we're not eager to implement this change
> across our entire fleet, as we recognize the potential for unforeseen
> consequences. Instead, we're taking a cautious approach by initially
> applying it to a limited number of servers. This allows us to assess
> its impact and make informed decisions about whether or not to expand
> its use in the future.

So, you haven't observed any performance hurt yet.  Right?  If you
haven't, I suggest you to keep the patch in your downstream kernel for a
while.  In the future, if you find the performance of some workloads
hurts because of the new batch number, you can repost the patch with the
supporting data.  If in the end, the performance of more and more
workloads is good with the new batch number.  You may consider to make 0
the default value :-)

> [0]  'Cluster' refers to a Kubernetes concept, where a single cluster
> comprises a specific group of servers designed to work in unison.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-12  5:25                           ` Huang, Ying
@ 2024-07-12  5:41                             ` Yafang Shao
  2024-07-12  6:16                               ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2024-07-12  5:41 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >>
> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >>
> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >> >>
> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment,
> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the
> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
> >> >> >> >> >> >> > is introduced as a more practical alternative.
> >> >> >> >> >> >>
> >> >> >> >> >> >> In general, I'm neutral to the change.  I can understand that kernel
> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
> >> >> >> >> >> >> too.
> >> >> >> >> >> >>
> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi
> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a
> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
> >> >> >> >> >> >> > likely to necessitate a more extended development effort.
> >> >> >> >> >> >>
> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
> >> >> >> >> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
> >> >> >> >> >> >
> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a
> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have
> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly
> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale
> >> >> >> >> >> > behind introducing this configuration in the beginning?
> >> >> >> >> >>
> >> >> >> >> >> Sorry, I don't understand your words.  Do you need me to explain what is
> >> >> >> >> >> "neutral"?
> >> >> >> >> >
> >> >> >> >> > No, thanks.
> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive
> >> >> >> >> > explanation of what "neutral" means, providing me with a better
> >> >> >> >> > understanding of the concept.
> >> >> >> >> >
> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ?
> >> >> >> >>
> >> >> >> >> I think that I have explained it in the commit log of commit
> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
> >> >> >> >> latency").  Which introduces the config.
> >> >> >> >
> >> >> >> > What specifically are your expectations for how users should utilize
> >> >> >> > this config in real production workload?
> >> >> >> >
> >> >> >> >>
> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever.  Can you
> >> >> >> >> explain why you need it?  Why cannot you use a fixed value after initial
> >> >> >> >> experiments.
> >> >> >> >
> >> >> >> > Given the extensive scale of our production environment, with hundreds
> >> >> >> > of thousands of servers, it begs the question: how do you propose we
> >> >> >> > efficiently manage the various workloads that remain unaffected by the
> >> >> >> > sysctl change implemented on just a few thousand servers? Is it
> >> >> >> > feasible to expect us to recompile and release a new kernel for every
> >> >> >> > instance where the default value falls short? Surely, there must be
> >> >> >> > more practical and efficient approaches we can explore together to
> >> >> >> > ensure optimal performance across all workloads.
> >> >> >> >
> >> >> >> > When making improvements or modifications, kindly ensure that they are
> >> >> >> > not solely confined to a test or lab environment. It's vital to also
> >> >> >> > consider the needs and requirements of our actual users, along with
> >> >> >> > the diverse workloads they encounter in their daily operations.
> >> >> >>
> >> >> >> Have you found that your different systems requires different
> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
> >> >> >
> >> >> > For specific workloads that introduce latency, we set the value to 0.
> >> >> > For other workloads, we keep it unchanged until we determine that the
> >> >> > default value is also suboptimal. What is the issue with this
> >> >> > approach?
> >> >>
> >> >> Firstly, this is a system wide configuration, not workload specific.
> >> >> So, other workloads run on the same system will be impacted too.  Will
> >> >> you run one workload only on one system?
> >> >
> >> > It seems we're living on different planets. You're happily working in
> >> > your lab environment, while I'm struggling with real-world production
> >> > issues.
> >> >
> >> > For servers:
> >> >
> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0
> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> >> > Server 1,000,001 and beyond: Happy with all values
> >> >
> >> > Is this hard to understand?
> >> >
> >> > In other words:
> >> >
> >> > For applications:
> >> >
> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0
> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> >> > Application 1,000,001 and beyond: Happy with all values
> >>
> >> Good to know this.  Thanks!
> >>
> >> >>
> >> >> Secondly, we need some evidences to introduce a new system ABI.  For
> >> >> example, we need to use different configuration on different systems
> >> >> otherwise some workloads will be hurt.  Can you provide some evidences
> >> >> to support your change?  IMHO, it's not good enough to say I don't know
> >> >> why I just don't want to change existing systems.  If so, it may be
> >> >> better to wait until you have more evidences.
> >> >
> >> > It seems the community encourages developers to experiment with their
> >> > improvements in lab environments using meticulously designed test
> >> > cases A, B, C, and as many others as they can imagine, ultimately
> >> > obtaining perfect data. However, it discourages developers from
> >> > directly addressing real-world workloads. Sigh.
> >>
> >> You cannot know whether your workloads benefit or hurt for the different
> >> batch number and how in your production environment?  If you cannot, how
> >> do you decide which workload deploys on which system (with different
> >> batch number configuration).  If you can, can you provide such
> >> information to support your patch?
> >
> > We leverage a meticulous selection of network metrics, particularly
> > focusing on TcpExt indicators, to keep a close eye on application
> > latency. This includes metrics such as TcpExt.TCPTimeouts,
> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans,
> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more.
> >
> > In instances where a problematic container terminates, we've noticed a
> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per
> > second, which serves as a clear indication that other applications are
> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max
> > parameter to 0, we've been able to drastically reduce the maximum
> > frequency of these timeouts to less than one per second.
>
> Thanks a lot for sharing this.  I learned much from it!
>
> > At present, we're selectively applying this adjustment to clusters
> > that exclusively host the identified problematic applications, and
> > we're closely monitoring their performance to ensure stability. To
> > date, we've observed no network latency issues as a result of this
> > change. However, we remain cautious about extending this optimization
> > to other clusters, as the decision ultimately depends on a variety of
> > factors.
> >
> > It's important to note that we're not eager to implement this change
> > across our entire fleet, as we recognize the potential for unforeseen
> > consequences. Instead, we're taking a cautious approach by initially
> > applying it to a limited number of servers. This allows us to assess
> > its impact and make informed decisions about whether or not to expand
> > its use in the future.
>
> So, you haven't observed any performance hurt yet.  Right?

Right.

> If you
> haven't, I suggest you to keep the patch in your downstream kernel for a
> while.  In the future, if you find the performance of some workloads
> hurts because of the new batch number, you can repost the patch with the
> supporting data.  If in the end, the performance of more and more
> workloads is good with the new batch number.  You may consider to make 0
> the default value :-)

That is not how the real world works.

In the real world:

- No one knows what may happen in the future.
  Therefore, if possible, we should make systems flexible, unless
there is a strong justification for using a hard-coded value.

- Minimize changes whenever possible.
  These systems have been working fine in the past, even if with lower
performance. Why make changes just for the sake of improving
performance? Does the key metric of your performance data truly matter
for their workload?

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-12  5:41                             ` Yafang Shao
@ 2024-07-12  6:16                               ` Huang, Ying
  2024-07-12  6:41                                 ` Yafang Shao
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2024-07-12  6:16 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

Yafang Shao <laoar.shao@gmail.com> writes:

> On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >>
>> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >>
>> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >> >>
>> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
>> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment,
>> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the
>> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
>> >> >> >> >> >> >> > is introduced as a more practical alternative.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> In general, I'm neutral to the change.  I can understand that kernel
>> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
>> >> >> >> >> >> >> too.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
>> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi
>> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
>> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
>> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a
>> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
>> >> >> >> >> >> >> > likely to necessitate a more extended development effort.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
>> >> >> >> >> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
>> >> >> >> >> >> >
>> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a
>> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have
>> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly
>> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale
>> >> >> >> >> >> > behind introducing this configuration in the beginning?
>> >> >> >> >> >>
>> >> >> >> >> >> Sorry, I don't understand your words.  Do you need me to explain what is
>> >> >> >> >> >> "neutral"?
>> >> >> >> >> >
>> >> >> >> >> > No, thanks.
>> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive
>> >> >> >> >> > explanation of what "neutral" means, providing me with a better
>> >> >> >> >> > understanding of the concept.
>> >> >> >> >> >
>> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ?
>> >> >> >> >>
>> >> >> >> >> I think that I have explained it in the commit log of commit
>> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
>> >> >> >> >> latency").  Which introduces the config.
>> >> >> >> >
>> >> >> >> > What specifically are your expectations for how users should utilize
>> >> >> >> > this config in real production workload?
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever.  Can you
>> >> >> >> >> explain why you need it?  Why cannot you use a fixed value after initial
>> >> >> >> >> experiments.
>> >> >> >> >
>> >> >> >> > Given the extensive scale of our production environment, with hundreds
>> >> >> >> > of thousands of servers, it begs the question: how do you propose we
>> >> >> >> > efficiently manage the various workloads that remain unaffected by the
>> >> >> >> > sysctl change implemented on just a few thousand servers? Is it
>> >> >> >> > feasible to expect us to recompile and release a new kernel for every
>> >> >> >> > instance where the default value falls short? Surely, there must be
>> >> >> >> > more practical and efficient approaches we can explore together to
>> >> >> >> > ensure optimal performance across all workloads.
>> >> >> >> >
>> >> >> >> > When making improvements or modifications, kindly ensure that they are
>> >> >> >> > not solely confined to a test or lab environment. It's vital to also
>> >> >> >> > consider the needs and requirements of our actual users, along with
>> >> >> >> > the diverse workloads they encounter in their daily operations.
>> >> >> >>
>> >> >> >> Have you found that your different systems requires different
>> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
>> >> >> >
>> >> >> > For specific workloads that introduce latency, we set the value to 0.
>> >> >> > For other workloads, we keep it unchanged until we determine that the
>> >> >> > default value is also suboptimal. What is the issue with this
>> >> >> > approach?
>> >> >>
>> >> >> Firstly, this is a system wide configuration, not workload specific.
>> >> >> So, other workloads run on the same system will be impacted too.  Will
>> >> >> you run one workload only on one system?
>> >> >
>> >> > It seems we're living on different planets. You're happily working in
>> >> > your lab environment, while I'm struggling with real-world production
>> >> > issues.
>> >> >
>> >> > For servers:
>> >> >
>> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0
>> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
>> >> > Server 1,000,001 and beyond: Happy with all values
>> >> >
>> >> > Is this hard to understand?
>> >> >
>> >> > In other words:
>> >> >
>> >> > For applications:
>> >> >
>> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0
>> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
>> >> > Application 1,000,001 and beyond: Happy with all values
>> >>
>> >> Good to know this.  Thanks!
>> >>
>> >> >>
>> >> >> Secondly, we need some evidences to introduce a new system ABI.  For
>> >> >> example, we need to use different configuration on different systems
>> >> >> otherwise some workloads will be hurt.  Can you provide some evidences
>> >> >> to support your change?  IMHO, it's not good enough to say I don't know
>> >> >> why I just don't want to change existing systems.  If so, it may be
>> >> >> better to wait until you have more evidences.
>> >> >
>> >> > It seems the community encourages developers to experiment with their
>> >> > improvements in lab environments using meticulously designed test
>> >> > cases A, B, C, and as many others as they can imagine, ultimately
>> >> > obtaining perfect data. However, it discourages developers from
>> >> > directly addressing real-world workloads. Sigh.
>> >>
>> >> You cannot know whether your workloads benefit or hurt for the different
>> >> batch number and how in your production environment?  If you cannot, how
>> >> do you decide which workload deploys on which system (with different
>> >> batch number configuration).  If you can, can you provide such
>> >> information to support your patch?
>> >
>> > We leverage a meticulous selection of network metrics, particularly
>> > focusing on TcpExt indicators, to keep a close eye on application
>> > latency. This includes metrics such as TcpExt.TCPTimeouts,
>> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans,
>> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more.
>> >
>> > In instances where a problematic container terminates, we've noticed a
>> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per
>> > second, which serves as a clear indication that other applications are
>> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max
>> > parameter to 0, we've been able to drastically reduce the maximum
>> > frequency of these timeouts to less than one per second.
>>
>> Thanks a lot for sharing this.  I learned much from it!
>>
>> > At present, we're selectively applying this adjustment to clusters
>> > that exclusively host the identified problematic applications, and
>> > we're closely monitoring their performance to ensure stability. To
>> > date, we've observed no network latency issues as a result of this
>> > change. However, we remain cautious about extending this optimization
>> > to other clusters, as the decision ultimately depends on a variety of
>> > factors.
>> >
>> > It's important to note that we're not eager to implement this change
>> > across our entire fleet, as we recognize the potential for unforeseen
>> > consequences. Instead, we're taking a cautious approach by initially
>> > applying it to a limited number of servers. This allows us to assess
>> > its impact and make informed decisions about whether or not to expand
>> > its use in the future.
>>
>> So, you haven't observed any performance hurt yet.  Right?
>
> Right.
>
>> If you
>> haven't, I suggest you to keep the patch in your downstream kernel for a
>> while.  In the future, if you find the performance of some workloads
>> hurts because of the new batch number, you can repost the patch with the
>> supporting data.  If in the end, the performance of more and more
>> workloads is good with the new batch number.  You may consider to make 0
>> the default value :-)
>
> That is not how the real world works.
>
> In the real world:
>
> - No one knows what may happen in the future.
>   Therefore, if possible, we should make systems flexible, unless
> there is a strong justification for using a hard-coded value.
>
> - Minimize changes whenever possible.
>   These systems have been working fine in the past, even if with lower
> performance. Why make changes just for the sake of improving
> performance? Does the key metric of your performance data truly matter
> for their workload?

These are good policy in your organization and business.  But, it's not
necessary the policy that Linux kernel upstream should take.

Community needs to consider long-term maintenance overhead, so it adds
new ABI (such as sysfs knob) to kernel with the necessary justification.
In general, it prefer to use a good default value or an automatic
algorithm that works for everyone.  Community tries avoiding (or fixing)
regressions as much as possible, but this will not stop kernel from
changing, even if it's big.

IIUC, because of the different requirements, there are upstream and
downstream kernels.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-12  6:16                               ` Huang, Ying
@ 2024-07-12  6:41                                 ` Yafang Shao
  2024-07-12  7:04                                   ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2024-07-12  6:41 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >>
> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >>
> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >> >>
> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment,
> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the
> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
> >> >> >> >> >> >> >> > is introduced as a more practical alternative.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> In general, I'm neutral to the change.  I can understand that kernel
> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
> >> >> >> >> >> >> >> too.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi
> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a
> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
> >> >> >> >> >> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a
> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have
> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly
> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale
> >> >> >> >> >> >> > behind introducing this configuration in the beginning?
> >> >> >> >> >> >>
> >> >> >> >> >> >> Sorry, I don't understand your words.  Do you need me to explain what is
> >> >> >> >> >> >> "neutral"?
> >> >> >> >> >> >
> >> >> >> >> >> > No, thanks.
> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive
> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better
> >> >> >> >> >> > understanding of the concept.
> >> >> >> >> >> >
> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ?
> >> >> >> >> >>
> >> >> >> >> >> I think that I have explained it in the commit log of commit
> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
> >> >> >> >> >> latency").  Which introduces the config.
> >> >> >> >> >
> >> >> >> >> > What specifically are your expectations for how users should utilize
> >> >> >> >> > this config in real production workload?
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever.  Can you
> >> >> >> >> >> explain why you need it?  Why cannot you use a fixed value after initial
> >> >> >> >> >> experiments.
> >> >> >> >> >
> >> >> >> >> > Given the extensive scale of our production environment, with hundreds
> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we
> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the
> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it
> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every
> >> >> >> >> > instance where the default value falls short? Surely, there must be
> >> >> >> >> > more practical and efficient approaches we can explore together to
> >> >> >> >> > ensure optimal performance across all workloads.
> >> >> >> >> >
> >> >> >> >> > When making improvements or modifications, kindly ensure that they are
> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also
> >> >> >> >> > consider the needs and requirements of our actual users, along with
> >> >> >> >> > the diverse workloads they encounter in their daily operations.
> >> >> >> >>
> >> >> >> >> Have you found that your different systems requires different
> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
> >> >> >> >
> >> >> >> > For specific workloads that introduce latency, we set the value to 0.
> >> >> >> > For other workloads, we keep it unchanged until we determine that the
> >> >> >> > default value is also suboptimal. What is the issue with this
> >> >> >> > approach?
> >> >> >>
> >> >> >> Firstly, this is a system wide configuration, not workload specific.
> >> >> >> So, other workloads run on the same system will be impacted too.  Will
> >> >> >> you run one workload only on one system?
> >> >> >
> >> >> > It seems we're living on different planets. You're happily working in
> >> >> > your lab environment, while I'm struggling with real-world production
> >> >> > issues.
> >> >> >
> >> >> > For servers:
> >> >> >
> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0
> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> >> >> > Server 1,000,001 and beyond: Happy with all values
> >> >> >
> >> >> > Is this hard to understand?
> >> >> >
> >> >> > In other words:
> >> >> >
> >> >> > For applications:
> >> >> >
> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0
> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> >> >> > Application 1,000,001 and beyond: Happy with all values
> >> >>
> >> >> Good to know this.  Thanks!
> >> >>
> >> >> >>
> >> >> >> Secondly, we need some evidences to introduce a new system ABI.  For
> >> >> >> example, we need to use different configuration on different systems
> >> >> >> otherwise some workloads will be hurt.  Can you provide some evidences
> >> >> >> to support your change?  IMHO, it's not good enough to say I don't know
> >> >> >> why I just don't want to change existing systems.  If so, it may be
> >> >> >> better to wait until you have more evidences.
> >> >> >
> >> >> > It seems the community encourages developers to experiment with their
> >> >> > improvements in lab environments using meticulously designed test
> >> >> > cases A, B, C, and as many others as they can imagine, ultimately
> >> >> > obtaining perfect data. However, it discourages developers from
> >> >> > directly addressing real-world workloads. Sigh.
> >> >>
> >> >> You cannot know whether your workloads benefit or hurt for the different
> >> >> batch number and how in your production environment?  If you cannot, how
> >> >> do you decide which workload deploys on which system (with different
> >> >> batch number configuration).  If you can, can you provide such
> >> >> information to support your patch?
> >> >
> >> > We leverage a meticulous selection of network metrics, particularly
> >> > focusing on TcpExt indicators, to keep a close eye on application
> >> > latency. This includes metrics such as TcpExt.TCPTimeouts,
> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans,
> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more.
> >> >
> >> > In instances where a problematic container terminates, we've noticed a
> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per
> >> > second, which serves as a clear indication that other applications are
> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max
> >> > parameter to 0, we've been able to drastically reduce the maximum
> >> > frequency of these timeouts to less than one per second.
> >>
> >> Thanks a lot for sharing this.  I learned much from it!
> >>
> >> > At present, we're selectively applying this adjustment to clusters
> >> > that exclusively host the identified problematic applications, and
> >> > we're closely monitoring their performance to ensure stability. To
> >> > date, we've observed no network latency issues as a result of this
> >> > change. However, we remain cautious about extending this optimization
> >> > to other clusters, as the decision ultimately depends on a variety of
> >> > factors.
> >> >
> >> > It's important to note that we're not eager to implement this change
> >> > across our entire fleet, as we recognize the potential for unforeseen
> >> > consequences. Instead, we're taking a cautious approach by initially
> >> > applying it to a limited number of servers. This allows us to assess
> >> > its impact and make informed decisions about whether or not to expand
> >> > its use in the future.
> >>
> >> So, you haven't observed any performance hurt yet.  Right?
> >
> > Right.
> >
> >> If you
> >> haven't, I suggest you to keep the patch in your downstream kernel for a
> >> while.  In the future, if you find the performance of some workloads
> >> hurts because of the new batch number, you can repost the patch with the
> >> supporting data.  If in the end, the performance of more and more
> >> workloads is good with the new batch number.  You may consider to make 0
> >> the default value :-)
> >
> > That is not how the real world works.
> >
> > In the real world:
> >
> > - No one knows what may happen in the future.
> >   Therefore, if possible, we should make systems flexible, unless
> > there is a strong justification for using a hard-coded value.
> >
> > - Minimize changes whenever possible.
> >   These systems have been working fine in the past, even if with lower
> > performance. Why make changes just for the sake of improving
> > performance? Does the key metric of your performance data truly matter
> > for their workload?
>
> These are good policy in your organization and business.  But, it's not
> necessary the policy that Linux kernel upstream should take.

You mean the Upstream Linux kernel only designed for the lab ?

>
> Community needs to consider long-term maintenance overhead, so it adds
> new ABI (such as sysfs knob) to kernel with the necessary justification.
> In general, it prefer to use a good default value or an automatic
> algorithm that works for everyone.  Community tries avoiding (or fixing)
> regressions as much as possible, but this will not stop kernel from
> changing, even if it's big.

Please explain to me why the kernel config is not ABI, but the sysctl is ABI.

>
> IIUC, because of the different requirements, there are upstream and
> downstream kernels.

The downstream developer backport features from the upsteam kernel,
and if they find issues in the upstream kernel, they should contribute
it back. That is how the Linux Community works, right ?

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-12  6:41                                 ` Yafang Shao
@ 2024-07-12  7:04                                   ` Huang, Ying
  2024-07-12  7:36                                     ` Yafang Shao
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2024-07-12  7:04 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

Yafang Shao <laoar.shao@gmail.com> writes:

> On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >>
>> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >>
>> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >> >>
>> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
>> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment,
>> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the
>> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
>> >> >> >> >> >> >> >> > is introduced as a more practical alternative.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> In general, I'm neutral to the change.  I can understand that kernel
>> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
>> >> >> >> >> >> >> >> too.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
>> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi
>> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
>> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
>> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a
>> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
>> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
>> >> >> >> >> >> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a
>> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have
>> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly
>> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale
>> >> >> >> >> >> >> > behind introducing this configuration in the beginning?
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Sorry, I don't understand your words.  Do you need me to explain what is
>> >> >> >> >> >> >> "neutral"?
>> >> >> >> >> >> >
>> >> >> >> >> >> > No, thanks.
>> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive
>> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better
>> >> >> >> >> >> > understanding of the concept.
>> >> >> >> >> >> >
>> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ?
>> >> >> >> >> >>
>> >> >> >> >> >> I think that I have explained it in the commit log of commit
>> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
>> >> >> >> >> >> latency").  Which introduces the config.
>> >> >> >> >> >
>> >> >> >> >> > What specifically are your expectations for how users should utilize
>> >> >> >> >> > this config in real production workload?
>> >> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever.  Can you
>> >> >> >> >> >> explain why you need it?  Why cannot you use a fixed value after initial
>> >> >> >> >> >> experiments.
>> >> >> >> >> >
>> >> >> >> >> > Given the extensive scale of our production environment, with hundreds
>> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we
>> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the
>> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it
>> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every
>> >> >> >> >> > instance where the default value falls short? Surely, there must be
>> >> >> >> >> > more practical and efficient approaches we can explore together to
>> >> >> >> >> > ensure optimal performance across all workloads.
>> >> >> >> >> >
>> >> >> >> >> > When making improvements or modifications, kindly ensure that they are
>> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also
>> >> >> >> >> > consider the needs and requirements of our actual users, along with
>> >> >> >> >> > the diverse workloads they encounter in their daily operations.
>> >> >> >> >>
>> >> >> >> >> Have you found that your different systems requires different
>> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
>> >> >> >> >
>> >> >> >> > For specific workloads that introduce latency, we set the value to 0.
>> >> >> >> > For other workloads, we keep it unchanged until we determine that the
>> >> >> >> > default value is also suboptimal. What is the issue with this
>> >> >> >> > approach?
>> >> >> >>
>> >> >> >> Firstly, this is a system wide configuration, not workload specific.
>> >> >> >> So, other workloads run on the same system will be impacted too.  Will
>> >> >> >> you run one workload only on one system?
>> >> >> >
>> >> >> > It seems we're living on different planets. You're happily working in
>> >> >> > your lab environment, while I'm struggling with real-world production
>> >> >> > issues.
>> >> >> >
>> >> >> > For servers:
>> >> >> >
>> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0
>> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
>> >> >> > Server 1,000,001 and beyond: Happy with all values
>> >> >> >
>> >> >> > Is this hard to understand?
>> >> >> >
>> >> >> > In other words:
>> >> >> >
>> >> >> > For applications:
>> >> >> >
>> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0
>> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
>> >> >> > Application 1,000,001 and beyond: Happy with all values
>> >> >>
>> >> >> Good to know this.  Thanks!
>> >> >>
>> >> >> >>
>> >> >> >> Secondly, we need some evidences to introduce a new system ABI.  For
>> >> >> >> example, we need to use different configuration on different systems
>> >> >> >> otherwise some workloads will be hurt.  Can you provide some evidences
>> >> >> >> to support your change?  IMHO, it's not good enough to say I don't know
>> >> >> >> why I just don't want to change existing systems.  If so, it may be
>> >> >> >> better to wait until you have more evidences.
>> >> >> >
>> >> >> > It seems the community encourages developers to experiment with their
>> >> >> > improvements in lab environments using meticulously designed test
>> >> >> > cases A, B, C, and as many others as they can imagine, ultimately
>> >> >> > obtaining perfect data. However, it discourages developers from
>> >> >> > directly addressing real-world workloads. Sigh.
>> >> >>
>> >> >> You cannot know whether your workloads benefit or hurt for the different
>> >> >> batch number and how in your production environment?  If you cannot, how
>> >> >> do you decide which workload deploys on which system (with different
>> >> >> batch number configuration).  If you can, can you provide such
>> >> >> information to support your patch?
>> >> >
>> >> > We leverage a meticulous selection of network metrics, particularly
>> >> > focusing on TcpExt indicators, to keep a close eye on application
>> >> > latency. This includes metrics such as TcpExt.TCPTimeouts,
>> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans,
>> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more.
>> >> >
>> >> > In instances where a problematic container terminates, we've noticed a
>> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per
>> >> > second, which serves as a clear indication that other applications are
>> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max
>> >> > parameter to 0, we've been able to drastically reduce the maximum
>> >> > frequency of these timeouts to less than one per second.
>> >>
>> >> Thanks a lot for sharing this.  I learned much from it!
>> >>
>> >> > At present, we're selectively applying this adjustment to clusters
>> >> > that exclusively host the identified problematic applications, and
>> >> > we're closely monitoring their performance to ensure stability. To
>> >> > date, we've observed no network latency issues as a result of this
>> >> > change. However, we remain cautious about extending this optimization
>> >> > to other clusters, as the decision ultimately depends on a variety of
>> >> > factors.
>> >> >
>> >> > It's important to note that we're not eager to implement this change
>> >> > across our entire fleet, as we recognize the potential for unforeseen
>> >> > consequences. Instead, we're taking a cautious approach by initially
>> >> > applying it to a limited number of servers. This allows us to assess
>> >> > its impact and make informed decisions about whether or not to expand
>> >> > its use in the future.
>> >>
>> >> So, you haven't observed any performance hurt yet.  Right?
>> >
>> > Right.
>> >
>> >> If you
>> >> haven't, I suggest you to keep the patch in your downstream kernel for a
>> >> while.  In the future, if you find the performance of some workloads
>> >> hurts because of the new batch number, you can repost the patch with the
>> >> supporting data.  If in the end, the performance of more and more
>> >> workloads is good with the new batch number.  You may consider to make 0
>> >> the default value :-)
>> >
>> > That is not how the real world works.
>> >
>> > In the real world:
>> >
>> > - No one knows what may happen in the future.
>> >   Therefore, if possible, we should make systems flexible, unless
>> > there is a strong justification for using a hard-coded value.
>> >
>> > - Minimize changes whenever possible.
>> >   These systems have been working fine in the past, even if with lower
>> > performance. Why make changes just for the sake of improving
>> > performance? Does the key metric of your performance data truly matter
>> > for their workload?
>>
>> These are good policy in your organization and business.  But, it's not
>> necessary the policy that Linux kernel upstream should take.
>
> You mean the Upstream Linux kernel only designed for the lab ?
>
>>
>> Community needs to consider long-term maintenance overhead, so it adds
>> new ABI (such as sysfs knob) to kernel with the necessary justification.
>> In general, it prefer to use a good default value or an automatic
>> algorithm that works for everyone.  Community tries avoiding (or fixing)
>> regressions as much as possible, but this will not stop kernel from
>> changing, even if it's big.
>
> Please explain to me why the kernel config is not ABI, but the sysctl is ABI.

Linux kernel will not break ABI until the last users stop using it.
This usually means tens years if not forever.  Kernel config options
aren't considered ABI, they are used by developers and distributions.
They come and go from version to version.

>>
>> IIUC, because of the different requirements, there are upstream and
>> downstream kernels.
>
> The downstream developer backport features from the upsteam kernel,
> and if they find issues in the upstream kernel, they should contribute
> it back. That is how the Linux Community works, right ?

Yes.  If they are issues for upstream kernel too.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-12  7:04                                   ` Huang, Ying
@ 2024-07-12  7:36                                     ` Yafang Shao
  2024-07-12  8:24                                       ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2024-07-12  7:36 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

On Fri, Jul 12, 2024 at 3:06 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >>
> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >>
> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >> >>
> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
> >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment,
> >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the
> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
> >> >> >> >> >> >> >> >> > is introduced as a more practical alternative.
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> In general, I'm neutral to the change.  I can understand that kernel
> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
> >> >> >> >> >> >> >> >> too.
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
> >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi
> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
> >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a
> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
> >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort.
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
> >> >> >> >> >> >> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a
> >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have
> >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly
> >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale
> >> >> >> >> >> >> >> > behind introducing this configuration in the beginning?
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Sorry, I don't understand your words.  Do you need me to explain what is
> >> >> >> >> >> >> >> "neutral"?
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > No, thanks.
> >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive
> >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better
> >> >> >> >> >> >> > understanding of the concept.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ?
> >> >> >> >> >> >>
> >> >> >> >> >> >> I think that I have explained it in the commit log of commit
> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
> >> >> >> >> >> >> latency").  Which introduces the config.
> >> >> >> >> >> >
> >> >> >> >> >> > What specifically are your expectations for how users should utilize
> >> >> >> >> >> > this config in real production workload?
> >> >> >> >> >> >
> >> >> >> >> >> >>
> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever.  Can you
> >> >> >> >> >> >> explain why you need it?  Why cannot you use a fixed value after initial
> >> >> >> >> >> >> experiments.
> >> >> >> >> >> >
> >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds
> >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we
> >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the
> >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it
> >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every
> >> >> >> >> >> > instance where the default value falls short? Surely, there must be
> >> >> >> >> >> > more practical and efficient approaches we can explore together to
> >> >> >> >> >> > ensure optimal performance across all workloads.
> >> >> >> >> >> >
> >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are
> >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also
> >> >> >> >> >> > consider the needs and requirements of our actual users, along with
> >> >> >> >> >> > the diverse workloads they encounter in their daily operations.
> >> >> >> >> >>
> >> >> >> >> >> Have you found that your different systems requires different
> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
> >> >> >> >> >
> >> >> >> >> > For specific workloads that introduce latency, we set the value to 0.
> >> >> >> >> > For other workloads, we keep it unchanged until we determine that the
> >> >> >> >> > default value is also suboptimal. What is the issue with this
> >> >> >> >> > approach?
> >> >> >> >>
> >> >> >> >> Firstly, this is a system wide configuration, not workload specific.
> >> >> >> >> So, other workloads run on the same system will be impacted too.  Will
> >> >> >> >> you run one workload only on one system?
> >> >> >> >
> >> >> >> > It seems we're living on different planets. You're happily working in
> >> >> >> > your lab environment, while I'm struggling with real-world production
> >> >> >> > issues.
> >> >> >> >
> >> >> >> > For servers:
> >> >> >> >
> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0
> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> >> >> >> > Server 1,000,001 and beyond: Happy with all values
> >> >> >> >
> >> >> >> > Is this hard to understand?
> >> >> >> >
> >> >> >> > In other words:
> >> >> >> >
> >> >> >> > For applications:
> >> >> >> >
> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0
> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> >> >> >> > Application 1,000,001 and beyond: Happy with all values
> >> >> >>
> >> >> >> Good to know this.  Thanks!
> >> >> >>
> >> >> >> >>
> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI.  For
> >> >> >> >> example, we need to use different configuration on different systems
> >> >> >> >> otherwise some workloads will be hurt.  Can you provide some evidences
> >> >> >> >> to support your change?  IMHO, it's not good enough to say I don't know
> >> >> >> >> why I just don't want to change existing systems.  If so, it may be
> >> >> >> >> better to wait until you have more evidences.
> >> >> >> >
> >> >> >> > It seems the community encourages developers to experiment with their
> >> >> >> > improvements in lab environments using meticulously designed test
> >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately
> >> >> >> > obtaining perfect data. However, it discourages developers from
> >> >> >> > directly addressing real-world workloads. Sigh.
> >> >> >>
> >> >> >> You cannot know whether your workloads benefit or hurt for the different
> >> >> >> batch number and how in your production environment?  If you cannot, how
> >> >> >> do you decide which workload deploys on which system (with different
> >> >> >> batch number configuration).  If you can, can you provide such
> >> >> >> information to support your patch?
> >> >> >
> >> >> > We leverage a meticulous selection of network metrics, particularly
> >> >> > focusing on TcpExt indicators, to keep a close eye on application
> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts,
> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans,
> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more.
> >> >> >
> >> >> > In instances where a problematic container terminates, we've noticed a
> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per
> >> >> > second, which serves as a clear indication that other applications are
> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max
> >> >> > parameter to 0, we've been able to drastically reduce the maximum
> >> >> > frequency of these timeouts to less than one per second.
> >> >>
> >> >> Thanks a lot for sharing this.  I learned much from it!
> >> >>
> >> >> > At present, we're selectively applying this adjustment to clusters
> >> >> > that exclusively host the identified problematic applications, and
> >> >> > we're closely monitoring their performance to ensure stability. To
> >> >> > date, we've observed no network latency issues as a result of this
> >> >> > change. However, we remain cautious about extending this optimization
> >> >> > to other clusters, as the decision ultimately depends on a variety of
> >> >> > factors.
> >> >> >
> >> >> > It's important to note that we're not eager to implement this change
> >> >> > across our entire fleet, as we recognize the potential for unforeseen
> >> >> > consequences. Instead, we're taking a cautious approach by initially
> >> >> > applying it to a limited number of servers. This allows us to assess
> >> >> > its impact and make informed decisions about whether or not to expand
> >> >> > its use in the future.
> >> >>
> >> >> So, you haven't observed any performance hurt yet.  Right?
> >> >
> >> > Right.
> >> >
> >> >> If you
> >> >> haven't, I suggest you to keep the patch in your downstream kernel for a
> >> >> while.  In the future, if you find the performance of some workloads
> >> >> hurts because of the new batch number, you can repost the patch with the
> >> >> supporting data.  If in the end, the performance of more and more
> >> >> workloads is good with the new batch number.  You may consider to make 0
> >> >> the default value :-)
> >> >
> >> > That is not how the real world works.
> >> >
> >> > In the real world:
> >> >
> >> > - No one knows what may happen in the future.
> >> >   Therefore, if possible, we should make systems flexible, unless
> >> > there is a strong justification for using a hard-coded value.
> >> >
> >> > - Minimize changes whenever possible.
> >> >   These systems have been working fine in the past, even if with lower
> >> > performance. Why make changes just for the sake of improving
> >> > performance? Does the key metric of your performance data truly matter
> >> > for their workload?
> >>
> >> These are good policy in your organization and business.  But, it's not
> >> necessary the policy that Linux kernel upstream should take.
> >
> > You mean the Upstream Linux kernel only designed for the lab ?
> >
> >>
> >> Community needs to consider long-term maintenance overhead, so it adds
> >> new ABI (such as sysfs knob) to kernel with the necessary justification.
> >> In general, it prefer to use a good default value or an automatic
> >> algorithm that works for everyone.  Community tries avoiding (or fixing)
> >> regressions as much as possible, but this will not stop kernel from
> >> changing, even if it's big.
> >
> > Please explain to me why the kernel config is not ABI, but the sysctl is ABI.
>
> Linux kernel will not break ABI until the last users stop using it.

However, you haven't given a clear reference why the systl is an ABI.

> This usually means tens years if not forever.  Kernel config options
> aren't considered ABI, they are used by developers and distributions.
> They come and go from version to version.
>
> >>
> >> IIUC, because of the different requirements, there are upstream and
> >> downstream kernels.
> >
> > The downstream developer backport features from the upsteam kernel,
> > and if they find issues in the upstream kernel, they should contribute
> > it back. That is how the Linux Community works, right ?
>
> Yes.  If they are issues for upstream kernel too.
>
> --
> Best Regards,
> Huang, Ying



-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-12  7:36                                     ` Yafang Shao
@ 2024-07-12  8:24                                       ` Huang, Ying
  2024-07-12  8:49                                         ` Yafang Shao
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2024-07-12  8:24 UTC (permalink / raw)
  To: Yafang Shao, akpm, Matthew Wilcox; +Cc: mgorman, linux-mm, David Rientjes

Yafang Shao <laoar.shao@gmail.com> writes:

> On Fri, Jul 12, 2024 at 3:06 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >>
>> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >>
>> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >> >>
>> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
>> >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment,
>> >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the
>> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
>> >> >> >> >> >> >> >> >> > is introduced as a more practical alternative.
>> >> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >> In general, I'm neutral to the change.  I can understand that kernel
>> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
>> >> >> >> >> >> >> >> >> too.
>> >> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
>> >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi
>> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
>> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
>> >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a
>> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
>> >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort.
>> >> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
>> >> >> >> >> >> >> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a
>> >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have
>> >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly
>> >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale
>> >> >> >> >> >> >> >> > behind introducing this configuration in the beginning?
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Sorry, I don't understand your words.  Do you need me to explain what is
>> >> >> >> >> >> >> >> "neutral"?
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > No, thanks.
>> >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive
>> >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better
>> >> >> >> >> >> >> > understanding of the concept.
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ?
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> I think that I have explained it in the commit log of commit
>> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
>> >> >> >> >> >> >> latency").  Which introduces the config.
>> >> >> >> >> >> >
>> >> >> >> >> >> > What specifically are your expectations for how users should utilize
>> >> >> >> >> >> > this config in real production workload?
>> >> >> >> >> >> >
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever.  Can you
>> >> >> >> >> >> >> explain why you need it?  Why cannot you use a fixed value after initial
>> >> >> >> >> >> >> experiments.
>> >> >> >> >> >> >
>> >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds
>> >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we
>> >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the
>> >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it
>> >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every
>> >> >> >> >> >> > instance where the default value falls short? Surely, there must be
>> >> >> >> >> >> > more practical and efficient approaches we can explore together to
>> >> >> >> >> >> > ensure optimal performance across all workloads.
>> >> >> >> >> >> >
>> >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are
>> >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also
>> >> >> >> >> >> > consider the needs and requirements of our actual users, along with
>> >> >> >> >> >> > the diverse workloads they encounter in their daily operations.
>> >> >> >> >> >>
>> >> >> >> >> >> Have you found that your different systems requires different
>> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
>> >> >> >> >> >
>> >> >> >> >> > For specific workloads that introduce latency, we set the value to 0.
>> >> >> >> >> > For other workloads, we keep it unchanged until we determine that the
>> >> >> >> >> > default value is also suboptimal. What is the issue with this
>> >> >> >> >> > approach?
>> >> >> >> >>
>> >> >> >> >> Firstly, this is a system wide configuration, not workload specific.
>> >> >> >> >> So, other workloads run on the same system will be impacted too.  Will
>> >> >> >> >> you run one workload only on one system?
>> >> >> >> >
>> >> >> >> > It seems we're living on different planets. You're happily working in
>> >> >> >> > your lab environment, while I'm struggling with real-world production
>> >> >> >> > issues.
>> >> >> >> >
>> >> >> >> > For servers:
>> >> >> >> >
>> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0
>> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
>> >> >> >> > Server 1,000,001 and beyond: Happy with all values
>> >> >> >> >
>> >> >> >> > Is this hard to understand?
>> >> >> >> >
>> >> >> >> > In other words:
>> >> >> >> >
>> >> >> >> > For applications:
>> >> >> >> >
>> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0
>> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
>> >> >> >> > Application 1,000,001 and beyond: Happy with all values
>> >> >> >>
>> >> >> >> Good to know this.  Thanks!
>> >> >> >>
>> >> >> >> >>
>> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI.  For
>> >> >> >> >> example, we need to use different configuration on different systems
>> >> >> >> >> otherwise some workloads will be hurt.  Can you provide some evidences
>> >> >> >> >> to support your change?  IMHO, it's not good enough to say I don't know
>> >> >> >> >> why I just don't want to change existing systems.  If so, it may be
>> >> >> >> >> better to wait until you have more evidences.
>> >> >> >> >
>> >> >> >> > It seems the community encourages developers to experiment with their
>> >> >> >> > improvements in lab environments using meticulously designed test
>> >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately
>> >> >> >> > obtaining perfect data. However, it discourages developers from
>> >> >> >> > directly addressing real-world workloads. Sigh.
>> >> >> >>
>> >> >> >> You cannot know whether your workloads benefit or hurt for the different
>> >> >> >> batch number and how in your production environment?  If you cannot, how
>> >> >> >> do you decide which workload deploys on which system (with different
>> >> >> >> batch number configuration).  If you can, can you provide such
>> >> >> >> information to support your patch?
>> >> >> >
>> >> >> > We leverage a meticulous selection of network metrics, particularly
>> >> >> > focusing on TcpExt indicators, to keep a close eye on application
>> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts,
>> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans,
>> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more.
>> >> >> >
>> >> >> > In instances where a problematic container terminates, we've noticed a
>> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per
>> >> >> > second, which serves as a clear indication that other applications are
>> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max
>> >> >> > parameter to 0, we've been able to drastically reduce the maximum
>> >> >> > frequency of these timeouts to less than one per second.
>> >> >>
>> >> >> Thanks a lot for sharing this.  I learned much from it!
>> >> >>
>> >> >> > At present, we're selectively applying this adjustment to clusters
>> >> >> > that exclusively host the identified problematic applications, and
>> >> >> > we're closely monitoring their performance to ensure stability. To
>> >> >> > date, we've observed no network latency issues as a result of this
>> >> >> > change. However, we remain cautious about extending this optimization
>> >> >> > to other clusters, as the decision ultimately depends on a variety of
>> >> >> > factors.
>> >> >> >
>> >> >> > It's important to note that we're not eager to implement this change
>> >> >> > across our entire fleet, as we recognize the potential for unforeseen
>> >> >> > consequences. Instead, we're taking a cautious approach by initially
>> >> >> > applying it to a limited number of servers. This allows us to assess
>> >> >> > its impact and make informed decisions about whether or not to expand
>> >> >> > its use in the future.
>> >> >>
>> >> >> So, you haven't observed any performance hurt yet.  Right?
>> >> >
>> >> > Right.
>> >> >
>> >> >> If you
>> >> >> haven't, I suggest you to keep the patch in your downstream kernel for a
>> >> >> while.  In the future, if you find the performance of some workloads
>> >> >> hurts because of the new batch number, you can repost the patch with the
>> >> >> supporting data.  If in the end, the performance of more and more
>> >> >> workloads is good with the new batch number.  You may consider to make 0
>> >> >> the default value :-)
>> >> >
>> >> > That is not how the real world works.
>> >> >
>> >> > In the real world:
>> >> >
>> >> > - No one knows what may happen in the future.
>> >> >   Therefore, if possible, we should make systems flexible, unless
>> >> > there is a strong justification for using a hard-coded value.
>> >> >
>> >> > - Minimize changes whenever possible.
>> >> >   These systems have been working fine in the past, even if with lower
>> >> > performance. Why make changes just for the sake of improving
>> >> > performance? Does the key metric of your performance data truly matter
>> >> > for their workload?
>> >>
>> >> These are good policy in your organization and business.  But, it's not
>> >> necessary the policy that Linux kernel upstream should take.
>> >
>> > You mean the Upstream Linux kernel only designed for the lab ?
>> >
>> >>
>> >> Community needs to consider long-term maintenance overhead, so it adds
>> >> new ABI (such as sysfs knob) to kernel with the necessary justification.
>> >> In general, it prefer to use a good default value or an automatic
>> >> algorithm that works for everyone.  Community tries avoiding (or fixing)
>> >> regressions as much as possible, but this will not stop kernel from
>> >> changing, even if it's big.
>> >
>> > Please explain to me why the kernel config is not ABI, but the sysctl is ABI.
>>
>> Linux kernel will not break ABI until the last users stop using it.
>
> However, you haven't given a clear reference why the systl is an ABI.

TBH, I don't find a formal document said it explicitly after some
searching.

Hi, Andrew, Matthew,

Can you help me on this?  Whether sysctl is considered Linux kernel ABI?
Or something similar?

>> This usually means tens years if not forever.  Kernel config options
>> aren't considered ABI, they are used by developers and distributions.
>> They come and go from version to version.
>>
>> >>
>> >> IIUC, because of the different requirements, there are upstream and
>> >> downstream kernels.
>> >
>> > The downstream developer backport features from the upsteam kernel,
>> > and if they find issues in the upstream kernel, they should contribute
>> > it back. That is how the Linux Community works, right ?
>>
>> Yes.  If they are issues for upstream kernel too.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-12  8:24                                       ` Huang, Ying
@ 2024-07-12  8:49                                         ` Yafang Shao
  2024-07-12  9:10                                           ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2024-07-12  8:49 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, Matthew Wilcox, mgorman, linux-mm, David Rientjes

On Fri, Jul 12, 2024 at 4:26 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Fri, Jul 12, 2024 at 3:06 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >>
> >> >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >>
> >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >> >>
> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
> >> >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment,
> >> >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the
> >> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
> >> >> >> >> >> >> >> >> >> > is introduced as a more practical alternative.
> >> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change.  I can understand that kernel
> >> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
> >> >> >> >> >> >> >> >> >> too.
> >> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
> >> >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi
> >> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
> >> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
> >> >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a
> >> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
> >> >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort.
> >> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
> >> >> >> >> >> >> >> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
> >> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a
> >> >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have
> >> >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly
> >> >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale
> >> >> >> >> >> >> >> >> > behind introducing this configuration in the beginning?
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> Sorry, I don't understand your words.  Do you need me to explain what is
> >> >> >> >> >> >> >> >> "neutral"?
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > No, thanks.
> >> >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive
> >> >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better
> >> >> >> >> >> >> >> > understanding of the concept.
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ?
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> I think that I have explained it in the commit log of commit
> >> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
> >> >> >> >> >> >> >> latency").  Which introduces the config.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > What specifically are your expectations for how users should utilize
> >> >> >> >> >> >> > this config in real production workload?
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever.  Can you
> >> >> >> >> >> >> >> explain why you need it?  Why cannot you use a fixed value after initial
> >> >> >> >> >> >> >> experiments.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds
> >> >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we
> >> >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the
> >> >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it
> >> >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every
> >> >> >> >> >> >> > instance where the default value falls short? Surely, there must be
> >> >> >> >> >> >> > more practical and efficient approaches we can explore together to
> >> >> >> >> >> >> > ensure optimal performance across all workloads.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are
> >> >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also
> >> >> >> >> >> >> > consider the needs and requirements of our actual users, along with
> >> >> >> >> >> >> > the diverse workloads they encounter in their daily operations.
> >> >> >> >> >> >>
> >> >> >> >> >> >> Have you found that your different systems requires different
> >> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
> >> >> >> >> >> >
> >> >> >> >> >> > For specific workloads that introduce latency, we set the value to 0.
> >> >> >> >> >> > For other workloads, we keep it unchanged until we determine that the
> >> >> >> >> >> > default value is also suboptimal. What is the issue with this
> >> >> >> >> >> > approach?
> >> >> >> >> >>
> >> >> >> >> >> Firstly, this is a system wide configuration, not workload specific.
> >> >> >> >> >> So, other workloads run on the same system will be impacted too.  Will
> >> >> >> >> >> you run one workload only on one system?
> >> >> >> >> >
> >> >> >> >> > It seems we're living on different planets. You're happily working in
> >> >> >> >> > your lab environment, while I'm struggling with real-world production
> >> >> >> >> > issues.
> >> >> >> >> >
> >> >> >> >> > For servers:
> >> >> >> >> >
> >> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0
> >> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> >> >> >> >> > Server 1,000,001 and beyond: Happy with all values
> >> >> >> >> >
> >> >> >> >> > Is this hard to understand?
> >> >> >> >> >
> >> >> >> >> > In other words:
> >> >> >> >> >
> >> >> >> >> > For applications:
> >> >> >> >> >
> >> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0
> >> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> >> >> >> >> > Application 1,000,001 and beyond: Happy with all values
> >> >> >> >>
> >> >> >> >> Good to know this.  Thanks!
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI.  For
> >> >> >> >> >> example, we need to use different configuration on different systems
> >> >> >> >> >> otherwise some workloads will be hurt.  Can you provide some evidences
> >> >> >> >> >> to support your change?  IMHO, it's not good enough to say I don't know
> >> >> >> >> >> why I just don't want to change existing systems.  If so, it may be
> >> >> >> >> >> better to wait until you have more evidences.
> >> >> >> >> >
> >> >> >> >> > It seems the community encourages developers to experiment with their
> >> >> >> >> > improvements in lab environments using meticulously designed test
> >> >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately
> >> >> >> >> > obtaining perfect data. However, it discourages developers from
> >> >> >> >> > directly addressing real-world workloads. Sigh.
> >> >> >> >>
> >> >> >> >> You cannot know whether your workloads benefit or hurt for the different
> >> >> >> >> batch number and how in your production environment?  If you cannot, how
> >> >> >> >> do you decide which workload deploys on which system (with different
> >> >> >> >> batch number configuration).  If you can, can you provide such
> >> >> >> >> information to support your patch?
> >> >> >> >
> >> >> >> > We leverage a meticulous selection of network metrics, particularly
> >> >> >> > focusing on TcpExt indicators, to keep a close eye on application
> >> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts,
> >> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans,
> >> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more.
> >> >> >> >
> >> >> >> > In instances where a problematic container terminates, we've noticed a
> >> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per
> >> >> >> > second, which serves as a clear indication that other applications are
> >> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max
> >> >> >> > parameter to 0, we've been able to drastically reduce the maximum
> >> >> >> > frequency of these timeouts to less than one per second.
> >> >> >>
> >> >> >> Thanks a lot for sharing this.  I learned much from it!
> >> >> >>
> >> >> >> > At present, we're selectively applying this adjustment to clusters
> >> >> >> > that exclusively host the identified problematic applications, and
> >> >> >> > we're closely monitoring their performance to ensure stability. To
> >> >> >> > date, we've observed no network latency issues as a result of this
> >> >> >> > change. However, we remain cautious about extending this optimization
> >> >> >> > to other clusters, as the decision ultimately depends on a variety of
> >> >> >> > factors.
> >> >> >> >
> >> >> >> > It's important to note that we're not eager to implement this change
> >> >> >> > across our entire fleet, as we recognize the potential for unforeseen
> >> >> >> > consequences. Instead, we're taking a cautious approach by initially
> >> >> >> > applying it to a limited number of servers. This allows us to assess
> >> >> >> > its impact and make informed decisions about whether or not to expand
> >> >> >> > its use in the future.
> >> >> >>
> >> >> >> So, you haven't observed any performance hurt yet.  Right?
> >> >> >
> >> >> > Right.
> >> >> >
> >> >> >> If you
> >> >> >> haven't, I suggest you to keep the patch in your downstream kernel for a
> >> >> >> while.  In the future, if you find the performance of some workloads
> >> >> >> hurts because of the new batch number, you can repost the patch with the
> >> >> >> supporting data.  If in the end, the performance of more and more
> >> >> >> workloads is good with the new batch number.  You may consider to make 0
> >> >> >> the default value :-)
> >> >> >
> >> >> > That is not how the real world works.
> >> >> >
> >> >> > In the real world:
> >> >> >
> >> >> > - No one knows what may happen in the future.
> >> >> >   Therefore, if possible, we should make systems flexible, unless
> >> >> > there is a strong justification for using a hard-coded value.
> >> >> >
> >> >> > - Minimize changes whenever possible.
> >> >> >   These systems have been working fine in the past, even if with lower
> >> >> > performance. Why make changes just for the sake of improving
> >> >> > performance? Does the key metric of your performance data truly matter
> >> >> > for their workload?
> >> >>
> >> >> These are good policy in your organization and business.  But, it's not
> >> >> necessary the policy that Linux kernel upstream should take.
> >> >
> >> > You mean the Upstream Linux kernel only designed for the lab ?
> >> >
> >> >>
> >> >> Community needs to consider long-term maintenance overhead, so it adds
> >> >> new ABI (such as sysfs knob) to kernel with the necessary justification.
> >> >> In general, it prefer to use a good default value or an automatic
> >> >> algorithm that works for everyone.  Community tries avoiding (or fixing)
> >> >> regressions as much as possible, but this will not stop kernel from
> >> >> changing, even if it's big.
> >> >
> >> > Please explain to me why the kernel config is not ABI, but the sysctl is ABI.
> >>
> >> Linux kernel will not break ABI until the last users stop using it.
> >
> > However, you haven't given a clear reference why the systl is an ABI.
>
> TBH, I don't find a formal document said it explicitly after some
> searching.
>
> Hi, Andrew, Matthew,
>
> Can you help me on this?  Whether sysctl is considered Linux kernel ABI?
> Or something similar?

In my experience, we consistently utilize an if-statement to configure
sysctl settings in our production environments.

    if [ -f ${sysctl_file} ]; then
        echo ${new_value} > ${sysctl_file}
    fi

Additionally, you can incorporate this into rc.local to ensure the
configuration is applied upon system reboot.

Even if you add it to the sysctl.conf without the if-statement, it
won't break anything.

The pcp-related sysctl parameter, vm.percpu_pagelist_high_fraction,
underwent a naming change along with a functional update from its
predecessor, vm.percpu_pagelist_fraction, in commit 74f44822097c
("mm/page_alloc: introduce vm.percpu_pagelist_high_fraction"). Despite
this significant change, there have been no reported issues or
complaints, suggesting that the renaming and functional update have
not negatively impacted the system's functionality.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-12  8:49                                         ` Yafang Shao
@ 2024-07-12  9:10                                           ` Huang, Ying
  2024-07-12  9:24                                             ` Yafang Shao
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2024-07-12  9:10 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, Matthew Wilcox, mgorman, linux-mm, David Rientjes

Yafang Shao <laoar.shao@gmail.com> writes:

> On Fri, Jul 12, 2024 at 4:26 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Fri, Jul 12, 2024 at 3:06 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >>
>> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >>
>> >> >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >> >>
>> >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
>> >> >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment,
>> >> >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the
>> >> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
>> >> >> >> >> >> >> >> >> >> > is introduced as a more practical alternative.
>> >> >> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change.  I can understand that kernel
>> >> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
>> >> >> >> >> >> >> >> >> >> too.
>> >> >> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
>> >> >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi
>> >> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
>> >> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
>> >> >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a
>> >> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
>> >> >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort.
>> >> >> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
>> >> >> >> >> >> >> >> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
>> >> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a
>> >> >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have
>> >> >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly
>> >> >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale
>> >> >> >> >> >> >> >> >> > behind introducing this configuration in the beginning?
>> >> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >> Sorry, I don't understand your words.  Do you need me to explain what is
>> >> >> >> >> >> >> >> >> "neutral"?
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > No, thanks.
>> >> >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive
>> >> >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better
>> >> >> >> >> >> >> >> > understanding of the concept.
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ?
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> I think that I have explained it in the commit log of commit
>> >> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
>> >> >> >> >> >> >> >> latency").  Which introduces the config.
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > What specifically are your expectations for how users should utilize
>> >> >> >> >> >> >> > this config in real production workload?
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever.  Can you
>> >> >> >> >> >> >> >> explain why you need it?  Why cannot you use a fixed value after initial
>> >> >> >> >> >> >> >> experiments.
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds
>> >> >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we
>> >> >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the
>> >> >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it
>> >> >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every
>> >> >> >> >> >> >> > instance where the default value falls short? Surely, there must be
>> >> >> >> >> >> >> > more practical and efficient approaches we can explore together to
>> >> >> >> >> >> >> > ensure optimal performance across all workloads.
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are
>> >> >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also
>> >> >> >> >> >> >> > consider the needs and requirements of our actual users, along with
>> >> >> >> >> >> >> > the diverse workloads they encounter in their daily operations.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Have you found that your different systems requires different
>> >> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
>> >> >> >> >> >> >
>> >> >> >> >> >> > For specific workloads that introduce latency, we set the value to 0.
>> >> >> >> >> >> > For other workloads, we keep it unchanged until we determine that the
>> >> >> >> >> >> > default value is also suboptimal. What is the issue with this
>> >> >> >> >> >> > approach?
>> >> >> >> >> >>
>> >> >> >> >> >> Firstly, this is a system wide configuration, not workload specific.
>> >> >> >> >> >> So, other workloads run on the same system will be impacted too.  Will
>> >> >> >> >> >> you run one workload only on one system?
>> >> >> >> >> >
>> >> >> >> >> > It seems we're living on different planets. You're happily working in
>> >> >> >> >> > your lab environment, while I'm struggling with real-world production
>> >> >> >> >> > issues.
>> >> >> >> >> >
>> >> >> >> >> > For servers:
>> >> >> >> >> >
>> >> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0
>> >> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
>> >> >> >> >> > Server 1,000,001 and beyond: Happy with all values
>> >> >> >> >> >
>> >> >> >> >> > Is this hard to understand?
>> >> >> >> >> >
>> >> >> >> >> > In other words:
>> >> >> >> >> >
>> >> >> >> >> > For applications:
>> >> >> >> >> >
>> >> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0
>> >> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
>> >> >> >> >> > Application 1,000,001 and beyond: Happy with all values
>> >> >> >> >>
>> >> >> >> >> Good to know this.  Thanks!
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI.  For
>> >> >> >> >> >> example, we need to use different configuration on different systems
>> >> >> >> >> >> otherwise some workloads will be hurt.  Can you provide some evidences
>> >> >> >> >> >> to support your change?  IMHO, it's not good enough to say I don't know
>> >> >> >> >> >> why I just don't want to change existing systems.  If so, it may be
>> >> >> >> >> >> better to wait until you have more evidences.
>> >> >> >> >> >
>> >> >> >> >> > It seems the community encourages developers to experiment with their
>> >> >> >> >> > improvements in lab environments using meticulously designed test
>> >> >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately
>> >> >> >> >> > obtaining perfect data. However, it discourages developers from
>> >> >> >> >> > directly addressing real-world workloads. Sigh.
>> >> >> >> >>
>> >> >> >> >> You cannot know whether your workloads benefit or hurt for the different
>> >> >> >> >> batch number and how in your production environment?  If you cannot, how
>> >> >> >> >> do you decide which workload deploys on which system (with different
>> >> >> >> >> batch number configuration).  If you can, can you provide such
>> >> >> >> >> information to support your patch?
>> >> >> >> >
>> >> >> >> > We leverage a meticulous selection of network metrics, particularly
>> >> >> >> > focusing on TcpExt indicators, to keep a close eye on application
>> >> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts,
>> >> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans,
>> >> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more.
>> >> >> >> >
>> >> >> >> > In instances where a problematic container terminates, we've noticed a
>> >> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per
>> >> >> >> > second, which serves as a clear indication that other applications are
>> >> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max
>> >> >> >> > parameter to 0, we've been able to drastically reduce the maximum
>> >> >> >> > frequency of these timeouts to less than one per second.
>> >> >> >>
>> >> >> >> Thanks a lot for sharing this.  I learned much from it!
>> >> >> >>
>> >> >> >> > At present, we're selectively applying this adjustment to clusters
>> >> >> >> > that exclusively host the identified problematic applications, and
>> >> >> >> > we're closely monitoring their performance to ensure stability. To
>> >> >> >> > date, we've observed no network latency issues as a result of this
>> >> >> >> > change. However, we remain cautious about extending this optimization
>> >> >> >> > to other clusters, as the decision ultimately depends on a variety of
>> >> >> >> > factors.
>> >> >> >> >
>> >> >> >> > It's important to note that we're not eager to implement this change
>> >> >> >> > across our entire fleet, as we recognize the potential for unforeseen
>> >> >> >> > consequences. Instead, we're taking a cautious approach by initially
>> >> >> >> > applying it to a limited number of servers. This allows us to assess
>> >> >> >> > its impact and make informed decisions about whether or not to expand
>> >> >> >> > its use in the future.
>> >> >> >>
>> >> >> >> So, you haven't observed any performance hurt yet.  Right?
>> >> >> >
>> >> >> > Right.
>> >> >> >
>> >> >> >> If you
>> >> >> >> haven't, I suggest you to keep the patch in your downstream kernel for a
>> >> >> >> while.  In the future, if you find the performance of some workloads
>> >> >> >> hurts because of the new batch number, you can repost the patch with the
>> >> >> >> supporting data.  If in the end, the performance of more and more
>> >> >> >> workloads is good with the new batch number.  You may consider to make 0
>> >> >> >> the default value :-)
>> >> >> >
>> >> >> > That is not how the real world works.
>> >> >> >
>> >> >> > In the real world:
>> >> >> >
>> >> >> > - No one knows what may happen in the future.
>> >> >> >   Therefore, if possible, we should make systems flexible, unless
>> >> >> > there is a strong justification for using a hard-coded value.
>> >> >> >
>> >> >> > - Minimize changes whenever possible.
>> >> >> >   These systems have been working fine in the past, even if with lower
>> >> >> > performance. Why make changes just for the sake of improving
>> >> >> > performance? Does the key metric of your performance data truly matter
>> >> >> > for their workload?
>> >> >>
>> >> >> These are good policy in your organization and business.  But, it's not
>> >> >> necessary the policy that Linux kernel upstream should take.
>> >> >
>> >> > You mean the Upstream Linux kernel only designed for the lab ?
>> >> >
>> >> >>
>> >> >> Community needs to consider long-term maintenance overhead, so it adds
>> >> >> new ABI (such as sysfs knob) to kernel with the necessary justification.
>> >> >> In general, it prefer to use a good default value or an automatic
>> >> >> algorithm that works for everyone.  Community tries avoiding (or fixing)
>> >> >> regressions as much as possible, but this will not stop kernel from
>> >> >> changing, even if it's big.
>> >> >
>> >> > Please explain to me why the kernel config is not ABI, but the sysctl is ABI.
>> >>
>> >> Linux kernel will not break ABI until the last users stop using it.
>> >
>> > However, you haven't given a clear reference why the systl is an ABI.
>>
>> TBH, I don't find a formal document said it explicitly after some
>> searching.
>>
>> Hi, Andrew, Matthew,
>>
>> Can you help me on this?  Whether sysctl is considered Linux kernel ABI?
>> Or something similar?
>
> In my experience, we consistently utilize an if-statement to configure
> sysctl settings in our production environments.
>
>     if [ -f ${sysctl_file} ]; then
>         echo ${new_value} > ${sysctl_file}
>     fi
>
> Additionally, you can incorporate this into rc.local to ensure the
> configuration is applied upon system reboot.
>
> Even if you add it to the sysctl.conf without the if-statement, it
> won't break anything.
>
> The pcp-related sysctl parameter, vm.percpu_pagelist_high_fraction,
> underwent a naming change along with a functional update from its
> predecessor, vm.percpu_pagelist_fraction, in commit 74f44822097c
> ("mm/page_alloc: introduce vm.percpu_pagelist_high_fraction"). Despite
> this significant change, there have been no reported issues or
> complaints, suggesting that the renaming and functional update have
> not negatively impacted the system's functionality.

Thanks for your information.  From the commit, sysctl isn't considered
as the kernel ABI.

Even if so, IMHO, we shouldn't introduce a user tunable knob without a
real world requirements except more flexibility.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-12  9:10                                           ` Huang, Ying
@ 2024-07-12  9:24                                             ` Yafang Shao
  2024-07-12  9:46                                               ` Yafang Shao
  0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2024-07-12  9:24 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, Matthew Wilcox, mgorman, linux-mm, David Rientjes

On Fri, Jul 12, 2024 at 5:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Fri, Jul 12, 2024 at 4:26 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Fri, Jul 12, 2024 at 3:06 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >>
> >> >> >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >>
> >> >> >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >> >>
> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
> >> >> >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment,
> >> >> >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the
> >> >> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
> >> >> >> >> >> >> >> >> >> >> > is introduced as a more practical alternative.
> >> >> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change.  I can understand that kernel
> >> >> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
> >> >> >> >> >> >> >> >> >> >> too.
> >> >> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
> >> >> >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi
> >> >> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
> >> >> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
> >> >> >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a
> >> >> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
> >> >> >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort.
> >> >> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
> >> >> >> >> >> >> >> >> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
> >> >> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a
> >> >> >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have
> >> >> >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly
> >> >> >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale
> >> >> >> >> >> >> >> >> >> > behind introducing this configuration in the beginning?
> >> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> >> Sorry, I don't understand your words.  Do you need me to explain what is
> >> >> >> >> >> >> >> >> >> "neutral"?
> >> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >> > No, thanks.
> >> >> >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive
> >> >> >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better
> >> >> >> >> >> >> >> >> > understanding of the concept.
> >> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ?
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> I think that I have explained it in the commit log of commit
> >> >> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
> >> >> >> >> >> >> >> >> latency").  Which introduces the config.
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > What specifically are your expectations for how users should utilize
> >> >> >> >> >> >> >> > this config in real production workload?
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever.  Can you
> >> >> >> >> >> >> >> >> explain why you need it?  Why cannot you use a fixed value after initial
> >> >> >> >> >> >> >> >> experiments.
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds
> >> >> >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we
> >> >> >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the
> >> >> >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it
> >> >> >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every
> >> >> >> >> >> >> >> > instance where the default value falls short? Surely, there must be
> >> >> >> >> >> >> >> > more practical and efficient approaches we can explore together to
> >> >> >> >> >> >> >> > ensure optimal performance across all workloads.
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are
> >> >> >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also
> >> >> >> >> >> >> >> > consider the needs and requirements of our actual users, along with
> >> >> >> >> >> >> >> > the diverse workloads they encounter in their daily operations.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Have you found that your different systems requires different
> >> >> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > For specific workloads that introduce latency, we set the value to 0.
> >> >> >> >> >> >> > For other workloads, we keep it unchanged until we determine that the
> >> >> >> >> >> >> > default value is also suboptimal. What is the issue with this
> >> >> >> >> >> >> > approach?
> >> >> >> >> >> >>
> >> >> >> >> >> >> Firstly, this is a system wide configuration, not workload specific.
> >> >> >> >> >> >> So, other workloads run on the same system will be impacted too.  Will
> >> >> >> >> >> >> you run one workload only on one system?
> >> >> >> >> >> >
> >> >> >> >> >> > It seems we're living on different planets. You're happily working in
> >> >> >> >> >> > your lab environment, while I'm struggling with real-world production
> >> >> >> >> >> > issues.
> >> >> >> >> >> >
> >> >> >> >> >> > For servers:
> >> >> >> >> >> >
> >> >> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0
> >> >> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> >> >> >> >> >> > Server 1,000,001 and beyond: Happy with all values
> >> >> >> >> >> >
> >> >> >> >> >> > Is this hard to understand?
> >> >> >> >> >> >
> >> >> >> >> >> > In other words:
> >> >> >> >> >> >
> >> >> >> >> >> > For applications:
> >> >> >> >> >> >
> >> >> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0
> >> >> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> >> >> >> >> >> > Application 1,000,001 and beyond: Happy with all values
> >> >> >> >> >>
> >> >> >> >> >> Good to know this.  Thanks!
> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI.  For
> >> >> >> >> >> >> example, we need to use different configuration on different systems
> >> >> >> >> >> >> otherwise some workloads will be hurt.  Can you provide some evidences
> >> >> >> >> >> >> to support your change?  IMHO, it's not good enough to say I don't know
> >> >> >> >> >> >> why I just don't want to change existing systems.  If so, it may be
> >> >> >> >> >> >> better to wait until you have more evidences.
> >> >> >> >> >> >
> >> >> >> >> >> > It seems the community encourages developers to experiment with their
> >> >> >> >> >> > improvements in lab environments using meticulously designed test
> >> >> >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately
> >> >> >> >> >> > obtaining perfect data. However, it discourages developers from
> >> >> >> >> >> > directly addressing real-world workloads. Sigh.
> >> >> >> >> >>
> >> >> >> >> >> You cannot know whether your workloads benefit or hurt for the different
> >> >> >> >> >> batch number and how in your production environment?  If you cannot, how
> >> >> >> >> >> do you decide which workload deploys on which system (with different
> >> >> >> >> >> batch number configuration).  If you can, can you provide such
> >> >> >> >> >> information to support your patch?
> >> >> >> >> >
> >> >> >> >> > We leverage a meticulous selection of network metrics, particularly
> >> >> >> >> > focusing on TcpExt indicators, to keep a close eye on application
> >> >> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts,
> >> >> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans,
> >> >> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more.
> >> >> >> >> >
> >> >> >> >> > In instances where a problematic container terminates, we've noticed a
> >> >> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per
> >> >> >> >> > second, which serves as a clear indication that other applications are
> >> >> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max
> >> >> >> >> > parameter to 0, we've been able to drastically reduce the maximum
> >> >> >> >> > frequency of these timeouts to less than one per second.
> >> >> >> >>
> >> >> >> >> Thanks a lot for sharing this.  I learned much from it!
> >> >> >> >>
> >> >> >> >> > At present, we're selectively applying this adjustment to clusters
> >> >> >> >> > that exclusively host the identified problematic applications, and
> >> >> >> >> > we're closely monitoring their performance to ensure stability. To
> >> >> >> >> > date, we've observed no network latency issues as a result of this
> >> >> >> >> > change. However, we remain cautious about extending this optimization
> >> >> >> >> > to other clusters, as the decision ultimately depends on a variety of
> >> >> >> >> > factors.
> >> >> >> >> >
> >> >> >> >> > It's important to note that we're not eager to implement this change
> >> >> >> >> > across our entire fleet, as we recognize the potential for unforeseen
> >> >> >> >> > consequences. Instead, we're taking a cautious approach by initially
> >> >> >> >> > applying it to a limited number of servers. This allows us to assess
> >> >> >> >> > its impact and make informed decisions about whether or not to expand
> >> >> >> >> > its use in the future.
> >> >> >> >>
> >> >> >> >> So, you haven't observed any performance hurt yet.  Right?
> >> >> >> >
> >> >> >> > Right.
> >> >> >> >
> >> >> >> >> If you
> >> >> >> >> haven't, I suggest you to keep the patch in your downstream kernel for a
> >> >> >> >> while.  In the future, if you find the performance of some workloads
> >> >> >> >> hurts because of the new batch number, you can repost the patch with the
> >> >> >> >> supporting data.  If in the end, the performance of more and more
> >> >> >> >> workloads is good with the new batch number.  You may consider to make 0
> >> >> >> >> the default value :-)
> >> >> >> >
> >> >> >> > That is not how the real world works.
> >> >> >> >
> >> >> >> > In the real world:
> >> >> >> >
> >> >> >> > - No one knows what may happen in the future.
> >> >> >> >   Therefore, if possible, we should make systems flexible, unless
> >> >> >> > there is a strong justification for using a hard-coded value.
> >> >> >> >
> >> >> >> > - Minimize changes whenever possible.
> >> >> >> >   These systems have been working fine in the past, even if with lower
> >> >> >> > performance. Why make changes just for the sake of improving
> >> >> >> > performance? Does the key metric of your performance data truly matter
> >> >> >> > for their workload?
> >> >> >>
> >> >> >> These are good policy in your organization and business.  But, it's not
> >> >> >> necessary the policy that Linux kernel upstream should take.
> >> >> >
> >> >> > You mean the Upstream Linux kernel only designed for the lab ?
> >> >> >
> >> >> >>
> >> >> >> Community needs to consider long-term maintenance overhead, so it adds
> >> >> >> new ABI (such as sysfs knob) to kernel with the necessary justification.
> >> >> >> In general, it prefer to use a good default value or an automatic
> >> >> >> algorithm that works for everyone.  Community tries avoiding (or fixing)
> >> >> >> regressions as much as possible, but this will not stop kernel from
> >> >> >> changing, even if it's big.
> >> >> >
> >> >> > Please explain to me why the kernel config is not ABI, but the sysctl is ABI.
> >> >>
> >> >> Linux kernel will not break ABI until the last users stop using it.
> >> >
> >> > However, you haven't given a clear reference why the systl is an ABI.
> >>
> >> TBH, I don't find a formal document said it explicitly after some
> >> searching.
> >>
> >> Hi, Andrew, Matthew,
> >>
> >> Can you help me on this?  Whether sysctl is considered Linux kernel ABI?
> >> Or something similar?
> >
> > In my experience, we consistently utilize an if-statement to configure
> > sysctl settings in our production environments.
> >
> >     if [ -f ${sysctl_file} ]; then
> >         echo ${new_value} > ${sysctl_file}
> >     fi
> >
> > Additionally, you can incorporate this into rc.local to ensure the
> > configuration is applied upon system reboot.
> >
> > Even if you add it to the sysctl.conf without the if-statement, it
> > won't break anything.
> >
> > The pcp-related sysctl parameter, vm.percpu_pagelist_high_fraction,
> > underwent a naming change along with a functional update from its
> > predecessor, vm.percpu_pagelist_fraction, in commit 74f44822097c
> > ("mm/page_alloc: introduce vm.percpu_pagelist_high_fraction"). Despite
> > this significant change, there have been no reported issues or
> > complaints, suggesting that the renaming and functional update have
> > not negatively impacted the system's functionality.
>
> Thanks for your information.  From the commit, sysctl isn't considered
> as the kernel ABI.
>
> Even if so, IMHO, we shouldn't introduce a user tunable knob without a
> real world requirements except more flexibility.

Indeed, I do not reside in the physical realm but within a virtualized
universe. (Of course, that is your perspective.)


--
Regards
Yafang


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-12  9:24                                             ` Yafang Shao
@ 2024-07-12  9:46                                               ` Yafang Shao
  2024-07-15  1:09                                                 ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2024-07-12  9:46 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, Matthew Wilcox, mgorman, linux-mm, David Rientjes

On Fri, Jul 12, 2024 at 5:24 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Fri, Jul 12, 2024 at 5:13 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Yafang Shao <laoar.shao@gmail.com> writes:
> >
> > > On Fri, Jul 12, 2024 at 4:26 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >>
> > >> Yafang Shao <laoar.shao@gmail.com> writes:
> > >>
> > >> > On Fri, Jul 12, 2024 at 3:06 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >>
> > >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> > >> >>
> > >> >> > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >> >>
> > >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> > >> >> >>
> > >> >> >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >> >> >>
> > >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> > >> >> >> >>
> > >> >> >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >> >> >> >>
> > >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> > >> >> >> >> >>
> > >> >> >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >> >> >> >> >>
> > >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> > >> >> >> >> >> >>
> > >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> > >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> > >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> > >> >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >> >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> > >> >> >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
> > >> >> >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment,
> > >> >> >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the
> > >> >> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
> > >> >> >> >> >> >> >> >> >> >> > is introduced as a more practical alternative.
> > >> >> >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change.  I can understand that kernel
> > >> >> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
> > >> >> >> >> >> >> >> >> >> >> too.
> > >> >> >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
> > >> >> >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi
> > >> >> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
> > >> >> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
> > >> >> >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a
> > >> >> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
> > >> >> >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort.
> > >> >> >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
> > >> >> >> >> >> >> >> >> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
> > >> >> >> >> >> >> >> >> >> >
> > >> >> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a
> > >> >> >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have
> > >> >> >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly
> > >> >> >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale
> > >> >> >> >> >> >> >> >> >> > behind introducing this configuration in the beginning?
> > >> >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> >> Sorry, I don't understand your words.  Do you need me to explain what is
> > >> >> >> >> >> >> >> >> >> "neutral"?
> > >> >> >> >> >> >> >> >> >
> > >> >> >> >> >> >> >> >> > No, thanks.
> > >> >> >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive
> > >> >> >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better
> > >> >> >> >> >> >> >> >> > understanding of the concept.
> > >> >> >> >> >> >> >> >> >
> > >> >> >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ?
> > >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> I think that I have explained it in the commit log of commit
> > >> >> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
> > >> >> >> >> >> >> >> >> latency").  Which introduces the config.
> > >> >> >> >> >> >> >> >
> > >> >> >> >> >> >> >> > What specifically are your expectations for how users should utilize
> > >> >> >> >> >> >> >> > this config in real production workload?
> > >> >> >> >> >> >> >> >
> > >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever.  Can you
> > >> >> >> >> >> >> >> >> explain why you need it?  Why cannot you use a fixed value after initial
> > >> >> >> >> >> >> >> >> experiments.
> > >> >> >> >> >> >> >> >
> > >> >> >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds
> > >> >> >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we
> > >> >> >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the
> > >> >> >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it
> > >> >> >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every
> > >> >> >> >> >> >> >> > instance where the default value falls short? Surely, there must be
> > >> >> >> >> >> >> >> > more practical and efficient approaches we can explore together to
> > >> >> >> >> >> >> >> > ensure optimal performance across all workloads.
> > >> >> >> >> >> >> >> >
> > >> >> >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are
> > >> >> >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also
> > >> >> >> >> >> >> >> > consider the needs and requirements of our actual users, along with
> > >> >> >> >> >> >> >> > the diverse workloads they encounter in their daily operations.
> > >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> Have you found that your different systems requires different
> > >> >> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
> > >> >> >> >> >> >> >
> > >> >> >> >> >> >> > For specific workloads that introduce latency, we set the value to 0.
> > >> >> >> >> >> >> > For other workloads, we keep it unchanged until we determine that the
> > >> >> >> >> >> >> > default value is also suboptimal. What is the issue with this
> > >> >> >> >> >> >> > approach?
> > >> >> >> >> >> >>
> > >> >> >> >> >> >> Firstly, this is a system wide configuration, not workload specific.
> > >> >> >> >> >> >> So, other workloads run on the same system will be impacted too.  Will
> > >> >> >> >> >> >> you run one workload only on one system?
> > >> >> >> >> >> >
> > >> >> >> >> >> > It seems we're living on different planets. You're happily working in
> > >> >> >> >> >> > your lab environment, while I'm struggling with real-world production
> > >> >> >> >> >> > issues.
> > >> >> >> >> >> >
> > >> >> >> >> >> > For servers:
> > >> >> >> >> >> >
> > >> >> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0
> > >> >> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> > >> >> >> >> >> > Server 1,000,001 and beyond: Happy with all values
> > >> >> >> >> >> >
> > >> >> >> >> >> > Is this hard to understand?
> > >> >> >> >> >> >
> > >> >> >> >> >> > In other words:
> > >> >> >> >> >> >
> > >> >> >> >> >> > For applications:
> > >> >> >> >> >> >
> > >> >> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0
> > >> >> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> > >> >> >> >> >> > Application 1,000,001 and beyond: Happy with all values
> > >> >> >> >> >>
> > >> >> >> >> >> Good to know this.  Thanks!
> > >> >> >> >> >>
> > >> >> >> >> >> >>
> > >> >> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI.  For
> > >> >> >> >> >> >> example, we need to use different configuration on different systems
> > >> >> >> >> >> >> otherwise some workloads will be hurt.  Can you provide some evidences
> > >> >> >> >> >> >> to support your change?  IMHO, it's not good enough to say I don't know
> > >> >> >> >> >> >> why I just don't want to change existing systems.  If so, it may be
> > >> >> >> >> >> >> better to wait until you have more evidences.
> > >> >> >> >> >> >
> > >> >> >> >> >> > It seems the community encourages developers to experiment with their
> > >> >> >> >> >> > improvements in lab environments using meticulously designed test
> > >> >> >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately
> > >> >> >> >> >> > obtaining perfect data. However, it discourages developers from
> > >> >> >> >> >> > directly addressing real-world workloads. Sigh.
> > >> >> >> >> >>
> > >> >> >> >> >> You cannot know whether your workloads benefit or hurt for the different
> > >> >> >> >> >> batch number and how in your production environment?  If you cannot, how
> > >> >> >> >> >> do you decide which workload deploys on which system (with different
> > >> >> >> >> >> batch number configuration).  If you can, can you provide such
> > >> >> >> >> >> information to support your patch?
> > >> >> >> >> >
> > >> >> >> >> > We leverage a meticulous selection of network metrics, particularly
> > >> >> >> >> > focusing on TcpExt indicators, to keep a close eye on application
> > >> >> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts,
> > >> >> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans,
> > >> >> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more.
> > >> >> >> >> >
> > >> >> >> >> > In instances where a problematic container terminates, we've noticed a
> > >> >> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per
> > >> >> >> >> > second, which serves as a clear indication that other applications are
> > >> >> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max
> > >> >> >> >> > parameter to 0, we've been able to drastically reduce the maximum
> > >> >> >> >> > frequency of these timeouts to less than one per second.
> > >> >> >> >>
> > >> >> >> >> Thanks a lot for sharing this.  I learned much from it!
> > >> >> >> >>
> > >> >> >> >> > At present, we're selectively applying this adjustment to clusters
> > >> >> >> >> > that exclusively host the identified problematic applications, and
> > >> >> >> >> > we're closely monitoring their performance to ensure stability. To
> > >> >> >> >> > date, we've observed no network latency issues as a result of this
> > >> >> >> >> > change. However, we remain cautious about extending this optimization
> > >> >> >> >> > to other clusters, as the decision ultimately depends on a variety of
> > >> >> >> >> > factors.
> > >> >> >> >> >
> > >> >> >> >> > It's important to note that we're not eager to implement this change
> > >> >> >> >> > across our entire fleet, as we recognize the potential for unforeseen
> > >> >> >> >> > consequences. Instead, we're taking a cautious approach by initially
> > >> >> >> >> > applying it to a limited number of servers. This allows us to assess
> > >> >> >> >> > its impact and make informed decisions about whether or not to expand
> > >> >> >> >> > its use in the future.
> > >> >> >> >>
> > >> >> >> >> So, you haven't observed any performance hurt yet.  Right?
> > >> >> >> >
> > >> >> >> > Right.
> > >> >> >> >
> > >> >> >> >> If you
> > >> >> >> >> haven't, I suggest you to keep the patch in your downstream kernel for a
> > >> >> >> >> while.  In the future, if you find the performance of some workloads
> > >> >> >> >> hurts because of the new batch number, you can repost the patch with the
> > >> >> >> >> supporting data.  If in the end, the performance of more and more
> > >> >> >> >> workloads is good with the new batch number.  You may consider to make 0
> > >> >> >> >> the default value :-)
> > >> >> >> >
> > >> >> >> > That is not how the real world works.
> > >> >> >> >
> > >> >> >> > In the real world:
> > >> >> >> >
> > >> >> >> > - No one knows what may happen in the future.
> > >> >> >> >   Therefore, if possible, we should make systems flexible, unless
> > >> >> >> > there is a strong justification for using a hard-coded value.
> > >> >> >> >
> > >> >> >> > - Minimize changes whenever possible.
> > >> >> >> >   These systems have been working fine in the past, even if with lower
> > >> >> >> > performance. Why make changes just for the sake of improving
> > >> >> >> > performance? Does the key metric of your performance data truly matter
> > >> >> >> > for their workload?
> > >> >> >>
> > >> >> >> These are good policy in your organization and business.  But, it's not
> > >> >> >> necessary the policy that Linux kernel upstream should take.
> > >> >> >
> > >> >> > You mean the Upstream Linux kernel only designed for the lab ?
> > >> >> >
> > >> >> >>
> > >> >> >> Community needs to consider long-term maintenance overhead, so it adds
> > >> >> >> new ABI (such as sysfs knob) to kernel with the necessary justification.
> > >> >> >> In general, it prefer to use a good default value or an automatic
> > >> >> >> algorithm that works for everyone.  Community tries avoiding (or fixing)
> > >> >> >> regressions as much as possible, but this will not stop kernel from
> > >> >> >> changing, even if it's big.
> > >> >> >
> > >> >> > Please explain to me why the kernel config is not ABI, but the sysctl is ABI.
> > >> >>
> > >> >> Linux kernel will not break ABI until the last users stop using it.
> > >> >
> > >> > However, you haven't given a clear reference why the systl is an ABI.
> > >>
> > >> TBH, I don't find a formal document said it explicitly after some
> > >> searching.
> > >>
> > >> Hi, Andrew, Matthew,
> > >>
> > >> Can you help me on this?  Whether sysctl is considered Linux kernel ABI?
> > >> Or something similar?
> > >
> > > In my experience, we consistently utilize an if-statement to configure
> > > sysctl settings in our production environments.
> > >
> > >     if [ -f ${sysctl_file} ]; then
> > >         echo ${new_value} > ${sysctl_file}
> > >     fi
> > >
> > > Additionally, you can incorporate this into rc.local to ensure the
> > > configuration is applied upon system reboot.
> > >
> > > Even if you add it to the sysctl.conf without the if-statement, it
> > > won't break anything.
> > >
> > > The pcp-related sysctl parameter, vm.percpu_pagelist_high_fraction,
> > > underwent a naming change along with a functional update from its
> > > predecessor, vm.percpu_pagelist_fraction, in commit 74f44822097c
> > > ("mm/page_alloc: introduce vm.percpu_pagelist_high_fraction"). Despite
> > > this significant change, there have been no reported issues or
> > > complaints, suggesting that the renaming and functional update have
> > > not negatively impacted the system's functionality.
> >
> > Thanks for your information.  From the commit, sysctl isn't considered
> > as the kernel ABI.
> >
> > Even if so, IMHO, we shouldn't introduce a user tunable knob without a
> > real world requirements except more flexibility.
>
> Indeed, I do not reside in the physical realm but within a virtualized
> universe. (Of course, that is your perspective.)

One final note: you explained very well what "neutral" means. Thank
you for your comments.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-12  9:46                                               ` Yafang Shao
@ 2024-07-15  1:09                                                 ` Huang, Ying
  2024-07-15  4:32                                                   ` Yafang Shao
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2024-07-15  1:09 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, Matthew Wilcox, mgorman, linux-mm, David Rientjes

Yafang Shao <laoar.shao@gmail.com> writes:

> On Fri, Jul 12, 2024 at 5:24 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>
>> On Fri, Jul 12, 2024 at 5:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >
>> > Yafang Shao <laoar.shao@gmail.com> writes:
>> >
>> > > On Fri, Jul 12, 2024 at 4:26 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > >>
>> > >> Yafang Shao <laoar.shao@gmail.com> writes:
>> > >>
>> > >> > On Fri, Jul 12, 2024 at 3:06 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > >> >>
>> > >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> > >> >>
>> > >> >> > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > >> >> >>
>> > >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> > >> >> >>
>> > >> >> >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > >> >> >> >>
>> > >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> > >> >> >> >>
>> > >> >> >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote:
>> > >> >> >> >> >>
>> > >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> > >> >> >> >> >>
>> > >> >> >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote:
>> > >> >> >> >> >> >>
>> > >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> > >> >> >> >> >> >>
>> > >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> > >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> > >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> > >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
>> > >> >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> > >> >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
>> > >> >> >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment,
>> > >> >> >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the
>> > >> >> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
>> > >> >> >> >> >> >> >> >> >> >> > is introduced as a more practical alternative.
>> > >> >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change.  I can understand that kernel
>> > >> >> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
>> > >> >> >> >> >> >> >> >> >> >> too.
>> > >> >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
>> > >> >> >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi
>> > >> >> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
>> > >> >> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
>> > >> >> >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a
>> > >> >> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
>> > >> >> >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort.
>> > >> >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
>> > >> >> >> >> >> >> >> >> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
>> > >> >> >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a
>> > >> >> >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have
>> > >> >> >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly
>> > >> >> >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale
>> > >> >> >> >> >> >> >> >> >> > behind introducing this configuration in the beginning?
>> > >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> Sorry, I don't understand your words.  Do you need me to explain what is
>> > >> >> >> >> >> >> >> >> >> "neutral"?
>> > >> >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> >> > No, thanks.
>> > >> >> >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive
>> > >> >> >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better
>> > >> >> >> >> >> >> >> >> > understanding of the concept.
>> > >> >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ?
>> > >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> I think that I have explained it in the commit log of commit
>> > >> >> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
>> > >> >> >> >> >> >> >> >> latency").  Which introduces the config.
>> > >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> > What specifically are your expectations for how users should utilize
>> > >> >> >> >> >> >> >> > this config in real production workload?
>> > >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever.  Can you
>> > >> >> >> >> >> >> >> >> explain why you need it?  Why cannot you use a fixed value after initial
>> > >> >> >> >> >> >> >> >> experiments.
>> > >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds
>> > >> >> >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we
>> > >> >> >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the
>> > >> >> >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it
>> > >> >> >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every
>> > >> >> >> >> >> >> >> > instance where the default value falls short? Surely, there must be
>> > >> >> >> >> >> >> >> > more practical and efficient approaches we can explore together to
>> > >> >> >> >> >> >> >> > ensure optimal performance across all workloads.
>> > >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are
>> > >> >> >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also
>> > >> >> >> >> >> >> >> > consider the needs and requirements of our actual users, along with
>> > >> >> >> >> >> >> >> > the diverse workloads they encounter in their daily operations.
>> > >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> Have you found that your different systems requires different
>> > >> >> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
>> > >> >> >> >> >> >> >
>> > >> >> >> >> >> >> > For specific workloads that introduce latency, we set the value to 0.
>> > >> >> >> >> >> >> > For other workloads, we keep it unchanged until we determine that the
>> > >> >> >> >> >> >> > default value is also suboptimal. What is the issue with this
>> > >> >> >> >> >> >> > approach?
>> > >> >> >> >> >> >>
>> > >> >> >> >> >> >> Firstly, this is a system wide configuration, not workload specific.
>> > >> >> >> >> >> >> So, other workloads run on the same system will be impacted too.  Will
>> > >> >> >> >> >> >> you run one workload only on one system?
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > It seems we're living on different planets. You're happily working in
>> > >> >> >> >> >> > your lab environment, while I'm struggling with real-world production
>> > >> >> >> >> >> > issues.
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > For servers:
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0
>> > >> >> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
>> > >> >> >> >> >> > Server 1,000,001 and beyond: Happy with all values
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > Is this hard to understand?
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > In other words:
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > For applications:
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0
>> > >> >> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
>> > >> >> >> >> >> > Application 1,000,001 and beyond: Happy with all values
>> > >> >> >> >> >>
>> > >> >> >> >> >> Good to know this.  Thanks!
>> > >> >> >> >> >>
>> > >> >> >> >> >> >>
>> > >> >> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI.  For
>> > >> >> >> >> >> >> example, we need to use different configuration on different systems
>> > >> >> >> >> >> >> otherwise some workloads will be hurt.  Can you provide some evidences
>> > >> >> >> >> >> >> to support your change?  IMHO, it's not good enough to say I don't know
>> > >> >> >> >> >> >> why I just don't want to change existing systems.  If so, it may be
>> > >> >> >> >> >> >> better to wait until you have more evidences.
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > It seems the community encourages developers to experiment with their
>> > >> >> >> >> >> > improvements in lab environments using meticulously designed test
>> > >> >> >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately
>> > >> >> >> >> >> > obtaining perfect data. However, it discourages developers from
>> > >> >> >> >> >> > directly addressing real-world workloads. Sigh.
>> > >> >> >> >> >>
>> > >> >> >> >> >> You cannot know whether your workloads benefit or hurt for the different
>> > >> >> >> >> >> batch number and how in your production environment?  If you cannot, how
>> > >> >> >> >> >> do you decide which workload deploys on which system (with different
>> > >> >> >> >> >> batch number configuration).  If you can, can you provide such
>> > >> >> >> >> >> information to support your patch?
>> > >> >> >> >> >
>> > >> >> >> >> > We leverage a meticulous selection of network metrics, particularly
>> > >> >> >> >> > focusing on TcpExt indicators, to keep a close eye on application
>> > >> >> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts,
>> > >> >> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans,
>> > >> >> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more.
>> > >> >> >> >> >
>> > >> >> >> >> > In instances where a problematic container terminates, we've noticed a
>> > >> >> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per
>> > >> >> >> >> > second, which serves as a clear indication that other applications are
>> > >> >> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max
>> > >> >> >> >> > parameter to 0, we've been able to drastically reduce the maximum
>> > >> >> >> >> > frequency of these timeouts to less than one per second.
>> > >> >> >> >>
>> > >> >> >> >> Thanks a lot for sharing this.  I learned much from it!
>> > >> >> >> >>
>> > >> >> >> >> > At present, we're selectively applying this adjustment to clusters
>> > >> >> >> >> > that exclusively host the identified problematic applications, and
>> > >> >> >> >> > we're closely monitoring their performance to ensure stability. To
>> > >> >> >> >> > date, we've observed no network latency issues as a result of this
>> > >> >> >> >> > change. However, we remain cautious about extending this optimization
>> > >> >> >> >> > to other clusters, as the decision ultimately depends on a variety of
>> > >> >> >> >> > factors.
>> > >> >> >> >> >
>> > >> >> >> >> > It's important to note that we're not eager to implement this change
>> > >> >> >> >> > across our entire fleet, as we recognize the potential for unforeseen
>> > >> >> >> >> > consequences. Instead, we're taking a cautious approach by initially
>> > >> >> >> >> > applying it to a limited number of servers. This allows us to assess
>> > >> >> >> >> > its impact and make informed decisions about whether or not to expand
>> > >> >> >> >> > its use in the future.
>> > >> >> >> >>
>> > >> >> >> >> So, you haven't observed any performance hurt yet.  Right?
>> > >> >> >> >
>> > >> >> >> > Right.
>> > >> >> >> >
>> > >> >> >> >> If you
>> > >> >> >> >> haven't, I suggest you to keep the patch in your downstream kernel for a
>> > >> >> >> >> while.  In the future, if you find the performance of some workloads
>> > >> >> >> >> hurts because of the new batch number, you can repost the patch with the
>> > >> >> >> >> supporting data.  If in the end, the performance of more and more
>> > >> >> >> >> workloads is good with the new batch number.  You may consider to make 0
>> > >> >> >> >> the default value :-)
>> > >> >> >> >
>> > >> >> >> > That is not how the real world works.
>> > >> >> >> >
>> > >> >> >> > In the real world:
>> > >> >> >> >
>> > >> >> >> > - No one knows what may happen in the future.
>> > >> >> >> >   Therefore, if possible, we should make systems flexible, unless
>> > >> >> >> > there is a strong justification for using a hard-coded value.
>> > >> >> >> >
>> > >> >> >> > - Minimize changes whenever possible.
>> > >> >> >> >   These systems have been working fine in the past, even if with lower
>> > >> >> >> > performance. Why make changes just for the sake of improving
>> > >> >> >> > performance? Does the key metric of your performance data truly matter
>> > >> >> >> > for their workload?
>> > >> >> >>
>> > >> >> >> These are good policy in your organization and business.  But, it's not
>> > >> >> >> necessary the policy that Linux kernel upstream should take.
>> > >> >> >
>> > >> >> > You mean the Upstream Linux kernel only designed for the lab ?
>> > >> >> >
>> > >> >> >>
>> > >> >> >> Community needs to consider long-term maintenance overhead, so it adds
>> > >> >> >> new ABI (such as sysfs knob) to kernel with the necessary justification.
>> > >> >> >> In general, it prefer to use a good default value or an automatic
>> > >> >> >> algorithm that works for everyone.  Community tries avoiding (or fixing)
>> > >> >> >> regressions as much as possible, but this will not stop kernel from
>> > >> >> >> changing, even if it's big.
>> > >> >> >
>> > >> >> > Please explain to me why the kernel config is not ABI, but the sysctl is ABI.
>> > >> >>
>> > >> >> Linux kernel will not break ABI until the last users stop using it.
>> > >> >
>> > >> > However, you haven't given a clear reference why the systl is an ABI.
>> > >>
>> > >> TBH, I don't find a formal document said it explicitly after some
>> > >> searching.
>> > >>
>> > >> Hi, Andrew, Matthew,
>> > >>
>> > >> Can you help me on this?  Whether sysctl is considered Linux kernel ABI?
>> > >> Or something similar?
>> > >
>> > > In my experience, we consistently utilize an if-statement to configure
>> > > sysctl settings in our production environments.
>> > >
>> > >     if [ -f ${sysctl_file} ]; then
>> > >         echo ${new_value} > ${sysctl_file}
>> > >     fi
>> > >
>> > > Additionally, you can incorporate this into rc.local to ensure the
>> > > configuration is applied upon system reboot.
>> > >
>> > > Even if you add it to the sysctl.conf without the if-statement, it
>> > > won't break anything.
>> > >
>> > > The pcp-related sysctl parameter, vm.percpu_pagelist_high_fraction,
>> > > underwent a naming change along with a functional update from its
>> > > predecessor, vm.percpu_pagelist_fraction, in commit 74f44822097c
>> > > ("mm/page_alloc: introduce vm.percpu_pagelist_high_fraction"). Despite
>> > > this significant change, there have been no reported issues or
>> > > complaints, suggesting that the renaming and functional update have
>> > > not negatively impacted the system's functionality.
>> >
>> > Thanks for your information.  From the commit, sysctl isn't considered
>> > as the kernel ABI.
>> >
>> > Even if so, IMHO, we shouldn't introduce a user tunable knob without a
>> > real world requirements except more flexibility.
>>
>> Indeed, I do not reside in the physical realm but within a virtualized
>> universe. (Of course, that is your perspective.)
>
> One final note: you explained very well what "neutral" means. Thank
> you for your comments.

Originally, my opinion to the change is neutral.  But, after more
thoughts, I changed my opinion to "we need more evidence to prove the
knob is necessary".

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-07-15  1:09                                                 ` Huang, Ying
@ 2024-07-15  4:32                                                   ` Yafang Shao
  0 siblings, 0 replies; 41+ messages in thread
From: Yafang Shao @ 2024-07-15  4:32 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, Matthew Wilcox, mgorman, linux-mm, David Rientjes

On Mon, Jul 15, 2024 at 9:11 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Fri, Jul 12, 2024 at 5:24 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>
> >> On Fri, Jul 12, 2024 at 5:13 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >
> >> > Yafang Shao <laoar.shao@gmail.com> writes:
> >> >
> >> > > On Fri, Jul 12, 2024 at 4:26 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >>
> >> > >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> > >>
> >> > >> > On Fri, Jul 12, 2024 at 3:06 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >> >>
> >> > >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> > >> >>
> >> > >> >> > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >> >> >>
> >> > >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> > >> >> >>
> >> > >> >> >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >> >> >> >>
> >> > >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> > >> >> >> >>
> >> > >> >> >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >> >> >> >> >> >>
> >> > >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> > >> >> >> >> >> >>
> >> > >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >> >> >> >> >> >> >>
> >> > >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> > >> >> >> >> >> >> >>
> >> > >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >> >> >> >> >> >> >> >>
> >> > >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> > >> >> >> >> >> >> >> >>
> >> > >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >> >> >> >> >> >> >> >> >>
> >> > >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> > >> >> >> >> >> >> >> >> >>
> >> > >> >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >> >> >> >> >> >> >> >> >> >>
> >> > >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> > >> >> >> >> >> >> >> >> >> >>
> >> > >> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
> >> > >> >> >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment,
> >> > >> >> >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the
> >> > >> >> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
> >> > >> >> >> >> >> >> >> >> >> >> > is introduced as a more practical alternative.
> >> > >> >> >> >> >> >> >> >> >> >>
> >> > >> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change.  I can understand that kernel
> >> > >> >> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
> >> > >> >> >> >> >> >> >> >> >> >> too.
> >> > >> >> >> >> >> >> >> >> >> >>
> >> > >> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
> >> > >> >> >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi
> >> > >> >> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
> >> > >> >> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
> >> > >> >> >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a
> >> > >> >> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
> >> > >> >> >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort.
> >> > >> >> >> >> >> >> >> >> >> >>
> >> > >> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
> >> > >> >> >> >> >> >> >> >> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
> >> > >> >> >> >> >> >> >> >> >> >
> >> > >> >> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a
> >> > >> >> >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have
> >> > >> >> >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly
> >> > >> >> >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale
> >> > >> >> >> >> >> >> >> >> >> > behind introducing this configuration in the beginning?
> >> > >> >> >> >> >> >> >> >> >>
> >> > >> >> >> >> >> >> >> >> >> Sorry, I don't understand your words.  Do you need me to explain what is
> >> > >> >> >> >> >> >> >> >> >> "neutral"?
> >> > >> >> >> >> >> >> >> >> >
> >> > >> >> >> >> >> >> >> >> > No, thanks.
> >> > >> >> >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive
> >> > >> >> >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better
> >> > >> >> >> >> >> >> >> >> > understanding of the concept.
> >> > >> >> >> >> >> >> >> >> >
> >> > >> >> >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ?
> >> > >> >> >> >> >> >> >> >>
> >> > >> >> >> >> >> >> >> >> I think that I have explained it in the commit log of commit
> >> > >> >> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
> >> > >> >> >> >> >> >> >> >> latency").  Which introduces the config.
> >> > >> >> >> >> >> >> >> >
> >> > >> >> >> >> >> >> >> > What specifically are your expectations for how users should utilize
> >> > >> >> >> >> >> >> >> > this config in real production workload?
> >> > >> >> >> >> >> >> >> >
> >> > >> >> >> >> >> >> >> >>
> >> > >> >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever.  Can you
> >> > >> >> >> >> >> >> >> >> explain why you need it?  Why cannot you use a fixed value after initial
> >> > >> >> >> >> >> >> >> >> experiments.
> >> > >> >> >> >> >> >> >> >
> >> > >> >> >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds
> >> > >> >> >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we
> >> > >> >> >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the
> >> > >> >> >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it
> >> > >> >> >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every
> >> > >> >> >> >> >> >> >> > instance where the default value falls short? Surely, there must be
> >> > >> >> >> >> >> >> >> > more practical and efficient approaches we can explore together to
> >> > >> >> >> >> >> >> >> > ensure optimal performance across all workloads.
> >> > >> >> >> >> >> >> >> >
> >> > >> >> >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are
> >> > >> >> >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also
> >> > >> >> >> >> >> >> >> > consider the needs and requirements of our actual users, along with
> >> > >> >> >> >> >> >> >> > the diverse workloads they encounter in their daily operations.
> >> > >> >> >> >> >> >> >>
> >> > >> >> >> >> >> >> >> Have you found that your different systems requires different
> >> > >> >> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
> >> > >> >> >> >> >> >> >
> >> > >> >> >> >> >> >> > For specific workloads that introduce latency, we set the value to 0.
> >> > >> >> >> >> >> >> > For other workloads, we keep it unchanged until we determine that the
> >> > >> >> >> >> >> >> > default value is also suboptimal. What is the issue with this
> >> > >> >> >> >> >> >> > approach?
> >> > >> >> >> >> >> >>
> >> > >> >> >> >> >> >> Firstly, this is a system wide configuration, not workload specific.
> >> > >> >> >> >> >> >> So, other workloads run on the same system will be impacted too.  Will
> >> > >> >> >> >> >> >> you run one workload only on one system?
> >> > >> >> >> >> >> >
> >> > >> >> >> >> >> > It seems we're living on different planets. You're happily working in
> >> > >> >> >> >> >> > your lab environment, while I'm struggling with real-world production
> >> > >> >> >> >> >> > issues.
> >> > >> >> >> >> >> >
> >> > >> >> >> >> >> > For servers:
> >> > >> >> >> >> >> >
> >> > >> >> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0
> >> > >> >> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> >> > >> >> >> >> >> > Server 1,000,001 and beyond: Happy with all values
> >> > >> >> >> >> >> >
> >> > >> >> >> >> >> > Is this hard to understand?
> >> > >> >> >> >> >> >
> >> > >> >> >> >> >> > In other words:
> >> > >> >> >> >> >> >
> >> > >> >> >> >> >> > For applications:
> >> > >> >> >> >> >> >
> >> > >> >> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0
> >> > >> >> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> >> > >> >> >> >> >> > Application 1,000,001 and beyond: Happy with all values
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> Good to know this.  Thanks!
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> >>
> >> > >> >> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI.  For
> >> > >> >> >> >> >> >> example, we need to use different configuration on different systems
> >> > >> >> >> >> >> >> otherwise some workloads will be hurt.  Can you provide some evidences
> >> > >> >> >> >> >> >> to support your change?  IMHO, it's not good enough to say I don't know
> >> > >> >> >> >> >> >> why I just don't want to change existing systems.  If so, it may be
> >> > >> >> >> >> >> >> better to wait until you have more evidences.
> >> > >> >> >> >> >> >
> >> > >> >> >> >> >> > It seems the community encourages developers to experiment with their
> >> > >> >> >> >> >> > improvements in lab environments using meticulously designed test
> >> > >> >> >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately
> >> > >> >> >> >> >> > obtaining perfect data. However, it discourages developers from
> >> > >> >> >> >> >> > directly addressing real-world workloads. Sigh.
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> You cannot know whether your workloads benefit or hurt for the different
> >> > >> >> >> >> >> batch number and how in your production environment?  If you cannot, how
> >> > >> >> >> >> >> do you decide which workload deploys on which system (with different
> >> > >> >> >> >> >> batch number configuration).  If you can, can you provide such
> >> > >> >> >> >> >> information to support your patch?
> >> > >> >> >> >> >
> >> > >> >> >> >> > We leverage a meticulous selection of network metrics, particularly
> >> > >> >> >> >> > focusing on TcpExt indicators, to keep a close eye on application
> >> > >> >> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts,
> >> > >> >> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans,
> >> > >> >> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more.
> >> > >> >> >> >> >
> >> > >> >> >> >> > In instances where a problematic container terminates, we've noticed a
> >> > >> >> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per
> >> > >> >> >> >> > second, which serves as a clear indication that other applications are
> >> > >> >> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max
> >> > >> >> >> >> > parameter to 0, we've been able to drastically reduce the maximum
> >> > >> >> >> >> > frequency of these timeouts to less than one per second.
> >> > >> >> >> >>
> >> > >> >> >> >> Thanks a lot for sharing this.  I learned much from it!
> >> > >> >> >> >>
> >> > >> >> >> >> > At present, we're selectively applying this adjustment to clusters
> >> > >> >> >> >> > that exclusively host the identified problematic applications, and
> >> > >> >> >> >> > we're closely monitoring their performance to ensure stability. To
> >> > >> >> >> >> > date, we've observed no network latency issues as a result of this
> >> > >> >> >> >> > change. However, we remain cautious about extending this optimization
> >> > >> >> >> >> > to other clusters, as the decision ultimately depends on a variety of
> >> > >> >> >> >> > factors.
> >> > >> >> >> >> >
> >> > >> >> >> >> > It's important to note that we're not eager to implement this change
> >> > >> >> >> >> > across our entire fleet, as we recognize the potential for unforeseen
> >> > >> >> >> >> > consequences. Instead, we're taking a cautious approach by initially
> >> > >> >> >> >> > applying it to a limited number of servers. This allows us to assess
> >> > >> >> >> >> > its impact and make informed decisions about whether or not to expand
> >> > >> >> >> >> > its use in the future.
> >> > >> >> >> >>
> >> > >> >> >> >> So, you haven't observed any performance hurt yet.  Right?
> >> > >> >> >> >
> >> > >> >> >> > Right.
> >> > >> >> >> >
> >> > >> >> >> >> If you
> >> > >> >> >> >> haven't, I suggest you to keep the patch in your downstream kernel for a
> >> > >> >> >> >> while.  In the future, if you find the performance of some workloads
> >> > >> >> >> >> hurts because of the new batch number, you can repost the patch with the
> >> > >> >> >> >> supporting data.  If in the end, the performance of more and more
> >> > >> >> >> >> workloads is good with the new batch number.  You may consider to make 0
> >> > >> >> >> >> the default value :-)
> >> > >> >> >> >
> >> > >> >> >> > That is not how the real world works.
> >> > >> >> >> >
> >> > >> >> >> > In the real world:
> >> > >> >> >> >
> >> > >> >> >> > - No one knows what may happen in the future.
> >> > >> >> >> >   Therefore, if possible, we should make systems flexible, unless
> >> > >> >> >> > there is a strong justification for using a hard-coded value.
> >> > >> >> >> >
> >> > >> >> >> > - Minimize changes whenever possible.
> >> > >> >> >> >   These systems have been working fine in the past, even if with lower
> >> > >> >> >> > performance. Why make changes just for the sake of improving
> >> > >> >> >> > performance? Does the key metric of your performance data truly matter
> >> > >> >> >> > for their workload?
> >> > >> >> >>
> >> > >> >> >> These are good policy in your organization and business.  But, it's not
> >> > >> >> >> necessary the policy that Linux kernel upstream should take.
> >> > >> >> >
> >> > >> >> > You mean the Upstream Linux kernel only designed for the lab ?
> >> > >> >> >
> >> > >> >> >>
> >> > >> >> >> Community needs to consider long-term maintenance overhead, so it adds
> >> > >> >> >> new ABI (such as sysfs knob) to kernel with the necessary justification.
> >> > >> >> >> In general, it prefer to use a good default value or an automatic
> >> > >> >> >> algorithm that works for everyone.  Community tries avoiding (or fixing)
> >> > >> >> >> regressions as much as possible, but this will not stop kernel from
> >> > >> >> >> changing, even if it's big.
> >> > >> >> >
> >> > >> >> > Please explain to me why the kernel config is not ABI, but the sysctl is ABI.
> >> > >> >>
> >> > >> >> Linux kernel will not break ABI until the last users stop using it.
> >> > >> >
> >> > >> > However, you haven't given a clear reference why the systl is an ABI.
> >> > >>
> >> > >> TBH, I don't find a formal document said it explicitly after some
> >> > >> searching.
> >> > >>
> >> > >> Hi, Andrew, Matthew,
> >> > >>
> >> > >> Can you help me on this?  Whether sysctl is considered Linux kernel ABI?
> >> > >> Or something similar?
> >> > >
> >> > > In my experience, we consistently utilize an if-statement to configure
> >> > > sysctl settings in our production environments.
> >> > >
> >> > >     if [ -f ${sysctl_file} ]; then
> >> > >         echo ${new_value} > ${sysctl_file}
> >> > >     fi
> >> > >
> >> > > Additionally, you can incorporate this into rc.local to ensure the
> >> > > configuration is applied upon system reboot.
> >> > >
> >> > > Even if you add it to the sysctl.conf without the if-statement, it
> >> > > won't break anything.
> >> > >
> >> > > The pcp-related sysctl parameter, vm.percpu_pagelist_high_fraction,
> >> > > underwent a naming change along with a functional update from its
> >> > > predecessor, vm.percpu_pagelist_fraction, in commit 74f44822097c
> >> > > ("mm/page_alloc: introduce vm.percpu_pagelist_high_fraction"). Despite
> >> > > this significant change, there have been no reported issues or
> >> > > complaints, suggesting that the renaming and functional update have
> >> > > not negatively impacted the system's functionality.
> >> >
> >> > Thanks for your information.  From the commit, sysctl isn't considered
> >> > as the kernel ABI.
> >> >
> >> > Even if so, IMHO, we shouldn't introduce a user tunable knob without a
> >> > real world requirements except more flexibility.
> >>
> >> Indeed, I do not reside in the physical realm but within a virtualized
> >> universe. (Of course, that is your perspective.)
> >
> > One final note: you explained very well what "neutral" means. Thank
> > you for your comments.
>
> Originally, my opinion to the change is neutral.  But, after more
> thoughts, I changed my opinion to "we need more evidence to prove the
> knob is necessary".
>

One obvious issue in your original patch is that you missed including
any type of AMD CPUs in your original commit, despite AMD CPUs being
widely used nowadays. This clearly indicates that your default
configuration lacks consideration. However, I'm a developer working in
a resource-limited company. So please don't ask me to verify it on
more AMD CPUs as you did for your Intel CPUs.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2024-07-15  4:32 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-07  9:49 [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2024-07-07  9:49 ` [PATCH 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
2024-07-10  1:52   ` Huang, Ying
2024-07-07  9:49 ` [PATCH 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX Yafang Shao
2024-07-10  1:51   ` Huang, Ying
2024-07-10  2:07     ` Yafang Shao
2024-07-07  9:49 ` [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2024-07-10  2:49   ` Huang, Ying
2024-07-11  2:21     ` Yafang Shao
2024-07-11  6:42       ` Huang, Ying
2024-07-11  7:25         ` Yafang Shao
2024-07-11  8:18           ` Huang, Ying
2024-07-11  9:51             ` Yafang Shao
2024-07-11 10:49               ` Huang, Ying
2024-07-11 12:45                 ` Yafang Shao
2024-07-12  1:19                   ` Huang, Ying
2024-07-12  2:25                     ` Yafang Shao
2024-07-12  3:05                       ` Huang, Ying
2024-07-12  3:44                         ` Yafang Shao
2024-07-12  5:25                           ` Huang, Ying
2024-07-12  5:41                             ` Yafang Shao
2024-07-12  6:16                               ` Huang, Ying
2024-07-12  6:41                                 ` Yafang Shao
2024-07-12  7:04                                   ` Huang, Ying
2024-07-12  7:36                                     ` Yafang Shao
2024-07-12  8:24                                       ` Huang, Ying
2024-07-12  8:49                                         ` Yafang Shao
2024-07-12  9:10                                           ` Huang, Ying
2024-07-12  9:24                                             ` Yafang Shao
2024-07-12  9:46                                               ` Yafang Shao
2024-07-15  1:09                                                 ` Huang, Ying
2024-07-15  4:32                                                   ` Yafang Shao
2024-07-10  3:00 ` [PATCH 0/3] " Huang, Ying
2024-07-11  2:25   ` Yafang Shao
2024-07-11  6:38     ` Huang, Ying
2024-07-11  7:21       ` Yafang Shao
2024-07-11  8:36         ` Huang, Ying
2024-07-11  9:40           ` Yafang Shao
2024-07-11 11:03             ` Huang, Ying
2024-07-11 12:40               ` Yafang Shao
2024-07-12  2:32                 ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).