From: Yafang Shao <laoar.shao@gmail.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: akpm@linux-foundation.org, mgorman@techsingularity.net,
linux-mm@kvack.org, Matthew Wilcox <willy@infradead.org>,
David Rientjes <rientjes@google.com>
Subject: Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
Date: Fri, 12 Jul 2024 15:36:15 +0800 [thread overview]
Message-ID: <CALOAHbCKzEXW6-8ApzYNsh=Ert+a0=GOS6k1enOMzMTVXg2Uqw@mail.gmail.com> (raw)
In-Reply-To: <874j8v851a.fsf@yhuang6-desk2.ccr.corp.intel.com>
On Fri, Jul 12, 2024 at 3:06 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >>
> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >>
> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >> >>
> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
> >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment,
> >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the
> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
> >> >> >> >> >> >> >> >> > is introduced as a more practical alternative.
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> In general, I'm neutral to the change. I can understand that kernel
> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI
> >> >> >> >> >> >> >> >> too.
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
> >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi
> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
> >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a
> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
> >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort.
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
> >> >> >> >> >> >> >> >> contention. Instead, it will reduce page allocation/freeing latency.
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a
> >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have
> >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly
> >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale
> >> >> >> >> >> >> >> > behind introducing this configuration in the beginning?
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Sorry, I don't understand your words. Do you need me to explain what is
> >> >> >> >> >> >> >> "neutral"?
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > No, thanks.
> >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive
> >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better
> >> >> >> >> >> >> > understanding of the concept.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ?
> >> >> >> >> >> >>
> >> >> >> >> >> >> I think that I have explained it in the commit log of commit
> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
> >> >> >> >> >> >> latency"). Which introduces the config.
> >> >> >> >> >> >
> >> >> >> >> >> > What specifically are your expectations for how users should utilize
> >> >> >> >> >> > this config in real production workload?
> >> >> >> >> >> >
> >> >> >> >> >> >>
> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you
> >> >> >> >> >> >> explain why you need it? Why cannot you use a fixed value after initial
> >> >> >> >> >> >> experiments.
> >> >> >> >> >> >
> >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds
> >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we
> >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the
> >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it
> >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every
> >> >> >> >> >> > instance where the default value falls short? Surely, there must be
> >> >> >> >> >> > more practical and efficient approaches we can explore together to
> >> >> >> >> >> > ensure optimal performance across all workloads.
> >> >> >> >> >> >
> >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are
> >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also
> >> >> >> >> >> > consider the needs and requirements of our actual users, along with
> >> >> >> >> >> > the diverse workloads they encounter in their daily operations.
> >> >> >> >> >>
> >> >> >> >> >> Have you found that your different systems requires different
> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
> >> >> >> >> >
> >> >> >> >> > For specific workloads that introduce latency, we set the value to 0.
> >> >> >> >> > For other workloads, we keep it unchanged until we determine that the
> >> >> >> >> > default value is also suboptimal. What is the issue with this
> >> >> >> >> > approach?
> >> >> >> >>
> >> >> >> >> Firstly, this is a system wide configuration, not workload specific.
> >> >> >> >> So, other workloads run on the same system will be impacted too. Will
> >> >> >> >> you run one workload only on one system?
> >> >> >> >
> >> >> >> > It seems we're living on different planets. You're happily working in
> >> >> >> > your lab environment, while I'm struggling with real-world production
> >> >> >> > issues.
> >> >> >> >
> >> >> >> > For servers:
> >> >> >> >
> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0
> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> >> >> >> > Server 1,000,001 and beyond: Happy with all values
> >> >> >> >
> >> >> >> > Is this hard to understand?
> >> >> >> >
> >> >> >> > In other words:
> >> >> >> >
> >> >> >> > For applications:
> >> >> >> >
> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0
> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
> >> >> >> > Application 1,000,001 and beyond: Happy with all values
> >> >> >>
> >> >> >> Good to know this. Thanks!
> >> >> >>
> >> >> >> >>
> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI. For
> >> >> >> >> example, we need to use different configuration on different systems
> >> >> >> >> otherwise some workloads will be hurt. Can you provide some evidences
> >> >> >> >> to support your change? IMHO, it's not good enough to say I don't know
> >> >> >> >> why I just don't want to change existing systems. If so, it may be
> >> >> >> >> better to wait until you have more evidences.
> >> >> >> >
> >> >> >> > It seems the community encourages developers to experiment with their
> >> >> >> > improvements in lab environments using meticulously designed test
> >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately
> >> >> >> > obtaining perfect data. However, it discourages developers from
> >> >> >> > directly addressing real-world workloads. Sigh.
> >> >> >>
> >> >> >> You cannot know whether your workloads benefit or hurt for the different
> >> >> >> batch number and how in your production environment? If you cannot, how
> >> >> >> do you decide which workload deploys on which system (with different
> >> >> >> batch number configuration). If you can, can you provide such
> >> >> >> information to support your patch?
> >> >> >
> >> >> > We leverage a meticulous selection of network metrics, particularly
> >> >> > focusing on TcpExt indicators, to keep a close eye on application
> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts,
> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans,
> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more.
> >> >> >
> >> >> > In instances where a problematic container terminates, we've noticed a
> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per
> >> >> > second, which serves as a clear indication that other applications are
> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max
> >> >> > parameter to 0, we've been able to drastically reduce the maximum
> >> >> > frequency of these timeouts to less than one per second.
> >> >>
> >> >> Thanks a lot for sharing this. I learned much from it!
> >> >>
> >> >> > At present, we're selectively applying this adjustment to clusters
> >> >> > that exclusively host the identified problematic applications, and
> >> >> > we're closely monitoring their performance to ensure stability. To
> >> >> > date, we've observed no network latency issues as a result of this
> >> >> > change. However, we remain cautious about extending this optimization
> >> >> > to other clusters, as the decision ultimately depends on a variety of
> >> >> > factors.
> >> >> >
> >> >> > It's important to note that we're not eager to implement this change
> >> >> > across our entire fleet, as we recognize the potential for unforeseen
> >> >> > consequences. Instead, we're taking a cautious approach by initially
> >> >> > applying it to a limited number of servers. This allows us to assess
> >> >> > its impact and make informed decisions about whether or not to expand
> >> >> > its use in the future.
> >> >>
> >> >> So, you haven't observed any performance hurt yet. Right?
> >> >
> >> > Right.
> >> >
> >> >> If you
> >> >> haven't, I suggest you to keep the patch in your downstream kernel for a
> >> >> while. In the future, if you find the performance of some workloads
> >> >> hurts because of the new batch number, you can repost the patch with the
> >> >> supporting data. If in the end, the performance of more and more
> >> >> workloads is good with the new batch number. You may consider to make 0
> >> >> the default value :-)
> >> >
> >> > That is not how the real world works.
> >> >
> >> > In the real world:
> >> >
> >> > - No one knows what may happen in the future.
> >> > Therefore, if possible, we should make systems flexible, unless
> >> > there is a strong justification for using a hard-coded value.
> >> >
> >> > - Minimize changes whenever possible.
> >> > These systems have been working fine in the past, even if with lower
> >> > performance. Why make changes just for the sake of improving
> >> > performance? Does the key metric of your performance data truly matter
> >> > for their workload?
> >>
> >> These are good policy in your organization and business. But, it's not
> >> necessary the policy that Linux kernel upstream should take.
> >
> > You mean the Upstream Linux kernel only designed for the lab ?
> >
> >>
> >> Community needs to consider long-term maintenance overhead, so it adds
> >> new ABI (such as sysfs knob) to kernel with the necessary justification.
> >> In general, it prefer to use a good default value or an automatic
> >> algorithm that works for everyone. Community tries avoiding (or fixing)
> >> regressions as much as possible, but this will not stop kernel from
> >> changing, even if it's big.
> >
> > Please explain to me why the kernel config is not ABI, but the sysctl is ABI.
>
> Linux kernel will not break ABI until the last users stop using it.
However, you haven't given a clear reference why the systl is an ABI.
> This usually means tens years if not forever. Kernel config options
> aren't considered ABI, they are used by developers and distributions.
> They come and go from version to version.
>
> >>
> >> IIUC, because of the different requirements, there are upstream and
> >> downstream kernels.
> >
> > The downstream developer backport features from the upsteam kernel,
> > and if they find issues in the upstream kernel, they should contribute
> > it back. That is how the Linux Community works, right ?
>
> Yes. If they are issues for upstream kernel too.
>
> --
> Best Regards,
> Huang, Ying
--
Regards
Yafang
next prev parent reply other threads:[~2024-07-12 7:36 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-07 9:49 [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2024-07-07 9:49 ` [PATCH 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
2024-07-10 1:52 ` Huang, Ying
2024-07-07 9:49 ` [PATCH 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX Yafang Shao
2024-07-10 1:51 ` Huang, Ying
2024-07-10 2:07 ` Yafang Shao
2024-07-07 9:49 ` [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2024-07-10 2:49 ` Huang, Ying
2024-07-11 2:21 ` Yafang Shao
2024-07-11 6:42 ` Huang, Ying
2024-07-11 7:25 ` Yafang Shao
2024-07-11 8:18 ` Huang, Ying
2024-07-11 9:51 ` Yafang Shao
2024-07-11 10:49 ` Huang, Ying
2024-07-11 12:45 ` Yafang Shao
2024-07-12 1:19 ` Huang, Ying
2024-07-12 2:25 ` Yafang Shao
2024-07-12 3:05 ` Huang, Ying
2024-07-12 3:44 ` Yafang Shao
2024-07-12 5:25 ` Huang, Ying
2024-07-12 5:41 ` Yafang Shao
2024-07-12 6:16 ` Huang, Ying
2024-07-12 6:41 ` Yafang Shao
2024-07-12 7:04 ` Huang, Ying
2024-07-12 7:36 ` Yafang Shao [this message]
2024-07-12 8:24 ` Huang, Ying
2024-07-12 8:49 ` Yafang Shao
2024-07-12 9:10 ` Huang, Ying
2024-07-12 9:24 ` Yafang Shao
2024-07-12 9:46 ` Yafang Shao
2024-07-15 1:09 ` Huang, Ying
2024-07-15 4:32 ` Yafang Shao
2024-07-10 3:00 ` [PATCH 0/3] " Huang, Ying
2024-07-11 2:25 ` Yafang Shao
2024-07-11 6:38 ` Huang, Ying
2024-07-11 7:21 ` Yafang Shao
2024-07-11 8:36 ` Huang, Ying
2024-07-11 9:40 ` Yafang Shao
2024-07-11 11:03 ` Huang, Ying
2024-07-11 12:40 ` Yafang Shao
2024-07-12 2:32 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CALOAHbCKzEXW6-8ApzYNsh=Ert+a0=GOS6k1enOMzMTVXg2Uqw@mail.gmail.com' \
--to=laoar.shao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=rientjes@google.com \
--cc=willy@infradead.org \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).