From: Luis Chamberlain <mcgrof@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
Cc: Song Liu <song@kernel.org>,
bpf@vger.kernel.org, linux-mm@kvack.org,
akpm@linux-foundation.org, x86@kernel.org, peterz@infradead.org,
hch@lst.de, rick.p.edgecombe@intel.com, dave.hansen@intel.com,
zhengjun.xing@linux.intel.com, kbusch@kernel.org,
p.raghav@samsung.com, dave@stgolabs.net, vbabka@suse.cz,
mgorman@suse.de, willy@infradead.org,
torvalds@linux-foundation.org, a.manzanares@samsung.com
Subject: Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
Date: Thu, 3 Nov 2022 11:59:48 -0700 [thread overview]
Message-ID: <Y2QPpODzdP+2YSMN@bombadil.infradead.org> (raw)
In-Reply-To: <Y2Pjnd3mxA9fTlox@kernel.org>
On Thu, Nov 03, 2022 at 05:51:57PM +0200, Mike Rapoport wrote:
> Hi Luis,
>
> Thanks for looping me in.
>
> On Wed, Nov 02, 2022 at 04:41:59PM -0700, Luis Chamberlain wrote:
> > On Mon, Oct 31, 2022 at 03:25:37PM -0700, Song Liu wrote:
> > > vmalloc_exec is used to allocate memory to host dynamic kernel text
> > > (modules, BPF programs, etc.) with huge pages. This is similar to the
> > > proposal by Peter in [1].
> >
> > This is allg reat but we need to clarify *why* we would go through the
> > trouble. So if folks are not to excited about this series, that's
> > probably why. IMHO it lacks substance for rationale, **and** implies a few
> > gains without any *clear* performance metrics. I have 0 experience with
> > mm so I'd like other's feedback on my this -- I'm just trying to do
> > decipher rationale from prior "bpf prog pack" efforts.
> >
> > I'm sensing that the cables in messaging are a bit crossed here and we need
> > to provide a bit better full picture for rationale and this is being
> > completely missed and this work is being undersold. If my assessment is
> > accurate though, the bpf prog pack strategy with sharing huge pages may prove
> > useful long term for other things than just modules / ftrace / kprobes.
> >
> > I was surprised to see this entire patch series upgrade from RFC to proper
> > PATCH form now completely fails to mention any of the original motivations
> > behind the "BPF prog pack", which you are doing a true heroic effort to try to
> > generalize as the problem is hard. Let me try to help with that. The rationale
> > for the old BPF prog pack is documented as follows:
> >
> > * Most BPF programs are pretty small. Allocating a hole page for each
> > * program is sometime a waste. Many small bpf program also adds pressure
> > * to instruction TLB. To solve this issue, we introduce a BPF program pack
> > * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
> > * to host BPF programs.
> >
> > Previously you have also stated in earlier versions of this patch set:
> >
> > "Most BPF programs are small, but they consume a page each. For systems
> > with busy traffic and many BPF programs, this could also add significant
> > pressure to instruction TLB. High iTLB pressure usually causes slow down
> > for the whole system, which includes visible performance
> > degradation for production workloads."
> >
> > So it is implied here that one of the benefits is to help reduce iTLB misses.
> > But that's it. We have no visible numbers to look at and for what... But
> > reducing iTLB misses doesn't always have a complete direct correlation
> > with improving things, but if the code change is small enough it obviously
> > makes sense to apply. If the change is a bit more intrusive, as in this
> > patch series a bit more rationale should be provided.
> >
> > Other than the "performance aspects" of your patchset, the *main* reason
> > I am engaged and like it is it reduces the nasty mess of semantics on
> > dealing with special permissions on pages which we see in modules and a
> > few other places which today completely open code it. That proves error
> > prone and I'm glad to see efforts to generalize that nastiness. So
> > please ensure this is added as part of the documented rationale. Even
> > if the iTLB miss ratio improvement is not astronomical I believe that
> > the gains in sanity on improving semantics on special pages and sharing code
> > make it well worthwhile. The iTLB miss ratio improvement is just a small
> > cherry on top.
> >
> > Going back to performance aspects, when Linus had poked for more details
> > about this your have elaborated further:
> >
> > "we have seen direct map fragmentation causing visible
> > performance drop for our major services. This is the shadow
> > production benchmark, so it is not possible to run it out of
> > our data centers. Tracing showed that BPF program was the top
> > trigger of these direct map splits."
> >
> > And the only other metric we have is:
> >
> > "For our web service production benchmark, bpf_prog_pack on 4kB pages
> > gives 0.5% to 0.7% more throughput than not using bpf_prog_pack."
> >
> > These metrics are completely arbitrary and opaque to us. We need
> > something tangible and reproducible and I have been suggesting that
> > from early on...
> >
> > I'm under the impression that the real missed, undocumented, major value-add
> > here is that the old "BPF prog pack" strategy helps to reduce the direct map
> > fragmentation caused by heavy use of the eBPF JIT programs and this in
> > turn helps your overall random system performance (regardless of what
> > it is you do). As I see it then the eBPF prog pack is just one strategy to
> > try to mitigate memory fragmentation on the direct map caused by the the eBPF
> > JIT programs, so the "slow down" your team has obvserved should be due to the
> > eventual fragmentation caused on the direct map *while* eBPF programs
> > get heavily used.
>
> I believe that while the eBPF prog pack is helpful in mitigation of the
> direct map fragmentation caused by the eBPF JIT programs, the same strategy
> of allocating a large page, splitting its PMD entry and then reusing the
> memory for smaller allocations can be (and should be) generalized to other
> use cases that require non-default permissions in the page table. Most
> prominent use cases are those that allocate memory for code, but the same
> approach is relevant for other cases, like secretmem or page table
> protection with PKS.
>
> A while ago I've suggested to handle such caching of large pages at the
> page allocator level, but when we discussed it at LSF/MM/BPF, prevailing
> opinion was that added value does not justify changes to the page
> allocator and it was suggested to handle such caching elsewhere.
I saw that on the lwn coverage.
> I had to put this project on a backburner for $VARIOUS_REASONS, but I still
> think that we need a generic allocator for memory with non-default
> permissions in the direct map and that code allocation should build on that
> allocator.
It seems this generalization of the bpf prog pack to possibly be used
for modules / kprobes / ftrace is a small step in that direction.
> All that said, the direct map fragmentation problem is currently relevant
> only to x86 because it's the only architecture that supports splitting of
> the large pages in the direct map.
I was thinking even more long term too, using this as a proof of concept. If
this practice in general helps with fragmentation, could it be used for
experimetnation with compound pages later, as a way to reduce possible
fragmentation.
> > Mike Rapoport had presented about the Direct map fragmentation problem
> > at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace /
> > kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance
> > evaluation on whether using 2M/1G pages aggressively for the kernel direct map
> > help performance [1] ends up generally recommending huge pages. The work by Xing
> > though was about using huge pages *alone*, not using a strategy such as in the
> > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs,
> > and that I think is the real golden nugget here.
> >
> > I contend therefore that the theoretical reduction of iTLB misses by using
> > huge pages for "bpf prog pack" is not what gets your systems to perform
> > somehow better. It should be simply that it reduces fragmentation and
> > *this* generally can help with performance long term. If this is accurate
> > then let's please separate the two aspects to this.
>
> The direct map fragmentation is the reason for higher TLB miss rate, both
> for iTLB and dTLB.
OK so then whatever benchmark is running in tandem as eBPF JIT is hammered
should *also* be measured with perf for iTLB and dTLB. ie, the patch can
provide such results as a justifications.
> Whenever a large page in the direct map is split, all
> kernel accesses via the direct map will use small pages which requires
> dealing with 512 page table entries instead of one for 2M range.
>
> Since small pages in the direct map are never collapsed back to large
> pages, long living system that heavily uses eBPF programs will have its
> direct map severely fragmented, higher TLB miss rate and worse overall
> performance.
Shouldn't compaction help with those situations?
> > There's two aspects to what I would like to see from a performance
> > perspective then actually mentioned in the commit logs:
> >
> > 1) iTLB miss loss ratio with "bpf prog pack" or this generalized solution
> > Vs not using it at all:
>
> ...
>
> > 2) Estimate in reduction on direct map fragmentation by using the "bpf
> > prog pack" or this generalized solution:
> >
> > For this I'd expect a benchmark similar to the workload you guys
> > run or something memory intensive, as eBPF JITs are heavily used,
> > and after a certain amount of time somehow compute how fragmented
> > memory is. The only sensible thing I can think to measure memory
> > fragmentation is to look at the memory compaction index
> > /sys/kernel/debug/extfrag/extfrag_index , but I highly welcome other's
> > ideas as I'm a mm n00b.
>
> The direct map fragmentation can be tracked with
>
> grep DirectMap /proc/meminfo
> grep direct_map /proc/vmstat
>
> and by looking at /sys/kernel/debug/page_tables/kernel
Thanks!
Luis
next prev parent reply other threads:[~2022-11-03 18:59 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-31 22:25 [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs Song Liu
2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec Song Liu
2022-11-02 23:41 ` Luis Chamberlain
2022-11-03 15:51 ` Mike Rapoport
2022-11-03 18:59 ` Luis Chamberlain [this message]
2022-11-03 21:19 ` Edgecombe, Rick P
2022-11-03 21:41 ` Song Liu
2022-11-03 23:33 ` Luis Chamberlain
2022-11-04 0:18 ` Luis Chamberlain
2022-11-04 3:29 ` Luis Chamberlain
2022-11-07 6:58 ` Mike Rapoport
2022-11-07 17:26 ` Luis Chamberlain
2022-11-07 6:40 ` Aaron Lu
2022-11-07 17:39 ` Luis Chamberlain
2022-11-07 18:35 ` Song Liu
2022-11-07 18:30 ` Song Liu
2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 2/5] x86/alternative: support vmalloc_exec() and vfree_exec() Song Liu
2022-11-02 22:21 ` Edgecombe, Rick P
2022-11-03 21:03 ` Song Liu
2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 3/5] bpf: use vmalloc_exec for bpf program and bpf dispatcher Song Liu
2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 4/5] vmalloc: introduce register_text_tail_vm() Song Liu
2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 5/5] x86: use register_text_tail_vm Song Liu
2022-11-02 22:24 ` Edgecombe, Rick P
2022-11-03 21:04 ` Song Liu
2022-11-01 11:26 ` [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs Christoph Hellwig
2022-11-01 15:10 ` Song Liu
2022-11-02 20:45 ` Luis Chamberlain
2022-11-02 22:29 ` Edgecombe, Rick P
2022-11-03 21:13 ` Song Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y2QPpODzdP+2YSMN@bombadil.infradead.org \
--to=mcgrof@kernel.org \
--cc=a.manzanares@samsung.com \
--cc=akpm@linux-foundation.org \
--cc=bpf@vger.kernel.org \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=hch@lst.de \
--cc=kbusch@kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=p.raghav@samsung.com \
--cc=peterz@infradead.org \
--cc=rick.p.edgecombe@intel.com \
--cc=rppt@kernel.org \
--cc=song@kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
--cc=zhengjun.xing@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).