All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Peter Oskolkov <posk@posk.io>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	Boqun Feng <boqun.feng@gmail.com>,
	"H. Peter Anvin" <hpa@zytor.com>, Paul Turner <pjt@google.com>,
	linux-api <linux-api@vger.kernel.org>,
	Christian Brauner <christian.brauner@ubuntu.com>,
	Florian Weimer <fw@deneb.enyo.de>,
	David Laight <David.Laight@aculab.com>,
	carlos <carlos@redhat.com>, Chris Kennelly <ckennelly@google.com>,
	Peter Oskolkov <posk@google.com>
Subject: Re: [PATCH v3 00/23] RSEQ node id and virtual cpu id extensions
Date: Tue, 2 Aug 2022 11:01:19 -0400 (EDT)	[thread overview]
Message-ID: <500891137.95782.1659452479846.JavaMail.zimbra@efficios.com> (raw)
In-Reply-To: <CAFTs51UAyc4Z5WUFdMXCTYR6zji6NwLeBxYsp9GQZvFdEtUm1w@mail.gmail.com>

----- On Aug 1, 2022, at 1:07 PM, Peter Oskolkov posk@posk.io wrote:

> On Fri, Jul 29, 2022 at 12:02 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> Extend the rseq ABI to expose a NUMA node ID and a vm_vcpu_id field.
> 
> Thanks a lot, Mathieu - it is really exciting to see this happening!
> 
> I'll share our experiences here, with the hope that it may be useful.
> I've also cc-ed
> Chris Kennelly, who worked on the userspace/tcmalloc side, as he can provide
> more context/details if I miss or misrepresent something.

Thanks for sharing your experiences at Google. This helps put things in
perspective.

> 
> The problem:
> 
> tcmalloc maintains per-cpu freelists in the userspace to make userspace
> memory allocations fast and efficient; it relies on rseq to do so, as
> any manipulation
> of the freelists has to be protected vs thread migrations.
> 
> However, as a typical userspace process at a Google datacenter is confined to
> a relatively small number of CPUs (8-16) via cgroups, while the
> servers typically
> have a much larger number of physical CPUs, the per-cpu freelist model
> is somewhat
> wasteful: if a process has only at most 10 threads running, for
> example, but these threads
> can "wander" across 100 CPUs over the lifetime of the process, keeping 100
> freelists instead of 10 noticeably wastes memory.
> 
> Note that although a typical process at Google has a limited CPU
> quota, thus using
> only a small number of CPUs at any given time, the process may often have many
> hundreds or thousands of threads, so per-thread freelists are not a viable
> solution to the problem just described.
> 
> Our current solution:
> 
> As you outlined in patch 9, tracking the number of currently running threads per
> address space and exposing this information via a vcpu_id abstraction helps
> tcmalloc to noticeably reduce its freelist overhead in the "narrow
> process running
> on a wide server" situation, which is typical at Google.
> 
> We have experimented with several approaches here. The one that we are
> currently using is the "flat" model: we allocate vcpu IDs ignoring numa nodes.
> 
> We did try per-numa-node vcpus, but it did not show any material improvement
> over the "flat" model, perhaps because on our most "wide" servers the CPU
> topology is multi-level. Chris Kennelly may provide more details here.

I would really like to know more about Google's per-numa-node vcpus implementation.
I suspect you guys may have taken a different turn somewhere in the design which
led to these results. But having not seen that implementation, I can only guess.

I notice the following Google-specific prototype extension in tcmalloc:

  // This is a prototype extension to the rseq() syscall.  Since a process may
  // run on only a few cores at a time, we can use a dense set of "v(irtual)
  // cpus."  This can reduce cache requirements, as we only need N caches for
  // the cores we actually run on simultaneously, rather than a cache for every
  // physical core.
  union {
    struct {
      short numa_node_id;
      short vcpu_id;
    };
    int vcpu_flat;
  };

Can you tell me more about the way the numa_node_id and vcpu_id are allocated
internally, and how they are expected to be used by userspace ?

> 
> On a more technical note, we do use atomic operations extensively in
> the kernel to make sure
> vcpu IDs are "tightly packed", i.e. if only N threads of a process are currently
> running on physical CPUs, vcpu IDs will be in the range [0, N-1], i.e. no gaps,
> no going to N and above; this does consume some extra CPU cycles, but the
> RAM savings we gain far outweigh the extra CPU cost; it will be interesting to
> see what you can do with the optimizations you propose in this patchset.

The optimizations I propose keep those "tightly packed" characteristics, but skip
the atomic operations in common scenarios. I'll welcome benchmarks of the added
overhead in representative workloads.

> Again, thanks a lot for this effort!

Thanks for your input. It really helps steering the effort in the right direction.

Mathieu

> 
> Peter
> 
> [...]

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

  reply	other threads:[~2022-08-02 15:01 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-29 19:02 [PATCH v3 00/23] RSEQ node id and virtual cpu id extensions Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 01/23] rseq: Introduce feature size and alignment ELF auxiliary vector entries Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 02/23] rseq: Introduce extensible rseq ABI Mathieu Desnoyers
2022-08-10  6:33   ` Florian Weimer
2022-08-10 13:27     ` Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 03/23] rseq: extend struct rseq with numa node id Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 04/23] selftests/rseq: Use ELF auxiliary vector for extensible rseq Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 05/23] selftests/rseq: Implement rseq numa node id field selftest Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 06/23] lib: invert _find_next_bit source arguments Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 07/23] lib: implement find_{first,next}_{zero,one}_and_zero_bit Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 08/23] cpumask: implement cpumask_{first,next}_{zero,one}_and_zero Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 09/23] sched: Introduce per memory space current virtual cpu id Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 10/23] rseq: extend struct rseq with per memory space vcpu id Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 11/23] selftests/rseq: Remove RSEQ_SKIP_FASTPATH code Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 12/23] selftests/rseq: Implement rseq vm_vcpu_id field support Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 13/23] selftests/rseq: x86: Template memory ordering and percpu access mode Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 14/23] selftests/rseq: arm: " Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 15/23] selftests/rseq: arm64: " Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 16/23] selftests/rseq: mips: " Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 17/23] selftests/rseq: ppc: " Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 18/23] selftests/rseq: s390: " Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 19/23] selftests/rseq: riscv: " Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 20/23] selftests/rseq: basic percpu ops vm_vcpu_id test Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 21/23] selftests/rseq: parametrized " Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 22/23] selftests/rseq: x86: Implement rseq_load_u32_u32 Mathieu Desnoyers
2022-07-29 19:02 ` [PATCH v3 23/23] selftests/rseq: Implement numa node id vs vm_vcpu_id invariant test Mathieu Desnoyers
2022-08-01 17:07 ` [PATCH v3 00/23] RSEQ node id and virtual cpu id extensions Peter Oskolkov
2022-08-02 15:01   ` Mathieu Desnoyers [this message]
2022-08-02 17:06     ` Peter Oskolkov
2022-08-02 20:53       ` Mathieu Desnoyers
2022-08-04 16:18         ` Chris Kennelly

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=500891137.95782.1659452479846.JavaMail.zimbra@efficios.com \
    --to=mathieu.desnoyers@efficios.com \
    --cc=David.Laight@aculab.com \
    --cc=boqun.feng@gmail.com \
    --cc=carlos@redhat.com \
    --cc=christian.brauner@ubuntu.com \
    --cc=ckennelly@google.com \
    --cc=fw@deneb.enyo.de \
    --cc=hpa@zytor.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=posk@google.com \
    --cc=posk@posk.io \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.