All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Morel <david.morel@vates.tech>
To: Stefano Stabellini <sstabellini@kernel.org>
Cc: xen-devel@lists.xenproject.org, xenia.ragiadakou@amd.com,
	andrew.cooper3@citrix.com, "Jan Beulich" <jbeulich@suse.com>,
	"Roger Pau Monné" <roger.pau@citrix.com>,
	"Juergen Gross" <jgross@suse.com>
Subject: Re: AMD EPYC VM to VM performance investigation
Date: Fri, 5 Jan 2024 10:03:28 +0100	[thread overview]
Message-ID: <ZZfF4Pz1Dj1Xc9xu@raton> (raw)
In-Reply-To: <alpine.DEB.2.22.394.2401041624500.1322202@ubuntu-linux-20-04-desktop>

On Thu, Jan 04, 2024 at 16:39:46PM, Stefano Stabellini wrote:
> On Thu, 4 Jan 2024, David Morel wrote:
> > Hello,
> > 
> > We have a customer and multiple users on our forum having performances that
> > seems quite low related to the general performance of the machines on AMD EPYC
> > Zen hosts when doing VM to VM networking.
> 
> By "VM to VM networking" I take you mean VM-to-VM on the same host using
> PV network?

Yes sorry, I though I mentionned it.
> 
> 
> > Below you'll find a write up about what we had a look at and what's in the
> > TODO on our side, but in the meantime we would like to ask here for some
> > feedback, suggestions and possible leads.
> > 
> > To sum up, the VM to VM performance on Zen generation server CPUs seems quite
> > low, and only minimally scaling when adding threads. They are outperformed by
> > 10 year old AMD desktop cpu and pretty low frequency XEON bronze from 2014.
> > CPU usage does not seem to be the limiting factor as neither the VM threads or
> > the kthreads on host seems to go to a 100% cpu usage.
> > 
> > As we're Vates, I'm talking about XCP-ng here, so Xen 4.13.5 and a dom0 kernel
> > 4.19. I did try a Xen 4.18-rc2 and kernel 6.1.56 on a Zen4 epyc, but as it was
> > borrowed from a colleague I was unsure of the setup, so although it was
> > actually worse than on my other test setups, I would not consider that a
> > complete validation the issues is also present on recent Xen versions.
> 
> I think it might be difficult to triage this if you are working on a
> Xen/Linux version that is so different from upstream

That's what I feared, also why I listed it here.
> 
> 
> > 1. Has anybody else noticed a similar behavior?
> > 2. Has anybody done any kind of investigation about it beside us?
> > 3. Any insight and suggestions of other points to look at would be welcome :)
> > 
> > And now the lengthy part about what we tested, I tried to make it shorter and
> > more legible than a full report…
> > 
> > Investigated
> > ------------
> > 
> > - Bench various cpu with iperf2 (iperf3 is not actually multithreaded):
> >   - amd fx8320e, xeon 3106: not impacted.
> >   - epyc 7451, 7443, 7302p, 7313p, 9124: impacted, but the zen4 one scales a
> >     bit more than zen1, 2 and 3.
> >   - ryzen 5950x, ryzen 7600: performances should likely be better than
> >     observed results, but still way better than epycs, and scaling nicely with
> >     more threads.
> > - Bench with tinymembench[1]: performances were as expected and didn't show
> >   issues with rep movsb as discussed in this article[2] and issue[3]. Which
> >   makes sense as it looks like this issues is related to ERMS support which is
> >   not present on Zen1 and 2 where the issue has been raised.
> > - Bench skb allocation with a small kernel module measuring cycles: actually
> >   same or lower on epyc than on the xeon with higher frequency so can be
> >   considered faster and likely not related to our issue.
> > - mitigations: we tried disabling what can be disabled through boot
> >   parameters, both for xen, dom0 and guests, but this made no differences.
> > - disabling AVX; Zen cpus before zen4 are know to limit boost and cpu scaling
> >   when doing heavy AVX load on one core, there was no reason to think this was
> >   related, but it was a quick test and as expected had no effect.
> > - localhost iperf bench on dom0 and guests: we noticed that on other machines
> >   host/guest with 1 threads are almost 1:1, with 4 threads guests are about
> >   generally not scaling as well in guests. On epyc machines, host tests were
> >   significantly slower than guests both with 1 and 4 threads, first
> >   investigation of profiling didn't help finding a cause yet. More in the
> >   profiling and TODO.
> 
> Wait, are you saying that the localhost iperf benchmark is faster in a
> VM compared to host ("host" I take means baremetal Linux without a
> hypervisor) ?   Maybe you meant the other way around?

I meant it is faster on domUs than on dom0, as mentionned below in the
profiling part, it does seem to come down to the kernel and/or userland,
but unlike I thought at first, not from the copy_user_* functions as
they are precisely the same, I was kind of hoping it could be related to
a better handling of rep movsb in those on newer kernel... An
additionnal test with an alma8 to have a closer environment to the dom0
seems to yields similar performances as dom0.

> > - cpu load: top/htop/xentop all seem to indicate that machines are not under
> >   full load, queue allocations on dom0 for VIF are by default (1 per vcpu) and
> >   seem to be all used when traffic is running but at a percentage below 100%
> >   per core/thread.
> > - pinning: manually pinning dom0 and guests to the same node and avoiding
> >   sharing cpu "threads" between host and guests gives a minimal increase of a
> >   few percents, but nothing drastic. Note, we do not know about the
> >   ccd/ccx/node mapping on these cpus, so we are not sure all memory access are
> >   "local".
> > - sched weight: playing with sched weight to prioritize dom0 did not make a
> >   difference either, which makes sense as the system are not under full load.
> > - cpu scaling: it is unlikely the core of the issue, but indeed the cpu
> >   scaling does not take advantage of the boost, never going above the base
> >   clock of these cpus. Also it also seems that less cores that the number of
> >   working kthreads/vcpus are going to base clock, may be normal in regard to
> >   the system not being fully loaded, to be defined.
> >   - QUESTION: is the powernow support in xen cpufreq implementation sufficient
> >     for zen cpus? Recent kernels/distributions use acpi_cpufreq and can use
> >     amd_pstate or even amd_pstate_epp. More concerning than the turbo boost
> >     could be the handling of package power limitation used in Zen CPUs that
> >     could prevent even all cores to base clock, to be checked…
> > 
> > Profiling
> > ---------
> > 
> > We profiled iperf on dom0 and guests on epyc, older amd desktop, and xeon
> > machines and gathered profiling traces, but analysis are still ongoing.
> > 
> > - localhost:
> > Client and server were profiled both on dom0 and guests runs for a xeon, an
> > old FX and a zen platform, to analyze the discrepancy shown by the localhost
> > tests earlier. It shows we spend a larger chunk of time in the copyout() or
> > copyin() functions on epyc and fx. This is likely related to the use of
> > copy_user_generic_string() on epyc (zen1) and old FX, whereas xeon uses
> > copy_user_enhanced_fast_string(), as it has ERMS support.  But on the same
> > machine, guests are going way faster, and the implementation of
> > copy_user_generic_string() is the same between the dom0 and guests, so this is
> > likely related to other changes in kernel and userland, and not only to these
> > function. Therefore it likely isn't directly linked to the issue.
> > 
> > - vm to vm: server, client & dom0 -> profiling traces to be analysed.
> > 
> > TODO
> > ----
> > 
> > - More Analysis of profiling traces in VM to VM case
> > - X2APIC (not enabled on the machines and setup we are using)
> > - Profiling at xen level / hypercalls
> > - Tests on a clean install of a newer Xen version
> > - Dig some more on cpu scaling, likely not the root of the problem but could
> >   be some gain to make.
> > 
> > [1] https://github.com/ssvb/tinymembench
> > [2] https://xuanwo.io/2023/04-rust-std-fs-slower-than-python/
> > [3] https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515
> > 
> > -- 
> > David Morel


  reply	other threads:[~2024-01-05  9:04 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-04 15:43 AMD EPYC VM to VM performance investigation David Morel
2024-01-05  0:39 ` Stefano Stabellini
2024-01-05  9:03   ` David Morel [this message]
2024-01-10 10:21   ` David Morel
2024-01-26  7:35     ` David Morel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZZfF4Pz1Dj1Xc9xu@raton \
    --to=david.morel@vates.tech \
    --cc=andrew.cooper3@citrix.com \
    --cc=jbeulich@suse.com \
    --cc=jgross@suse.com \
    --cc=roger.pau@citrix.com \
    --cc=sstabellini@kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    --cc=xenia.ragiadakou@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.