All of lore.kernel.org
 help / color / mirror / Atom feed
* AMD EPYC VM to VM performance investigation
@ 2024-01-04 15:43 David Morel
  2024-01-05  0:39 ` Stefano Stabellini
  0 siblings, 1 reply; 5+ messages in thread
From: David Morel @ 2024-01-04 15:43 UTC (permalink / raw)
  To: xen-devel

Hello,

We have a customer and multiple users on our forum having performances that
seems quite low related to the general performance of the machines on AMD EPYC
Zen hosts when doing VM to VM networking.

Below you'll find a write up about what we had a look at and what's in the
TODO on our side, but in the meantime we would like to ask here for some
feedback, suggestions and possible leads.

To sum up, the VM to VM performance on Zen generation server CPUs seems quite
low, and only minimally scaling when adding threads. They are outperformed by
10 year old AMD desktop cpu and pretty low frequency XEON bronze from 2014.
CPU usage does not seem to be the limiting factor as neither the VM threads or
the kthreads on host seems to go to a 100% cpu usage.

As we're Vates, I'm talking about XCP-ng here, so Xen 4.13.5 and a dom0 kernel
4.19. I did try a Xen 4.18-rc2 and kernel 6.1.56 on a Zen4 epyc, but as it was
borrowed from a colleague I was unsure of the setup, so although it was
actually worse than on my other test setups, I would not consider that a
complete validation the issues is also present on recent Xen versions.

1. Has anybody else noticed a similar behavior?
2. Has anybody done any kind of investigation about it beside us?
3. Any insight and suggestions of other points to look at would be welcome :)

And now the lengthy part about what we tested, I tried to make it shorter and
more legible than a full report…

Investigated
------------

- Bench various cpu with iperf2 (iperf3 is not actually multithreaded):
  - amd fx8320e, xeon 3106: not impacted.
  - epyc 7451, 7443, 7302p, 7313p, 9124: impacted, but the zen4 one scales a
    bit more than zen1, 2 and 3.
  - ryzen 5950x, ryzen 7600: performances should likely be better than
    observed results, but still way better than epycs, and scaling nicely with
    more threads.
- Bench with tinymembench[1]: performances were as expected and didn't show
  issues with rep movsb as discussed in this article[2] and issue[3]. Which
  makes sense as it looks like this issues is related to ERMS support which is
  not present on Zen1 and 2 where the issue has been raised.
- Bench skb allocation with a small kernel module measuring cycles: actually
  same or lower on epyc than on the xeon with higher frequency so can be
  considered faster and likely not related to our issue.
- mitigations: we tried disabling what can be disabled through boot
  parameters, both for xen, dom0 and guests, but this made no differences.
- disabling AVX; Zen cpus before zen4 are know to limit boost and cpu scaling
  when doing heavy AVX load on one core, there was no reason to think this was
  related, but it was a quick test and as expected had no effect.
- localhost iperf bench on dom0 and guests: we noticed that on other machines
  host/guest with 1 threads are almost 1:1, with 4 threads guests are about
  generally not scaling as well in guests. On epyc machines, host tests were
  significantly slower than guests both with 1 and 4 threads, first
  investigation of profiling didn't help finding a cause yet. More in the
  profiling and TODO.
- cpu load: top/htop/xentop all seem to indicate that machines are not under
  full load, queue allocations on dom0 for VIF are by default (1 per vcpu) and
  seem to be all used when traffic is running but at a percentage below 100%
  per core/thread.
- pinning: manually pinning dom0 and guests to the same node and avoiding
  sharing cpu "threads" between host and guests gives a minimal increase of a
  few percents, but nothing drastic. Note, we do not know about the
  ccd/ccx/node mapping on these cpus, so we are not sure all memory access are
  "local".
- sched weight: playing with sched weight to prioritize dom0 did not make a
  difference either, which makes sense as the system are not under full load.
- cpu scaling: it is unlikely the core of the issue, but indeed the cpu
  scaling does not take advantage of the boost, never going above the base
  clock of these cpus. Also it also seems that less cores that the number of
  working kthreads/vcpus are going to base clock, may be normal in regard to
  the system not being fully loaded, to be defined.
  - QUESTION: is the powernow support in xen cpufreq implementation sufficient
    for zen cpus? Recent kernels/distributions use acpi_cpufreq and can use
    amd_pstate or even amd_pstate_epp. More concerning than the turbo boost
    could be the handling of package power limitation used in Zen CPUs that
    could prevent even all cores to base clock, to be checked…

Profiling
---------

We profiled iperf on dom0 and guests on epyc, older amd desktop, and xeon
machines and gathered profiling traces, but analysis are still ongoing.

- localhost:
Client and server were profiled both on dom0 and guests runs for a xeon, an
old FX and a zen platform, to analyze the discrepancy shown by the localhost
tests earlier. It shows we spend a larger chunk of time in the copyout() or
copyin() functions on epyc and fx. This is likely related to the use of
copy_user_generic_string() on epyc (zen1) and old FX, whereas xeon uses
copy_user_enhanced_fast_string(), as it has ERMS support.  But on the same
machine, guests are going way faster, and the implementation of
copy_user_generic_string() is the same between the dom0 and guests, so this is
likely related to other changes in kernel and userland, and not only to these
function. Therefore it likely isn't directly linked to the issue.

- vm to vm: server, client & dom0 -> profiling traces to be analysed.

TODO
----

- More Analysis of profiling traces in VM to VM case
- X2APIC (not enabled on the machines and setup we are using)
- Profiling at xen level / hypercalls
- Tests on a clean install of a newer Xen version
- Dig some more on cpu scaling, likely not the root of the problem but could
  be some gain to make.

[1] https://github.com/ssvb/tinymembench
[2] https://xuanwo.io/2023/04-rust-std-fs-slower-than-python/
[3] https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515

-- 
David Morel


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: AMD EPYC VM to VM performance investigation
  2024-01-04 15:43 AMD EPYC VM to VM performance investigation David Morel
@ 2024-01-05  0:39 ` Stefano Stabellini
  2024-01-05  9:03   ` David Morel
  2024-01-10 10:21   ` David Morel
  0 siblings, 2 replies; 5+ messages in thread
From: Stefano Stabellini @ 2024-01-05  0:39 UTC (permalink / raw)
  To: David Morel
  Cc: xen-devel, xenia.ragiadakou, andrew.cooper3, Jan Beulich,
	Roger Pau Monné, Juergen Gross, sstabellini

[-- Attachment #1: Type: text/plain, Size: 6728 bytes --]

On Thu, 4 Jan 2024, David Morel wrote:
> Hello,
> 
> We have a customer and multiple users on our forum having performances that
> seems quite low related to the general performance of the machines on AMD EPYC
> Zen hosts when doing VM to VM networking.

By "VM to VM networking" I take you mean VM-to-VM on the same host using
PV network?


> Below you'll find a write up about what we had a look at and what's in the
> TODO on our side, but in the meantime we would like to ask here for some
> feedback, suggestions and possible leads.
> 
> To sum up, the VM to VM performance on Zen generation server CPUs seems quite
> low, and only minimally scaling when adding threads. They are outperformed by
> 10 year old AMD desktop cpu and pretty low frequency XEON bronze from 2014.
> CPU usage does not seem to be the limiting factor as neither the VM threads or
> the kthreads on host seems to go to a 100% cpu usage.
> 
> As we're Vates, I'm talking about XCP-ng here, so Xen 4.13.5 and a dom0 kernel
> 4.19. I did try a Xen 4.18-rc2 and kernel 6.1.56 on a Zen4 epyc, but as it was
> borrowed from a colleague I was unsure of the setup, so although it was
> actually worse than on my other test setups, I would not consider that a
> complete validation the issues is also present on recent Xen versions.

I think it might be difficult to triage this if you are working on a
Xen/Linux version that is so different from upstream


> 1. Has anybody else noticed a similar behavior?
> 2. Has anybody done any kind of investigation about it beside us?
> 3. Any insight and suggestions of other points to look at would be welcome :)
> 
> And now the lengthy part about what we tested, I tried to make it shorter and
> more legible than a full report…
> 
> Investigated
> ------------
> 
> - Bench various cpu with iperf2 (iperf3 is not actually multithreaded):
>   - amd fx8320e, xeon 3106: not impacted.
>   - epyc 7451, 7443, 7302p, 7313p, 9124: impacted, but the zen4 one scales a
>     bit more than zen1, 2 and 3.
>   - ryzen 5950x, ryzen 7600: performances should likely be better than
>     observed results, but still way better than epycs, and scaling nicely with
>     more threads.
> - Bench with tinymembench[1]: performances were as expected and didn't show
>   issues with rep movsb as discussed in this article[2] and issue[3]. Which
>   makes sense as it looks like this issues is related to ERMS support which is
>   not present on Zen1 and 2 where the issue has been raised.
> - Bench skb allocation with a small kernel module measuring cycles: actually
>   same or lower on epyc than on the xeon with higher frequency so can be
>   considered faster and likely not related to our issue.
> - mitigations: we tried disabling what can be disabled through boot
>   parameters, both for xen, dom0 and guests, but this made no differences.
> - disabling AVX; Zen cpus before zen4 are know to limit boost and cpu scaling
>   when doing heavy AVX load on one core, there was no reason to think this was
>   related, but it was a quick test and as expected had no effect.
> - localhost iperf bench on dom0 and guests: we noticed that on other machines
>   host/guest with 1 threads are almost 1:1, with 4 threads guests are about
>   generally not scaling as well in guests. On epyc machines, host tests were
>   significantly slower than guests both with 1 and 4 threads, first
>   investigation of profiling didn't help finding a cause yet. More in the
>   profiling and TODO.

Wait, are you saying that the localhost iperf benchmark is faster in a
VM compared to host ("host" I take means baremetal Linux without a
hypervisor) ?   Maybe you meant the other way around?


> - cpu load: top/htop/xentop all seem to indicate that machines are not under
>   full load, queue allocations on dom0 for VIF are by default (1 per vcpu) and
>   seem to be all used when traffic is running but at a percentage below 100%
>   per core/thread.
> - pinning: manually pinning dom0 and guests to the same node and avoiding
>   sharing cpu "threads" between host and guests gives a minimal increase of a
>   few percents, but nothing drastic. Note, we do not know about the
>   ccd/ccx/node mapping on these cpus, so we are not sure all memory access are
>   "local".
> - sched weight: playing with sched weight to prioritize dom0 did not make a
>   difference either, which makes sense as the system are not under full load.
> - cpu scaling: it is unlikely the core of the issue, but indeed the cpu
>   scaling does not take advantage of the boost, never going above the base
>   clock of these cpus. Also it also seems that less cores that the number of
>   working kthreads/vcpus are going to base clock, may be normal in regard to
>   the system not being fully loaded, to be defined.
>   - QUESTION: is the powernow support in xen cpufreq implementation sufficient
>     for zen cpus? Recent kernels/distributions use acpi_cpufreq and can use
>     amd_pstate or even amd_pstate_epp. More concerning than the turbo boost
>     could be the handling of package power limitation used in Zen CPUs that
>     could prevent even all cores to base clock, to be checked…
> 
> Profiling
> ---------
> 
> We profiled iperf on dom0 and guests on epyc, older amd desktop, and xeon
> machines and gathered profiling traces, but analysis are still ongoing.
> 
> - localhost:
> Client and server were profiled both on dom0 and guests runs for a xeon, an
> old FX and a zen platform, to analyze the discrepancy shown by the localhost
> tests earlier. It shows we spend a larger chunk of time in the copyout() or
> copyin() functions on epyc and fx. This is likely related to the use of
> copy_user_generic_string() on epyc (zen1) and old FX, whereas xeon uses
> copy_user_enhanced_fast_string(), as it has ERMS support.  But on the same
> machine, guests are going way faster, and the implementation of
> copy_user_generic_string() is the same between the dom0 and guests, so this is
> likely related to other changes in kernel and userland, and not only to these
> function. Therefore it likely isn't directly linked to the issue.
> 
> - vm to vm: server, client & dom0 -> profiling traces to be analysed.
> 
> TODO
> ----
> 
> - More Analysis of profiling traces in VM to VM case
> - X2APIC (not enabled on the machines and setup we are using)
> - Profiling at xen level / hypercalls
> - Tests on a clean install of a newer Xen version
> - Dig some more on cpu scaling, likely not the root of the problem but could
>   be some gain to make.
> 
> [1] https://github.com/ssvb/tinymembench
> [2] https://xuanwo.io/2023/04-rust-std-fs-slower-than-python/
> [3] https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515
> 
> -- 
> David Morel
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: AMD EPYC VM to VM performance investigation
  2024-01-05  0:39 ` Stefano Stabellini
@ 2024-01-05  9:03   ` David Morel
  2024-01-10 10:21   ` David Morel
  1 sibling, 0 replies; 5+ messages in thread
From: David Morel @ 2024-01-05  9:03 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, xenia.ragiadakou, andrew.cooper3, Jan Beulich,
	Roger Pau Monné, Juergen Gross

On Thu, Jan 04, 2024 at 16:39:46PM, Stefano Stabellini wrote:
> On Thu, 4 Jan 2024, David Morel wrote:
> > Hello,
> > 
> > We have a customer and multiple users on our forum having performances that
> > seems quite low related to the general performance of the machines on AMD EPYC
> > Zen hosts when doing VM to VM networking.
> 
> By "VM to VM networking" I take you mean VM-to-VM on the same host using
> PV network?

Yes sorry, I though I mentionned it.
> 
> 
> > Below you'll find a write up about what we had a look at and what's in the
> > TODO on our side, but in the meantime we would like to ask here for some
> > feedback, suggestions and possible leads.
> > 
> > To sum up, the VM to VM performance on Zen generation server CPUs seems quite
> > low, and only minimally scaling when adding threads. They are outperformed by
> > 10 year old AMD desktop cpu and pretty low frequency XEON bronze from 2014.
> > CPU usage does not seem to be the limiting factor as neither the VM threads or
> > the kthreads on host seems to go to a 100% cpu usage.
> > 
> > As we're Vates, I'm talking about XCP-ng here, so Xen 4.13.5 and a dom0 kernel
> > 4.19. I did try a Xen 4.18-rc2 and kernel 6.1.56 on a Zen4 epyc, but as it was
> > borrowed from a colleague I was unsure of the setup, so although it was
> > actually worse than on my other test setups, I would not consider that a
> > complete validation the issues is also present on recent Xen versions.
> 
> I think it might be difficult to triage this if you are working on a
> Xen/Linux version that is so different from upstream

That's what I feared, also why I listed it here.
> 
> 
> > 1. Has anybody else noticed a similar behavior?
> > 2. Has anybody done any kind of investigation about it beside us?
> > 3. Any insight and suggestions of other points to look at would be welcome :)
> > 
> > And now the lengthy part about what we tested, I tried to make it shorter and
> > more legible than a full report…
> > 
> > Investigated
> > ------------
> > 
> > - Bench various cpu with iperf2 (iperf3 is not actually multithreaded):
> >   - amd fx8320e, xeon 3106: not impacted.
> >   - epyc 7451, 7443, 7302p, 7313p, 9124: impacted, but the zen4 one scales a
> >     bit more than zen1, 2 and 3.
> >   - ryzen 5950x, ryzen 7600: performances should likely be better than
> >     observed results, but still way better than epycs, and scaling nicely with
> >     more threads.
> > - Bench with tinymembench[1]: performances were as expected and didn't show
> >   issues with rep movsb as discussed in this article[2] and issue[3]. Which
> >   makes sense as it looks like this issues is related to ERMS support which is
> >   not present on Zen1 and 2 where the issue has been raised.
> > - Bench skb allocation with a small kernel module measuring cycles: actually
> >   same or lower on epyc than on the xeon with higher frequency so can be
> >   considered faster and likely not related to our issue.
> > - mitigations: we tried disabling what can be disabled through boot
> >   parameters, both for xen, dom0 and guests, but this made no differences.
> > - disabling AVX; Zen cpus before zen4 are know to limit boost and cpu scaling
> >   when doing heavy AVX load on one core, there was no reason to think this was
> >   related, but it was a quick test and as expected had no effect.
> > - localhost iperf bench on dom0 and guests: we noticed that on other machines
> >   host/guest with 1 threads are almost 1:1, with 4 threads guests are about
> >   generally not scaling as well in guests. On epyc machines, host tests were
> >   significantly slower than guests both with 1 and 4 threads, first
> >   investigation of profiling didn't help finding a cause yet. More in the
> >   profiling and TODO.
> 
> Wait, are you saying that the localhost iperf benchmark is faster in a
> VM compared to host ("host" I take means baremetal Linux without a
> hypervisor) ?   Maybe you meant the other way around?

I meant it is faster on domUs than on dom0, as mentionned below in the
profiling part, it does seem to come down to the kernel and/or userland,
but unlike I thought at first, not from the copy_user_* functions as
they are precisely the same, I was kind of hoping it could be related to
a better handling of rep movsb in those on newer kernel... An
additionnal test with an alma8 to have a closer environment to the dom0
seems to yields similar performances as dom0.

> > - cpu load: top/htop/xentop all seem to indicate that machines are not under
> >   full load, queue allocations on dom0 for VIF are by default (1 per vcpu) and
> >   seem to be all used when traffic is running but at a percentage below 100%
> >   per core/thread.
> > - pinning: manually pinning dom0 and guests to the same node and avoiding
> >   sharing cpu "threads" between host and guests gives a minimal increase of a
> >   few percents, but nothing drastic. Note, we do not know about the
> >   ccd/ccx/node mapping on these cpus, so we are not sure all memory access are
> >   "local".
> > - sched weight: playing with sched weight to prioritize dom0 did not make a
> >   difference either, which makes sense as the system are not under full load.
> > - cpu scaling: it is unlikely the core of the issue, but indeed the cpu
> >   scaling does not take advantage of the boost, never going above the base
> >   clock of these cpus. Also it also seems that less cores that the number of
> >   working kthreads/vcpus are going to base clock, may be normal in regard to
> >   the system not being fully loaded, to be defined.
> >   - QUESTION: is the powernow support in xen cpufreq implementation sufficient
> >     for zen cpus? Recent kernels/distributions use acpi_cpufreq and can use
> >     amd_pstate or even amd_pstate_epp. More concerning than the turbo boost
> >     could be the handling of package power limitation used in Zen CPUs that
> >     could prevent even all cores to base clock, to be checked…
> > 
> > Profiling
> > ---------
> > 
> > We profiled iperf on dom0 and guests on epyc, older amd desktop, and xeon
> > machines and gathered profiling traces, but analysis are still ongoing.
> > 
> > - localhost:
> > Client and server were profiled both on dom0 and guests runs for a xeon, an
> > old FX and a zen platform, to analyze the discrepancy shown by the localhost
> > tests earlier. It shows we spend a larger chunk of time in the copyout() or
> > copyin() functions on epyc and fx. This is likely related to the use of
> > copy_user_generic_string() on epyc (zen1) and old FX, whereas xeon uses
> > copy_user_enhanced_fast_string(), as it has ERMS support.  But on the same
> > machine, guests are going way faster, and the implementation of
> > copy_user_generic_string() is the same between the dom0 and guests, so this is
> > likely related to other changes in kernel and userland, and not only to these
> > function. Therefore it likely isn't directly linked to the issue.
> > 
> > - vm to vm: server, client & dom0 -> profiling traces to be analysed.
> > 
> > TODO
> > ----
> > 
> > - More Analysis of profiling traces in VM to VM case
> > - X2APIC (not enabled on the machines and setup we are using)
> > - Profiling at xen level / hypercalls
> > - Tests on a clean install of a newer Xen version
> > - Dig some more on cpu scaling, likely not the root of the problem but could
> >   be some gain to make.
> > 
> > [1] https://github.com/ssvb/tinymembench
> > [2] https://xuanwo.io/2023/04-rust-std-fs-slower-than-python/
> > [3] https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515
> > 
> > -- 
> > David Morel


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: AMD EPYC VM to VM performance investigation
  2024-01-05  0:39 ` Stefano Stabellini
  2024-01-05  9:03   ` David Morel
@ 2024-01-10 10:21   ` David Morel
  2024-01-26  7:35     ` David Morel
  1 sibling, 1 reply; 5+ messages in thread
From: David Morel @ 2024-01-10 10:21 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, xenia.ragiadakou, andrew.cooper3, Jan Beulich,
	Roger Pau Monné, Juergen Gross

On Thu, Jan 04, 2024 at 16:39:46PM, Stefano Stabellini wrote:
> On Thu, 4 Jan 2024, David Morel wrote:
> > Hello,
> > 
> > We have a customer and multiple users on our forum having performances that
> > seems quite low related to the general performance of the machines on AMD EPYC
> > Zen hosts when doing VM to VM networking.
> 
> By "VM to VM networking" I take you mean VM-to-VM on the same host using
> PV network?
> 
> 
> > Below you'll find a write up about what we had a look at and what's in the
> > TODO on our side, but in the meantime we would like to ask here for some
> > feedback, suggestions and possible leads.
> > 
> > To sum up, the VM to VM performance on Zen generation server CPUs seems quite
> > low, and only minimally scaling when adding threads. They are outperformed by
> > 10 year old AMD desktop cpu and pretty low frequency XEON bronze from 2014.
> > CPU usage does not seem to be the limiting factor as neither the VM threads or
> > the kthreads on host seems to go to a 100% cpu usage.
> > 
> > As we're Vates, I'm talking about XCP-ng here, so Xen 4.13.5 and a dom0 kernel
> > 4.19. I did try a Xen 4.18-rc2 and kernel 6.1.56 on a Zen4 epyc, but as it was
> > borrowed from a colleague I was unsure of the setup, so although it was
> > actually worse than on my other test setups, I would not consider that a
> > complete validation the issues is also present on recent Xen versions.
> 
> I think it might be difficult to triage this if you are working on a
> Xen/Linux version that is so different from upstream
I ran some tests on a Xen 4.13.5 with a dom0 in 6.6.10, and on an XCP-ng on
the same machine, the performances are similar, a few percent better on
the recent Xen, but still pretty low for such a machine and similar to
other EPYC we looked at.

> 
> > 1. Has anybody else noticed a similar behavior?
> > 2. Has anybody done any kind of investigation about it beside us?
> > 3. Any insight and suggestions of other points to look at would be welcome :)
> > 
> > And now the lengthy part about what we tested, I tried to make it shorter and
> > more legible than a full report…
> > 
> > Investigated
> > ------------
> > 
> > - Bench various cpu with iperf2 (iperf3 is not actually multithreaded):
> >   - amd fx8320e, xeon 3106: not impacted.
> >   - epyc 7451, 7443, 7302p, 7313p, 9124: impacted, but the zen4 one scales a
> >     bit more than zen1, 2 and 3.
> >   - ryzen 5950x, ryzen 7600: performances should likely be better than
> >     observed results, but still way better than epycs, and scaling nicely with
> >     more threads.
> > - Bench with tinymembench[1]: performances were as expected and didn't show
> >   issues with rep movsb as discussed in this article[2] and issue[3]. Which
> >   makes sense as it looks like this issues is related to ERMS support which is
> >   not present on Zen1 and 2 where the issue has been raised.
> > - Bench skb allocation with a small kernel module measuring cycles: actually
> >   same or lower on epyc than on the xeon with higher frequency so can be
> >   considered faster and likely not related to our issue.
> > - mitigations: we tried disabling what can be disabled through boot
> >   parameters, both for xen, dom0 and guests, but this made no differences.
> > - disabling AVX; Zen cpus before zen4 are know to limit boost and cpu scaling
> >   when doing heavy AVX load on one core, there was no reason to think this was
> >   related, but it was a quick test and as expected had no effect.
> > - localhost iperf bench on dom0 and guests: we noticed that on other machines
> >   host/guest with 1 threads are almost 1:1, with 4 threads guests are about
> >   generally not scaling as well in guests. On epyc machines, host tests were
> >   significantly slower than guests both with 1 and 4 threads, first
> >   investigation of profiling didn't help finding a cause yet. More in the
> >   profiling and TODO.
> 
> Wait, are you saying that the localhost iperf benchmark is faster in a
> VM compared to host ("host" I take means baremetal Linux without a
> hypervisor) ?   Maybe you meant the other way around?
> 
> 
> > - cpu load: top/htop/xentop all seem to indicate that machines are not under
> >   full load, queue allocations on dom0 for VIF are by default (1 per vcpu) and
> >   seem to be all used when traffic is running but at a percentage below 100%
> >   per core/thread.
> > - pinning: manually pinning dom0 and guests to the same node and avoiding
> >   sharing cpu "threads" between host and guests gives a minimal increase of a
> >   few percents, but nothing drastic. Note, we do not know about the
> >   ccd/ccx/node mapping on these cpus, so we are not sure all memory access are
> >   "local".
> > - sched weight: playing with sched weight to prioritize dom0 did not make a
> >   difference either, which makes sense as the system are not under full load.
> > - cpu scaling: it is unlikely the core of the issue, but indeed the cpu
> >   scaling does not take advantage of the boost, never going above the base
> >   clock of these cpus. Also it also seems that less cores that the number of
> >   working kthreads/vcpus are going to base clock, may be normal in regard to
> >   the system not being fully loaded, to be defined.
> >   - QUESTION: is the powernow support in xen cpufreq implementation sufficient
> >     for zen cpus? Recent kernels/distributions use acpi_cpufreq and can use
> >     amd_pstate or even amd_pstate_epp. More concerning than the turbo boost
> >     could be the handling of package power limitation used in Zen CPUs that
> >     could prevent even all cores to base clock, to be checked…
> > 
> > Profiling
> > ---------
> > 
> > We profiled iperf on dom0 and guests on epyc, older amd desktop, and xeon
> > machines and gathered profiling traces, but analysis are still ongoing.
> > 
> > - localhost:
> > Client and server were profiled both on dom0 and guests runs for a xeon, an
> > old FX and a zen platform, to analyze the discrepancy shown by the localhost
> > tests earlier. It shows we spend a larger chunk of time in the copyout() or
> > copyin() functions on epyc and fx. This is likely related to the use of
> > copy_user_generic_string() on epyc (zen1) and old FX, whereas xeon uses
> > copy_user_enhanced_fast_string(), as it has ERMS support.  But on the same
> > machine, guests are going way faster, and the implementation of
> > copy_user_generic_string() is the same between the dom0 and guests, so this is
> > likely related to other changes in kernel and userland, and not only to these
> > function. Therefore it likely isn't directly linked to the issue.
> > 
> > - vm to vm: server, client & dom0 -> profiling traces to be analysed.
> > 
> > TODO
> > ----
> > 
> > - More Analysis of profiling traces in VM to VM case
> > - X2APIC (not enabled on the machines and setup we are using)
> > - Profiling at xen level / hypercalls
> > - Tests on a clean install of a newer Xen version
> > - Dig some more on cpu scaling, likely not the root of the problem but could
> >   be some gain to make.
> > 
> > [1] https://github.com/ssvb/tinymembench
> > [2] https://xuanwo.io/2023/04-rust-std-fs-slower-than-python/
> > [3] https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515
> > 
> > -- 
> > David Morel


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: AMD EPYC VM to VM performance investigation
  2024-01-10 10:21   ` David Morel
@ 2024-01-26  7:35     ` David Morel
  0 siblings, 0 replies; 5+ messages in thread
From: David Morel @ 2024-01-26  7:35 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, xenia.ragiadakou, andrew.cooper3, Jan Beulich,
	Roger Pau Monné, Juergen Gross

On Wed, Jan 10, 2024 at 11:21:06AM, David Morel wrote:
> > I think it might be difficult to triage this if you are working on a
> > Xen/Linux version that is so different from upstream
> I ran some tests on a Xen 4.13.5 with a dom0 in 6.6.10, and on an XCP-ng on
> the same machine, the performances are similar, a few percent better on
> the recent Xen, but still pretty low for such a machine and similar to
> other EPYC we looked at.
I only recently went over my message and realized I typed the XCP-ng
version out of the habit... The test I was talking about was actually on
a Xen 4.17.3, so recent Xen and recent kernel. Sorry about that.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-01-26  7:36 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-04 15:43 AMD EPYC VM to VM performance investigation David Morel
2024-01-05  0:39 ` Stefano Stabellini
2024-01-05  9:03   ` David Morel
2024-01-10 10:21   ` David Morel
2024-01-26  7:35     ` David Morel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.