Kernel KVM virtualization development
 help / color / mirror / Atom feed
* Re: [PATCH v3 00/18] vDSO: Introduce generic data storage
       [not found] <20250204-vdso-store-rng-v3-0-13a4669dfc8c@linutronix.de>
@ 2025-02-06  9:31 ` David Woodhouse
  2025-02-06 10:59   ` Thomas Weißschuh
  2025-02-14 11:34   ` Thomas Gleixner
  0 siblings, 2 replies; 5+ messages in thread
From: David Woodhouse @ 2025-02-06  9:31 UTC (permalink / raw)
  To: Thomas Weißschuh, James E.J. Bottomley, Helge Deller,
	Andy Lutomirski, Thomas Gleixner, Vincenzo Frascino,
	Anna-Maria Behnsen, Frederic Weisbecker, Andrew Morton,
	Catalin Marinas, Will Deacon, Theodore Ts'o,
	Jason A. Donenfeld, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Huacai Chen, WANG Xuerui, Russell King, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Thomas Bogendoerfer, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Naveen N Rao,
	Madhavan Srinivasan, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86, H. Peter Anvin, Arnd Bergmann, Guo Ren
  Cc: linux-parisc, linux-kernel, linux-arm-kernel, linux-riscv,
	loongarch, linux-s390, linux-mips, linuxppc-dev, linux-arch,
	Nam Cao, linux-csky, Ridoux, Julien, Luu, Ryan, kvm

[-- Attachment #1: Type: text/plain, Size: 1724 bytes --]

On Tue, 2025-02-04 at 13:05 +0100, Thomas Weißschuh wrote:
> Currently each architecture defines the setup of the vDSO data page on
> its own, mostly through copy-and-paste from some other architecture.
> Extend the existing generic vDSO implementation to also provide generic
> data storage.
> This removes duplicated code and paves the way for further changes to
> the generic vDSO implementation without having to go through a lot of
> per-architecture changes.
> 
> Based on v6.14-rc1 and intended to be merged through the tip tree.

Thanks for working on this. Is there a plan to expose the time data
directly to userspace in a form which is usable *other* than by
function calls which get the value of the clock at a given moment?

For populating the vmclock device¹ we need to know the actual
relationship between the hardware counter (TSC, arch timer, etc.) and
real time in order to propagate that to the guest.

I see two options for doing this:

 1. Via userspace, exposing the vdso time data (and a notification when
    it changes?) and letting the userspace VMM populate the vmclock.
    This is complex for x86 because of TSC scaling; in fact userspace
    doesn't currently know the precise scaling from host to guest TSC
    so we'd have to be able to extract that from KVM.

 2. In kernel, asking KVM to populate the vmclock structure much like
    it does other pvclocks shared with the guest. KVM/x86 already uses
    pvclock_gtod_register_notifier() to hook changes; should we expand
    on that? The problem with that notifier is that it seems to be
    called far more frequently than I'd expect.



¹ https://gitlab.com/qemu-project/qemu/-/commit/3634039b93cc5

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3 00/18] vDSO: Introduce generic data storage
  2025-02-06  9:31 ` [PATCH v3 00/18] vDSO: Introduce generic data storage David Woodhouse
@ 2025-02-06 10:59   ` Thomas Weißschuh
  2025-02-07 10:15     ` David Woodhouse
  2025-02-14 11:34   ` Thomas Gleixner
  1 sibling, 1 reply; 5+ messages in thread
From: Thomas Weißschuh @ 2025-02-06 10:59 UTC (permalink / raw)
  To: David Woodhouse
  Cc: James E.J. Bottomley, Helge Deller, Andy Lutomirski,
	Thomas Gleixner, Vincenzo Frascino, Anna-Maria Behnsen,
	Frederic Weisbecker, Andrew Morton, Catalin Marinas, Will Deacon,
	Theodore Ts'o, Jason A. Donenfeld, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Huacai Chen, WANG Xuerui, Russell King,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy, Naveen N Rao,
	Madhavan Srinivasan, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86, H. Peter Anvin, Arnd Bergmann, Guo Ren, linux-parisc,
	linux-kernel, linux-arm-kernel, linux-riscv, loongarch,
	linux-s390, linux-mips, linuxppc-dev, linux-arch, Nam Cao,
	linux-csky, Ridoux, Julien, Luu, Ryan, kvm

On Thu, Feb 06, 2025 at 09:31:42AM +0000, David Woodhouse wrote:
> On Tue, 2025-02-04 at 13:05 +0100, Thomas Weißschuh wrote:
> > Currently each architecture defines the setup of the vDSO data page on
> > its own, mostly through copy-and-paste from some other architecture.
> > Extend the existing generic vDSO implementation to also provide generic
> > data storage.
> > This removes duplicated code and paves the way for further changes to
> > the generic vDSO implementation without having to go through a lot of
> > per-architecture changes.
> > 
> > Based on v6.14-rc1 and intended to be merged through the tip tree.

Note: The real answer will need to come from the timekeeping
maintainers, my personal two cents below.

> Thanks for working on this. Is there a plan to expose the time data
> directly to userspace in a form which is usable *other* than by
> function calls which get the value of the clock at a given moment?

There are no current plans that I am aware of.

> For populating the vmclock device¹ we need to know the actual
> relationship between the hardware counter (TSC, arch timer, etc.) and
> real time in order to propagate that to the guest.
> 
> I see two options for doing this:
> 
>  1. Via userspace, exposing the vdso time data (and a notification when
>     it changes?) and letting the userspace VMM populate the vmclock.
>     This is complex for x86 because of TSC scaling; in fact userspace
>     doesn't currently know the precise scaling from host to guest TSC
>     so we'd have to be able to extract that from KVM.

Exposing the raw vdso time data is problematic as it precludes any
evolution to its datastructures, like the one we are currently doing.

An additional, trimmed down and stable data structure could be used.
But I don't think it makes sense. The vDSO is all about a stable
highlevel function interface on top of an unstable data interface.
However the vmclock needs the lowlevel data to populate its own
datastructure, wrapping raw data access in function calls is unnecessary.
If no functions are involved then the vDSO is not needed. The data can
be maintained separately in any other place in the kernel and accessed
or mapped by userspace from there.
Also the vDSO does not have an active notification mechanism, this would
probably be implemented through a filedescriptor, but then the data
can also be mapped through exactly that fd.

>  2. In kernel, asking KVM to populate the vmclock structure much like
>     it does other pvclocks shared with the guest. KVM/x86 already uses
>     pvclock_gtod_register_notifier() to hook changes; should we expand
>     on that? The problem with that notifier is that it seems to be
>     called far more frequently than I'd expect.

This sounds better, especially as any custom ABI from the host kernel to
the VMM would look a lot like the vmclock structure anyways.

Timekeeper updates are indeed very frequent, but what are the concrete
issues? That frequency is fine for regular vDSO data page updates,
updating the vmclock data page should be very similar.
The timekeeper core can pass context to the notifier callbacks, maybe
this can be used to skip some expensive steps where possible.

> ¹ https://gitlab.com/qemu-project/qemu/-/commit/3634039b93cc5

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3 00/18] vDSO: Introduce generic data storage
  2025-02-06 10:59   ` Thomas Weißschuh
@ 2025-02-07 10:15     ` David Woodhouse
  0 siblings, 0 replies; 5+ messages in thread
From: David Woodhouse @ 2025-02-07 10:15 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: James E.J. Bottomley, Helge Deller, Andy Lutomirski,
	Thomas Gleixner, Vincenzo Frascino, Anna-Maria Behnsen,
	Frederic Weisbecker, Andrew Morton, Catalin Marinas, Will Deacon,
	Theodore Ts'o, Jason A. Donenfeld, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Huacai Chen, WANG Xuerui, Russell King,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy, Naveen N Rao,
	Madhavan Srinivasan, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86, H. Peter Anvin, Arnd Bergmann, Guo Ren, linux-parisc,
	linux-kernel, linux-arm-kernel, linux-riscv, loongarch,
	linux-s390, linux-mips, linuxppc-dev, linux-arch, Nam Cao,
	linux-csky, Ridoux, Julien, Luu, Ryan, kvm

[-- Attachment #1: Type: text/plain, Size: 4065 bytes --]

On Thu, 2025-02-06 at 11:59 +0100, Thomas Weißschuh wrote:
> On Thu, Feb 06, 2025 at 09:31:42AM +0000, David Woodhouse wrote:
> > On Tue, 2025-02-04 at 13:05 +0100, Thomas Weißschuh wrote:
> > > Currently each architecture defines the setup of the vDSO data page on
> > > its own, mostly through copy-and-paste from some other architecture.
> > > Extend the existing generic vDSO implementation to also provide generic
> > > data storage.
> > > This removes duplicated code and paves the way for further changes to
> > > the generic vDSO implementation without having to go through a lot of
> > > per-architecture changes.
> > > 
> > > Based on v6.14-rc1 and intended to be merged through the tip tree.
> 
> Note: The real answer will need to come from the timekeeping
> maintainers, my personal two cents below.
> 
> > Thanks for working on this. Is there a plan to expose the time data
> > directly to userspace in a form which is usable *other* than by
> > function calls which get the value of the clock at a given moment?
> 
> There are no current plans that I am aware of.
> 
> > For populating the vmclock device¹ we need to know the actual
> > relationship between the hardware counter (TSC, arch timer, etc.) and
> > real time in order to propagate that to the guest.
> > 
> > I see two options for doing this:
> > 
> >  1. Via userspace, exposing the vdso time data (and a notification when
> >     it changes?) and letting the userspace VMM populate the vmclock.
> >     This is complex for x86 because of TSC scaling; in fact userspace
> >     doesn't currently know the precise scaling from host to guest TSC
> >     so we'd have to be able to extract that from KVM.
> 
> Exposing the raw vdso time data is problematic as it precludes any
> evolution to its datastructures, like the one we are currently doing.
> 
> An additional, trimmed down and stable data structure could be used.
> But I don't think it makes sense. The vDSO is all about a stable
> highlevel function interface on top of an unstable data interface.
> However the vmclock needs the lowlevel data to populate its own
> datastructure, wrapping raw data access in function calls is unnecessary.
> If no functions are involved then the vDSO is not needed. The data can
> be maintained separately in any other place in the kernel and accessed
> or mapped by userspace from there.
> Also the vDSO does not have an active notification mechanism, this would
> probably be implemented through a filedescriptor, but then the data
> can also be mapped through exactly that fd.
> 
> >  2. In kernel, asking KVM to populate the vmclock structure much like
> >     it does other pvclocks shared with the guest. KVM/x86 already uses
> >     pvclock_gtod_register_notifier() to hook changes; should we expand
> >     on that? The problem with that notifier is that it seems to be
> >     called far more frequently than I'd expect.
> 
> This sounds better, especially as any custom ABI from the host kernel to
> the VMM would look a lot like the vmclock structure anyways.
> 
> Timekeeper updates are indeed very frequent, but what are the concrete
> issues? That frequency is fine for regular vDSO data page updates,
> updating the vmclock data page should be very similar.
> The timekeeper core can pass context to the notifier callbacks, maybe
> this can be used to skip some expensive steps where possible.

In the context of a hypervisor with lots of guests running, that's a
lot of pointless steal time. But it isn't just that; ISTR the result
was also *inaccurate*.

I need to go back and reproduce the testing, but I think it was
constantly adjusting the apparent rate even with no changed inputs from
NTP. Where the number of clock counts per jiffy wasn't an integer, the
notification would be constantly changing, for example to report 333333
counts per jiffy for most of the time, and occasionally 333334 counts
for a single jiffy before flipping back again. Or something like that.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3 00/18] vDSO: Introduce generic data storage
  2025-02-06  9:31 ` [PATCH v3 00/18] vDSO: Introduce generic data storage David Woodhouse
  2025-02-06 10:59   ` Thomas Weißschuh
@ 2025-02-14 11:34   ` Thomas Gleixner
  2025-02-14 12:04     ` David Woodhouse
  1 sibling, 1 reply; 5+ messages in thread
From: Thomas Gleixner @ 2025-02-14 11:34 UTC (permalink / raw)
  To: David Woodhouse, Thomas Weißschuh, James E.J. Bottomley,
	Helge Deller, Andy Lutomirski, Vincenzo Frascino,
	Anna-Maria Behnsen, Frederic Weisbecker, Andrew Morton,
	Catalin Marinas, Will Deacon, Theodore Ts'o,
	Jason A. Donenfeld, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Huacai Chen, WANG Xuerui, Russell King, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Thomas Bogendoerfer, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Naveen N Rao,
	Madhavan Srinivasan, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86, H. Peter Anvin, Arnd Bergmann, Guo Ren
  Cc: linux-parisc, linux-kernel, linux-arm-kernel, linux-riscv,
	loongarch, linux-s390, linux-mips, linuxppc-dev, linux-arch,
	Nam Cao, linux-csky, Ridoux, Julien, Luu, Ryan, kvm

David!

On Thu, Feb 06 2025 at 09:31, David Woodhouse wrote:
> Thanks for working on this. Is there a plan to expose the time data
> directly to userspace in a form which is usable *other* than by
> function calls which get the value of the clock at a given moment?
>
> For populating the vmclock device¹ we need to know the actual
> relationship between the hardware counter (TSC, arch timer, etc.) and
> real time in order to propagate that to the guest.
>
> I see two options for doing this:
>
>  1. Via userspace, exposing the vdso time data (and a notification when
>     it changes?) and letting the userspace VMM populate the vmclock.
>     This is complex for x86 because of TSC scaling; in fact userspace
>     doesn't currently know the precise scaling from host to guest TSC
>     so we'd have to be able to extract that from KVM.

Exposing the raw data is not going to happen as we would create an ABI
preventing any modifications to the internals. VDSO data is considered a
fully internal (think kernel) representation and the accessor functions
create an ABI around it. So if at all you can add a accessor function
which exposes data to user space so that the internal data
representation can still be modified as necessary.

>  2. In kernel, asking KVM to populate the vmclock structure much like
>     it does other pvclocks shared with the guest. KVM/x86 already uses
>     pvclock_gtod_register_notifier() to hook changes; should we expand
>     on that? The problem with that notifier is that it seems to be
>     called far more frequently than I'd expect.

It's called once per tick to expose the continous updates to the
conversion factors and related internal data.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3 00/18] vDSO: Introduce generic data storage
  2025-02-14 11:34   ` Thomas Gleixner
@ 2025-02-14 12:04     ` David Woodhouse
  0 siblings, 0 replies; 5+ messages in thread
From: David Woodhouse @ 2025-02-14 12:04 UTC (permalink / raw)
  To: Thomas Gleixner, Thomas Weißschuh, James E.J. Bottomley,
	Helge Deller, Andy Lutomirski, Vincenzo Frascino,
	Anna-Maria Behnsen, Frederic Weisbecker, Andrew Morton,
	Catalin Marinas, Will Deacon, Theodore Ts'o,
	Jason A. Donenfeld, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Huacai Chen, WANG Xuerui, Russell King, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Thomas Bogendoerfer, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Naveen N Rao,
	Madhavan Srinivasan, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86, H. Peter Anvin, Arnd Bergmann, Guo Ren
  Cc: linux-parisc, linux-kernel, linux-arm-kernel, linux-riscv,
	loongarch, linux-s390, linux-mips, linuxppc-dev, linux-arch,
	Nam Cao, linux-csky, Ridoux, Julien, Luu, Ryan, kvm

[-- Attachment #1: Type: text/plain, Size: 2581 bytes --]

On Fri, 2025-02-14 at 12:34 +0100, Thomas Gleixner wrote:
> >  2. In kernel, asking KVM to populate the vmclock structure much like
> >     it does other pvclocks shared with the guest. KVM/x86 already uses
> >     pvclock_gtod_register_notifier() to hook changes; should we expand
> >     on that? The problem with that notifier is that it seems to be
> >     called far more frequently than I'd expect.
> 
> It's called once per tick to expose the continous updates to the
> conversion factors and related internal data.

My recollection (a vague one) is that it's called, and reports
"changes", even when there *are* no changes to underlying conversion
factors. Something along the lines of "N ticks at 333 counts per tick,
then one tick at 334 counts per tick to catch up" because it can't
express the division factor completely without that discontinuity?

The actual 'error' caused by the apparent fluctuation in rate is
probably entirely negligible, but I am slightly concerned about the
steal time, if the hypervisor then spends stolen CPU time relaying all
those "changes" to the guest, and then the guest has to spend time
feeding the "changes" into its own timekeeping.

I'd like to strive for a mode where we only adjust what we tell guests,
when adjtimex actually changes the real timing factors.

In fact if we have a userspace tool like chrony feeding adjtimex based
on external NTP/PPS/whatever, that tool could probably calibrate a
stable host TSC directly against the external real time. And in that
mode maybe we don't even need to feed the guest from the kernel's
CLOCK_REALTIME; that would be just another conversion step to introduce
noise.

We might end up with the direct setup for dedicated hosting
environments, but I do also want to support the general-purpose QEMU-
based setup where we expose the host's CLOCK_REALTIME as efficiently as
possible.

How about this: A KVM feature to provide/populate the VMCLOCK, since
only KVM knows the precise TSC scaling (and can immediately flip the
VMCLOCK to report invalid state if the TSC becomes unreliable).

It can *either* be fed the precise TSC/realtime relationship from
userspace (maybe in a vmclock structure that *userspace* populates, so
all the kernel has to do is scale/offset to account for the guest TSC
being different from the host TSC).

Or it can be in 'automatic' mode, where it derives from the host's
timekeeping. Which at the moment would have "too many" updates for my
liking, but we can worry about that later if necessary.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-02-14 12:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250204-vdso-store-rng-v3-0-13a4669dfc8c@linutronix.de>
2025-02-06  9:31 ` [PATCH v3 00/18] vDSO: Introduce generic data storage David Woodhouse
2025-02-06 10:59   ` Thomas Weißschuh
2025-02-07 10:15     ` David Woodhouse
2025-02-14 11:34   ` Thomas Gleixner
2025-02-14 12:04     ` David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox