* Re: PROPOSAL: Extend inline asm syntax with size spec
From: Borislav Petkov @ 2018-10-07 14:13 UTC (permalink / raw)
To: Segher Boessenkool
Cc: Kate Stewart, Peter Zijlstra, Christopher Li, virtualization,
Masahiro Yamada, Nadav Amit, Jan Beulich, H. Peter Anvin,
Sam Ravnborg, Thomas Gleixner, x86, linux-sparse, Ingo Molnar,
linux-xtensa, Kees Cook, Chris Zankel, Michael Matz,
Josh Poimboeuf, Alok Kataria, Juergen Gross, gcc, Richard Biener,
Max Filippov, Greg Kroah-Hartman, linux-kernel, Philippe
In-Reply-To: <20181007132228.GJ29268@gate.crashing.org>
On Sun, Oct 07, 2018 at 08:22:28AM -0500, Segher Boessenkool wrote:
> GCC already estimates the *size* of inline asm, and this is required
> *for correctness*.
I didn't say it didn't - but the heuristic could use improving.
> So I guess the real issue is that the inline asm size estimate for x86
> isn't very good (since it has to be pessimistic, and x86 insns can be
> huge)?
Well, the size thing could be just a "parameter" or "hint" of sorts, to
tell gcc to inline the function X which is inlining the asm statement
into the function Y which is calling function X. If you look at the
patchset, it is moving everything to asm macros where gcc is apparently
able to do better inlining.
> > 3) asm ("...") __attribute__((asm_size(<size-expr>)));
>
> Eww.
Why?
> More precise *size* estimates, yes. And if the user lies he should not
> be surprised to get assembler errors, etc.
Yes.
Another option would be if gcc parses the inline asm directly and
does a more precise size estimation. Which is a lot more involved and
complicated solution so I guess we wanna look at the simpler ones first.
:-)
> I don't like 2) either. But 1) looks interesting, depends what its
> semantics would be? "Don't count this insn's size for inlining decisions",
> maybe?
Or simply "this asm statement has a size of 1" to mean, inline it
everywhere. Which has the same caveats as above.
> Another option is to just force inlining for those few functions where
> GCC currently makes an inlining decision you don't like. Or are there
> more than a few?
I'm afraid they're more than a few and this should work automatically,
if possible.
Thx.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
^ permalink raw reply
* PROPOSAL: Extend inline asm syntax with size spec
From: Borislav Petkov @ 2018-10-07 9:18 UTC (permalink / raw)
To: gcc, Richard Biener, Michael Matz
Cc: Kate Stewart, Peter Zijlstra, Christopher Li, virtualization,
Masahiro Yamada, Nadav Amit, Jan Beulich, H. Peter Anvin,
Sam Ravnborg, Thomas Gleixner, x86, linux-sparse, Ingo Molnar,
linux-xtensa, Kees Cook, Josh Poimboeuf, Alok Kataria,
Juergen Gross, Chris Zankel, Max Filippov, Greg Kroah-Hartman,
linux-kernel, Philippe Ombredanne, Linus Torvalds
In-Reply-To: <20181003213100.189959-1-namit@vmware.com>
Hi people,
this is an attempt to see whether gcc's inline asm heuristic when
estimating inline asm statements' cost for better inlining can be
improved.
AFAIU, the problematic arises when one ends up using a lot of inline
asm statements in the kernel but due to the inline asm cost estimation
heuristic which counts lines, I think, for example like in this here
macro:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/include/asm/cpufeature.h#n162
the resulting code ends up not inlining the functions themselves which
use this macro. I.e., you see a CALL <function> instead of its body
getting inlined directly.
Even though it should be because the actual instructions are only a
couple in most cases and all those other directives end up in another
section anyway.
The issue is explained below in the forwarded mail in a larger detail
too.
Now, Richard suggested doing something like:
1) inline asm ("...")
2) asm ("..." : : : : <size-expr>)
3) asm ("...") __attribute__((asm_size(<size-expr>)));
with which user can tell gcc what the size of that inline asm statement
is and thus allow for more precise cost estimation and in the end better
inlining.
And FWIW 3) looks pretty straight-forward to me because attributes are
pretty common anyways.
But I'm sure there are other options and I'm sure people will have
better/different ideas so feel free to chime in.
Thx.
On Wed, Oct 03, 2018 at 02:30:50PM -0700, Nadav Amit wrote:
> This patch-set deals with an interesting yet stupid problem: kernel code
> that does not get inlined despite its simplicity. There are several
> causes for this behavior: "cold" attribute on __init, different function
> optimization levels; conditional constant computations based on
> __builtin_constant_p(); and finally large inline assembly blocks.
>
> This patch-set deals with the inline assembly problem. I separated these
> patches from the others (that were sent in the RFC) for easier
> inclusion. I also separated the removal of unnecessary new-lines which
> would be sent separately.
>
> The problem with inline assembly is that inline assembly is often used
> by the kernel for things that are other than code - for example,
> assembly directives and data. GCC however is oblivious to the content of
> the blocks and assumes their cost in space and time is proportional to
> the number of the perceived assembly "instruction", according to the
> number of newlines and semicolons. Alternatives, paravirt and other
> mechanisms are affected, causing code not to be inlined, and degrading
> compilation quality in general.
>
> The solution that this patch-set carries for this problem is to create
> an assembly macro, and then call it from the inline assembly block. As
> a result, the compiler sees a single "instruction" and assigns the more
> appropriate cost to the code.
>
> To avoid uglification of the code, as many noted, the macros are first
> precompiled into an assembly file, which is later assembled together
> with the C files. This also enables to avoid duplicate implementation
> that was set before for the asm and C code. This can be seen in the
> exception table changes.
>
> Overall this patch-set slightly increases the kernel size (my build was
> done using my Ubuntu 18.04 config + localyesconfig for the record):
>
> text data bss dec hex filename
> 18140829 10224724 2957312 31322865 1ddf2f1 ./vmlinux before
> 18163608 10227348 2957312 31348268 1de562c ./vmlinux after (+0.1%)
>
> The number of static functions in the image is reduced by 379, but
> actually inlining is even better, which does not always shows in these
> numbers: a function may be inlined causing the calling function not to
> be inlined.
>
> I ran some limited number of benchmarks, and in general the performance
> impact is not very notable. You can still see >10 cycles shaved off some
> syscalls that manipulate page-tables (e.g., mprotect()), in which
> paravirt caused many functions not to be inlined. In addition this
> patch-set can prevent issues such as [1], and improves code readability
> and maintainability.
>
> Update: Rasmus recently caused me (inadvertently) to become paranoid
> about the dependencies. To clarify: if any of the headers changes, any c
> file which uses macros that are included in macros.S would be fine as
> long as it includes the header as well (as it should). Adding an
> assertion to check this is done might become slightly ugly, and nobody
> else is concerned about it. Another minor issue is that changes of
> macros.S would not trigger a global rebuild, but that is pretty similar
> to changes of the Makefile that do not trigger a rebuild.
>
> [1] https://patchwork.kernel.org/patch/10450037/
>
> v8->v9: * Restoring the '-pipe' parameter (Rasmus)
> * Adding Kees's tested-by tag (Kees)
>
> v7->v8: * Add acks (Masahiro, Max)
> * Rebase on 4.19 (Ingo)
>
> v6->v7: * Fix context switch tracking (Ingo)
> * Fix xtensa build error (Ingo)
> * Rebase on 4.18-rc8
>
> v5->v6: * Removing more code from jump-labels (PeterZ)
> * Fix build issue on i386 (0-day, PeterZ)
>
> v4->v5: * Makefile fixes (Masahiro, Sam)
>
> v3->v4: * Changed naming of macros in 2 patches (PeterZ)
> * Minor cleanup of the paravirt patch
>
> v2->v3: * Several build issues resolved (0-day)
> * Wrong comments fix (Josh)
> * Change asm vs C order in refcount (Kees)
>
> v1->v2: * Compiling the macros into a separate .s file, improving
> readability (Linus)
> * Improving assembly formatting, applying most of the comments
> according to my judgment (Jan)
> * Adding exception-table, cpufeature and jump-labels
> * Removing new-line cleanup; to be submitted separately
>
> Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> Cc: Sam Ravnborg <sam@ravnborg.org>
> Cc: Alok Kataria <akataria@vmware.com>
> Cc: Christopher Li <sparse@chrisli.org>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Jan Beulich <JBeulich@suse.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Kate Stewart <kstewart@linuxfoundation.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: linux-sparse@vger.kernel.org
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Philippe Ombredanne <pombredanne@nexb.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: virtualization@lists.linux-foundation.org
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: x86@kernel.org
> Cc: Chris Zankel <chris@zankel.net>
> Cc: Max Filippov <jcmvbkbc@gmail.com>
> Cc: linux-xtensa@linux-xtensa.org
>
> Nadav Amit (10):
> xtensa: defining LINKER_SCRIPT for the linker script
> Makefile: Prepare for using macros for inline asm
> x86: objtool: use asm macro for better compiler decisions
> x86: refcount: prevent gcc distortions
> x86: alternatives: macrofy locks for better inlining
> x86: bug: prevent gcc distortions
> x86: prevent inline distortion by paravirt ops
> x86: extable: use macros instead of inline assembly
> x86: cpufeature: use macros instead of inline assembly
> x86: jump-labels: use macros instead of inline assembly
>
> Makefile | 9 ++-
> arch/x86/Makefile | 7 ++
> arch/x86/entry/calling.h | 2 +-
> arch/x86/include/asm/alternative-asm.h | 20 ++++--
> arch/x86/include/asm/alternative.h | 11 +--
> arch/x86/include/asm/asm.h | 61 +++++++---------
> arch/x86/include/asm/bug.h | 98 +++++++++++++++-----------
> arch/x86/include/asm/cpufeature.h | 82 ++++++++++++---------
> arch/x86/include/asm/jump_label.h | 77 ++++++++------------
> arch/x86/include/asm/paravirt_types.h | 56 +++++++--------
> arch/x86/include/asm/refcount.h | 74 +++++++++++--------
> arch/x86/kernel/macros.S | 16 +++++
> arch/xtensa/kernel/Makefile | 4 +-
> include/asm-generic/bug.h | 8 +--
> include/linux/compiler.h | 56 +++++++++++----
> scripts/Kbuild.include | 4 +-
> scripts/mod/Makefile | 2 +
> 17 files changed, 331 insertions(+), 256 deletions(-)
> create mode 100644 arch/x86/kernel/macros.S
>
> --
> 2.17.1
>
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
^ permalink raw reply
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Andy Lutomirski @ 2018-10-06 22:28 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: Wanpeng Li, Florian Weimer, Juergen Gross, Arnd Bergmann,
Radim Krcmar, Peter Zijlstra, X86 ML, LKML, Linux Virtualization,
Stephen Boyd, John Stultz, Andrew Lutomirski, devel,
Paolo Bonzini, Thomas Gleixner, Matt Rickard
In-Reply-To: <20181006202731.GC7129@amt.cnet>
On Sat, Oct 6, 2018 at 1:29 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Thu, Oct 04, 2018 at 03:15:32PM -0700, Andy Lutomirski wrote:
> > For better or for worse, I'm trying to understand this code. So far,
> > I've come up with this patch:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=14fd71e12b1c4492a06f368f75041f263e6862bf
> >
> > Is it correct, or am I missing some subtlety?
>
> The master clock, when initialized, has a pair
>
> masterclockvalues=(TSC value, time-of-day data).
>
> When updating the guest clock, we only update relative to (TSC value)
> that was read on masterclock initialization.
I don't see the problem. The masterclock data is updated here:
host_tsc_clocksource = kvm_get_time_and_clockread(
&ka->master_kernel_ns,
&ka->master_cycle_now);
kvm_get_time_and_clockread() gets those values from
do_monotonic_boot(), which, barring bugs, should cause
get_kvmclock_ns() to return exactly the same thing as
ktime_get_boot_ns() + ka->kvmclock_offset, albeit in a rather
roundabout manner.
So what am I missing? Is there actually something wrong with my patch?
>
> See the following comment on x86.c:
I read that comment, and it's not obvious to me how it's related.
^ permalink raw reply
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Marcelo Tosatti @ 2018-10-06 20:49 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Wanpeng Li, Florian Weimer, Juergen Gross, Arnd Bergmann,
Radim Krcmar, Peter Zijlstra, X86 ML, LKML, Linux Virtualization,
Stephen Boyd, John Stultz, devel, Paolo Bonzini, Thomas Gleixner,
Matt Rickard
In-Reply-To: <CALCETrXNKCNZzUXj=2sp1vBd_o7HHzfW6RwyY5v3-wMfbVaZnQ@mail.gmail.com>
On Thu, Oct 04, 2018 at 03:15:32PM -0700, Andy Lutomirski wrote:
> For better or for worse, I'm trying to understand this code. So far,
> I've come up with this patch:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=14fd71e12b1c4492a06f368f75041f263e6862bf
>
> Is it correct, or am I missing some subtlety?
"In the non-fallback case, a bunch of gnarly arithmetic is done: a
hopefully matched pair of (TSC, boot time) is read, then all locks
are dropped, then the TSC frequency is read, a branch new multiplier
and shift is read, and the result is turned into a time.
This seems quite racy to me. Because locks are not held, I don't
see what keeps TSC frequency used in sync with the master clock
data."
If tsc_khz changes, the host TSC clocksource will not be used, which
disables masterclock:
if ((val == CPUFREQ_PRECHANGE && freq->old < freq->new) ||
(val == CPUFREQ_POSTCHANGE && freq->old >
freq->new)) {
*lpj = cpufreq_scale(loops_per_jiffy_ref, ref_freq,
freq->new);
tsc_khz = cpufreq_scale(tsc_khz_ref, ref_freq,
freq->new);
if (!(freq->flags & CPUFREQ_CONST_LOOPS))
mark_tsc_unstable("cpufreq changes");
In which case it ends up in:
- spin_lock(&ka->pvclock_gtod_sync_lock);
- if (!ka->use_master_clock) {
- spin_unlock(&ka->pvclock_gtod_sync_lock);
- return ktime_get_boot_ns() + ka->kvmclock_offset;
- }
masterclock -> non masterclock transition sets
a REQUEST bit on each vCPU, so as to invalidate any previous
clock reads.
static void kvm_gen_update_masterclock(struct kvm *kvm)
{
#ifdef CONFIG_X86_64
int i;
struct kvm_vcpu *vcpu;
struct kvm_arch *ka = &kvm->arch;
spin_lock(&ka->pvclock_gtod_sync_lock);
kvm_make_mclock_inprogress_request(kvm);
/* no guest entries from this point */
pvclock_update_vm_gtod_copy(kvm);
kvm_for_each_vcpu(i, vcpu, kvm)
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
/* guest entries allowed */
kvm_for_each_vcpu(i, vcpu, kvm)
kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
spin_unlock(&ka->pvclock_gtod_sync_lock);
#endif
/*
* If the host uses TSC clock, then passthrough TSC as stable
* to the guest.
*/
host_tsc_clocksource = kvm_get_time_and_clockread(
&ka->master_kernel_ns,
&ka->master_cycle_now);
ka->use_master_clock = host_tsc_clocksource && vcpus_matched
&& !ka->backwards_tsc_observed
&& !ka->boot_vcpu_runs_old_kvmclock;
^ permalink raw reply
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Marcelo Tosatti @ 2018-10-06 20:27 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Wanpeng Li, Florian Weimer, Juergen Gross, Arnd Bergmann,
Radim Krcmar, Peter Zijlstra, X86 ML, LKML, Linux Virtualization,
Stephen Boyd, John Stultz, devel, Paolo Bonzini, Thomas Gleixner,
Matt Rickard
In-Reply-To: <CALCETrXNKCNZzUXj=2sp1vBd_o7HHzfW6RwyY5v3-wMfbVaZnQ@mail.gmail.com>
On Thu, Oct 04, 2018 at 03:15:32PM -0700, Andy Lutomirski wrote:
> For better or for worse, I'm trying to understand this code. So far,
> I've come up with this patch:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=14fd71e12b1c4492a06f368f75041f263e6862bf
>
> Is it correct, or am I missing some subtlety?
The master clock, when initialized, has a pair
masterclockvalues=(TSC value, time-of-day data).
When updating the guest clock, we only update relative to (TSC value)
that was read on masterclock initialization.
See the following comment on x86.c:
/*
*
* Assuming a stable TSC across physical CPUS, and a stable TSC
* across virtual CPUs, the following condition is possible.
* Each numbered line represents an event visible to both
* CPUs at the next numbered event.
...
When updating the "masterclockvalues" pair, all vcpus are
stopped.
^ permalink raw reply
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Marcelo Tosatti @ 2018-10-05 21:18 UTC (permalink / raw)
To: Vitaly Kuznetsov
Cc: Wanpeng Li, Florian Weimer, Juergen Gross, Arnd Bergmann,
Radim Krcmar, Peter Zijlstra, X86 ML, LKML, Linux Virtualization,
Stephen Boyd, John Stultz, Andy Lutomirski, devel, Paolo Bonzini,
Thomas Gleixner, Matt Rickard
In-Reply-To: <87k1mycfju.fsf@vitty.brq.redhat.com>
On Thu, Oct 04, 2018 at 09:54:45AM +0200, Vitaly Kuznetsov wrote:
> Marcelo Tosatti <mtosatti@redhat.com> writes:
>
> > On Wed, Oct 03, 2018 at 11:22:58AM +0200, Vitaly Kuznetsov wrote:
> >>
> >> There is a very long history of different (hardware) issues Marcelo was
> >> fighting with and the current code is the survived Frankenstein.
> >
> > Right, the code has to handle different TSC modes.
> >
> >> E.g. it
> >> is very, very unclear what "catchup", "always catchup" and
> >> masterclock-less mode in general are and if we still need them.
> >
> > Catchup means handling exposed (to guest) TSC frequency smaller than
> > HW TSC frequency.
> >
> > That is "frankenstein" code, could be removed.
> >
> >> That said I'm all for simplification. I'm not sure if we still need to
> >> care about buggy hardware though.
> >
> > What simplification is that again?
> >
>
> I was hoping to hear this from you :-) If I am to suggest how we can
> move forward I'd propose:
> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
> is supported).
In that case just use TSC clocksource on the guest directly: i am
writing code for that now (its faster than pvclock syscall).
> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
> clocksource is a single page for the whole VM, not a per-cpu thing. Can
> we think that all the buggy hardware is already gone?
As Peter mentioned, non sync TSC hardware might exist in the future.
And older hardware must be supported.
^ permalink raw reply
* Re: [PATCH v2] virtio-gpu: add VIRTIO_GPU_F_EDID feature
From: Gerd Hoffmann @ 2018-10-05 15:55 UTC (permalink / raw)
To: Christophe de Dinechin
Cc: kvm, Michael S. Tsirkin, David Airlie, open list, dri-devel,
virtio-dev, Daniel Vetter, open list:VIRTIO GPU DRIVER
In-Reply-To: <2CB183E2-B7ED-481C-9899-CA908558D39F@redhat.com>
Hi,
> >>> +/* VIRTIO_GPU_RESP_OK_EDID */
> >>> +struct virtio_gpu_resp_edid {
> >>> + struct virtio_gpu_ctrl_hdr hdr;
> >>> + __le32 scanout;
> >>> + __le32 size;
> >>> + __u8 edid[1024];
> >>
> >> Wouldn’t it be enough to stick to EDID 2.0 (256 bytes)?
> >>
> >> If not, maybe add comment to explain why you chose 1024.
> >
> > EDID in the wild can be up to 512 bytes.
>
> Does this return a physical EDID? I thought it would be made-up by virtio-gpu.
Well, edid has extensions, so it can become pretty large in theory, and
I've figured it would be a good idea to leave some room just in case.
It should be a rather infrequent operation, so the unused buffer space
should not hurt much.
Yes, it will be a edid generated by qemu. The current generator code will
use at most 256 bytes.
cheers,
Gerd
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v2] virtio-gpu: add VIRTIO_GPU_F_EDID feature
From: Daniel Vetter @ 2018-10-05 14:41 UTC (permalink / raw)
To: Christophe de Dinechin
Cc: kvm, Michael S. Tsirkin, David Airlie, open list, dri-devel,
virtio-dev, open list:VIRTIO GPU DRIVER
In-Reply-To: <1BA83F18-87E5-4B01-A2D9-01777078B637@redhat.com>
On Fri, Oct 05, 2018 at 04:38:11PM +0200, Christophe de Dinechin wrote:
>
>
> > On 5 Oct 2018, at 14:51, Gerd Hoffmann <kraxel@redhat.com> wrote:
> >
> > The feature allows the guest request an EDID blob (describing monitor
> > capabilities) for a given scanout (aka virtual monitor connector).
> >
> > It brings a new command message, which has just a scanout field (beside
> > the standard virtio-gpu header) and a response message which carries the
> > EDID data.
> >
> > Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
> > ---
> > include/uapi/linux/virtio_gpu.h | 17 +++++++++++++++++
> > 1 file changed, 17 insertions(+)
> >
> > diff --git a/include/uapi/linux/virtio_gpu.h b/include/uapi/linux/virtio_gpu.h
> > index f43c3c6171..1cef1ff339 100644
> > --- a/include/uapi/linux/virtio_gpu.h
> > +++ b/include/uapi/linux/virtio_gpu.h
> > @@ -41,6 +41,7 @@
> > #include <linux/types.h>
> >
> > #define VIRTIO_GPU_F_VIRGL 0
> > +#define VIRTIO_GPU_F_EDID 1
> >
> > enum virtio_gpu_ctrl_type {
> > VIRTIO_GPU_UNDEFINED = 0,
> > @@ -56,6 +57,7 @@ enum virtio_gpu_ctrl_type {
> > VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING,
> > VIRTIO_GPU_CMD_GET_CAPSET_INFO,
> > VIRTIO_GPU_CMD_GET_CAPSET,
> > + VIRTIO_GPU_CMD_GET_EDID,
> >
> > /* 3d commands */
> > VIRTIO_GPU_CMD_CTX_CREATE = 0x0200,
> > @@ -76,6 +78,7 @@ enum virtio_gpu_ctrl_type {
> > VIRTIO_GPU_RESP_OK_DISPLAY_INFO,
> > VIRTIO_GPU_RESP_OK_CAPSET_INFO,
> > VIRTIO_GPU_RESP_OK_CAPSET,
> > + VIRTIO_GPU_RESP_OK_EDID,
> >
> > /* error responses */
> > VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200,
> > @@ -291,6 +294,20 @@ struct virtio_gpu_resp_capset {
> > __u8 capset_data[];
> > };
> >
> > +/* VIRTIO_GPU_CMD_GET_EDID */
> > +struct virtio_gpu_get_edid {
> > + struct virtio_gpu_ctrl_hdr hdr;
> > + __le32 scanout;
> > +};
> > +
> > +/* VIRTIO_GPU_RESP_OK_EDID */
> > +struct virtio_gpu_resp_edid {
> > + struct virtio_gpu_ctrl_hdr hdr;
> > + __le32 scanout;
> > + __le32 size;
> > + __u8 edid[1024];
>
> Wouldn’t it be enough to stick to EDID 2.0 (256 bytes)?
>
> If not, maybe add comment to explain why you chose 1024.
EDID in the wild can be up to 512 bytes.
-Daniel
>
> > +};
> > +
> > #define VIRTIO_GPU_EVENT_DISPLAY (1 << 0)
> >
> > struct virtio_gpu_config {
> > --
> > 2.9.3
> >
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH v2] virtio-gpu: add VIRTIO_GPU_F_EDID feature
From: Gerd Hoffmann @ 2018-10-05 12:51 UTC (permalink / raw)
To: virtio-dev
Cc: kvm, Michael S. Tsirkin, David Airlie, open list, dri-devel,
open list:VIRTIO GPU DRIVER
The feature allows the guest request an EDID blob (describing monitor
capabilities) for a given scanout (aka virtual monitor connector).
It brings a new command message, which has just a scanout field (beside
the standard virtio-gpu header) and a response message which carries the
EDID data.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
---
include/uapi/linux/virtio_gpu.h | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/include/uapi/linux/virtio_gpu.h b/include/uapi/linux/virtio_gpu.h
index f43c3c6171..1cef1ff339 100644
--- a/include/uapi/linux/virtio_gpu.h
+++ b/include/uapi/linux/virtio_gpu.h
@@ -41,6 +41,7 @@
#include <linux/types.h>
#define VIRTIO_GPU_F_VIRGL 0
+#define VIRTIO_GPU_F_EDID 1
enum virtio_gpu_ctrl_type {
VIRTIO_GPU_UNDEFINED = 0,
@@ -56,6 +57,7 @@ enum virtio_gpu_ctrl_type {
VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING,
VIRTIO_GPU_CMD_GET_CAPSET_INFO,
VIRTIO_GPU_CMD_GET_CAPSET,
+ VIRTIO_GPU_CMD_GET_EDID,
/* 3d commands */
VIRTIO_GPU_CMD_CTX_CREATE = 0x0200,
@@ -76,6 +78,7 @@ enum virtio_gpu_ctrl_type {
VIRTIO_GPU_RESP_OK_DISPLAY_INFO,
VIRTIO_GPU_RESP_OK_CAPSET_INFO,
VIRTIO_GPU_RESP_OK_CAPSET,
+ VIRTIO_GPU_RESP_OK_EDID,
/* error responses */
VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200,
@@ -291,6 +294,20 @@ struct virtio_gpu_resp_capset {
__u8 capset_data[];
};
+/* VIRTIO_GPU_CMD_GET_EDID */
+struct virtio_gpu_get_edid {
+ struct virtio_gpu_ctrl_hdr hdr;
+ __le32 scanout;
+};
+
+/* VIRTIO_GPU_RESP_OK_EDID */
+struct virtio_gpu_resp_edid {
+ struct virtio_gpu_ctrl_hdr hdr;
+ __le32 scanout;
+ __le32 size;
+ __u8 edid[1024];
+};
+
#define VIRTIO_GPU_EVENT_DISPLAY (1 << 0)
struct virtio_gpu_config {
--
2.9.3
^ permalink raw reply related
* [PATCH v2] drm/bochs: add edid support.
From: Gerd Hoffmann @ 2018-10-05 12:09 UTC (permalink / raw)
To: dri-devel
Cc: David Airlie, open list,
open list:DRM DRIVER FOR BOCHS VIRTUAL GPU
Recent qemu (latest master branch, upcoming 3.1 release) got support
for EDID data. This patch adds guest driver support.
EDID support in qemu is not (yet) enabled by default, so please use
'qemu -device VGA,edid=on' for testing.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
---
drivers/gpu/drm/bochs/bochs.h | 1 +
drivers/gpu/drm/bochs/bochs_hw.c | 38 ++++++++++++++++++++++++++++++++++++++
drivers/gpu/drm/bochs/bochs_kms.c | 18 +++++++++++++++---
3 files changed, 54 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/bochs/bochs.h b/drivers/gpu/drm/bochs/bochs.h
index e7a69077e4..06b8166efa 100644
--- a/drivers/gpu/drm/bochs/bochs.h
+++ b/drivers/gpu/drm/bochs/bochs.h
@@ -66,6 +66,7 @@ struct bochs_device {
u16 yres_virtual;
u32 stride;
u32 bpp;
+ struct edid *edid;
/* drm */
struct drm_device *dev;
diff --git a/drivers/gpu/drm/bochs/bochs_hw.c b/drivers/gpu/drm/bochs/bochs_hw.c
index cacff73a64..6ce4cdac38 100644
--- a/drivers/gpu/drm/bochs/bochs_hw.c
+++ b/drivers/gpu/drm/bochs/bochs_hw.c
@@ -69,6 +69,41 @@ static void bochs_hw_set_little_endian(struct bochs_device *bochs)
#define bochs_hw_set_native_endian(_b) bochs_hw_set_little_endian(_b)
#endif
+static int bochs_load_edid(struct bochs_device *bochs)
+{
+ uint8_t *blob;
+ size_t i, len;
+ uint8_t num_exts;
+
+ if (!bochs->mmio)
+ return -1;
+
+ if ((readb(bochs->mmio+0) != 0x00 ||
+ readb(bochs->mmio+1) != 0xff))
+ return -1;
+
+ num_exts = readb(bochs->mmio + 126);
+ len = EDID_LENGTH * (1 + num_exts);
+ if (len > 0x400 /* vga register offset */)
+ return -1;
+
+ kfree(bochs->edid);
+ bochs->edid = kmalloc(len, GFP_KERNEL);
+ blob = (void *)bochs->edid;
+ for (i = 0; i < len; i++) {
+ blob[i] = readb(bochs->mmio+i);
+ }
+
+ if (!drm_edid_is_valid(bochs->edid)) {
+ DRM_ERROR("EDID is not valid, ignoring.\n");
+ kfree(bochs->edid);
+ bochs->edid = NULL;
+ return -1;
+ }
+
+ return 0;
+}
+
int bochs_hw_init(struct drm_device *dev)
{
struct bochs_device *bochs = dev->dev_private;
@@ -150,6 +185,9 @@ int bochs_hw_init(struct drm_device *dev)
}
noext:
+ if (bochs_load_edid(bochs) == 0)
+ DRM_INFO("Found EDID data blob.\n");
+
return 0;
}
diff --git a/drivers/gpu/drm/bochs/bochs_kms.c b/drivers/gpu/drm/bochs/bochs_kms.c
index 9bc5b438ae..b9931443a7 100644
--- a/drivers/gpu/drm/bochs/bochs_kms.c
+++ b/drivers/gpu/drm/bochs/bochs_kms.c
@@ -213,10 +213,17 @@ static void bochs_encoder_init(struct drm_device *dev)
static int bochs_connector_get_modes(struct drm_connector *connector)
{
- int count;
+ struct bochs_device *bochs =
+ container_of(connector, struct bochs_device, connector);
+ int count = 0;
- count = drm_add_modes_noedid(connector, 8192, 8192);
- drm_set_preferred_mode(connector, defx, defy);
+ if (bochs->edid)
+ count = drm_add_edid_modes(connector, bochs->edid);
+
+ if (!count) {
+ count = drm_add_modes_noedid(connector, 8192, 8192);
+ drm_set_preferred_mode(connector, defx, defy);
+ }
return count;
}
@@ -271,6 +278,11 @@ static void bochs_connector_init(struct drm_device *dev)
drm_connector_helper_add(connector,
&bochs_connector_connector_helper_funcs);
drm_connector_register(connector);
+
+ if (bochs->edid) {
+ drm_connector_attach_edid_property(connector);
+ drm_connector_update_edid_property(connector, bochs->edid);
+ }
}
--
2.9.3
^ permalink raw reply related
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Andy Lutomirski @ 2018-10-04 22:15 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Wanpeng Li, Florian Weimer, X86 ML, Juergen Gross, Arnd Bergmann,
Radim Krcmar, Stephen Boyd, LKML, Linux Virtualization,
John Stultz, Andrew Lutomirski, devel, Paolo Bonzini,
Thomas Gleixner, Matt Rickard
In-Reply-To: <499807AB-E779-40C3-AA3F-E8C77A7770EC@amacapital.net>
For better or for worse, I'm trying to understand this code. So far,
I've come up with this patch:
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=14fd71e12b1c4492a06f368f75041f263e6862bf
Is it correct, or am I missing some subtlety?
^ permalink raw reply
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Andy Lutomirski @ 2018-10-04 20:32 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra,
X86 ML, LKML, Linux Virtualization, Stephen Boyd, John Stultz,
Andrew Lutomirski, Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180914125006.349747096@linutronix.de>
On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> implementation, which extended the clockid switch case and added yet
> another slightly different copy of the same code.
>
> Especially the extended switch case is problematic as the compiler tends to
> generate a jump table which then requires to use retpolines. If jump tables
> are disabled it adds yet another conditional to the existing maze.
>
> This series takes a different approach by consolidating the almost
> identical functions into one implementation for high resolution clocks and
> one for the coarse grained clock ids by storing the base data for each
> clock id in an array which is indexed by the clock id.
>
> This completely eliminates the switch case and allows further
> simplifications of the code base, which at the end all together gain a few
> cycles performance or at least stay on par with todays code. The resulting
> performance depends heavily on the micro architecture and the compiler.
tglx, please consider this whole series Acked-by: Andy Lutomirski
<luto@kernel.org>
Please feel free to push it top tip:x86/vdso, as long as you pull in
tip/x86/urgent first. Once it lands, I'll email out a couple of
follow-up patches.
--Andy
^ permalink raw reply
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Andy Lutomirski @ 2018-10-04 20:05 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Wanpeng Li, Florian Weimer, X86 ML, Juergen Gross, Arnd Bergmann,
Radim Krcmar, Stephen Boyd, LKML, Linux Virtualization,
John Stultz, Andy Lutomirski, devel, Paolo Bonzini,
Thomas Gleixner, Matt Rickard
In-Reply-To: <20181004193150.GQ19272@hirez.programming.kicks-ass.net>
> On Oct 4, 2018, at 12:31 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Oct 04, 2018 at 07:00:45AM -0700, Andy Lutomirski wrote:
>>> On Oct 4, 2018, at 1:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>>> On Thu, Oct 04, 2018 at 09:54:45AM +0200, Vitaly Kuznetsov wrote:
>>>> I was hoping to hear this from you :-) If I am to suggest how we can
>>>> move forward I'd propose:
>>>> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
>>>> is supported).
>>>> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
>>>> clocksource is a single page for the whole VM, not a per-cpu thing. Can
>>>> we think that all the buggy hardware is already gone?
>>>
>>> No, and it is not the hardware you have to worry about (mostly), it is
>>> the frigging PoS firmware people put on it.
>>>
>>> Ever since Nehalem TSC is stable (unless you get to >4 socket systems,
>>> after which it still can be, but bets are off). But even relatively
>>> recent systems fail the TSC sync test because firmware messes it up by
>>> writing to either MSR_TSC or MSR_TSC_ADJUST.
>>>
>>> But the thing is, if the TSC is not synced, you cannot use it for
>>> timekeeping, full stop. So having a single page is fine, it either
>>> contains a mult/shift that is valid, or it indicates TSC is messed up
>>> and you fall back to something else.
>>>
>>> There is no inbetween there.
>>>
>>> For sched_clock we can still use the global page, because the rate will
>>> still be the same for each cpu, it's just offset between CPUs and the
>>> code compensates for that.
>>
>> But if we’re in a KVM guest, then the clock will jump around on the
>> same *vCPU* when the vCPU migrates.
>
> Urgh yes..
>
>> But I don’t see how kvmclock helps here, since I don’t think it’s used
>> for sched_clock.
>
> I get hits on kvm_sched_clock, but haven't looked further.
Ok, so KVM is using the per-vCPU pvclock data for sched_clock. Which hopefully does something intelligent.
Regardless of any TSC syncing issues, a paravirt clock should presumably be used for sched_clock to account for time that the vCPU was stopped.
It would be fantastic if this stuff were documented much better, both in terms of the data structures and the Linux code.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Peter Zijlstra @ 2018-10-04 19:31 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Wanpeng Li, Florian Weimer, X86 ML, Juergen Gross, Arnd Bergmann,
Radim Krcmar, Stephen Boyd, LKML, Linux Virtualization,
John Stultz, Andy Lutomirski, devel, Paolo Bonzini,
Thomas Gleixner, Matt Rickard
In-Reply-To: <B8C05946-43BB-40A4-A564-53904FAF8CC0@amacapital.net>
On Thu, Oct 04, 2018 at 07:00:45AM -0700, Andy Lutomirski wrote:
> > On Oct 4, 2018, at 1:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >> On Thu, Oct 04, 2018 at 09:54:45AM +0200, Vitaly Kuznetsov wrote:
> >> I was hoping to hear this from you :-) If I am to suggest how we can
> >> move forward I'd propose:
> >> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
> >> is supported).
> >> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
> >> clocksource is a single page for the whole VM, not a per-cpu thing. Can
> >> we think that all the buggy hardware is already gone?
> >
> > No, and it is not the hardware you have to worry about (mostly), it is
> > the frigging PoS firmware people put on it.
> >
> > Ever since Nehalem TSC is stable (unless you get to >4 socket systems,
> > after which it still can be, but bets are off). But even relatively
> > recent systems fail the TSC sync test because firmware messes it up by
> > writing to either MSR_TSC or MSR_TSC_ADJUST.
> >
> > But the thing is, if the TSC is not synced, you cannot use it for
> > timekeeping, full stop. So having a single page is fine, it either
> > contains a mult/shift that is valid, or it indicates TSC is messed up
> > and you fall back to something else.
> >
> > There is no inbetween there.
> >
> > For sched_clock we can still use the global page, because the rate will
> > still be the same for each cpu, it's just offset between CPUs and the
> > code compensates for that.
>
> But if we’re in a KVM guest, then the clock will jump around on the
> same *vCPU* when the vCPU migrates.
Urgh yes..
> But I don’t see how kvmclock helps here, since I don’t think it’s used
> for sched_clock.
I get hits on kvm_sched_clock, but haven't looked further.
Anyway, Most of the argument still holds, either TSC is synced or it is
not and it _really_ should not be used. Not much middle ground there.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Vitaly Kuznetsov @ 2018-10-04 17:28 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: Wanpeng Li, Florian Weimer, Juergen Gross, Arnd Bergmann,
Radim Krcmar, Peter Zijlstra, X86 ML, LKML, Linux Virtualization,
Stephen Boyd, John Stultz, Andrew Lutomirski, devel,
Paolo Bonzini, Thomas Gleixner, Matt Rickard
In-Reply-To: <CALCETrWbWLM5Jjm7iJCE7S=BJ9OFw2QQGJ2Ao-qsuaN50z=y0A@mail.gmail.com>
Andy Lutomirski <luto@kernel.org> writes:
> On Thu, Oct 4, 2018 at 9:43 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>
>> On Wed, Oct 03, 2018 at 03:32:08PM -0700, Andy Lutomirski wrote:
>> > On Wed, Oct 3, 2018 at 12:01 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> > >
>> > > On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote:
>> > > > Hi Vitaly, Paolo, Radim, etc.,
>> > > >
>> > > > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > > > >
>> > > > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
>> > > > > implementation, which extended the clockid switch case and added yet
>> > > > > another slightly different copy of the same code.
>> > > > >
>> > > > > Especially the extended switch case is problematic as the compiler tends to
>> > > > > generate a jump table which then requires to use retpolines. If jump tables
>> > > > > are disabled it adds yet another conditional to the existing maze.
>> > > > >
>> > > > > This series takes a different approach by consolidating the almost
>> > > > > identical functions into one implementation for high resolution clocks and
>> > > > > one for the coarse grained clock ids by storing the base data for each
>> > > > > clock id in an array which is indexed by the clock id.
>> > > > >
>> > > >
>> > > > I was trying to understand more of the implications of this patch
>> > > > series, and I was again reminded that there is an entire extra copy of
>> > > > the vclock reading code in arch/x86/kvm/x86.c. And the purpose of
>> > > > that code is very, very opaque.
>> > > >
>> > > > Can one of you explain what the code is even doing? From a couple of
>> > > > attempts to read through it, it's a whole bunch of
>> > > > probably-extremely-buggy code that,
>> > >
>> > > Yes, probably.
>> > >
>> > > > drumroll please, tries to atomically read the TSC value and the time. And decide whether the
>> > > > result is "based on the TSC".
>> > >
>> > > I think "based on the TSC" refers to whether TSC clocksource is being
>> > > used.
>> > >
>> > > > And then synthesizes a TSC-to-ns
>> > > > multiplier and shift, based on *something other than the actual
>> > > > multiply and shift used*.
>> > > >
>> > > > IOW, unless I'm totally misunderstanding it, the code digs into the
>> > > > private arch clocksource data intended for the vDSO, uses a poorly
>> > > > maintained copy of the vDSO code to read the time (instead of doing
>> > > > the sane thing and using the kernel interfaces for this), and
>> > > > propagates a totally made up copy to the guest.
>> > >
>> > > I posted kernel interfaces for this, and it was suggested to
>> > > instead write a "in-kernel user of pvclock data".
>> > >
>> > > If you can get kernel interfaces to replace that, go for it. I prefer
>> > > kernel interfaces as well.
>> > >
>> > > > And gets it entirely
>> > > > wrong when doing nested virt, since, unless there's some secret in
>> > > > this maze, it doesn't acutlaly use the scaling factor from the host
>> > > > when it tells the guest what to do.
>> > > >
>> > > > I am really, seriously tempted to send a patch to simply delete all
>> > > > this code.
>> > >
>> > > If your patch which deletes the code gets the necessary features right,
>> > > sure, go for it.
>> > >
>> > > > The correct way to do it is to hook
>> > >
>> > > Can you expand on the correct way to do it?
>> > >
>> > > > And I don't see how it's even possible to pass kvmclock correctly to
>> > > > the L2 guest when L0 is hyperv. KVM could pass *hyperv's* clock, but
>> > > > L1 isn't notified when the data structure changes, so how the heck is
>> > > > it supposed to update the kvmclock structure?
>> > >
>> > > I don't parse your question.
>> >
>> > Let me ask it more intelligently: when the "reenlightenment" IRQ
>> > happens, what tells KVM to do its own update for its guests?
>>
>> Update of what, and why it needs to update anything from IRQ?
>>
>> The update i can think of is from host kernel clocksource,
>> which there is a notifier for.
>>
>>
>
> Unless I've missed some serious magic, L2 guests see kvmclock, not hv.
> So we have the following sequence of events:
>
> - L0 migrates the whole VM. Starting now, RDTSC is emulated to match
> the old host, which applies in L1 and L2.
>
> - An IRQ is queued to L1.
>
> - L1 acknowledges that it noticed the TSC change.
Before the acknowledgement we actually pause all guests so they don't
notice the change ....
> RDTSC stops being emulated for L1 and L2.
.... and right after that we update all kvmclocks for all L2s and
unpause them so all their readings are still correct (see
kvm_hyperv_tsc_notifier()).
>
> - L2 reads the TSC. It has no idea that anything changed, and it
> gets the wrong answer.
I have to admit I forgot what happens if L2 uses raw TSC. I *think* that
we actually adjust TSC offset along with adjusting kvmclocks so the
reading is still correct. I'll have to check this.
All bets are off in case L2 was using TSC for time interval
measurements: frequency, of course, changes.
>
> - At some point, kvm clock updates.
>
> What prevents this? Vitaly, am I missing some subtlety of what
> actually happens?
--
Vitaly
^ permalink raw reply
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Andy Lutomirski @ 2018-10-04 17:08 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: Wanpeng Li, Florian Weimer, Juergen Gross, Arnd Bergmann,
Radim Krcmar, Peter Zijlstra, X86 ML, LKML, Linux Virtualization,
Stephen Boyd, John Stultz, Andrew Lutomirski, devel,
Paolo Bonzini, Thomas Gleixner, Matt Rickard
In-Reply-To: <20181004163705.GA25129@amt.cnet>
On Thu, Oct 4, 2018 at 9:43 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Wed, Oct 03, 2018 at 03:32:08PM -0700, Andy Lutomirski wrote:
> > On Wed, Oct 3, 2018 at 12:01 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > >
> > > On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote:
> > > > Hi Vitaly, Paolo, Radim, etc.,
> > > >
> > > > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > > >
> > > > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > > > > implementation, which extended the clockid switch case and added yet
> > > > > another slightly different copy of the same code.
> > > > >
> > > > > Especially the extended switch case is problematic as the compiler tends to
> > > > > generate a jump table which then requires to use retpolines. If jump tables
> > > > > are disabled it adds yet another conditional to the existing maze.
> > > > >
> > > > > This series takes a different approach by consolidating the almost
> > > > > identical functions into one implementation for high resolution clocks and
> > > > > one for the coarse grained clock ids by storing the base data for each
> > > > > clock id in an array which is indexed by the clock id.
> > > > >
> > > >
> > > > I was trying to understand more of the implications of this patch
> > > > series, and I was again reminded that there is an entire extra copy of
> > > > the vclock reading code in arch/x86/kvm/x86.c. And the purpose of
> > > > that code is very, very opaque.
> > > >
> > > > Can one of you explain what the code is even doing? From a couple of
> > > > attempts to read through it, it's a whole bunch of
> > > > probably-extremely-buggy code that,
> > >
> > > Yes, probably.
> > >
> > > > drumroll please, tries to atomically read the TSC value and the time. And decide whether the
> > > > result is "based on the TSC".
> > >
> > > I think "based on the TSC" refers to whether TSC clocksource is being
> > > used.
> > >
> > > > And then synthesizes a TSC-to-ns
> > > > multiplier and shift, based on *something other than the actual
> > > > multiply and shift used*.
> > > >
> > > > IOW, unless I'm totally misunderstanding it, the code digs into the
> > > > private arch clocksource data intended for the vDSO, uses a poorly
> > > > maintained copy of the vDSO code to read the time (instead of doing
> > > > the sane thing and using the kernel interfaces for this), and
> > > > propagates a totally made up copy to the guest.
> > >
> > > I posted kernel interfaces for this, and it was suggested to
> > > instead write a "in-kernel user of pvclock data".
> > >
> > > If you can get kernel interfaces to replace that, go for it. I prefer
> > > kernel interfaces as well.
> > >
> > > > And gets it entirely
> > > > wrong when doing nested virt, since, unless there's some secret in
> > > > this maze, it doesn't acutlaly use the scaling factor from the host
> > > > when it tells the guest what to do.
> > > >
> > > > I am really, seriously tempted to send a patch to simply delete all
> > > > this code.
> > >
> > > If your patch which deletes the code gets the necessary features right,
> > > sure, go for it.
> > >
> > > > The correct way to do it is to hook
> > >
> > > Can you expand on the correct way to do it?
> > >
> > > > And I don't see how it's even possible to pass kvmclock correctly to
> > > > the L2 guest when L0 is hyperv. KVM could pass *hyperv's* clock, but
> > > > L1 isn't notified when the data structure changes, so how the heck is
> > > > it supposed to update the kvmclock structure?
> > >
> > > I don't parse your question.
> >
> > Let me ask it more intelligently: when the "reenlightenment" IRQ
> > happens, what tells KVM to do its own update for its guests?
>
> Update of what, and why it needs to update anything from IRQ?
>
> The update i can think of is from host kernel clocksource,
> which there is a notifier for.
>
>
Unless I've missed some serious magic, L2 guests see kvmclock, not hv.
So we have the following sequence of events:
- L0 migrates the whole VM. Starting now, RDTSC is emulated to match
the old host, which applies in L1 and L2.
- An IRQ is queued to L1.
- L1 acknowledges that it noticed the TSC change. RDTSC stops being
emulated for L1 and L2.
- L2 reads the TSC. It has no idea that anything changed, and it
gets the wrong answer.
- At some point, kvm clock updates.
What prevents this? Vitaly, am I missing some subtlety of what
actually happens?
^ permalink raw reply
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Marcelo Tosatti @ 2018-10-04 16:37 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Wanpeng Li, Florian Weimer, Juergen Gross, Arnd Bergmann,
Radim Krcmar, Peter Zijlstra, X86 ML, LKML, Linux Virtualization,
Stephen Boyd, John Stultz, devel, Paolo Bonzini, Thomas Gleixner,
Matt Rickard
In-Reply-To: <CALCETrXwEDGNry=9KKvWyc7Eot3eEDifkJ234igDLrea2mVfuA@mail.gmail.com>
On Wed, Oct 03, 2018 at 03:32:08PM -0700, Andy Lutomirski wrote:
> On Wed, Oct 3, 2018 at 12:01 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >
> > On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote:
> > > Hi Vitaly, Paolo, Radim, etc.,
> > >
> > > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > >
> > > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > > > implementation, which extended the clockid switch case and added yet
> > > > another slightly different copy of the same code.
> > > >
> > > > Especially the extended switch case is problematic as the compiler tends to
> > > > generate a jump table which then requires to use retpolines. If jump tables
> > > > are disabled it adds yet another conditional to the existing maze.
> > > >
> > > > This series takes a different approach by consolidating the almost
> > > > identical functions into one implementation for high resolution clocks and
> > > > one for the coarse grained clock ids by storing the base data for each
> > > > clock id in an array which is indexed by the clock id.
> > > >
> > >
> > > I was trying to understand more of the implications of this patch
> > > series, and I was again reminded that there is an entire extra copy of
> > > the vclock reading code in arch/x86/kvm/x86.c. And the purpose of
> > > that code is very, very opaque.
> > >
> > > Can one of you explain what the code is even doing? From a couple of
> > > attempts to read through it, it's a whole bunch of
> > > probably-extremely-buggy code that,
> >
> > Yes, probably.
> >
> > > drumroll please, tries to atomically read the TSC value and the time. And decide whether the
> > > result is "based on the TSC".
> >
> > I think "based on the TSC" refers to whether TSC clocksource is being
> > used.
> >
> > > And then synthesizes a TSC-to-ns
> > > multiplier and shift, based on *something other than the actual
> > > multiply and shift used*.
> > >
> > > IOW, unless I'm totally misunderstanding it, the code digs into the
> > > private arch clocksource data intended for the vDSO, uses a poorly
> > > maintained copy of the vDSO code to read the time (instead of doing
> > > the sane thing and using the kernel interfaces for this), and
> > > propagates a totally made up copy to the guest.
> >
> > I posted kernel interfaces for this, and it was suggested to
> > instead write a "in-kernel user of pvclock data".
> >
> > If you can get kernel interfaces to replace that, go for it. I prefer
> > kernel interfaces as well.
> >
> > > And gets it entirely
> > > wrong when doing nested virt, since, unless there's some secret in
> > > this maze, it doesn't acutlaly use the scaling factor from the host
> > > when it tells the guest what to do.
> > >
> > > I am really, seriously tempted to send a patch to simply delete all
> > > this code.
> >
> > If your patch which deletes the code gets the necessary features right,
> > sure, go for it.
> >
> > > The correct way to do it is to hook
> >
> > Can you expand on the correct way to do it?
> >
> > > And I don't see how it's even possible to pass kvmclock correctly to
> > > the L2 guest when L0 is hyperv. KVM could pass *hyperv's* clock, but
> > > L1 isn't notified when the data structure changes, so how the heck is
> > > it supposed to update the kvmclock structure?
> >
> > I don't parse your question.
>
> Let me ask it more intelligently: when the "reenlightenment" IRQ
> happens, what tells KVM to do its own update for its guests?
Update of what, and why it needs to update anything from IRQ?
The update i can think of is from host kernel clocksource,
which there is a notifier for.
^ permalink raw reply
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Andy Lutomirski @ 2018-10-04 14:04 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Wanpeng Li, Florian Weimer, X86 ML, Juergen Gross, Arnd Bergmann,
Radim Krcmar, Peter Zijlstra, LKML, Linux Virtualization,
Stephen Boyd, John Stultz, Andy Lutomirski, devel,
Thomas Gleixner, Matt Rickard
In-Reply-To: <1832e2af-3189-cb50-f4b6-45e74cdcf4b4@redhat.com>
> On Oct 4, 2018, at 5:00 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>> On 04/10/2018 09:54, Vitaly Kuznetsov wrote:
>> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
>> is supported).
>
> Not if you want to migrate to pre-Skylake systems.
>
>> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
>> clocksource is a single page for the whole VM, not a per-cpu thing. Can
>> we think that all the buggy hardware is already gone?
>
> No. :( We still get reports whenever we break 2007-2008 hardware.
>
>
Does the KVM non-masterclock mode actually help? It’s not clear to me exactly how it’s supposed to work, but it seems like it’s trying to expose per-vCPU adjustments to the guest. Which is dubious at best, since the guest can’t validly use them for anything other than sched_clock, since they aren’t fully corrected by anything KVM can do.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Andy Lutomirski @ 2018-10-04 14:00 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Wanpeng Li, Florian Weimer, X86 ML, Juergen Gross, Arnd Bergmann,
Radim Krcmar, Stephen Boyd, LKML, Linux Virtualization,
John Stultz, Andy Lutomirski, devel, Paolo Bonzini,
Thomas Gleixner, Matt Rickard
In-Reply-To: <20181004081100.GI19272@hirez.programming.kicks-ass.net>
> On Oct 4, 2018, at 1:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Thu, Oct 04, 2018 at 09:54:45AM +0200, Vitaly Kuznetsov wrote:
>> I was hoping to hear this from you :-) If I am to suggest how we can
>> move forward I'd propose:
>> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
>> is supported).
>> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
>> clocksource is a single page for the whole VM, not a per-cpu thing. Can
>> we think that all the buggy hardware is already gone?
>
> No, and it is not the hardware you have to worry about (mostly), it is
> the frigging PoS firmware people put on it.
>
> Ever since Nehalem TSC is stable (unless you get to >4 socket systems,
> after which it still can be, but bets are off). But even relatively
> recent systems fail the TSC sync test because firmware messes it up by
> writing to either MSR_TSC or MSR_TSC_ADJUST.
>
> But the thing is, if the TSC is not synced, you cannot use it for
> timekeeping, full stop. So having a single page is fine, it either
> contains a mult/shift that is valid, or it indicates TSC is messed up
> and you fall back to something else.
>
> There is no inbetween there.
>
> For sched_clock we can still use the global page, because the rate will
> still be the same for each cpu, it's just offset between CPUs and the
> code compensates for that.
But if we’re in a KVM guest, then the clock will jump around on the same *vCPU* when the vCPU migrates.
But I don’t see how kvmclock helps here, since I don’t think it’s used for sched_clock.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Paolo Bonzini @ 2018-10-04 12:00 UTC (permalink / raw)
To: Vitaly Kuznetsov, Marcelo Tosatti
Cc: Wanpeng Li, Florian Weimer, Juergen Gross, Arnd Bergmann,
Radim Krcmar, Peter Zijlstra, X86 ML, LKML, Linux Virtualization,
Stephen Boyd, John Stultz, Andy Lutomirski, devel,
Thomas Gleixner, Matt Rickard
In-Reply-To: <87k1mycfju.fsf@vitty.brq.redhat.com>
On 04/10/2018 09:54, Vitaly Kuznetsov wrote:
> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
> is supported).
Not if you want to migrate to pre-Skylake systems.
> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
> clocksource is a single page for the whole VM, not a per-cpu thing. Can
> we think that all the buggy hardware is already gone?
No. :( We still get reports whenever we break 2007-2008 hardware.
Paolo
^ permalink raw reply
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Peter Zijlstra @ 2018-10-04 8:11 UTC (permalink / raw)
To: Vitaly Kuznetsov
Cc: Wanpeng Li, Florian Weimer, X86 ML, Juergen Gross, Arnd Bergmann,
Radim Krcmar, Stephen Boyd, LKML, Linux Virtualization,
John Stultz, Andy Lutomirski, devel, Paolo Bonzini,
Thomas Gleixner, Matt Rickard
In-Reply-To: <87k1mycfju.fsf@vitty.brq.redhat.com>
On Thu, Oct 04, 2018 at 09:54:45AM +0200, Vitaly Kuznetsov wrote:
> I was hoping to hear this from you :-) If I am to suggest how we can
> move forward I'd propose:
> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
> is supported).
> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
> clocksource is a single page for the whole VM, not a per-cpu thing. Can
> we think that all the buggy hardware is already gone?
No, and it is not the hardware you have to worry about (mostly), it is
the frigging PoS firmware people put on it.
Ever since Nehalem TSC is stable (unless you get to >4 socket systems,
after which it still can be, but bets are off). But even relatively
recent systems fail the TSC sync test because firmware messes it up by
writing to either MSR_TSC or MSR_TSC_ADJUST.
But the thing is, if the TSC is not synced, you cannot use it for
timekeeping, full stop. So having a single page is fine, it either
contains a mult/shift that is valid, or it indicates TSC is messed up
and you fall back to something else.
There is no inbetween there.
For sched_clock we can still use the global page, because the rate will
still be the same for each cpu, it's just offset between CPUs and the
code compensates for that.
^ permalink raw reply
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Vitaly Kuznetsov @ 2018-10-04 7:54 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: Wanpeng Li, Florian Weimer, Juergen Gross, Arnd Bergmann,
Radim Krcmar, Peter Zijlstra, X86 ML, LKML, Linux Virtualization,
Stephen Boyd, John Stultz, Andy Lutomirski, devel, Paolo Bonzini,
Thomas Gleixner, Matt Rickard
In-Reply-To: <20181003190617.GC21381@amt.cnet>
Marcelo Tosatti <mtosatti@redhat.com> writes:
> On Wed, Oct 03, 2018 at 11:22:58AM +0200, Vitaly Kuznetsov wrote:
>>
>> There is a very long history of different (hardware) issues Marcelo was
>> fighting with and the current code is the survived Frankenstein.
>
> Right, the code has to handle different TSC modes.
>
>> E.g. it
>> is very, very unclear what "catchup", "always catchup" and
>> masterclock-less mode in general are and if we still need them.
>
> Catchup means handling exposed (to guest) TSC frequency smaller than
> HW TSC frequency.
>
> That is "frankenstein" code, could be removed.
>
>> That said I'm all for simplification. I'm not sure if we still need to
>> care about buggy hardware though.
>
> What simplification is that again?
>
I was hoping to hear this from you :-) If I am to suggest how we can
move forward I'd propose:
- Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
is supported).
- Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
clocksource is a single page for the whole VM, not a per-cpu thing. Can
we think that all the buggy hardware is already gone?
--
Vitaly
^ permalink raw reply
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Andy Lutomirski @ 2018-10-03 22:32 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: Wanpeng Li, Florian Weimer, Juergen Gross, Arnd Bergmann,
Radim Krcmar, Peter Zijlstra, X86 ML, LKML, Linux Virtualization,
Stephen Boyd, John Stultz, Andrew Lutomirski, devel,
Paolo Bonzini, Thomas Gleixner, Matt Rickard
In-Reply-To: <20181003190026.GB21381@amt.cnet>
On Wed, Oct 3, 2018 at 12:01 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote:
> > Hi Vitaly, Paolo, Radim, etc.,
> >
> > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > > implementation, which extended the clockid switch case and added yet
> > > another slightly different copy of the same code.
> > >
> > > Especially the extended switch case is problematic as the compiler tends to
> > > generate a jump table which then requires to use retpolines. If jump tables
> > > are disabled it adds yet another conditional to the existing maze.
> > >
> > > This series takes a different approach by consolidating the almost
> > > identical functions into one implementation for high resolution clocks and
> > > one for the coarse grained clock ids by storing the base data for each
> > > clock id in an array which is indexed by the clock id.
> > >
> >
> > I was trying to understand more of the implications of this patch
> > series, and I was again reminded that there is an entire extra copy of
> > the vclock reading code in arch/x86/kvm/x86.c. And the purpose of
> > that code is very, very opaque.
> >
> > Can one of you explain what the code is even doing? From a couple of
> > attempts to read through it, it's a whole bunch of
> > probably-extremely-buggy code that,
>
> Yes, probably.
>
> > drumroll please, tries to atomically read the TSC value and the time. And decide whether the
> > result is "based on the TSC".
>
> I think "based on the TSC" refers to whether TSC clocksource is being
> used.
>
> > And then synthesizes a TSC-to-ns
> > multiplier and shift, based on *something other than the actual
> > multiply and shift used*.
> >
> > IOW, unless I'm totally misunderstanding it, the code digs into the
> > private arch clocksource data intended for the vDSO, uses a poorly
> > maintained copy of the vDSO code to read the time (instead of doing
> > the sane thing and using the kernel interfaces for this), and
> > propagates a totally made up copy to the guest.
>
> I posted kernel interfaces for this, and it was suggested to
> instead write a "in-kernel user of pvclock data".
>
> If you can get kernel interfaces to replace that, go for it. I prefer
> kernel interfaces as well.
>
> > And gets it entirely
> > wrong when doing nested virt, since, unless there's some secret in
> > this maze, it doesn't acutlaly use the scaling factor from the host
> > when it tells the guest what to do.
> >
> > I am really, seriously tempted to send a patch to simply delete all
> > this code.
>
> If your patch which deletes the code gets the necessary features right,
> sure, go for it.
>
> > The correct way to do it is to hook
>
> Can you expand on the correct way to do it?
>
> > And I don't see how it's even possible to pass kvmclock correctly to
> > the L2 guest when L0 is hyperv. KVM could pass *hyperv's* clock, but
> > L1 isn't notified when the data structure changes, so how the heck is
> > it supposed to update the kvmclock structure?
>
> I don't parse your question.
Let me ask it more intelligently: when the "reenlightenment" IRQ
happens, what tells KVM to do its own update for its guests?
^ permalink raw reply
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Marcelo Tosatti @ 2018-10-03 19:06 UTC (permalink / raw)
To: Vitaly Kuznetsov
Cc: Wanpeng Li, Florian Weimer, Juergen Gross, Arnd Bergmann,
Radim Krcmar, Peter Zijlstra, X86 ML, LKML, Linux Virtualization,
Stephen Boyd, John Stultz, Andy Lutomirski, devel, Paolo Bonzini,
Thomas Gleixner, Matt Rickard
In-Reply-To: <87sh1ne64t.fsf@vitty.brq.redhat.com>
On Wed, Oct 03, 2018 at 11:22:58AM +0200, Vitaly Kuznetsov wrote:
> Andy Lutomirski <luto@kernel.org> writes:
>
> > Hi Vitaly, Paolo, Radim, etc.,
> >
> > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>
> >> Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> >> implementation, which extended the clockid switch case and added yet
> >> another slightly different copy of the same code.
> >>
> >> Especially the extended switch case is problematic as the compiler tends to
> >> generate a jump table which then requires to use retpolines. If jump tables
> >> are disabled it adds yet another conditional to the existing maze.
> >>
> >> This series takes a different approach by consolidating the almost
> >> identical functions into one implementation for high resolution clocks and
> >> one for the coarse grained clock ids by storing the base data for each
> >> clock id in an array which is indexed by the clock id.
> >>
> >
> > I was trying to understand more of the implications of this patch
> > series, and I was again reminded that there is an entire extra copy of
> > the vclock reading code in arch/x86/kvm/x86.c. And the purpose of
> > that code is very, very opaque.
> >
> > Can one of you explain what the code is even doing? From a couple of
> > attempts to read through it, it's a whole bunch of
> > probably-extremely-buggy code that, drumroll please, tries to
> > atomically read the TSC value and the time. And decide whether the
> > result is "based on the TSC". And then synthesizes a TSC-to-ns
> > multiplier and shift, based on *something other than the actual
> > multiply and shift used*.
> >
> > IOW, unless I'm totally misunderstanding it, the code digs into the
> > private arch clocksource data intended for the vDSO, uses a poorly
> > maintained copy of the vDSO code to read the time (instead of doing
> > the sane thing and using the kernel interfaces for this), and
> > propagates a totally made up copy to the guest. And gets it entirely
> > wrong when doing nested virt, since, unless there's some secret in
> > this maze, it doesn't acutlaly use the scaling factor from the host
> > when it tells the guest what to do.
> >
> > I am really, seriously tempted to send a patch to simply delete all
> > this code. The correct way to do it is to hook
>
> "I have discovered a truly marvelous proof of this, which this margin is
> too narrow to contain" :-)
>
> There is a very long history of different (hardware) issues Marcelo was
> fighting with and the current code is the survived Frankenstein.
Right, the code has to handle different TSC modes.
> E.g. it
> is very, very unclear what "catchup", "always catchup" and
> masterclock-less mode in general are and if we still need them.
Catchup means handling exposed (to guest) TSC frequency smaller than
HW TSC frequency.
That is "frankenstein" code, could be removed.
> That said I'm all for simplification. I'm not sure if we still need to
> care about buggy hardware though.
What simplification is that again?
> >
> > And I don't see how it's even possible to pass kvmclock correctly to
> > the L2 guest when L0 is hyperv. KVM could pass *hyperv's* clock, but
> > L1 isn't notified when the data structure changes, so how the heck is
> > it supposed to update the kvmclock structure?
>
> Well, this kind of works in the the followin way:
> L1's clocksource is 'tsc_page' which is, basically, a compliment to TSC:
> two numbers provided by L0: offset and scale and KVM was tought to treat
> this clocksource as a good one (see b0c39dc68e3b "x86/kvm: Pass stable
> clocksource to guests when running nested on Hyper-V").
>
> The notification you're talking about exists, it is called
> Reenligntenment, see 0092e4346f49 "x86/kvm: Support Hyper-V
> reenlightenment"). When TSC page changes (and this only happens when L1
> is migrated to a different host with a different TSC frequency and TSC
> scaling is not supported by the CPU) we receive an interrupt in L1 (at
> this moment all TSC accesses are emulated which guarantees the
> correctness of the readings), pause all L2 guests, update their kvmclock
> structures with new data (we already know the new TSC frequency) and
> then tell L0 that we're done and it can stop emulating TSC accesses.
>
> (Nothing like this exists for KVM-on-KVM, by the way, when L1's
> clocksource is 'kvmclock' L2s won't get a stable kvmclock clocksource.)
>
> --
> Vitaly
^ permalink raw reply
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support\
From: Marcelo Tosatti @ 2018-10-03 19:05 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Wanpeng Li, Florian Weimer, Juergen Gross, Arnd Bergmann,
Radim Krcmar, Peter Zijlstra, X86 ML, LKML, Linux Virtualization,
Stephen Boyd, John Stultz, devel, Paolo Bonzini, Thomas Gleixner,
Matt Rickard
In-Reply-To: <20181003190026.GB21381@amt.cnet>
On Wed, Oct 03, 2018 at 04:00:29PM -0300, Marcelo Tosatti wrote:
> On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote:
> > Hi Vitaly, Paolo, Radim, etc.,
> >
> > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > > implementation, which extended the clockid switch case and added yet
> > > another slightly different copy of the same code.
> > >
> > > Especially the extended switch case is problematic as the compiler tends to
> > > generate a jump table which then requires to use retpolines. If jump tables
> > > are disabled it adds yet another conditional to the existing maze.
> > >
> > > This series takes a different approach by consolidating the almost
> > > identical functions into one implementation for high resolution clocks and
> > > one for the coarse grained clock ids by storing the base data for each
> > > clock id in an array which is indexed by the clock id.
> > >
> >
> > I was trying to understand more of the implications of this patch
> > series, and I was again reminded that there is an entire extra copy of
> > the vclock reading code in arch/x86/kvm/x86.c. And the purpose of
> > that code is very, very opaque.
> >
> > Can one of you explain what the code is even doing? From a couple of
> > attempts to read through it, it's a whole bunch of
> > probably-extremely-buggy code that,
>
> Yes, probably.
>
> > drumroll please, tries to atomically read the TSC value and the time. And decide whether the
> > result is "based on the TSC".
>
> I think "based on the TSC" refers to whether TSC clocksource is being
> used.
>
> > And then synthesizes a TSC-to-ns
> > multiplier and shift, based on *something other than the actual
> > multiply and shift used*.
> >
> > IOW, unless I'm totally misunderstanding it, the code digs into the
> > private arch clocksource data intended for the vDSO, uses a poorly
> > maintained copy of the vDSO code to read the time (instead of doing
> > the sane thing and using the kernel interfaces for this), and
> > propagates a totally made up copy to the guest.
>
> I posted kernel interfaces for this, and it was suggested to
> instead write a "in-kernel user of pvclock data".
>
> If you can get kernel interfaces to replace that, go for it. I prefer
> kernel interfaces as well.
And cleanup patches, to make that code look nicer, are also very
welcome!
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox