linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Maxim Levitsky <mlevitsk@redhat.com>,
	kvm list <kvm@vger.kernel.org>, "H. Peter Anvin" <hpa@zytor.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Jim Mattson <jmattson@google.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	"open list\:KERNEL SELFTEST FRAMEWORK" 
	<linux-kselftest@vger.kernel.org>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Sean Christopherson <sean.j.christopherson@intel.com>,
	open list <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>,
	"maintainer\:X86 ARCHITECTURE \(32-BIT AND 64-BIT\)"
	<x86@kernel.org>, Joerg Roedel <joro@8bytes.org>,
	Borislav Petkov <bp@alien8.de>, Shuah Khan <shuah@kernel.org>,
	Andrew Jones <drjones@redhat.com>,
	Oliver Upton <oupton@google.com>,
	"open list\:DOCUMENTATION" <linux-doc@vger.kernel.org>
Subject: Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE
Date: Wed, 09 Dec 2020 11:14:48 +0100	[thread overview]
Message-ID: <87lfe71e1z.fsf@nanos.tec.linutronix.de> (raw)
In-Reply-To: <CALCETrXeXCvbxAuRuLwWoF3-zvjhzzjj46VZ3RfgUEhb0SeK6A@mail.gmail.com>

Andy,

On Tue, Dec 08 2020 at 20:08, Andy Lutomirski wrote:
> On Tue, Dec 8, 2020 at 4:19 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Tue, Dec 08 2020 at 12:32, Andy Lutomirski wrote:
>> all the way through the end and then come up with a real proposal which
>> solves all of the issues mentioned there.
>
> You're misunderstanding me, which is entirely reasonable, since my
> description was crap.  In particular, what I meant by smearing is not
> at all what's done today.  Let me try again.  The thing below is my
> proposal, not necessarily a description of exactly what happens now.

Fair enough. /me rewinds grump and starts over.

> All of these are only valid if t_start <= read_time() <= t_end and,
> and they all assume that read_time() hasn't wrapped and gotten into
> that interval again.  There is nothing at all we can do in software if
> we wrap like this.  t_end isn't necessarily something we compute
> explicitly --- it might just be the case that, if read_time() > t_end,
> our arithmetic overflows and we return garbage.  But t_end might be a
> real thing on architectures where vdso_cycles_ok() actually does
> something (sigh, x86).
>
> If t > t_end, then we fall back to a syscall if we're in user mode and
> we fall back to hypercall or we just spin if we're in the kernel.  But
> see below.

Yes, we could do that.

> CLOCK_SMEARED_REALTIME:
> return mult[smeared_realtime] * (read_time() - t_start) +
> offset[smeared_realtime]
> This is a leap-second-smeared variant of CLOCK_SANE_REALTIME.
>
> CLOCK_REALTIME: maps to CLOCK_SANE_REALTIME or CLOCK_SMEARED_REALTIME
> depending on user preference.  Doing this without an extra branch
> somewhere might take a bit of thought.

Plus adding support for clock_*(CLOCK_DISTORTED_TIME) and make that work
correctly. Plus dealing with all the other interesting problems vs. file
time stamps and whatever. Time is all over the place and not just in
clock_gettime(CLOCK*).

But what's more problematic is the basic requirement that time all over
the place has to be consistent.

On machines which use DISTORTED_REALTIME everything _IS_ consistent
within the distorted universe they created. It's still inconsistent
vs. the outside, but that's unsolvable and none of our problems.

TLDR: Do not even think about opening pandoras box.

> As far as I can tell, if the kernel were to do something utterly
> asinine like adding some arbitrary value to TSC_ADJUST on all CPUs,
> the kernel could do so correctly by taking the seqlock, making the
> change, updating everything, and releasing the seqlock.

Plus a few other things, but yes that's similar to the scheme I
outlined. Using stomp_machine() ensures that _all_ possible ways to
wreckage things are covered.

> This would be nuts, but it's more or less the same thing that happens
> when a VM migrates.  So I think a VM could migrate a guest without any
> particular magic, except that there's a potential race if the old and
> new systems happen to have close enough seqlock values that the guest
> might start reading on the old host, finish on the new host, see the
> same seqlock value, and end up with utter garbage.  One way to
> mitigate this would be, in paravirt mode, to have an extra per-guest
> page that contains a count of how many times the guest has migrated.
>
> Timens would work a lot like it does today, but the mechanism that
> tells the vdso code to use timens might need tweaking.
>
> I could easily be missing something that prevents this from working,
> but I'm not seeing any fundamental problems.

It can be made work.

> If we want to get fancy, we can make a change that I've contemplated
> for a while -- we could make t_end explicit and have two copies of all
> these data structures.  The reader would use one copy if t < t_change
> and a different copy if t >= t_change.

See below.

> This would allow NTP-like code in usermode to schedule a frequency
> shift to start at a specific time.

That's an orthogonal problem and can be done without changing the
reader side.

> With some care, it would also allow the timekeeping code to update the
> data structures without causing clock_gettime() to block while the
> timekeeping code is running on a different CPU.

It still has to block, i.e. retry, because the data set becomes invalid
when t_end is reached. So the whole thing would be:

       do {
       		seq = read_seqcount_latch();
                data = select_data(seq);
                delta = read_clocksource() - data->base;
                if (delta >= data->max_delta)
                	continue;
                ....
      } while (read_seqcount_latch_retry());

TBH, I like the idea for exactly one reason: It increases robustness.

For X86 we already have the comparison for dealing with TSC < base
which would be covered by

                if (delta >= data->max_delta)
                	continue;

automatically. Non X86 gains this extra conditional, but I think it's
worth to pay that price.

It won't solve the VM migration problem on it's own though. You still
have to be careful about the inner workings of everything related to
timekeeping itself.

> One other thing that might be worth noting: there's another thread
> about "vmgenid".  It's plausible that it's worth considering stopping
> the guest or perhaps interrupting all vCPUs to allow it to take some
> careful actions on migration for reasons that have nothing to do with
> timekeeping.

How surprising. Who could have thought about that?

OMG, virtualization seems to have gone off into a virtual reality long
ago.

Thanks,

        tglx


  reply	other threads:[~2020-12-09 10:15 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-03 17:11 [PATCH v2 0/3] RFC: Precise TSC migration Maxim Levitsky
2020-12-03 17:11 ` [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE Maxim Levitsky
2020-12-06 16:19   ` Thomas Gleixner
2020-12-07 12:16     ` Maxim Levitsky
2020-12-07 13:16       ` Vitaly Kuznetsov
2020-12-07 17:41         ` Thomas Gleixner
2020-12-08  9:48           ` Peter Zijlstra
2020-12-10 11:42           ` Paolo Bonzini
2020-12-10 12:14             ` Peter Zijlstra
2020-12-10 12:22               ` Paolo Bonzini
2020-12-10 13:01                 ` Peter Zijlstra
2020-12-10 20:20                   ` Thomas Gleixner
2020-12-07 16:38       ` Thomas Gleixner
2020-12-07 16:53         ` Andy Lutomirski
2020-12-07 17:00           ` Maxim Levitsky
2020-12-07 18:04             ` Andy Lutomirski
2020-12-07 23:11               ` Marcelo Tosatti
2020-12-08 17:43                 ` Andy Lutomirski
2020-12-08 19:24                   ` Thomas Gleixner
2020-12-08 20:32                     ` Andy Lutomirski
2020-12-09  0:19                       ` Thomas Gleixner
2020-12-09  4:08                         ` Andy Lutomirski
2020-12-09 10:14                           ` Thomas Gleixner [this message]
2020-12-10 23:42                             ` Andy Lutomirski
2020-12-08 11:24               ` Maxim Levitsky
2020-12-08  9:35         ` Peter Zijlstra
2020-12-07 23:34     ` Marcelo Tosatti
2020-12-07 17:29   ` Oliver Upton
2020-12-08 11:13     ` Maxim Levitsky
2020-12-08 15:57       ` Oliver Upton
2020-12-08 15:58         ` Oliver Upton
2020-12-08 17:10           ` Maxim Levitsky
2020-12-08 16:40       ` Thomas Gleixner
2020-12-08 17:08         ` Maxim Levitsky
2020-12-10 11:48           ` Paolo Bonzini
2020-12-10 14:25             ` Maxim Levitsky
2020-12-07 23:29   ` Marcelo Tosatti
2020-12-08 14:50     ` Maxim Levitsky
2020-12-08 16:02       ` Thomas Gleixner
2020-12-08 16:25         ` Maxim Levitsky
2020-12-08 17:33           ` Andy Lutomirski
2020-12-08 21:25             ` Thomas Gleixner
2020-12-08 18:12           ` Marcelo Tosatti
2020-12-08 21:35             ` Thomas Gleixner
2020-12-08 21:20           ` Thomas Gleixner
2020-12-10 11:48             ` Paolo Bonzini
2020-12-10 14:52               ` Maxim Levitsky
2020-12-10 15:16                 ` Andy Lutomirski
2020-12-10 17:59                   ` Oliver Upton
2020-12-10 18:05                     ` Paolo Bonzini
2020-12-10 18:13                       ` Oliver Upton
2020-12-10 21:25                   ` Thomas Gleixner
2020-12-10 22:01                     ` Andy Lutomirski
2020-12-10 22:28                       ` Thomas Gleixner
2020-12-10 23:19                         ` Andy Lutomirski
2020-12-11  0:03                           ` Thomas Gleixner
2020-12-08 18:11         ` Marcelo Tosatti
2020-12-08 21:33           ` Thomas Gleixner
2020-12-09 16:34             ` Marcelo Tosatti
2020-12-09 20:58               ` Thomas Gleixner
2020-12-10 15:26                 ` Marcelo Tosatti
2020-12-10 21:48                   ` Thomas Gleixner
2020-12-11  0:27                     ` Marcelo Tosatti
2020-12-11 13:30                       ` Thomas Gleixner
2020-12-11 14:18                         ` Marcelo Tosatti
2020-12-11 21:04                           ` Thomas Gleixner
2020-12-11 21:59                             ` Paolo Bonzini
2020-12-12 13:03                               ` Thomas Gleixner
2020-12-15 10:59                               ` Marcelo Tosatti
2020-12-15 16:55                                 ` Andy Lutomirski
2020-12-15 22:34                                 ` Thomas Gleixner
2020-12-11 13:37                       ` Paolo Bonzini
2020-12-08 17:35       ` Marcelo Tosatti
2020-12-03 17:11 ` [PATCH v2 2/3] KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS Maxim Levitsky
2020-12-03 17:11 ` [PATCH v2 3/3] kvm/selftests: update tsc_msrs_test to cover KVM_X86_QUIRK_TSC_HOST_ACCESS Maxim Levitsky
2020-12-07 23:16 ` [PATCH v2 0/3] RFC: Precise TSC migration Marcelo Tosatti
2020-12-10 11:48 ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87lfe71e1z.fsf@nanos.tec.linutronix.de \
    --to=tglx@linutronix.de \
    --cc=bp@alien8.de \
    --cc=corbet@lwn.net \
    --cc=drjones@redhat.com \
    --cc=hpa@zytor.com \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mlevitsk@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=oupton@google.com \
    --cc=pbonzini@redhat.com \
    --cc=sean.j.christopherson@intel.com \
    --cc=shuah@kernel.org \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).