From: Joao Martins <joao.m.martins@oracle.com>
To: Jan Beulich <JBeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>,
xen-devel@lists.xenproject.org
Subject: Re: [PATCH v3 5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT
Date: Fri, 26 Aug 2016 16:44:07 +0100 [thread overview]
Message-ID: <57C063C7.7070505@oracle.com> (raw)
In-Reply-To: <57BEE6910200007800108E9C@prv-mh.provo.novell.com>
On 08/25/2016 11:37 AM, Jan Beulich wrote:
>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>> This patch proposes relying on host TSC synchronization and
>> passthrough to the guest, when running on a TSC-safe platform. On
>> time_calibration we retrieve the platform time in ns and the counter
>> read by the clocksource that was used to compute system time. We
>> introduce a new rendezous function which doesn't require
>> synchronization between master and slave CPUS and just reads
>> calibration_rendezvous struct and writes it down the stime and stamp
>> to the cpu_calibration struct to be used later on. We can guarantee that
>> on a platform with a constant and reliable TSC, that the time read on
>> vcpu B right after A is bigger independently of the VCPU calibration
>> error. Since pvclock time infos are monotonic as seen by any vCPU set
>> PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux.
>> IIUC, this is similar to how it's implemented on KVM.
>
> Without any tools side change, how is it guaranteed that a guest
> which observed the stable bit won't get migrated to a host not
> providing that guarantee?
Do you want to prevent migration in such cases? The worst that can happen is that the
guest might need to fallback to a system call if this bit is 0 and would keep doing
so if the bit is 0.
>> Changes since v2:
>> - Add XEN_ prefix to pvclock flags.
>> - Adapter time_calibration_rendezvous_tail to have the case of setting master
>> tsc/stime and use it for the nop_rendezvous.
>> - Removed hotplug CPU option that was added in v1
>> - Prevent online of CPUs when clocksource is tsc.
>
> Some of the above listed changes don't seem to belong here, but
> rather in one of the earlier patches.
OK - as mentioned in patch two this was intended. Will merge these changes into patch 2.
>
>> --- a/xen/arch/x86/platform_hypercall.c
>> +++ b/xen/arch/x86/platform_hypercall.c
>> @@ -631,7 +631,8 @@ ret_t do_platform_op(XEN_GUEST_HANDLE_PARAM(xen_platform_op_t) u_xenpf_op)
>> if ( ret )
>> break;
>>
>> - if ( cpu >= nr_cpu_ids || !cpu_present(cpu) )
>> + if ( cpu >= nr_cpu_ids || !cpu_present(cpu) ||
>> + host_tsc_is_clocksource() )
>
> I have to admit that I consider the "host" here confusing: What other
> TSCs could one think of on x86? Maybe clocksource_is_tsc()?
Hmm, didn't thought of any other TSC, I was just merely follow the existent naming
style in the header ("host_tsc_is_safe()"). I am also fine with clocksource_is_tsc().
>
>> --- a/xen/arch/x86/time.c
>> +++ b/xen/arch/x86/time.c
>> @@ -480,6 +480,13 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns)
>>
>> static s64 __init init_tsctimer(struct platform_timesource *pts)
>> {
>> + if ( nr_cpu_ids != num_present_cpus() )
>> + {
>> + printk(XENLOG_INFO "TSC: CPU Hotplug intended,"
>> + "not using TSC as clocksource\n");
>
> Please don't split log messages across lines. Keep the XENLOG_INFO
> on one line, and then the whole literal on the next.
>
Fixed.
> Also you/we will need to take some measures to avoid this triggering
> on systems where extra MADT entries exist just as dummies (perhaps
> to ease firmware implementation, as was the case prior to commit
> 0875433389 ["hvmloader: limit CPUs exposed to guests"] with our
> own ACPI tables we present to HVM guests).
OK. I think my laptop might be one of those but I need to
check (at least do need to adjust maxcpus= to use clocksource=tsc, but that's the
only case. Other boxes I don't need to)
>
>> @@ -1328,12 +1337,22 @@ struct calibration_rendezvous {
>> };
>>
>> static void
>> -time_calibration_rendezvous_tail(const struct calibration_rendezvous *r)
>> +time_calibration_rendezvous_tail(const struct calibration_rendezvous *r,
>> + bool_t master_tsc)
>
> Please use plain"bool" in new code.
OK.
> And then I'm not convinced this
> refactoring is better than simply having the new rendezvous function
> not call time_calibration_rendezvous_tail().
I will move it to the nop_rendezvous.
>
>> {
>> struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
>>
>> - c->local_tsc = rdtsc_ordered();
>> - c->local_stime = get_s_time_fixed(c->local_tsc);
>> + if ( master_tsc )
>> + {
>> + c->local_tsc = r->master_tsc_stamp;
>
> Doesn't this require the TSCs to run in perfect sync (not even off
> wrt each other by a single cycle)? Is such even possible on multi
> socket systems? I.e. how would multiple chips get into such a
> mode in the first place, considering their TSCs can't possibly start
> ticking at exactly the same (perhaps even sub-)nanosecond?
They do require to be in sync with multi-sockets, otherwise this wouldn't work.
Invariant TSC only refers to cores in a package, but multi-socket is up to board
vendors (no manuals mention this guarantee across sockets). That one of the reasons
TSC is such a burden :(
Looking at datasheets (on the oldest processor I was testing this) it mentions this note:
"In order In order to ensure Timestamp Counter (TSC) synchronization across sockets
in multi-socket systems, the RESET# deassertion edge should arrive at the same BCLK
rising edge at both sockets and should meet the Tsu and Th requirement of 600ps
relative to BCLK, as outlined in Table 2-26.".
[0] Intel Xeon Processor 5600 Series Datasheet Vol 1, Page 63,
http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-5600-vol-1-datasheet.pdf
The BCLK looks to be the global reference clock shared across sockets IIUC used in
the PLLs in the individual packages (to generate the signal where the TSC is
extrapolated from). ( Please read it with a grain of salt, as I may be doing wrong
readings of these datasheets ). But If it was a box with TSC skewed among sockets,
wouldn't we see that at boot time in the tsc warp test? Or maybe TSC sync check isn't
potentially fast enough to catch any oddities? Our docs
(https://xenbits.xen.org/docs/unstable/misc/tscmode.txt) also seem to mention
something along these lines on multi-socket systems. And Linux tsc code seems to
assume that Intel boxes have synchronized TSCs across sockets [1] and that the
exceptions cases should mark tsc=skewed (we also have such parameter).
[1] arch/x86/kernel/tsc.c#L1094
As reassurance I've been running tests for days long (currently in almost a week on
2-socket system) and I'll keep running to see if it catches any issues or time going
backwards. Could also run in the biggest boxes we have with 8 sockets. But still it
would represent only a tiny fraction of what x86 has available these days.
Other than the things above I am not sure how to go about this :( Should we start
adjusting the TSCs if we find disparities or skew is observed on the long run? Or
allow only TSCs on vCPUS of the same package to expose this flag? Hmm, what's your
take on this? Appreciate your feedback.
>> +static void time_calibration_nop_rendezvous(void *rv)
>> +{
>> + const struct calibration_rendezvous *r = rv;
>> +
>> + time_calibration_rendezvous_tail(r, true);
>> }
>
> I don't see what you need the local variable for, unless you
> stopped calling time_calibration_rendezvous_tail() as suggested
> above.
I like the one above as you suggested.
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
next prev parent reply other threads:[~2016-08-26 15:42 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-08-24 12:43 [PATCH v3 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
2016-08-24 12:43 ` [PATCH v3 1/6] x86/time: refactor init_platform_time() Joao Martins
2016-08-25 10:03 ` Jan Beulich
2016-08-26 14:54 ` Joao Martins
2016-08-24 12:43 ` [PATCH v3 2/6] x86/time: implement tsc as clocksource Joao Martins
2016-08-25 10:06 ` Jan Beulich
2016-08-26 15:11 ` Joao Martins
2016-08-29 9:36 ` Jan Beulich
2016-08-30 12:08 ` Joao Martins
2016-08-30 12:30 ` Jan Beulich
2016-08-30 13:59 ` Joao Martins
2016-08-24 12:43 ` [PATCH v3 3/6] x86/time: streamline platform time init on plt_update() Joao Martins
2016-08-25 10:13 ` Jan Beulich
2016-08-26 15:12 ` Joao Martins
2016-08-29 9:41 ` Jan Beulich
2016-08-30 12:10 ` Joao Martins
2016-08-30 12:31 ` Jan Beulich
2016-09-09 16:32 ` Joao Martins
2016-09-12 7:26 ` Jan Beulich
2016-09-12 10:35 ` Joao Martins
2016-08-24 12:43 ` [PATCH v3 4/6] x86/time: refactor read_platform_stime() Joao Martins
2016-08-25 10:17 ` Jan Beulich
2016-08-26 15:13 ` Joao Martins
2016-08-29 9:42 ` Jan Beulich
2016-08-24 12:43 ` [PATCH v3 5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT Joao Martins
2016-08-25 10:37 ` Jan Beulich
2016-08-26 15:44 ` Joao Martins [this message]
2016-08-29 10:06 ` Jan Beulich
2016-08-30 12:26 ` Joao Martins
2016-08-30 12:45 ` Jan Beulich
2016-08-30 14:14 ` Joao Martins
2016-08-24 12:43 ` [PATCH v3 6/6] docs: update clocksource option Joao Martins
2016-08-25 10:38 ` Jan Beulich
2016-08-26 15:13 ` Joao Martins
2016-08-24 12:50 ` [PATCH v3 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=57C063C7.7070505@oracle.com \
--to=joao.m.martins@oracle.com \
--cc=JBeulich@suse.com \
--cc=andrew.cooper3@citrix.com \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.