From: Andy Lutomirski <luto@kernel.org>
To: Fenghua Yu <fenghua.yu@intel.com>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, H Peter Anvin <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>, Alan Cox <alan@linux.intel.com>,
Ravi V Shankar <ravi.v.shankar@intel.com>,
linux-kernel <linux-kernel@vger.kernel.org>, x86 <x86@kernel.org>
Subject: Re: [PATCH 6/7] x86/vdso: Add vDSO functions for user wait instructions
Date: Mon, 23 Jul 2018 19:11:48 -0700 [thread overview]
Message-ID: <2267fbe6-37e8-7063-d48f-1879f31d3258@kernel.org> (raw)
In-Reply-To: <1532350557-98388-7-git-send-email-fenghua.yu@intel.com>
On 07/23/2018 05:55 AM, Fenghua Yu wrote:
> User wants to query if user wait instructions (umonitor, umwait, and
> tpause) are supported and use the instructions. The vDSO functions
> provides fast interface for user to check the support and use the
> instructions.
>
> waitpkg_supported and its alias __vdso_waitpkg_supported check if
> user wait instructions (a.k.a. wait package feature) are supported
>
> umonitor and its alias __vdso_umonitor provide user APIs for calling
> umonitor instruction.
>
> umwait and its alias __vdso_umwait provide user APIs for calling
> umwait instruction.
>
> tpause and its alias __vdso_tpause provide user APIs for calling
> tpause instruction.
>
> nsec_to_tsc and its alias __vdso_nsec_to_tsc converts nanoseconds
> to TSC counter if TSC frequency is known. It will fail if TSC frequency
> is unknown.
>
> The instructions can be implemented in intrinsic functions in future
> GCC. But the vDSO interfaces are available to user without the
> intrinsic functions support in GCC and the API waitpkg_supported and
> nsec_to_tsc cannot be implemented as GCC functions.
>
> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> ---
> arch/x86/entry/vdso/Makefile | 2 +-
> arch/x86/entry/vdso/vdso.lds.S | 10 ++
> arch/x86/entry/vdso/vma.c | 9 ++
> arch/x86/entry/vdso/vuserwait.c | 233 +++++++++++++++++++++++++++++++++
> arch/x86/include/asm/vdso_funcs_data.h | 3 +
> 5 files changed, 256 insertions(+), 1 deletion(-)
> create mode 100644 arch/x86/entry/vdso/vuserwait.c
>
> diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
> index af4fcae5de83..fb0062b09b3c 100644
> --- a/arch/x86/entry/vdso/Makefile
> +++ b/arch/x86/entry/vdso/Makefile
> @@ -17,7 +17,7 @@ VDSO32-$(CONFIG_X86_32) := y
> VDSO32-$(CONFIG_IA32_EMULATION) := y
>
> # files to link into the vdso
> -vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o vdirectstore.o
> +vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o vdirectstore.o vuserwait.o
>
> # files to link into kernel
> obj-y += vma.o
> diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S
> index 097cdcda43a5..0942710608bf 100644
> --- a/arch/x86/entry/vdso/vdso.lds.S
> +++ b/arch/x86/entry/vdso/vdso.lds.S
> @@ -35,6 +35,16 @@ VERSION {
> __vdso_movdir64b_supported;
> movdir64b;
> __vdso_movdir64b;
> + waitpkg_supported;
> + __vdso_waitpkg_supported;
> + umonitor;
> + __vdso_umonitor;
> + umwait;
> + __vdso_umwait;
> + tpause;
> + __vdso_tpause;
> + nsec_to_tsc;
> + __vdso_nsec_to_tsc;
> local: *;
> };
> }
> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> index edbe5e63e5c2..006dfb5e5003 100644
> --- a/arch/x86/entry/vdso/vma.c
> +++ b/arch/x86/entry/vdso/vma.c
> @@ -372,10 +372,19 @@ static int vgetcpu_online(unsigned int cpu)
>
> static void __init init_vdso_funcs_data(void)
> {
> + struct system_counterval_t sys_counterval;
> +
> if (static_cpu_has(X86_FEATURE_MOVDIRI))
> vdso_funcs_data.movdiri_supported = true;
> if (static_cpu_has(X86_FEATURE_MOVDIR64B))
> vdso_funcs_data.movdir64b_supported = true;
> + if (static_cpu_has(X86_FEATURE_WAITPKG))
> + vdso_funcs_data.waitpkg_supported = true;
> + if (static_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
> + vdso_funcs_data.tsc_known_freq = true;
> + sys_counterval = convert_art_ns_to_tsc(1);
> + vdso_funcs_data.tsc_per_nsec = sys_counterval.cycles;
> + }
You're losing a ton of precision here. You might even be losing *all*
of the precision and malfunctioning rather badly.
The correct way to do this is:
tsc_counts = ns * mul >> shift;
and the vclock code illustrates it. convert_art_ns_to_tsc() is a bad
example because it uses an expensive division operation for no good
reason except that no one bothered to optimize it.
> +notrace int __vdso_nsec_to_tsc(unsigned long nsec, unsigned long *tsc)
> +{
> + if (!_vdso_funcs_data->tsc_known_freq)
> + return -ENODEV;
> +
> + *tsc = _vdso_funcs_data->tsc_per_nsec * nsec;
> +
> + return 0;
> +}
Please don't expose this one at all. It would be nice for programs that
use waitpkg to be migratable using CRIU-like tools, and this export
actively harms any such effort. If you omit this function, then the
kernel could learn to abort an in-progress __vdso_umwait if preempted
(rseq-style) and CRIU would just work. It would be a bit of a hack, but
it solves a real problem.
> +notrace int __vdso_umwait(int state, unsigned long nsec)
__vdso_umwait_relative(), please. Because some day (possibly soon)
someone will want __vdso_umwait_absolute() and its friend
__vdso_read_art_ns() so they can do:
u64 start = __vdso_read_art_ns();
__vdso_umonitor(...);
... do something potentially slow or that might fault ...
__vdso_umwait_absolute(start + timeout);
Also, this patch appears to have a subtle but show-stopping race. Consider:
1. Task A does UMONITOR on CPU 1
2. Task A is preempted.
3. Task B does UMONITOR on CPU 1 at a different address
4. Task A resumes
5. Task A does UMWAIT
Now task A hangs, at least until the next external wakeup happens.
It's not entirely clear to me how you're supposed to fix this without
some abomination that's so bad that it torpedoes the entire feature.
Except that there is no chicken bit to turn this thing off. Sigh.
next prev parent reply other threads:[~2018-07-24 2:11 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-07-23 12:55 [PATCH 0/7] x86: Enable a few new instructions Fenghua Yu
2018-07-23 12:55 ` [PATCH 1/7] x86/cpufeatures: Enumerate MOVDIRI instruction Fenghua Yu
2018-07-23 12:55 ` [PATCH 2/7] x86/cpufeatures: Enumerate MOVDIR64B instruction Fenghua Yu
2018-07-23 12:55 ` [PATCH 3/7] x86/cpufeatures: Enumerate UMONITOR, UMWAIT, and TPAUSE instructions Fenghua Yu
2018-07-23 12:55 ` [PATCH 4/7] x86/umwait_contro: Set global umwait maximum time limit and umwait C0.2 state Fenghua Yu
2018-07-24 1:41 ` Andy Lutomirski
2018-08-01 9:01 ` Thomas Gleixner
2018-07-23 12:55 ` [PATCH 5/7] x86/vdso: Add vDSO functions for direct store instructions Fenghua Yu
2018-07-24 1:48 ` Andy Lutomirski
2018-07-24 3:42 ` Fenghua Yu
2018-07-24 5:27 ` Andy Lutomirski
2018-07-25 22:18 ` Fenghua Yu
2018-07-23 12:55 ` [PATCH 6/7] x86/vdso: Add vDSO functions for user wait instructions Fenghua Yu
2018-07-24 2:11 ` Andy Lutomirski [this message]
2018-07-24 15:14 ` Andy Lutomirski
2018-07-31 21:22 ` Thomas Gleixner
2018-07-31 21:38 ` Andy Lutomirski
2018-08-01 8:55 ` Thomas Gleixner
2018-07-23 12:55 ` [PATCH 7/7] selftests/vDSO: Add selftest to test vDSO functions for direct store and " Fenghua Yu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2267fbe6-37e8-7063-d48f-1879f31d3258@kernel.org \
--to=luto@kernel.org \
--cc=alan@linux.intel.com \
--cc=ashok.raj@intel.com \
--cc=fenghua.yu@intel.com \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=ravi.v.shankar@intel.com \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.