Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [PATCH net-next V2] virtio_net: ethtool tx napi configuration
From: Jason Wang @ 2018-09-14  3:29 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, willemb, virtualization, linux-kernel, mst
In-Reply-To: <20180913.155253.713713547323242272.davem@davemloft.net>



On 2018年09月14日 06:52, David Miller wrote:
> From: Jason Wang <jasowang@redhat.com>
> Date: Thu, 13 Sep 2018 13:35:45 +0800
>
>> Toggle tx napi through a bit in tx-frames.
> This is not what the code implements as the interface any more.
>
> Please fix the commit message to match the code.
>
> Thanks.

Will fix this.

Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: Sources of initialized memory in virtio?
From: Jason Wang @ 2018-09-14  3:50 UTC (permalink / raw)
  To: Alexander Potapenko, stefanha, Michael S. Tsirkin, kraxel
  Cc: kvm, virtualization
In-Reply-To: <CAG_fn=URm9Bv0sTy6Ex8ohrLVStaxoZtwH_BOEao6sqGCzQMiA@mail.gmail.com>



On 2018年09月13日 21:00, Alexander Potapenko wrote:
> Hi mighty virtio maintainers,
>
> I'm working on KMSAN, a new runtime detector of uninitialized memory
> based on compiler instrumentation (https://github.com/google/kmsan)
> KMSAN is mostly being tested on QEMU with KVM enabled, so my kernel
> interacts a lot with various virtio drivers, that's why I'm seeking
> your help.
>
> By default KMSAN treats kernel memory allocated by kmalloc() and
> alloc_page() as uninitialized. Writing a constant to memory or using
> it in copy_from_user() makes that memory initialized.
> Unfortunately a lot of writes to memory from KVM (mostly in the disk
> and network drivers) remain unnoticed by the tool, therefore we're
> seeing a lot of false positive reports (along with actual bugs, like
> CVE-2018-1118).
>
> KMSAN has an API function `kmsan_unpoison_shadow(void *buf, int len)`,
> which means "from now on, till this memory is freed or written to,
> mark it as initialized".
> I've tried playing Whack-a-Mole adding it to various places where the
> data comes from KVM, but failed to find them all. In fact, some of my
> annotations were wrong, so I ended up with the following two patches:
>
> https://github.com/google/kmsan/commit/76c671199a4de5bbe73cd13210a5e28848211bd1
> https://github.com/google/kmsan/commit/40ba1c8e2a3c6bbe8f34037413e253894251a405
>
> But I'm far from being sure this is the complete list of places where
> the memory is initialized by virtio drivers.
> May I ask you to help me find the places where we actually need to
> annotate the memory in virtio?

It looks to me another one is the used ring which device writes back the 
completed descriptor id and length. It (vr->used) was a part of a page 
which was allocated in vring_alloc_queue() through alloc_pages_exact() 
with __GFP_ZERO. So I'm not we need care about it.

Thanks

>
> Thanks in advance,

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net-next V2] virtio_net: ethtool tx napi configuration
From: Jason Wang @ 2018-09-14  8:24 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Willem de Bruijn, Michael S. Tsirkin, Network Development, LKML,
	virtualization, David Miller
In-Reply-To: <CAF=yD-LSgCEnL_kTH44A8-FH3nut3y=bonf4Bgf75=jd=D64hw@mail.gmail.com>



On 2018年09月13日 23:20, Willem de Bruijn wrote:
> On Thu, Sep 13, 2018 at 1:40 AM Jason Wang <jasowang@redhat.com> wrote:
>> Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
>> Interrupt moderation is currently not supported, so these accept and
>> display the default settings of 0 usec and 1 frame.
>>
>> Toggle tx napi through a bit in tx-frames. So as to not interfere
>> with possible future interrupt moderation, value 1 means tx napi while
>> value 0 means not.
>>
>> To properly synchronize with the data path, tx napi is disabled and
>> tx lock is held when changing the value of napi weight. And two more
>> places that can access tx napi weight:
>>
>> - speculative tx polling in rx napi, we can leave it as is since it
>>    not a must for correctness.
>> - skb_xmit_done(), one more check of napi weight is added before
>>    trying to enable tx to avoid tx to be disabled forever if napi is
>>    disabled after skb_xmit_done() but before the napi
>>
>> Link: https://patchwork.ozlabs.org/patch/948149/
>> Suggested-by: Jason Wang <jasowang@redhat.com>
>> Signed-off-by: Willem de Bruijn <willemb@google.com>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>> Changes from V1:
>> - try to synchronize with datapath to allow changing mode when
>>    interface is up.
>> - use tx-frames 0 as to disable tx napi while tx-frames 1 to enable tx napi
>> ---
>>   drivers/net/virtio_net.c | 64 +++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 63 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 765920905226..6e70864f5899 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -66,6 +66,8 @@ DECLARE_EWMA(pkt_len, 0, 64)
>>
>>   #define VIRTNET_DRIVER_VERSION "1.0.0"
>>
>> +static const u32 ethtool_coalesce_napi_mask = (1UL << 10);
>> +
> This is no longer needed

Yes, will remove this.

>
>>   static const unsigned long guest_offloads[] = {
>>          VIRTIO_NET_F_GUEST_TSO4,
>>          VIRTIO_NET_F_GUEST_TSO6,
>> @@ -1444,7 +1446,10 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>>
>>          virtqueue_napi_complete(napi, sq->vq, 0);
>>
>> -       if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
>> +       /* Check napi.weight to avoid tx stall since it could be set
>> +        * to zero by ethtool after skb_xmit_done().
>> +        */
>> +       if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS || !sq->napi.weight)
>>                  netif_tx_wake_queue(txq);
> I see. This assumes that the napi handler will always be called on
> conversion from napi to no-napi mode.
>
> That is safe to assume because if it isn't called (and will not call
> netif_tx_wake_queue) that implies that napi was not scheduled, and
> thus the tx interrupt was not suppressed and thus there was no tx
> completion work to be scheduled?

If it isn't called it means skb_xmit_done() wakeup tx directly instead 
of schedule tx. This could be a little bit early since there may be 
still lots of pending tx packets. But it doesn't harm, start_xmit() can 
handle this by re enable a delayed tx interrupt and disable TX.

But there's a bug, look like I need remove the check of 
(!sq->napi.weight) in the beginning of the function.

>
>>          return 0;
>> @@ -2181,6 +2186,61 @@ static int virtnet_get_link_ksettings(struct net_device *dev,
>>          return 0;
>>   }
>>
>> +static int virtnet_set_coalesce(struct net_device *dev,
>> +                               struct ethtool_coalesce *ec)
>> +{
>> +       struct ethtool_coalesce ec_default = {
>> +               .cmd = ETHTOOL_SCOALESCE,
>> +               .rx_max_coalesced_frames = 1,
>> +       };
>> +       struct virtnet_info *vi = netdev_priv(dev);
>> +       int i, napi_weight;
>> +
>> +       if (ec->tx_max_coalesced_frames > 1)
>> +               return -EINVAL;
>> +
>> +       ec_default.tx_max_coalesced_frames = ec->tx_max_coalesced_frames;
>> +       napi_weight = ec->tx_max_coalesced_frames ? NAPI_POLL_WEIGHT : 0;
>> +
>> +       /* disallow changes to fields not explicitly tested above */
>> +       if (memcmp(ec, &ec_default, sizeof(ec_default)))
>> +               return -EINVAL;
>> +
>> +       if (napi_weight ^ vi->sq[0].napi.weight) {
>> +               for (i = 0; i < vi->max_queue_pairs; i++) {
>> +                       struct netdev_queue *txq =
>> +                              netdev_get_tx_queue(vi->dev, i);
>> +
>> +                       virtnet_napi_tx_disable(&vi->sq[i].napi);
>> +                       __netif_tx_lock_bh(txq);
>> +                       vi->sq[i].napi.weight = napi_weight;
>> +                       __netif_tx_unlock_bh(txq);
>> +                       virtnet_napi_tx_enable(vi, vi->sq[i].vq,
>> +                                              &vi->sq[i].napi);
>> +               }
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static int virtnet_get_coalesce(struct net_device *dev,
>> +                               struct ethtool_coalesce *ec)
>> +{
>> +       struct ethtool_coalesce ec_default = {
>> +               .cmd = ETHTOOL_GCOALESCE,
>> +               .rx_max_coalesced_frames = 1,
>> +               .tx_max_coalesced_frames = 0,
> no need to explicitly initialize to 0 (unless you did this for
> documentation purposes, which is fine).

Yes.

Thanks

>> +       };
>> +       struct virtnet_info *vi = netdev_priv(dev);
>> +
>> +       memcpy(ec, &ec_default, sizeof(ec_default));
>> +
>> +       if (vi->sq[0].napi.weight)
>> +               ec->tx_max_coalesced_frames = 1;
>> +
>> +       return 0;
>> +}
>> +
>>   static void virtnet_init_settings(struct net_device *dev)
>>   {
>>          struct virtnet_info *vi = netdev_priv(dev);
>> @@ -2219,6 +2279,8 @@ static const struct ethtool_ops virtnet_ethtool_ops = {
>>          .get_ts_info = ethtool_op_get_ts_info,
>>          .get_link_ksettings = virtnet_get_link_ksettings,
>>          .set_link_ksettings = virtnet_set_link_ksettings,
>> +       .set_coalesce = virtnet_set_coalesce,
>> +       .get_coalesce = virtnet_get_coalesce,
>>   };
>>
>>   static void virtnet_freeze_down(struct virtio_device *vdev)
>> --
>> 2.17.1
>>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* ICITS'19 - Quito, Ecuador | Extended deadline: September 23
From: Marle @ 2018-09-14 11:13 UTC (permalink / raw)
  To: virtualization

[-- Attachment #1.1: Type: text/plain, Size: 6043 bytes --]

* Proceedings by Springer, indexed in Scopus, ISI, etc.

***********************************************************

ICITS'19 - The 2019 International Conference on Information Technology & Systems

Quito, Ecuador, 6 - 8 February 2019

http://www.icits.me <http://www.icits.me/>

********************************************************************************

ICITS'19 - The 2019 International Conference on Information Technology & Systems, to be held at Quito, Ecuador, 6 - 8 February 2019, is an international forum for researchers and practitioners to present and discuss the most recent innovations, trends, results, experiences and concerns in the several perspectives of Information Technology & Systems.

We are pleased to invite you to submit your papers to ICITS'19. They can be written in English, Spanish or Portuguese. All submissions will be reviewed on the basis of relevance, originality, importance and clarity.

Topics

Submitted papers should be related with one or more of the main themes proposed for the Conference:

A) Information and Knowledge Management (IKM);

B) Organizational Models and Information Systems (OMIS);

C) Software and Systems Modeling (SSM);

D) Software Systems, Architectures, Applications and Tools (SSAAT);

E) Multimedia Systems and Applications (MSA);

F) Computer Networks, Mobility and Pervasive Systems (CNMPS);

G) Intelligent and Decision Support Systems (IDSS);

H) Big Data Analytics and Applications (BDAA);

I) Human-Computer Interaction (HCI);

J) Ethics, Computers and Security (ECS)

K) Health Informatics (HIS);

L) Information Technologies in Education (ITE);

M) Cybersecurity and Cyber-defense;

N) Electromagnetics, Sensors and Antennas for Security.

Submission and Decision

Submitted papers written in English (until 10-page limit) must comply with the format of Advances in Intelligent Systems and Computing series (see Instructions for Authors at Springer Website <http://www.springer.com/series/11156> or download a DOC example <http://www.icits.me/springerformat.doc>), must not have been published before, not be under review for any other conference or publication and not include any information leading to the authors’ identification. Therefore, the authors’ names, affiliations and bibliographic references should not be included in the version for evaluation by the Scientific Committee. This information should only be included in the camera-ready version, saved in Word or Latex format and also in PDF format. These files must be accompanied by the Consent to Publish form <http://www.icits.me/copyright.pdf> filled out, in a ZIP file, and uploaded at the conference management system.

Submitted papers written in Spanish or Portuguese (until 15-page limit) must comply with the format of RISTI <http://www.risti.xyz/> - Revista Ibérica de Sistemas e Tecnologias de Informação (download instructions/template for authors in Spanish <http://www.risti.xyz/formato-es.doc> or Portuguese <http://www.risti.xyz/formato-pt.doc>), must not have been published before, not be under review for any other conference or publication and not include any information leading to the authors’ identification. Therefore, the authors’ names, affiliations and bibliographic references should not be included in the version for evaluation by the Scientific Committee. This information should only be included in the camera-ready version, saved in Word. These file must be uploaded at the conference management system in a ZIP file.

All papers will be subjected to a “double-blind review” by at least two members of the Scientific Committee.

Based on Scientific Committee evaluation, a paper can be rejected or accepted by the Conference Chairs. In the later case, it can be accepted as paper or poster.

The authors of papers accepted as posters must build and print a poster to be exhibited during the Conference. This poster must follow an A1 or A2 vertical format. The Conference can includes Work Sessions where these posters are presented and orally discussed, with a 7 minute limit per poster.

The authors of accepted papers will have 15 minutes to present their work in a Conference Work Session; approximately 5 minutes of discussion will follow each presentation.

Publication and Indexing

To ensure that an accepted paper is published, at least one of the authors must be fully registered by the 9th of November 2018, and the paper must comply with the suggested layout and page-limit. Additionally, all recommended changes must be addressed by the authors before they submit the camera-ready version.

No more than one paper per registration will be published. An extra fee must be paid for publication of additional papers, with a maximum of one additional paper per registration. One registration permits only the participation of one author in the conference.

Papers written in English and accepted and registered will be published in Proceedings by Springer, in a book of the Advances in Intelligent Systems and Computing <http://www.springer.com/series/11156>series, will  be submitted for indexation by ISI, EI-Compendex, SCOPUS and DBLP, among others, and will be available in the SpringerLink Digital Library <http://link.springer.com/>.

Papers written in Spanish or Portuguese and accepted and registered will be published in a Special Issue of RISTI <http://www.risti.xyz/index.php?option=com_content&view=article&id=3&Itemid=104&lang=es> and will be submitted for indexation by SCOPUS, among others.

Important Dates

Paper Submission: September 16 23, 2018

Notification of Acceptance: October 21, 2018

Payment of Registration, to ensure the inclusion of an accepted paper in the conference proceedings: November 9, 2018.

Camera-ready Submission: November 9, 2018

ICITS'19 website: http://www.icits.me <http://www.icits.me/>

---
Este e-mail foi verificado em termos de vírus pelo AVG.
http://www.avg.com

[-- Attachment #1.2: Type: text/html, Size: 8378 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
	virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
	Paolo Bonzini, devel, Matt Rickard

Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
implementation, which extended the clockid switch case and added yet
another slightly different copy of the same code.

Especially the extended switch case is problematic as the compiler tends to
generate a jump table which then requires to use retpolines. If jump tables
are disabled it adds yet another conditional to the existing maze.

This series takes a different approach by consolidating the almost
identical functions into one implementation for high resolution clocks and
one for the coarse grained clock ids by storing the base data for each
clock id in an array which is indexed by the clock id.

This completely eliminates the switch case and allows further
simplifications of the code base, which at the end all together gain a few
cycles performance or at least stay on par with todays code. The resulting
performance depends heavily on the micro architecture and the compiler.

Thanks,

	tglx

8<-------------------
 arch/x86/Kconfig                        |    1 
 arch/x86/entry/vdso/vclock_gettime.c    |  199 ++++++++------------------------
 arch/x86/entry/vsyscall/vsyscall_gtod.c |   55 ++++----
 arch/x86/include/asm/vgtod.h            |   46 ++++---
 arch/x86/kernel/time.c                  |   22 +++
 include/linux/clocksource.h             |    5 
 kernel/time/Kconfig                     |    4 
 kernel/time/clocksource.c               |    2 
 8 files changed, 144 insertions(+), 190 deletions(-)

^ permalink raw reply

* [patch 01/11] clocksource: Provide clocksource_arch_init()
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
	virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
	Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180914125006.349747096@linutronix.de>

[-- Attachment #1: clocksource--Provide-clocksource_arch_init--.patch --]
[-- Type: text/plain, Size: 1492 bytes --]

Architectures have extra archdata in the clocksource, e.g. for VDSO
support. There are no sanity checks or general initializations for this
available. Add support for that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/clocksource.h |    5 +++++
 kernel/time/Kconfig         |    4 ++++
 kernel/time/clocksource.c   |    2 ++
 3 files changed, 11 insertions(+)

--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -241,6 +241,11 @@ static inline void __clocksource_update_
 	__clocksource_update_freq_scale(cs, 1000, khz);
 }
 
+#ifdef CONFIG_ARCH_CLOCKSOURCE_INIT
+extern void clocksource_arch_init(struct clocksource *cs);
+#else
+static inline void clocksource_arch_init(struct clocksource *cs) { }
+#endif
 
 extern int timekeeping_notify(struct clocksource *clock);
 
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -12,6 +12,10 @@ config CLOCKSOURCE_WATCHDOG
 config ARCH_CLOCKSOURCE_DATA
 	bool
 
+# Architecture has extra clocksource init called from registration
+config ARCH_CLOCKSOURCE_INIT
+	bool
+
 # Clocksources require validation of the clocksource against the last
 # cycle update - x86/TSC misfeature
 config CLOCKSOURCE_VALIDATE_LAST_CYCLE
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -937,6 +937,8 @@ int __clocksource_register_scale(struct
 {
 	unsigned long flags;
 
+	clocksource_arch_init(cs);
+
 	/* Initialize mult/shift and max_idle_ns */
 	__clocksource_update_freq_scale(cs, scale, freq);

^ permalink raw reply

* [patch 02/11] x86/time: Implement clocksource_arch_init()
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
	virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
	Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180914125006.349747096@linutronix.de>

[-- Attachment #1: x86-time--Implement-clocksource_arch_init--.patch --]
[-- Type: text/plain, Size: 1337 bytes --]

Runtime validate the VCLOCK_MODE in clocksource::archdata and disable
VCLOCK if invalid, which disables the VDSO but keeps the system running.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/Kconfig       |    1 +
 arch/x86/kernel/time.c |   16 ++++++++++++++++
 2 files changed, 17 insertions(+)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -48,6 +48,7 @@ config X86
 	select ACPI_SYSTEM_POWER_STATES_SUPPORT	if ACPI
 	select ANON_INODES
 	select ARCH_CLOCKSOURCE_DATA
+	select ARCH_CLOCKSOURCE_INIT
 	select ARCH_DISCARD_MEMBLOCK
 	select ARCH_HAS_ACPI_TABLE_UPGRADE	if ACPI
 	select ARCH_HAS_DEBUG_VIRTUAL
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -10,6 +10,7 @@
  *
  */
 
+#include <linux/clocksource.h>
 #include <linux/clockchips.h>
 #include <linux/interrupt.h>
 #include <linux/irq.h>
@@ -105,3 +106,18 @@ void __init time_init(void)
 {
 	late_time_init = x86_late_time_init;
 }
+
+/*
+ * Sanity check the vdso related archdata content.
+ */
+void clocksource_arch_init(struct clocksource *cs)
+{
+	if (cs->archdata.vclock_mode == VCLOCK_NONE)
+		return;
+
+	if (cs->archdata.vclock_mode >= VCLOCK_MAX) {
+		pr_warn("clocksource %s registered with invalid vclock_mode %d. Disabling vclock.\n",
+			cs->name, cs->archdata.vclock_mode);
+		cs->archdata.vclock_mode = VCLOCK_NONE;
+	}
+}

^ permalink raw reply

* [patch 03/11] x86/vdso: Enforce 64bit clocksource
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
	virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
	Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180914125006.349747096@linutronix.de>

[-- Attachment #1: x86-vdso--Enforce-64bit-clocksource.patch --]
[-- Type: text/plain, Size: 1074 bytes --]

All VDSO clock sources are TSC based and use CLOCKSOURCE_MASK(64). There is
no point in masking with all FF. Get rid of it and enforce the mask in the
sanity checker.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/vdso/vclock_gettime.c |    2 +-
 arch/x86/kernel/time.c               |    6 ++++++
 2 files changed, 7 insertions(+), 1 deletion(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -199,7 +199,7 @@ notrace static inline u64 vgetsns(int *m
 #endif
 	else
 		return 0;
-	v = (cycles - gtod->cycle_last) & gtod->mask;
+	v = cycles - gtod->cycle_last;
 	return v * gtod->mult;
 }
 
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -120,4 +120,10 @@ void clocksource_arch_init(struct clocks
 			cs->name, cs->archdata.vclock_mode);
 		cs->archdata.vclock_mode = VCLOCK_NONE;
 	}
+
+	if (cs->mask != CLOCKSOURCE_MASK(64)) {
+		pr_warn("clocksource %s registered with invalid mask %016llx. Disabling vclock.\n",
+			cs->name, cs->mask);
+		cs->archdata.vclock_mode = VCLOCK_NONE;
+	}
 }

^ permalink raw reply

* [patch 04/11] x86/vdso: Use unsigned int consistently for vsyscall_gtod_data::seq
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
	virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
	Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180914125006.349747096@linutronix.de>

[-- Attachment #1: x86-vdso--Mintor-cleanups.patch --]
[-- Type: text/plain, Size: 2383 bytes --]

The sequence count in vgtod_data is unsigned int, but the call sites use
unsigned long, which is a pointless exercise. Fix the call sites and
replace 'unsigned' with unsinged 'int' while at it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/vdso/vclock_gettime.c |    8 ++++----
 arch/x86/include/asm/vgtod.h         |   10 +++++-----
 2 files changed, 9 insertions(+), 9 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -206,7 +206,7 @@ notrace static inline u64 vgetsns(int *m
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
-	unsigned long seq;
+	unsigned int seq;
 	u64 ns;
 	int mode;
 
@@ -227,7 +227,7 @@ notrace static int __always_inline do_re
 
 notrace static int __always_inline do_monotonic(struct timespec *ts)
 {
-	unsigned long seq;
+	unsigned int seq;
 	u64 ns;
 	int mode;
 
@@ -248,7 +248,7 @@ notrace static int __always_inline do_mo
 
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
-	unsigned long seq;
+	unsigned int seq;
 	do {
 		seq = gtod_read_begin(gtod);
 		ts->tv_sec = gtod->wall_time_coarse_sec;
@@ -258,7 +258,7 @@ notrace static void do_realtime_coarse(s
 
 notrace static void do_monotonic_coarse(struct timespec *ts)
 {
-	unsigned long seq;
+	unsigned int seq;
 	do {
 		seq = gtod_read_begin(gtod);
 		ts->tv_sec = gtod->monotonic_time_coarse_sec;
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -15,9 +15,9 @@ typedef unsigned long gtod_long_t;
  * so be carefull by modifying this structure.
  */
 struct vsyscall_gtod_data {
-	unsigned seq;
+	unsigned int seq;
 
-	int vclock_mode;
+	int	vclock_mode;
 	u64	cycle_last;
 	u64	mask;
 	u32	mult;
@@ -44,9 +44,9 @@ static inline bool vclock_was_used(int v
 	return READ_ONCE(vclocks_used) & (1 << vclock);
 }
 
-static inline unsigned gtod_read_begin(const struct vsyscall_gtod_data *s)
+static inline unsigned int gtod_read_begin(const struct vsyscall_gtod_data *s)
 {
-	unsigned ret;
+	unsigned int ret;
 
 repeat:
 	ret = READ_ONCE(s->seq);
@@ -59,7 +59,7 @@ static inline unsigned gtod_read_begin(c
 }
 
 static inline int gtod_read_retry(const struct vsyscall_gtod_data *s,
-					unsigned start)
+				  unsigned int start)
 {
 	smp_rmb();
 	return unlikely(s->seq != start);

^ permalink raw reply

* [patch 05/11] x86/vdso: Introduce and use vgtod_ts
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
	virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
	Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180914125006.349747096@linutronix.de>

[-- Attachment #1: x86-vdso--Introduce-and-use-vgtod_ts.patch --]
[-- Type: text/plain, Size: 7483 bytes --]

It's desired to support more clocks in the VDSO, e.g. CLOCK_TAI. This
results either in indirect calls due to the larger switch case, which then
requires retpolines or when the compiler is forced to avoid jump tables it
results in even more conditionals.

To avoid both variants which are bad for performance the high resolution
functions and the coarse grained functions will be collapsed into one for
each. That requires to store the clock specific base time in an array.

Introcude struct vgtod_ts for storage and convert the data store, the
update function and the individual clock functions over to use it.

The new storage does not longer use gtod_long_t for seconds depending on 32
or 64 bit compile because this needs to be the full 64bit value even for
32bit when a Y2038 function is added. No point in keeping the distinction
alive in the internal representation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/vdso/vclock_gettime.c    |   24 +++++++++------
 arch/x86/entry/vsyscall/vsyscall_gtod.c |   51 ++++++++++++++++----------------
 arch/x86/include/asm/vgtod.h            |   36 ++++++++++++----------
 3 files changed, 61 insertions(+), 50 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -206,6 +206,7 @@ notrace static inline u64 vgetsns(int *m
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
+	struct vgtod_ts *base = &gtod->basetime[CLOCK_REALTIME];
 	unsigned int seq;
 	u64 ns;
 	int mode;
@@ -213,8 +214,8 @@ notrace static int __always_inline do_re
 	do {
 		seq = gtod_read_begin(gtod);
 		mode = gtod->vclock_mode;
-		ts->tv_sec = gtod->wall_time_sec;
-		ns = gtod->wall_time_snsec;
+		ts->tv_sec = base->sec;
+		ns = base->nsec;
 		ns += vgetsns(&mode);
 		ns >>= gtod->shift;
 	} while (unlikely(gtod_read_retry(gtod, seq)));
@@ -227,6 +228,7 @@ notrace static int __always_inline do_re
 
 notrace static int __always_inline do_monotonic(struct timespec *ts)
 {
+	struct vgtod_ts *base = &gtod->basetime[CLOCK_MONOTONIC];
 	unsigned int seq;
 	u64 ns;
 	int mode;
@@ -234,8 +236,8 @@ notrace static int __always_inline do_mo
 	do {
 		seq = gtod_read_begin(gtod);
 		mode = gtod->vclock_mode;
-		ts->tv_sec = gtod->monotonic_time_sec;
-		ns = gtod->monotonic_time_snsec;
+		ts->tv_sec = base->sec;
+		ns = base->nsec;
 		ns += vgetsns(&mode);
 		ns >>= gtod->shift;
 	} while (unlikely(gtod_read_retry(gtod, seq)));
@@ -248,21 +250,25 @@ notrace static int __always_inline do_mo
 
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
+	struct vgtod_ts *base = &gtod->basetime[CLOCK_REALTIME_COARSE];
 	unsigned int seq;
+
 	do {
 		seq = gtod_read_begin(gtod);
-		ts->tv_sec = gtod->wall_time_coarse_sec;
-		ts->tv_nsec = gtod->wall_time_coarse_nsec;
+		ts->tv_sec = base->sec;
+		ts->tv_nsec = base->nsec;
 	} while (unlikely(gtod_read_retry(gtod, seq)));
 }
 
 notrace static void do_monotonic_coarse(struct timespec *ts)
 {
+	struct vgtod_ts *base = &gtod->basetime[CLOCK_MONOTONIC_COARSE];
 	unsigned int seq;
+
 	do {
 		seq = gtod_read_begin(gtod);
-		ts->tv_sec = gtod->monotonic_time_coarse_sec;
-		ts->tv_nsec = gtod->monotonic_time_coarse_nsec;
+		ts->tv_sec = base->sec;
+		ts->tv_nsec = base->nsec;
 	} while (unlikely(gtod_read_retry(gtod, seq)));
 }
 
@@ -318,7 +324,7 @@ int gettimeofday(struct timeval *, struc
 notrace time_t __vdso_time(time_t *t)
 {
 	/* This is atomic on x86 so we don't need any locks. */
-	time_t result = READ_ONCE(gtod->wall_time_sec);
+	time_t result = READ_ONCE(gtod->basetime[CLOCK_REALTIME].sec);
 
 	if (t)
 		*t = result;
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -31,6 +31,8 @@ void update_vsyscall(struct timekeeper *
 {
 	int vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode;
 	struct vsyscall_gtod_data *vdata = &vsyscall_gtod_data;
+	struct vgtod_ts *base;
+	u64 nsec;
 
 	/* Mark the new vclock used. */
 	BUILD_BUG_ON(VCLOCK_MAX >= 32);
@@ -45,34 +47,33 @@ void update_vsyscall(struct timekeeper *
 	vdata->mult		= tk->tkr_mono.mult;
 	vdata->shift		= tk->tkr_mono.shift;
 
-	vdata->wall_time_sec		= tk->xtime_sec;
-	vdata->wall_time_snsec		= tk->tkr_mono.xtime_nsec;
-
-	vdata->monotonic_time_sec	= tk->xtime_sec
-					+ tk->wall_to_monotonic.tv_sec;
-	vdata->monotonic_time_snsec	= tk->tkr_mono.xtime_nsec
-					+ ((u64)tk->wall_to_monotonic.tv_nsec
-						<< tk->tkr_mono.shift);
-	while (vdata->monotonic_time_snsec >=
-					(((u64)NSEC_PER_SEC) << tk->tkr_mono.shift)) {
-		vdata->monotonic_time_snsec -=
-					((u64)NSEC_PER_SEC) << tk->tkr_mono.shift;
-		vdata->monotonic_time_sec++;
+	base = &vdata->basetime[CLOCK_REALTIME];
+	base->sec = tk->xtime_sec;
+	base->nsec = tk->tkr_mono.xtime_nsec;
+
+	base = &vdata->basetime[CLOCK_MONOTONIC];
+	base->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
+	nsec = tk->tkr_mono.xtime_nsec;
+	nsec +=	((u64)tk->wall_to_monotonic.tv_nsec << tk->tkr_mono.shift);
+	while (nsec >= (((u64)NSEC_PER_SEC) << tk->tkr_mono.shift)) {
+		nsec -= ((u64)NSEC_PER_SEC) << tk->tkr_mono.shift;
+		base->sec++;
 	}
+	base->nsec = nsec;
 
-	vdata->wall_time_coarse_sec	= tk->xtime_sec;
-	vdata->wall_time_coarse_nsec	= (long)(tk->tkr_mono.xtime_nsec >>
-						 tk->tkr_mono.shift);
-
-	vdata->monotonic_time_coarse_sec =
-		vdata->wall_time_coarse_sec + tk->wall_to_monotonic.tv_sec;
-	vdata->monotonic_time_coarse_nsec =
-		vdata->wall_time_coarse_nsec + tk->wall_to_monotonic.tv_nsec;
-
-	while (vdata->monotonic_time_coarse_nsec >= NSEC_PER_SEC) {
-		vdata->monotonic_time_coarse_nsec -= NSEC_PER_SEC;
-		vdata->monotonic_time_coarse_sec++;
+	base = &vdata->basetime[CLOCK_REALTIME_COARSE];
+	base->sec = tk->xtime_sec;
+	base->nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+
+	base = &vdata->basetime[CLOCK_MONOTONIC_COARSE];
+	base->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
+	nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+	nsec += tk->wall_to_monotonic.tv_nsec;
+	while (nsec >= NSEC_PER_SEC) {
+		nsec -= NSEC_PER_SEC;
+		base->sec++;
 	}
+	base->nsec = nsec;
 
 	gtod_write_end(vdata);
 }
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -5,33 +5,37 @@
 #include <linux/compiler.h>
 #include <linux/clocksource.h>
 
+#include <uapi/linux/time.h>
+
 #ifdef BUILD_VDSO32_64
 typedef u64 gtod_long_t;
 #else
 typedef unsigned long gtod_long_t;
 #endif
+
+struct vgtod_ts {
+	u64		sec;
+	u64		nsec;
+};
+
+#define VGTOD_BASES	(CLOCK_MONOTONIC_COARSE + 1)
+#define VGTOD_HRES	(BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC))
+#define VGTOD_COARSE	(BIT(CLOCK_REALTIME_COARSE) | BIT(CLOCK_MONOTONIC_COARSE))
+
 /*
  * vsyscall_gtod_data will be accessed by 32 and 64 bit code at the same time
  * so be carefull by modifying this structure.
  */
 struct vsyscall_gtod_data {
-	unsigned int seq;
+	unsigned int	seq;
+
+	int		vclock_mode;
+	u64		cycle_last;
+	u64		mask;
+	u32		mult;
+	u32		shift;
 
-	int	vclock_mode;
-	u64	cycle_last;
-	u64	mask;
-	u32	mult;
-	u32	shift;
-
-	/* open coded 'struct timespec' */
-	u64		wall_time_snsec;
-	gtod_long_t	wall_time_sec;
-	gtod_long_t	monotonic_time_sec;
-	u64		monotonic_time_snsec;
-	gtod_long_t	wall_time_coarse_sec;
-	gtod_long_t	wall_time_coarse_nsec;
-	gtod_long_t	monotonic_time_coarse_sec;
-	gtod_long_t	monotonic_time_coarse_nsec;
+	struct vgtod_ts	basetime[VGTOD_BASES];
 
 	int		tz_minuteswest;
 	int		tz_dsttime;

^ permalink raw reply

* [patch 06/11] x86/vdso: Collapse high resolution functions
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
	virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
	Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180914125006.349747096@linutronix.de>

[-- Attachment #1: x86-vdso--Collapse-high-res-functions.patch --]
[-- Type: text/plain, Size: 2221 bytes --]

do_realtime() and do_monotonic() are now the same except for the storage
array index. Hand the index in as an argument and collapse the functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/vdso/vclock_gettime.c |   35 +++++++----------------------------
 1 file changed, 7 insertions(+), 28 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -203,35 +203,12 @@ notrace static inline u64 vgetsns(int *m
 	return v * gtod->mult;
 }
 
-/* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
-notrace static int __always_inline do_realtime(struct timespec *ts)
+notrace static int do_hres(clockid_t clk, struct timespec *ts)
 {
-	struct vgtod_ts *base = &gtod->basetime[CLOCK_REALTIME];
+	struct vgtod_ts *base = &gtod->basetime[clk];
 	unsigned int seq;
-	u64 ns;
 	int mode;
-
-	do {
-		seq = gtod_read_begin(gtod);
-		mode = gtod->vclock_mode;
-		ts->tv_sec = base->sec;
-		ns = base->nsec;
-		ns += vgetsns(&mode);
-		ns >>= gtod->shift;
-	} while (unlikely(gtod_read_retry(gtod, seq)));
-
-	ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
-	ts->tv_nsec = ns;
-
-	return mode;
-}
-
-notrace static int __always_inline do_monotonic(struct timespec *ts)
-{
-	struct vgtod_ts *base = &gtod->basetime[CLOCK_MONOTONIC];
-	unsigned int seq;
 	u64 ns;
-	int mode;
 
 	do {
 		seq = gtod_read_begin(gtod);
@@ -276,11 +253,11 @@ notrace int __vdso_clock_gettime(clockid
 {
 	switch (clock) {
 	case CLOCK_REALTIME:
-		if (do_realtime(ts) == VCLOCK_NONE)
+		if (do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE)
 			goto fallback;
 		break;
 	case CLOCK_MONOTONIC:
-		if (do_monotonic(ts) == VCLOCK_NONE)
+		if (do_hres(CLOCK_MONOTONIC, ts) == VCLOCK_NONE)
 			goto fallback;
 		break;
 	case CLOCK_REALTIME_COARSE:
@@ -303,7 +280,9 @@ int clock_gettime(clockid_t, struct time
 notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
 {
 	if (likely(tv != NULL)) {
-		if (unlikely(do_realtime((struct timespec *)tv) == VCLOCK_NONE))
+		struct timespec *ts = (struct timespec *) tv;
+
+		if (unlikely(do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE))
 			return vdso_fallback_gtod(tv, tz);
 		tv->tv_usec /= 1000;
 	}

^ permalink raw reply

* [patch 07/11] x86/vdso: Collapse coarse functions
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
	virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
	Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180914125006.349747096@linutronix.de>

[-- Attachment #1: x86-vdso--Collapse-coarse-functions.patch --]
[-- Type: text/plain, Size: 1421 bytes --]

do_realtime_coarse() and do_monotonic_coarse() are now the same except for
the storage array index. Hand the index in as an argument and collapse the
functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/vdso/vclock_gettime.c |   20 ++++----------------
 1 file changed, 4 insertions(+), 16 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -225,21 +225,9 @@ notrace static int do_hres(clockid_t clk
 	return mode;
 }
 
-notrace static void do_realtime_coarse(struct timespec *ts)
+notrace static void do_coarse(clockid_t clk, struct timespec *ts)
 {
-	struct vgtod_ts *base = &gtod->basetime[CLOCK_REALTIME_COARSE];
-	unsigned int seq;
-
-	do {
-		seq = gtod_read_begin(gtod);
-		ts->tv_sec = base->sec;
-		ts->tv_nsec = base->nsec;
-	} while (unlikely(gtod_read_retry(gtod, seq)));
-}
-
-notrace static void do_monotonic_coarse(struct timespec *ts)
-{
-	struct vgtod_ts *base = &gtod->basetime[CLOCK_MONOTONIC_COARSE];
+	struct vgtod_ts *base = &gtod->basetime[clk];
 	unsigned int seq;
 
 	do {
@@ -261,10 +249,10 @@ notrace int __vdso_clock_gettime(clockid
 			goto fallback;
 		break;
 	case CLOCK_REALTIME_COARSE:
-		do_realtime_coarse(ts);
+		do_coarse(CLOCK_REALTIME_COARSE, ts);
 		break;
 	case CLOCK_MONOTONIC_COARSE:
-		do_monotonic_coarse(ts);
+		do_coarse(CLOCK_MONOTONIC_COARSE, ts);
 		break;
 	default:
 		goto fallback;

^ permalink raw reply

* [patch 08/11] x86/vdso: Replace the clockid switch case
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
	virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
	Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180914125006.349747096@linutronix.de>

[-- Attachment #1: x86-vdso--Replace-the-clockid-switch-case.patch --]
[-- Type: text/plain, Size: 2110 bytes --]

Now that the time getter functions use the clockid as index into the
storage array for the base time access, the switch case can be replaced.

- Check for clockid >= MAX_CLOCKS and for negative clockid (CPU/FD) first
  and call the fallback function right away.

- After establishing that clockid is < MAX_CLOCKS, convert the clockid to a
  bitmask

- Check for the supported high resolution and coarse functions by anding
  the bitmask of supported clocks and check whether a bit is set.

This completely avoids jump tables, reduces the number of conditionals and
makes the VDSO extensible for other clock ids.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/vdso/vclock_gettime.c |   38 ++++++++++++++++-------------------
 1 file changed, 18 insertions(+), 20 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -239,29 +239,27 @@ notrace static void do_coarse(clockid_t
 
 notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
 {
-	switch (clock) {
-	case CLOCK_REALTIME:
-		if (do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE)
-			goto fallback;
-		break;
-	case CLOCK_MONOTONIC:
-		if (do_hres(CLOCK_MONOTONIC, ts) == VCLOCK_NONE)
-			goto fallback;
-		break;
-	case CLOCK_REALTIME_COARSE:
-		do_coarse(CLOCK_REALTIME_COARSE, ts);
-		break;
-	case CLOCK_MONOTONIC_COARSE:
-		do_coarse(CLOCK_MONOTONIC_COARSE, ts);
-		break;
-	default:
-		goto fallback;
-	}
+	unsigned int msk;
+
+	/* Sort out negative (CPU/FD) and invalid clocks */
+	if (unlikely((unsigned int) clock >= MAX_CLOCKS))
+		return vdso_fallback_gettime(clock, ts);
 
-	return 0;
-fallback:
+	/*
+	 * Convert the clockid to a bitmask and use it to check which
+	 * clocks are handled in the VDSO directly.
+	 */
+	msk = 1U << clock;
+	if (likely(msk & VGTOD_HRES)) {
+		if (do_hres(clock, ts) != VCLOCK_NONE)
+			return 0;
+	} else if (msk & VGTOD_COARSE) {
+		do_coarse(clock, ts);
+		return 0;
+	}
 	return vdso_fallback_gettime(clock, ts);
 }
+
 int clock_gettime(clockid_t, struct timespec *)
 	__attribute__((weak, alias("__vdso_clock_gettime")));

^ permalink raw reply

* [patch 09/11] x86/vdso: Simplify the invalid vclock case
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
	virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
	Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180914125006.349747096@linutronix.de>

[-- Attachment #1: x86-vdso--Simplify-the-invalid-vclock-case.patch --]
[-- Type: text/plain, Size: 5307 bytes --]

The code flow for the vclocks is convoluted as it requires the vclocks
which can be invalidated separately from the vsyscall_gtod_data sequence to
store the fact in a separate variable. That's inefficient.

Restructure the code so the vclock readout returns cycles and the
conversion to nanoseconds is handled at the call site.

If the clock gets invalidated or vclock is already VCLOCK_NONE, return
U64_MAX as the cycle value, which is invalid for all clocks and leave the
sequence loop immediately in that case by calling the fallback function
directly.

This allows to remove the gettimeofday fallback as it now uses the
clock_gettime() fallback and does the nanoseconds to microseconds
conversion in the same way as it does when the vclock is functional. It
does not make a difference whether the division by 1000 happens in the
kernel fallback or in userspace.

Generates way better code and gains a few cycles back.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/vdso/vclock_gettime.c |   81 +++++++++--------------------------
 1 file changed, 21 insertions(+), 60 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -48,16 +48,6 @@ notrace static long vdso_fallback_gettim
 	return ret;
 }
 
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
-{
-	long ret;
-
-	asm("syscall" : "=a" (ret) :
-	    "0" (__NR_gettimeofday), "D" (tv), "S" (tz) : "memory");
-	return ret;
-}
-
-
 #else
 
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
@@ -75,21 +65,6 @@ notrace static long vdso_fallback_gettim
 	return ret;
 }
 
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
-{
-	long ret;
-
-	asm(
-		"mov %%ebx, %%edx \n"
-		"mov %2, %%ebx \n"
-		"call __kernel_vsyscall \n"
-		"mov %%edx, %%ebx \n"
-		: "=a" (ret)
-		: "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
-		: "memory", "edx");
-	return ret;
-}
-
 #endif
 
 #ifdef CONFIG_PARAVIRT_CLOCK
@@ -98,7 +73,7 @@ static notrace const struct pvclock_vsys
 	return (const struct pvclock_vsyscall_time_info *)&pvclock_page;
 }
 
-static notrace u64 vread_pvclock(int *mode)
+static notrace u64 vread_pvclock(void)
 {
 	const struct pvclock_vcpu_time_info *pvti = &get_pvti0()->pvti;
 	u64 ret;
@@ -130,10 +105,8 @@ static notrace u64 vread_pvclock(int *mo
 	do {
 		version = pvclock_read_begin(pvti);
 
-		if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
-			*mode = VCLOCK_NONE;
-			return 0;
-		}
+		if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT)))
+			return U64_MAX;
 
 		ret = __pvclock_read_cycles(pvti, rdtsc_ordered());
 	} while (pvclock_read_retry(pvti, version));
@@ -148,17 +121,12 @@ static notrace u64 vread_pvclock(int *mo
 }
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-static notrace u64 vread_hvclock(int *mode)
+static notrace u64 vread_hvclock(void)
 {
 	const struct ms_hyperv_tsc_page *tsc_pg =
 		(const struct ms_hyperv_tsc_page *)&hvclock_page;
-	u64 current_tick = hv_read_tsc_page(tsc_pg);
-
-	if (current_tick != U64_MAX)
-		return current_tick;
 
-	*mode = VCLOCK_NONE;
-	return 0;
+	return hv_read_tsc_page(tsc_pg);
 }
 #endif
 
@@ -182,47 +150,42 @@ notrace static u64 vread_tsc(void)
 	return last;
 }
 
-notrace static inline u64 vgetsns(int *mode)
+notrace static inline u64 vgetcyc(int mode)
 {
-	u64 v;
-	cycles_t cycles;
-
-	if (gtod->vclock_mode == VCLOCK_TSC)
-		cycles = vread_tsc();
+	if (mode == VCLOCK_TSC)
+		return vread_tsc();
 #ifdef CONFIG_PARAVIRT_CLOCK
-	else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
-		cycles = vread_pvclock(mode);
+	else if (mode == VCLOCK_PVCLOCK)
+		return vread_pvclock();
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-	else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
-		cycles = vread_hvclock(mode);
+	else if (mode == VCLOCK_HVCLOCK)
+		return vread_hvclock();
 #endif
-	else
-		return 0;
-	v = cycles - gtod->cycle_last;
-	return v * gtod->mult;
+	return U64_MAX;
 }
 
 notrace static int do_hres(clockid_t clk, struct timespec *ts)
 {
 	struct vgtod_ts *base = &gtod->basetime[clk];
 	unsigned int seq;
-	int mode;
-	u64 ns;
+	u64 cycles, ns;
 
 	do {
 		seq = gtod_read_begin(gtod);
-		mode = gtod->vclock_mode;
 		ts->tv_sec = base->sec;
 		ns = base->nsec;
-		ns += vgetsns(&mode);
+		cycles = vgetcyc(gtod->vclock_mode);
+		if (unlikely((s64)cycles < 0))
+			return vdso_fallback_gettime(clk, ts);
+		ns += (cycles - gtod->cycle_last) * gtod->mult;
 		ns >>= gtod->shift;
 	} while (unlikely(gtod_read_retry(gtod, seq)));
 
 	ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
 	ts->tv_nsec = ns;
 
-	return mode;
+	return 0;
 }
 
 notrace static void do_coarse(clockid_t clk, struct timespec *ts)
@@ -251,8 +214,7 @@ notrace int __vdso_clock_gettime(clockid
 	 */
 	msk = 1U << clock;
 	if (likely(msk & VGTOD_HRES)) {
-		if (do_hres(clock, ts) != VCLOCK_NONE)
-			return 0;
+		return do_hres(clock, ts);
 	} else if (msk & VGTOD_COARSE) {
 		do_coarse(clock, ts);
 		return 0;
@@ -268,8 +230,7 @@ notrace int __vdso_gettimeofday(struct t
 	if (likely(tv != NULL)) {
 		struct timespec *ts = (struct timespec *) tv;
 
-		if (unlikely(do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE))
-			return vdso_fallback_gtod(tv, tz);
+		do_hres(CLOCK_REALTIME, ts);
 		tv->tv_usec /= 1000;
 	}
 	if (unlikely(tz != NULL)) {

^ permalink raw reply

* [patch 10/11] x86/vdso: Move cycle_last handling into the caller
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
	virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
	Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180914125006.349747096@linutronix.de>

[-- Attachment #1: x86-vdso--Move-cycle_last-handling-into-the-caller.patch --]
[-- Type: text/plain, Size: 2774 bytes --]

Dereferencing gtod->cycle_last all over the place and foing the cycles <
last comparison in the vclock read functions generates horrible code. Doing
it at the call site is much better and gains a few cycles both for TSC and
pvclock.

Caveat: This adds the comparison to the hyperv vclock as well, but I have
no way to test that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/vdso/vclock_gettime.c |   39 ++++++-----------------------------
 1 file changed, 7 insertions(+), 32 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -76,9 +76,8 @@ static notrace const struct pvclock_vsys
 static notrace u64 vread_pvclock(void)
 {
 	const struct pvclock_vcpu_time_info *pvti = &get_pvti0()->pvti;
-	u64 ret;
-	u64 last;
 	u32 version;
+	u64 ret;
 
 	/*
 	 * Note: The kernel and hypervisor must guarantee that cpu ID
@@ -111,13 +110,7 @@ static notrace u64 vread_pvclock(void)
 		ret = __pvclock_read_cycles(pvti, rdtsc_ordered());
 	} while (pvclock_read_retry(pvti, version));
 
-	/* refer to vread_tsc() comment for rationale */
-	last = gtod->cycle_last;
-
-	if (likely(ret >= last))
-		return ret;
-
-	return last;
+	return ret;
 }
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
@@ -130,30 +123,10 @@ static notrace u64 vread_hvclock(void)
 }
 #endif
 
-notrace static u64 vread_tsc(void)
-{
-	u64 ret = (u64)rdtsc_ordered();
-	u64 last = gtod->cycle_last;
-
-	if (likely(ret >= last))
-		return ret;
-
-	/*
-	 * GCC likes to generate cmov here, but this branch is extremely
-	 * predictable (it's just a function of time and the likely is
-	 * very likely) and there's a data dependence, so force GCC
-	 * to generate a branch instead.  I don't barrier() because
-	 * we don't actually need a barrier, and if this function
-	 * ever gets inlined it will generate worse code.
-	 */
-	asm volatile ("");
-	return last;
-}
-
 notrace static inline u64 vgetcyc(int mode)
 {
 	if (mode == VCLOCK_TSC)
-		return vread_tsc();
+		return (u64)rdtsc_ordered();
 #ifdef CONFIG_PARAVIRT_CLOCK
 	else if (mode == VCLOCK_PVCLOCK)
 		return vread_pvclock();
@@ -168,17 +141,19 @@ notrace static inline u64 vgetcyc(int mo
 notrace static int do_hres(clockid_t clk, struct timespec *ts)
 {
 	struct vgtod_ts *base = &gtod->basetime[clk];
+	u64 cycles, last, ns;
 	unsigned int seq;
-	u64 cycles, ns;
 
 	do {
 		seq = gtod_read_begin(gtod);
 		ts->tv_sec = base->sec;
 		ns = base->nsec;
+		last = gtod->cycle_last;
 		cycles = vgetcyc(gtod->vclock_mode);
 		if (unlikely((s64)cycles < 0))
 			return vdso_fallback_gettime(clk, ts);
-		ns += (cycles - gtod->cycle_last) * gtod->mult;
+		if (cycles > last)
+			ns += (cycles - last) * gtod->mult;
 		ns >>= gtod->shift;
 	} while (unlikely(gtod_read_retry(gtod, seq)));

^ permalink raw reply

* [patch 11/11] x66/vdso: Add CLOCK_TAI support
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
	virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
	Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180914125006.349747096@linutronix.de>

[-- Attachment #1: x66-vdso--Add-CLOCK_TAI-support.patch --]
[-- Type: text/plain, Size: 2323 bytes --]

With the storage array in place it's now trivial to support CLOCK_TAI in
the vdso. Instead of extending the array to accomodate CLOCK_TAI, make use
of the fact that:

 - CLOCK ids are set in stone
 - CLOCK_THREAD_CPUTIME is never going to be supported in the VDSO so
   the array slot 3 is unused
 - CLOCK_TAI is id 11 which results in 3 when masked with 0x3

Add the mask to the basetime array lookup and set up the CLOCK_TAI base
time in update_vsyscall().

The performance impact of the mask operation is within the noise.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/vdso/vclock_gettime.c    |    2 +-
 arch/x86/entry/vsyscall/vsyscall_gtod.c |    4 ++++
 arch/x86/include/asm/vgtod.h            |    6 +++++-
 3 files changed, 10 insertions(+), 2 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -140,7 +140,7 @@ notrace static inline u64 vgetcyc(int mo
 
 notrace static int do_hres(clockid_t clk, struct timespec *ts)
 {
-	struct vgtod_ts *base = &gtod->basetime[clk];
+	struct vgtod_ts *base = &gtod->basetime[clk & VGTOD_HRES_MASK];
 	u64 cycles, last, ns;
 	unsigned int seq;
 
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -51,6 +51,10 @@ void update_vsyscall(struct timekeeper *
 	base->sec = tk->xtime_sec;
 	base->nsec = tk->tkr_mono.xtime_nsec;
 
+	base = &vdata->basetime[VGTOD_TAI];
+	base->sec = tk->xtime_sec + (s64)tk->tai_offset;
+	base->nsec = tk->tkr_mono.xtime_nsec;
+
 	base = &vdata->basetime[CLOCK_MONOTONIC];
 	base->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
 	nsec = tk->tkr_mono.xtime_nsec;
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -19,9 +19,13 @@ struct vgtod_ts {
 };
 
 #define VGTOD_BASES	(CLOCK_MONOTONIC_COARSE + 1)
-#define VGTOD_HRES	(BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC))
+#define VGTOD_HRES	(BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC) | BIT(CLOCK_TAI))
 #define VGTOD_COARSE	(BIT(CLOCK_REALTIME_COARSE) | BIT(CLOCK_MONOTONIC_COARSE))
 
+/* Abuse CLOCK_THREAD_CPUTIME_ID for VGTOD CLOCK TAI */
+#define VGTOD_HRES_MASK	0x3
+#define VGTOD_TAI	(CLOCK_TAI & VGTOD_HRES_MASK)
+
 /*
  * vsyscall_gtod_data will be accessed by 32 and 64 bit code at the same time
  * so be carefull by modifying this structure.

^ permalink raw reply

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Thomas Gleixner @ 2018-09-14 13:05 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86, LKML,
	virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
	Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <2f723b28-8f81-4b02-861c-42f756a8825a@redhat.com>

On Fri, 14 Sep 2018, Florian Weimer wrote:

> On 09/14/2018 02:50 PM, Thomas Gleixner wrote:
> > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > implementation, which extended the clockid switch case and added yet
> > another slightly different copy of the same code.
> > 
> > Especially the extended switch case is problematic as the compiler tends to
> > generate a jump table which then requires to use retpolines.
> 
> Does vDSO code really have to use retpolines?  It's in userspace, after all.

Unless you have IBRS/STIPB enabled, you need user space ratpoutine as well.

Thanks,

	tglx

^ permalink raw reply

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Peter Zijlstra @ 2018-09-14 13:09 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Juergen Gross, Arnd Bergmann, Stephen Boyd, x86, LKML,
	virtualization, John Stultz, Andy Lutomirski, Paolo Bonzini,
	devel, Thomas Gleixner, Matt Rickard
In-Reply-To: <2f723b28-8f81-4b02-861c-42f756a8825a@redhat.com>

On Fri, Sep 14, 2018 at 02:56:46PM +0200, Florian Weimer wrote:
> On 09/14/2018 02:50 PM, Thomas Gleixner wrote:
> > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > implementation, which extended the clockid switch case and added yet
> > another slightly different copy of the same code.
> > 
> > Especially the extended switch case is problematic as the compiler tends to
> > generate a jump table which then requires to use retpolines.
> 
> Does vDSO code really have to use retpolines?  It's in userspace, after all.

Userspace is equally susceptible to spectre-v2. Ideally we'd recompile
world with retpoline, but given the amount of inline asm in say things
like openssl and similar projects, validating that there are indeed no
indirect calls/jumps left is nontrivial.

There are currently pending patches to otherwise address user-user
spectre-v2 attacks.

^ permalink raw reply

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Thomas Gleixner @ 2018-09-14 13:19 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86, LKML,
	virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
	Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <63a0f67d-fdb1-e2fc-5d4d-4f3cfbf86fd1@redhat.com>

On Fri, 14 Sep 2018, Florian Weimer wrote:

> On 09/14/2018 03:05 PM, Thomas Gleixner wrote:
> > On Fri, 14 Sep 2018, Florian Weimer wrote:
> > 
> > > On 09/14/2018 02:50 PM, Thomas Gleixner wrote:
> > > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > > > implementation, which extended the clockid switch case and added yet
> > > > another slightly different copy of the same code.
> > > > 
> > > > Especially the extended switch case is problematic as the compiler tends
> > > > to
> > > > generate a jump table which then requires to use retpolines.
> > > 
> > > Does vDSO code really have to use retpolines?  It's in userspace, after
> > > all.
> > 
> > Unless you have IBRS/STIPB enabled, you need user space ratpoutine as well.
> 
> I don't think this is a consensus position, and it obviously depends on the
> (sub)architecture.

It does, but we are not building kernels for specific micro architectures
nor do distros AFAIK.

But that aside, even with jump tables the generated code (ratpoutine
disabled) is not any better than with the current approach. It's even worse
and needs the same optimizations at the end and then the main difference is
that you have 3 copies of one function and 2 of the other. Not a win at
all.

Thanks,

	tglx

^ permalink raw reply

* Re: [patch 11/11] x66/vdso: Add CLOCK_TAI support
From: Andy Lutomirski @ 2018-09-14 14:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
	LKML, virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
	Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180914125119.081037164@linutronix.de>



> On Sep 14, 2018, at 5:50 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> With the storage array in place it's now trivial to support CLOCK_TAI in
> the vdso. Instead of extending the array to accomodate CLOCK_TAI, make use
> of the fact that:
> 
> - CLOCK ids are set in stone
> - CLOCK_THREAD_CPUTIME is never going to be supported in the VDSO so
>   the array slot 3 is unused
> - CLOCK_TAI is id 11 which results in 3 when masked with 0x3
> 
> Add the mask to the basetime array lookup and set up the CLOCK_TAI base
> time in update_vsyscall().

That’s... horrible. In an amazing way. Can you add BUILD_BUG_ON somewhere to assert that this actually works?

> 
> The performance impact of the mask operation is within the noise.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> arch/x86/entry/vdso/vclock_gettime.c    |    2 +-
> arch/x86/entry/vsyscall/vsyscall_gtod.c |    4 ++++
> arch/x86/include/asm/vgtod.h            |    6 +++++-
> 3 files changed, 10 insertions(+), 2 deletions(-)
> 
> --- a/arch/x86/entry/vdso/vclock_gettime.c
> +++ b/arch/x86/entry/vdso/vclock_gettime.c
> @@ -140,7 +140,7 @@ notrace static inline u64 vgetcyc(int mo
> 
> notrace static int do_hres(clockid_t clk, struct timespec *ts)
> {
> -    struct vgtod_ts *base = &gtod->basetime[clk];
> +    struct vgtod_ts *base = &gtod->basetime[clk & VGTOD_HRES_MASK];
>    u64 cycles, last, ns;
>    unsigned int seq;
> 
> --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
> +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
> @@ -51,6 +51,10 @@ void update_vsyscall(struct timekeeper *
>    base->sec = tk->xtime_sec;
>    base->nsec = tk->tkr_mono.xtime_nsec;
> 
> +    base = &vdata->basetime[VGTOD_TAI];
> +    base->sec = tk->xtime_sec + (s64)tk->tai_offset;
> +    base->nsec = tk->tkr_mono.xtime_nsec;
> +
>    base = &vdata->basetime[CLOCK_MONOTONIC];
>    base->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
>    nsec = tk->tkr_mono.xtime_nsec;
> --- a/arch/x86/include/asm/vgtod.h
> +++ b/arch/x86/include/asm/vgtod.h
> @@ -19,9 +19,13 @@ struct vgtod_ts {
> };
> 
> #define VGTOD_BASES    (CLOCK_MONOTONIC_COARSE + 1)
> -#define VGTOD_HRES    (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC))
> +#define VGTOD_HRES    (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC) | BIT(CLOCK_TAI))
> #define VGTOD_COARSE    (BIT(CLOCK_REALTIME_COARSE) | BIT(CLOCK_MONOTONIC_COARSE))
> 
> +/* Abuse CLOCK_THREAD_CPUTIME_ID for VGTOD CLOCK TAI */
> +#define VGTOD_HRES_MASK    0x3
> +#define VGTOD_TAI    (CLOCK_TAI & VGTOD_HRES_MASK)
> +
> /*
>  * vsyscall_gtod_data will be accessed by 32 and 64 bit code at the same time
>  * so be carefull by modifying this structure.
> 
> 
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Arnd Bergmann @ 2018-09-14 14:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Florian Weimer, Juergen Gross, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	virtualization, Stephen Boyd, John Stultz, Deepa Dinamani,
	Andy Lutomirski, Paolo Bonzini, devel, matt
In-Reply-To: <20180914125006.349747096@linutronix.de>

On Fri, Sep 14, 2018 at 2:52 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> implementation, which extended the clockid switch case and added yet
> another slightly different copy of the same code.
>
> Especially the extended switch case is problematic as the compiler tends to
> generate a jump table which then requires to use retpolines. If jump tables
> are disabled it adds yet another conditional to the existing maze.
>
> This series takes a different approach by consolidating the almost
> identical functions into one implementation for high resolution clocks and
> one for the coarse grained clock ids by storing the base data for each
> clock id in an array which is indexed by the clock id.
>
> This completely eliminates the switch case and allows further
> simplifications of the code base, which at the end all together gain a few
> cycles performance or at least stay on par with todays code. The resulting
> performance depends heavily on the micro architecture and the compiler.

No objections from my side, just a few general remarks: Deepa
and I have discussed the vdso in the past for 2038. I have started
a patch that I'll have to redo on top of yours, but that is fine, your
cleanup is only going to help here.

More generally speaking, Deepa said that she would like to see
some generalization on the vdso before adding the time64_t
based variants. Your series probably makes x86 diverge more
from the others, which makes it harder to consolidate them again,
but we might just as well use your new implementation to base the
generic one on, and just move the other ones over to that.

A couple of architectures (s390, ia64, riscv, powerpc, arm64)
implement the vdso as assembler code at the moment, so they
won't be as easy to consolidate (other than outright replacing all
the code).

The other five:
arch/x86/entry/vdso/vclock_gettime.c
arch/sparc/vdso/vclock_gettime.c
arch/nds32/kernel/vdso/gettimeofday.c
arch/mips/vdso/gettimeofday.c
arch/arm/vdso/vgettimeofday.c

are basically all minor variations of the same code base and could be
consolidated to some degree.
Any suggestions here? Should we plan to do that consolitdation based on
your new version, or just add clock_gettime64 in arm32 and x86-32, and then
be done with it? The other ones will obviously still be fast for 32-bit time_t
and will have a working non-vdso sys_clock_getttime64().

I also wonder about clock_getres(): half the architectures seem to implement
it in vdso, but notably arm32 and x86 don't, and I had not expected it to be
performance critical given that the result is easily cached in user space.

         Arnd

^ permalink raw reply

* Re: [patch 11/11] x66/vdso: Add CLOCK_TAI support
From: Thomas Gleixner @ 2018-09-14 14:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
	LKML, virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
	Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <ABB53D42-2A06-41B9-8A8C-40B39CD0289D@amacapital.net>

[-- Attachment #1: Type: text/plain, Size: 836 bytes --]

On Fri, 14 Sep 2018, Andy Lutomirski wrote:
> > On Sep 14, 2018, at 5:50 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > With the storage array in place it's now trivial to support CLOCK_TAI in
> > the vdso. Instead of extending the array to accomodate CLOCK_TAI, make use
> > of the fact that:
> > 
> > - CLOCK ids are set in stone
> > - CLOCK_THREAD_CPUTIME is never going to be supported in the VDSO so
> >   the array slot 3 is unused
> > - CLOCK_TAI is id 11 which results in 3 when masked with 0x3
> > 
> > Add the mask to the basetime array lookup and set up the CLOCK_TAI base
> > time in update_vsyscall().
> 
> That’s... horrible. In an amazing way. Can you add BUILD_BUG_ON somewhere
> to assert that this actually works?

Sure, but changing any of the clock ids will cause more wreckage than that.

Thanks,

	tglx

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [patch 11/11] x66/vdso: Add CLOCK_TAI support
From: Andy Lutomirski @ 2018-09-14 14:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
	LKML, virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
	Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <alpine.DEB.2.21.1809141625040.10480@nanos.tec.linutronix.de>



> On Sep 14, 2018, at 7:27 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Fri, 14 Sep 2018, Andy Lutomirski wrote:
>>> On Sep 14, 2018, at 5:50 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> 
>>> With the storage array in place it's now trivial to support CLOCK_TAI in
>>> the vdso. Instead of extending the array to accomodate CLOCK_TAI, make use
>>> of the fact that:
>>> 
>>> - CLOCK ids are set in stone
>>> - CLOCK_THREAD_CPUTIME is never going to be supported in the VDSO so
>>>  the array slot 3 is unused
>>> - CLOCK_TAI is id 11 which results in 3 when masked with 0x3
>>> 
>>> Add the mask to the basetime array lookup and set up the CLOCK_TAI base
>>> time in update_vsyscall().
>> 
>> That’s... horrible. In an amazing way. Can you add BUILD_BUG_ON somewhere
>> to assert that this actually works?
> 
> Sure, but changing any of the clock ids will cause more wreckage than that.
> 

I’m more concerned that we add a new one and break the magic masking. Maybe two start sharing the same slot. 

> Thanks,
> 
>    tglx
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [patch 10/11] x86/vdso: Move cycle_last handling into the caller
From: Vitaly Kuznetsov @ 2018-09-14 15:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
	LKML, virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
	Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180914125118.998589817@linutronix.de>

Thomas Gleixner <tglx@linutronix.de> writes:

> Dereferencing gtod->cycle_last all over the place and foing the cycles <
> last comparison in the vclock read functions generates horrible code. Doing
> it at the call site is much better and gains a few cycles both for TSC and
> pvclock.
>
> Caveat: This adds the comparison to the hyperv vclock as well, but I have
> no way to test that.

While the change looks obviously correct for Hyper-V vclock too, by saying
that you encouraged me to give the whole series a shot. I did and turns
out Hyper-V vclock is broken: simple clock_gettime(CLOCK_REALTIME, )
test goes from 70 cpu cycles to 1000 (meaning that we're falling back to
syscall I guess). Bisecting now, stay tuned!

-- 
  Vitaly

^ permalink raw reply

* Re: [patch 02/11] x86/time: Implement clocksource_arch_init()
From: Vitaly Kuznetsov @ 2018-09-14 15:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
	LKML, virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
	Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180914125118.293161327@linutronix.de>

Thomas Gleixner <tglx@linutronix.de> writes:

> Runtime validate the VCLOCK_MODE in clocksource::archdata and disable
> VCLOCK if invalid, which disables the VDSO but keeps the system running.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/Kconfig       |    1 +
>  arch/x86/kernel/time.c |   16 ++++++++++++++++
>  2 files changed, 17 insertions(+)
>
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -48,6 +48,7 @@ config X86
>  	select ACPI_SYSTEM_POWER_STATES_SUPPORT	if ACPI
>  	select ANON_INODES
>  	select ARCH_CLOCKSOURCE_DATA
> +	select ARCH_CLOCKSOURCE_INIT
>  	select ARCH_DISCARD_MEMBLOCK
>  	select ARCH_HAS_ACPI_TABLE_UPGRADE	if ACPI
>  	select ARCH_HAS_DEBUG_VIRTUAL
> --- a/arch/x86/kernel/time.c
> +++ b/arch/x86/kernel/time.c
> @@ -10,6 +10,7 @@
>   *
>   */
>
> +#include <linux/clocksource.h>
>  #include <linux/clockchips.h>
>  #include <linux/interrupt.h>
>  #include <linux/irq.h>
> @@ -105,3 +106,18 @@ void __init time_init(void)
>  {
>  	late_time_init = x86_late_time_init;
>  }
> +
> +/*
> + * Sanity check the vdso related archdata content.
> + */
> +void clocksource_arch_init(struct clocksource *cs)
> +{
> +	if (cs->archdata.vclock_mode == VCLOCK_NONE)
> +		return;
> +
> +	if (cs->archdata.vclock_mode >= VCLOCK_MAX) {

It should be '>' as VCLOCK_MAX == VCLOCK_HVCLOCK == 3

> +		pr_warn("clocksource %s registered with invalid vclock_mode %d. Disabling vclock.\n",
> +			cs->name, cs->archdata.vclock_mode);
> +		cs->archdata.vclock_mode = VCLOCK_NONE;
> +	}
> +}

-- 
  Vitaly

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox