From: Andrei Vagin <avagin@gmail.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>,
Thomas Gleixner <tglx@linutronix.de>
Cc: "linux-kselftest@vger.kernel.org"
<linux-kselftest@vger.kernel.org>,
Dmitry Safonov <dima@arista.com>,
"linux-api@vger.kernel.org" <linux-api@vger.kernel.org>,
Jeff Dike <jdike@addtoit.com>, "x86@kernel.org" <x86@kernel.org>,
Dmitry Safonov <0x7f454c46@gmail.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Oleg Nesterov <oleg@redhat.com>,
"criu@openvz.org" <criu@openvz.org>,
Ingo Molnar <mingo@redhat.com>,
Alexey Dobriyan <adobriyan@gmail.com>,
Andy Lutomirski <luto@kernel.org>,
"H. Peter Anvin" <hpa@zytor.com>,
Cyrill Gorcunov <gorcunov@openvz.org>,
Christian Brauner <christian.brauner@ubuntu.com>,
Pavel Emelianov <xemul@virtuozzo.com>,
Shuah Khan <shuah@kernel.org>,
"containers@lists.linux-foundation.org"
<containers@lists.linux-foundation.org>,
Adrian Reber <adrian@lisas.de>
Subject: Re: [RFC 00/20] ns: Introduce Time Namespace
Date: Sat, 20 Oct 2018 20:54:36 -0700 [thread overview]
Message-ID: <20181021035435.GA21328@gmail.com> (raw)
In-Reply-To: <20181021014121.GA23474@gmail.com>
On Sat, Oct 20, 2018 at 06:41:23PM -0700, Andrei Vagin wrote:
> On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote:
> > Thomas Gleixner <tglx@linutronix.de> writes:
> >
> > > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> > >> Reading the code the calling sequence there is:
> > >> tick_sched_do_timer
> > >> tick_do_update_jiffies64
> > >> update_wall_time
> > >> timekeeping_advance
> > >> timekeepging_update
> > >>
> > >> If I read that properly under the right nohz circumstances that update
> > >> can be delayed indefinitely.
> > >>
> > >> So I think we could prototype a time namespace that was per
> > >> timekeeping_update and just had update_wall_time iterate through
> > >> all of the time namespaces.
> > >
> > > Please don't go there. timekeeping_update() is already heavy and walking
> > > through a gazillion of namespaces will just make it horrible,
> > >
> > >> I don't think the naive version would scale to very many time
> > >> namespaces.
> > >
> > > :)
> > >
> > >> At the same time using the techniques from the nohz work and a little
> > >> smarts I expect we could get the code to scale.
> > >
> > > You'd need to invoke the update when the namespace is switched in and
> > > hasn't been updated since the last tick happened. That might be doable, but
> > > you also need to take the wraparound constraints of the underlying
> > > clocksources into account, which again can cause walking all name spaces
> > > when they are all idle long enough.
> >
> > The wrap around constraints being how long before the time sources wrap
> > around so you have to read them once per wrap around? I have not dug
> > deeply enough into the code to see that yet.
> >
> > > From there it becomes hairy, because it's not only timekeeping,
> > > i.e. reading time, this is also affecting all timers which are armed from a
> > > namespace.
> > >
> > > That gets really ugly because when you do settimeofday() or adjtimex() for
> > > a particular namespace, then you have to search for all armed timers of
> > > that namespace and adjust them.
> > >
> > > The original posix timer code had the same issue because it mapped the
> > > clock realtime timers to the timer wheel so any setting of the clock caused
> > > a full walk of all armed timers, disarming, adjusting and requeing
> > > them. That's horrible not only performance wise, it's also a locking
> > > nightmare of all sorts.
> > >
> > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > timers as well, because you need to guarantee that they are not expiring
> > > early.
> > >
> > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > can work at all without introducing subtle issues all over the place.
> >
> > Then it sounds like this will take some more digging.
> >
> > Please pardon me for thinking out load.
> >
> > There are one or more time sources that we use to compute the time
> > and for each time source we have a conversion from ticks of the
> > time source to nanoseconds.
> >
> > Each time source needs to be sampled at least once per wrap-around
> > and something incremented so that we don't loose time when looking
> > at that time source.
> >
> > There are several clocks presented to userspace and they all share the
> > same length of second and are all fundamentally offsets from
> > CLOCK_MONOTONIC.
> >
> > I see two fundamental driving cases for a time namespace.
> > 1) Migration from one node to another node in a cluster in almost
> > real time.
> >
> > The problem is that CLOCK_MONOTONIC between nodes in the cluster
> > has not relation ship to each other (except a synchronized length of
> > the second). So applications that migrate can see CLOCK_MONOTONIC
> > and CLOCK_BOOTTIME go backwards.
> >
> > This is the truly pressing problem and adding some kind of offset
> > sounds like it would be the solution. Possibly by allowing a boot
> > time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
> >
> > 2) Dealing with two separate time management domains. Say a machine
> > that needes to deal with both something inside of google where they
> > slew time to avoid leap time seconds and something in the outside
> > world proper UTC time is kept as an offset from TAI with the
> > occasional leap seconds.
> >
> > In the later case it would fundamentally require having seconds of
> > different length.
> >
>
> I want to add that the second case should be optional.
>
> When a container is migrated to another host, we have to restore its
> monotonic and boottime clocks, but we still expect that the container
> will continue using the host real-time clock.
>
> Before stating this series, I was thinking about this, I decided that
> these cases can be solved independently. Probably, the full isolation of
> the time sub-system will have much higher overhead than just offsets for
> a few clocks. And the idea that isolation of the real-time clock should
> be optional gives us another hint that offsets for monotonic and
> boot-time clocks can be implemented independently.
>
> Eric and Tomas, what do you think about this? If you agree that these
Sorry Thomas, I mistyped your name.
> two cases can be implemented separately, what should we do with this
> series to make it ready to be merged?
>
> I know that we need to:
>
> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
> * forbid changing offsets after creating timers
>
> Anything else?
>
> Thanks,
> Andrei
>
> >
> > A pure 64bit nanoseond counter is good for 500 years. So 64bit
> > variables can be used to hold time, and everything can be converted from
> > there.
> >
> > This suggests we can for ticks have two values.
> > - The number of ticks from the time source.
> > - The number of times the ticks would have rolled over.
> >
> > That sounds like it may be a little simplistic as it would require being
> > very diligent about firing a timer exactly at rollover and not losing
> > that, but for a handwaving argument is probably enough to generate
> > a 64bit tick counter.
> >
> > If the focus is on a 64bit tick counter then what update_wall_time
> > has to do is very limited. Just deal the accounting needed to cope with
> > tick rollover.
> >
> > Getting the actual time looks like it would be as simple as now, with
> > perhaps an extra addition to account for the number of times the tick
> > counter has rolled over. With limited precision arithmetic and various
> > optimizations I don't think it is that simple to implement but it feels
> > like it should be very little extra work.
> >
> > For timers my inclination would be to assume no adjustments to the
> > current time parameters and set the timer to go off then. If the time
> > on the appropriate clock has been changed since the timer was set and
> > the timer is going off early reschedule so the timer fires at the
> > appropriate time.
> >
> > With the above I think it is theoretically possible to build a time
> > namespace that supports multiple lengths of second, and does not have
> > much overhead.
> >
> > Not that I think a final implementation would necessary look like what I
> > have described. I just think it is possible with extreme care to evolve
> > the current code base into something that can efficiently handle
> > multiple time domains with slightly different lenghts of second.
> >
> > Thomas does it sound like I am completely out of touch with reality?
> >
> > It does though sound like it is going to take some serious digging
> > through the code to understand how what everything does and how and why
> > everthing works the way it does. Not something grafted on top with just
> > a cursory understanding of how the code works.
> >
> > Eric
> > _______________________________________________
> > Containers mailing list
> > Containers@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/containers
next prev parent reply other threads:[~2018-10-21 3:54 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-09-19 20:50 [RFC 00/20] ns: Introduce Time Namespace Dmitry Safonov
2018-09-19 20:50 ` [RFC 01/20] " Dmitry Safonov
2018-09-28 18:20 ` Laurent Vivier
2018-09-19 20:50 ` [RFC 02/20] timens: Add timens_offsets Dmitry Safonov
2018-09-20 18:45 ` Cyrill Gorcunov
2018-09-20 22:14 ` Cyrill Gorcunov
2018-09-19 20:50 ` [RFC 03/20] timens: Introduce CLOCK_MONOTONIC offsets Dmitry Safonov
2018-09-19 20:50 ` [RFC 04/20] timens: Introduce CLOCK_BOOTTIME offset Dmitry Safonov
2018-09-30 3:18 ` [LKP] [timens] 3cc8de9dcb: RIP:posix_get_boottime kernel test robot
2018-09-19 20:50 ` [RFC 05/20] timerfd/timens: Take into account ns clock offsets Dmitry Safonov
2018-09-19 20:50 ` [RFC 06/20] kernel: Take into account timens clock offsets in clock_nanosleep Dmitry Safonov
2018-09-19 20:50 ` [RFC 07/20] timens: Shift /proc/uptime Dmitry Safonov
2018-09-19 20:50 ` [RFC 08/20] x86/vdso: Restrict splitting vvar vma Dmitry Safonov
2018-09-19 20:50 ` [RFC 09/20] x86/vdso/timens: Add offsets page in vvar Dmitry Safonov
2018-09-19 20:50 ` [RFC 10/20] x86/vdso: Use set_normalized_timespec() to avoid 32 bit overflow Dmitry Safonov
2018-09-19 20:50 ` [RFC 11/20] x86/vdso: Purge timens page on setns()/unshare()/clone() Dmitry Safonov
2018-09-19 20:50 ` [RFC 12/20] x86/vdso: Look for vvar vma to purge timens page Dmitry Safonov
2018-09-19 20:50 ` [RFC 13/20] posix-timers/timens: Take into account clock offsets Dmitry Safonov
2018-09-30 3:11 ` [LKP] [posix] 25217c6e39: BUG:KASAN:null-ptr-deref_in_c kernel test robot
2018-09-19 20:50 ` [RFC 14/20] timens: Add align for timens_offsets Dmitry Safonov
2018-09-19 20:50 ` [RFC 15/20] timens: Optimize zero-offsets Dmitry Safonov
2018-09-19 20:50 ` [RFC 16/20] selftest: Add Time Namespace test for supported clocks Dmitry Safonov
2018-09-24 21:36 ` Shuah Khan
2018-09-19 20:50 ` [RFC 17/20] selftest/timens: Add test for timerfd Dmitry Safonov
2018-09-19 20:50 ` [RFC 18/20] selftest/timens: Add test for clock_nanosleep Dmitry Safonov
2018-09-19 20:50 ` [RFC 19/20] timens/selftest: Add procfs selftest Dmitry Safonov
2018-09-19 20:50 ` [RFC 20/20] timens/selftest: Add timer offsets test Dmitry Safonov
2018-09-21 12:27 ` [RFC 00/20] ns: Introduce Time Namespace Eric W. Biederman
2018-09-24 20:51 ` Andrey Vagin
2018-09-24 22:02 ` Eric W. Biederman
2018-09-25 1:42 ` Andrey Vagin
2018-09-26 17:36 ` Eric W. Biederman
2018-09-26 17:59 ` Dmitry Safonov
2018-09-27 21:30 ` Thomas Gleixner
2018-09-27 21:41 ` Thomas Gleixner
2018-10-01 23:20 ` Andrey Vagin
2018-10-02 6:15 ` Thomas Gleixner
2018-10-02 21:05 ` Dmitry Safonov
2018-10-02 21:26 ` Thomas Gleixner
2018-09-28 17:03 ` Eric W. Biederman
2018-09-28 19:32 ` Thomas Gleixner
2018-10-01 9:05 ` Eric W. Biederman
2018-10-01 9:15 ` Setting monotonic time? Eric W. Biederman
2018-10-01 18:52 ` Thomas Gleixner
2018-10-02 20:00 ` Arnd Bergmann
2018-10-02 20:06 ` Thomas Gleixner
2018-10-03 4:50 ` Eric W. Biederman
2018-10-03 5:25 ` Thomas Gleixner
2018-10-03 6:14 ` Eric W. Biederman
2018-10-03 7:02 ` Arnd Bergmann
2018-10-03 6:14 ` Thomas Gleixner
2018-10-01 20:51 ` Andrey Vagin
2018-10-02 6:16 ` Thomas Gleixner
2018-10-21 1:41 ` [RFC 00/20] ns: Introduce Time Namespace Andrei Vagin
2018-10-21 3:54 ` Andrei Vagin [this message]
2018-10-29 20:33 ` Thomas Gleixner
2018-10-29 21:21 ` Eric W. Biederman
2018-10-29 21:36 ` Thomas Gleixner
2018-10-31 16:26 ` Andrei Vagin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181021035435.GA21328@gmail.com \
--to=avagin@gmail.com \
--cc=0x7f454c46@gmail.com \
--cc=adobriyan@gmail.com \
--cc=adrian@lisas.de \
--cc=christian.brauner@ubuntu.com \
--cc=containers@lists.linux-foundation.org \
--cc=criu@openvz.org \
--cc=dima@arista.com \
--cc=ebiederm@xmission.com \
--cc=gorcunov@openvz.org \
--cc=hpa@zytor.com \
--cc=jdike@addtoit.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=luto@kernel.org \
--cc=mingo@redhat.com \
--cc=oleg@redhat.com \
--cc=shuah@kernel.org \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
--cc=xemul@virtuozzo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).