Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v10)
From: Florian Weimer @ 2019-06-14 10:06 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: carlos, Joseph Myers, Szabolcs Nagy, libc-alpha, Thomas Gleixner,
	Ben Maurer, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Will Deacon, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <802638054.3032.1560506584705.JavaMail.zimbra@efficios.com>

* Mathieu Desnoyers:

> ----- On Jun 12, 2019, at 4:00 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:
>
>> ----- On Jun 10, 2019, at 4:43 PM, carlos carlos@redhat.com wrote:
>> 
>>> On 6/6/19 7:57 AM, Florian Weimer wrote:
>>>> Let me ask the key question again: Does it matter if code observes the
>>>> rseq area first without kernel support, and then with kernel support?
>>>> If we don't expect any problems immediately, we do not need to worry
>>>> much about the constructor ordering right now.  I expect that over time,
>>>> fixing this properly will become easier.
>>> 
>>> I just wanted to chime in and say that splitting this into:
>>> 
>>> * Ownership (__rseq_handled)
>>> 
>>> * Initialization (__rseq_abi)
>>> 
>>> Makes sense to me.
>>> 
>>> I agree we need an answer to this question of ownership but not yet
>>> initialized, to owned and initialized.
>>> 
>>> I like the idea of having __rseq_handled in ld.so.
>> 
>> Very good, so I'll implement this approach. Sorry for the delayed
>> feedback, I am traveling this week.
>
> I had issues with cases where application or LD_PRELOAD library also
> define the __rseq_handled symbol. They appear not to see the same
> address as the one initialized by ld.so.

What exactly did you do?  How did you determine the addresses?  How is
__rseq_handled defined in ld.so?

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v10)
From: Mathieu Desnoyers @ 2019-06-14 10:03 UTC (permalink / raw)
  To: carlos, Florian Weimer
  Cc: Joseph Myers, Szabolcs Nagy, libc-alpha, Thomas Gleixner,
	Ben Maurer, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Will Deacon, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <914051741.43025.1560348011775.JavaMail.zimbra@efficios.com>

----- On Jun 12, 2019, at 4:00 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Jun 10, 2019, at 4:43 PM, carlos carlos@redhat.com wrote:
> 
>> On 6/6/19 7:57 AM, Florian Weimer wrote:
>>> Let me ask the key question again: Does it matter if code observes the
>>> rseq area first without kernel support, and then with kernel support?
>>> If we don't expect any problems immediately, we do not need to worry
>>> much about the constructor ordering right now.  I expect that over time,
>>> fixing this properly will become easier.
>> 
>> I just wanted to chime in and say that splitting this into:
>> 
>> * Ownership (__rseq_handled)
>> 
>> * Initialization (__rseq_abi)
>> 
>> Makes sense to me.
>> 
>> I agree we need an answer to this question of ownership but not yet
>> initialized, to owned and initialized.
>> 
>> I like the idea of having __rseq_handled in ld.so.
> 
> Very good, so I'll implement this approach. Sorry for the delayed
> feedback, I am traveling this week.

I had issues with cases where application or LD_PRELOAD library also
define the __rseq_handled symbol. They appear not to see the same
address as the one initialized by ld.so.

I tried using the GL() macro in ld.so to set __rseq_handled, but it's
the wrong address compared to what the preload lib and application observe.

Any thoughts on how to solve this ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: Regression for MS_MOVE on kernel v5.1
From: Christian Brauner @ 2019-06-13 23:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Miklos Szeredi, Linus Torvalds, Al Viro,
	Linux List Kernel Mailing, linux-fsdevel, Linux API,
	David Howells
In-Reply-To: <878su5tadf.fsf@xmission.com>

On Thu, Jun 13, 2019 at 04:59:24PM -0500, Eric W. Biederman wrote:
> Miklos Szeredi <miklos@szeredi.hu> writes:
> 
> > On Thu, Jun 13, 2019 at 8:35 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >>
> >> Christian Brauner <christian@brauner.io> writes:
> >>
> >> > On Wed, Jun 12, 2019 at 06:00:39PM -1000, Linus Torvalds wrote:
> >> >> On Wed, Jun 12, 2019 at 12:54 PM Christian Brauner <christian@brauner.io> wrote:
> >> >> >
> >> >> > The commit changes the internal logic to lock mounts when propagating
> >> >> > mounts (user+)mount namespaces and - I believe - causes do_mount_move()
> >> >> > to fail at:
> >> >>
> >> >> You mean 'do_move_mount()'.
> >> >>
> >> >> > if (old->mnt.mnt_flags & MNT_LOCKED)
> >> >> >         goto out;
> >> >> >
> >> >> > If that's indeed the case we should either revert this commit (reverts
> >> >> > cleanly, just tested it) or find a fix.
> >> >>
> >> >> Hmm.. I'm not entirely sure of the logic here, and just looking at
> >> >> that commit 3bd045cc9c4b ("separate copying and locking mount tree on
> >> >> cross-userns copies") doesn't make me go "Ahh" either.
> >> >>
> >> >> Al? My gut feel is that we need to just revert, since this was in 5.1
> >> >> and it's getting reasonably late in 5.2 too. But maybe you go "guys,
> >> >> don't be silly, this is easily fixed with this one-liner".
> >> >
> >> > David and I have been staring at that code today for a while together.
> >> > I think I made some sense of it.
> >> > One thing we weren't absolutely sure is if the old MS_MOVE behavior was
> >> > intentional or a bug. If it is a bug we have a problem since we quite
> >> > heavily rely on this...
> >>
> >> It was intentional.
> >>
> >> The only mounts that are locked in propagation are the mounts that
> >> propagate together.  If you see the mounts come in as individuals you
> >> can always see/manipulate/work with the underlying mount.
> >>
> >> I can think of only a few ways for MNT_LOCKED to become set:
> >> a) unshare(CLONE_NEWNS)
> >> b) mount --rclone /path/to/mnt/tree /path/to/propagation/point
> >> c) mount --move /path/to/mnt/tree /path/to/propgation/point
> >>
> >> Nothing in the target namespace should be locked on the propgation point
> >> but all of the new mounts that came across as a unit should be locked
> >> together.
> >
> > Locked together means the root of the new mount tree doesn't have
> > MNT_LOCKED set, but all mounts below do have MNT_LOCKED, right?
> >
> > Isn't the bug here that the root mount gets MNT_LOCKED as well?

Yes, we suspected this as well. We just couldn't pinpoint where the
surgery would need to start.

> 
> Yes, and the code to remove MNT_LOCKED is still sitting there in
> propogate_one right after it calls copy_tree.  It should be a trivial
> matter of moving that change to after the lock_mnt_tree call.
> 
> Now that I have been elightened about anonymous mount namespaces
> I am suspecting that we want to take the user_namespace of the anonymous
> namespace into account when deciding to lock the mounts.
> 
> >> Then it breaking is definitely a regression that needs to be fixed.
> >>
> >> I believe the problematic change as made because the new mount
> >> api allows attaching floating mounts.  Or that was the plan last I
> >> looked.   Those floating mounts don't have a mnt_ns so will result
> >> in a NULL pointer dereference when they are attached.
> >
> > Well, it's called anonymous namespace.  So there *is* an mnt_ns, and
> > its lifetime is bound to the file returned by fsmount().
> 
> Interesting.  That has changed since I last saw the patches.
> 
> Below is what will probably be a straight forward fix for the regression.

Tested the patch just now applied on top of v5.1. It fixes the
regression.
Can you please send a proper patch, Eric?

Tested-by: Christian Brauner <christian@brauner.io>
Acked-by: Christian Brauner <christian@brauner.io>

> 
> Eric
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index ffb13f0562b0..a39edeecbc46 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2105,6 +2105,7 @@ static int attach_recursive_mnt(struct mount *source_mnt,
>                 /* Notice when we are propagating across user namespaces */
>                 if (child->mnt_parent->mnt_ns->user_ns != user_ns)
>                         lock_mnt_tree(child);
> +               child->mnt.mnt_flags &= ~MNT_LOCKED;
>                 commit_tree(child);
>         }
>         put_mountpoint(smp);
> diff --git a/fs/pnode.c b/fs/pnode.c
> index 7ea6cfb65077..012be405fec0 100644
> --- a/fs/pnode.c
> +++ b/fs/pnode.c
> @@ -262,7 +262,6 @@ static int propagate_one(struct mount *m)
>         child = copy_tree(last_source, last_source->mnt.mnt_root, type);
>         if (IS_ERR(child))
>                 return PTR_ERR(child);
> -       child->mnt.mnt_flags &= ~MNT_LOCKED;
>         mnt_set_mountpoint(m, mp, child);
>         last_dest = m;
>         last_source = child;
> 
> 

^ permalink raw reply

* Re: Regression for MS_MOVE on kernel v5.1
From: Eric W. Biederman @ 2019-06-13 21:59 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Christian Brauner, Linus Torvalds, Al Viro,
	Linux List Kernel Mailing, linux-fsdevel, Linux API,
	David Howells
In-Reply-To: <CAJfpegvZwDY+zoWjDTrPpMCS01rzQgeE-_z-QtGfvcRnoamzgg@mail.gmail.com>

Miklos Szeredi <miklos@szeredi.hu> writes:

> On Thu, Jun 13, 2019 at 8:35 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Christian Brauner <christian@brauner.io> writes:
>>
>> > On Wed, Jun 12, 2019 at 06:00:39PM -1000, Linus Torvalds wrote:
>> >> On Wed, Jun 12, 2019 at 12:54 PM Christian Brauner <christian@brauner.io> wrote:
>> >> >
>> >> > The commit changes the internal logic to lock mounts when propagating
>> >> > mounts (user+)mount namespaces and - I believe - causes do_mount_move()
>> >> > to fail at:
>> >>
>> >> You mean 'do_move_mount()'.
>> >>
>> >> > if (old->mnt.mnt_flags & MNT_LOCKED)
>> >> >         goto out;
>> >> >
>> >> > If that's indeed the case we should either revert this commit (reverts
>> >> > cleanly, just tested it) or find a fix.
>> >>
>> >> Hmm.. I'm not entirely sure of the logic here, and just looking at
>> >> that commit 3bd045cc9c4b ("separate copying and locking mount tree on
>> >> cross-userns copies") doesn't make me go "Ahh" either.
>> >>
>> >> Al? My gut feel is that we need to just revert, since this was in 5.1
>> >> and it's getting reasonably late in 5.2 too. But maybe you go "guys,
>> >> don't be silly, this is easily fixed with this one-liner".
>> >
>> > David and I have been staring at that code today for a while together.
>> > I think I made some sense of it.
>> > One thing we weren't absolutely sure is if the old MS_MOVE behavior was
>> > intentional or a bug. If it is a bug we have a problem since we quite
>> > heavily rely on this...
>>
>> It was intentional.
>>
>> The only mounts that are locked in propagation are the mounts that
>> propagate together.  If you see the mounts come in as individuals you
>> can always see/manipulate/work with the underlying mount.
>>
>> I can think of only a few ways for MNT_LOCKED to become set:
>> a) unshare(CLONE_NEWNS)
>> b) mount --rclone /path/to/mnt/tree /path/to/propagation/point
>> c) mount --move /path/to/mnt/tree /path/to/propgation/point
>>
>> Nothing in the target namespace should be locked on the propgation point
>> but all of the new mounts that came across as a unit should be locked
>> together.
>
> Locked together means the root of the new mount tree doesn't have
> MNT_LOCKED set, but all mounts below do have MNT_LOCKED, right?
>
> Isn't the bug here that the root mount gets MNT_LOCKED as well?

Yes, and the code to remove MNT_LOCKED is still sitting there in
propogate_one right after it calls copy_tree.  It should be a trivial
matter of moving that change to after the lock_mnt_tree call.

Now that I have been elightened about anonymous mount namespaces
I am suspecting that we want to take the user_namespace of the anonymous
namespace into account when deciding to lock the mounts.

>> Then it breaking is definitely a regression that needs to be fixed.
>>
>> I believe the problematic change as made because the new mount
>> api allows attaching floating mounts.  Or that was the plan last I
>> looked.   Those floating mounts don't have a mnt_ns so will result
>> in a NULL pointer dereference when they are attached.
>
> Well, it's called anonymous namespace.  So there *is* an mnt_ns, and
> its lifetime is bound to the file returned by fsmount().

Interesting.  That has changed since I last saw the patches.

Below is what will probably be a straight forward fix for the regression.

Eric

diff --git a/fs/namespace.c b/fs/namespace.c
index ffb13f0562b0..a39edeecbc46 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2105,6 +2105,7 @@ static int attach_recursive_mnt(struct mount *source_mnt,
                /* Notice when we are propagating across user namespaces */
                if (child->mnt_parent->mnt_ns->user_ns != user_ns)
                        lock_mnt_tree(child);
+               child->mnt.mnt_flags &= ~MNT_LOCKED;
                commit_tree(child);
        }
        put_mountpoint(smp);
diff --git a/fs/pnode.c b/fs/pnode.c
index 7ea6cfb65077..012be405fec0 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -262,7 +262,6 @@ static int propagate_one(struct mount *m)
        child = copy_tree(last_source, last_source->mnt.mnt_root, type);
        if (IS_ERR(child))
                return PTR_ERR(child);
-       child->mnt.mnt_flags &= ~MNT_LOCKED;
        mnt_set_mountpoint(m, mp, child);
        last_dest = m;
        last_source = child;

^ permalink raw reply related

* Re: Regression for MS_MOVE on kernel v5.1
From: Miklos Szeredi @ 2019-06-13 20:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Linus Torvalds, Al Viro,
	Linux List Kernel Mailing, linux-fsdevel, Linux API,
	David Howells
In-Reply-To: <871rzxwcz7.fsf@xmission.com>

On Thu, Jun 13, 2019 at 8:35 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Christian Brauner <christian@brauner.io> writes:
>
> > On Wed, Jun 12, 2019 at 06:00:39PM -1000, Linus Torvalds wrote:
> >> On Wed, Jun 12, 2019 at 12:54 PM Christian Brauner <christian@brauner.io> wrote:
> >> >
> >> > The commit changes the internal logic to lock mounts when propagating
> >> > mounts (user+)mount namespaces and - I believe - causes do_mount_move()
> >> > to fail at:
> >>
> >> You mean 'do_move_mount()'.
> >>
> >> > if (old->mnt.mnt_flags & MNT_LOCKED)
> >> >         goto out;
> >> >
> >> > If that's indeed the case we should either revert this commit (reverts
> >> > cleanly, just tested it) or find a fix.
> >>
> >> Hmm.. I'm not entirely sure of the logic here, and just looking at
> >> that commit 3bd045cc9c4b ("separate copying and locking mount tree on
> >> cross-userns copies") doesn't make me go "Ahh" either.
> >>
> >> Al? My gut feel is that we need to just revert, since this was in 5.1
> >> and it's getting reasonably late in 5.2 too. But maybe you go "guys,
> >> don't be silly, this is easily fixed with this one-liner".
> >
> > David and I have been staring at that code today for a while together.
> > I think I made some sense of it.
> > One thing we weren't absolutely sure is if the old MS_MOVE behavior was
> > intentional or a bug. If it is a bug we have a problem since we quite
> > heavily rely on this...
>
> It was intentional.
>
> The only mounts that are locked in propagation are the mounts that
> propagate together.  If you see the mounts come in as individuals you
> can always see/manipulate/work with the underlying mount.
>
> I can think of only a few ways for MNT_LOCKED to become set:
> a) unshare(CLONE_NEWNS)
> b) mount --rclone /path/to/mnt/tree /path/to/propagation/point
> c) mount --move /path/to/mnt/tree /path/to/propgation/point
>
> Nothing in the target namespace should be locked on the propgation point
> but all of the new mounts that came across as a unit should be locked
> together.

Locked together means the root of the new mount tree doesn't have
MNT_LOCKED set, but all mounts below do have MNT_LOCKED, right?

Isn't the bug here that the root mount gets MNT_LOCKED as well?

>
> Then it breaking is definitely a regression that needs to be fixed.
>
> I believe the problematic change as made because the new mount
> api allows attaching floating mounts.  Or that was the plan last I
> looked.   Those floating mounts don't have a mnt_ns so will result
> in a NULL pointer dereference when they are attached.

Well, it's called anonymous namespace.  So there *is* an mnt_ns, and
its lifetime is bound to the file returned by fsmount().

Thanks,
Miklos

^ permalink raw reply

* Re: Regression for MS_MOVE on kernel v5.1
From: Eric W. Biederman @ 2019-06-13 18:34 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Linus Torvalds, Al Viro, Linux List Kernel Mailing, linux-fsdevel,
	Linux API, David Howells
In-Reply-To: <20190613132250.u65yawzvf4voifea@brauner.io>

Christian Brauner <christian@brauner.io> writes:

> On Wed, Jun 12, 2019 at 06:00:39PM -1000, Linus Torvalds wrote:
>> On Wed, Jun 12, 2019 at 12:54 PM Christian Brauner <christian@brauner.io> wrote:
>> >
>> > The commit changes the internal logic to lock mounts when propagating
>> > mounts (user+)mount namespaces and - I believe - causes do_mount_move()
>> > to fail at:
>> 
>> You mean 'do_move_mount()'.
>> 
>> > if (old->mnt.mnt_flags & MNT_LOCKED)
>> >         goto out;
>> >
>> > If that's indeed the case we should either revert this commit (reverts
>> > cleanly, just tested it) or find a fix.
>> 
>> Hmm.. I'm not entirely sure of the logic here, and just looking at
>> that commit 3bd045cc9c4b ("separate copying and locking mount tree on
>> cross-userns copies") doesn't make me go "Ahh" either.
>> 
>> Al? My gut feel is that we need to just revert, since this was in 5.1
>> and it's getting reasonably late in 5.2 too. But maybe you go "guys,
>> don't be silly, this is easily fixed with this one-liner".
>
> David and I have been staring at that code today for a while together.
> I think I made some sense of it.
> One thing we weren't absolutely sure is if the old MS_MOVE behavior was
> intentional or a bug. If it is a bug we have a problem since we quite
> heavily rely on this...

It was intentional.

The only mounts that are locked in propagation are the mounts that
propagate together.  If you see the mounts come in as individuals you
can always see/manipulate/work with the underlying mount.

I can think of only a few ways for MNT_LOCKED to become set:
a) unshare(CLONE_NEWNS)
b) mount --rclone /path/to/mnt/tree /path/to/propagation/point
c) mount --move /path/to/mnt/tree /path/to/propgation/point

Nothing in the target namespace should be locked on the propgation point
but all of the new mounts that came across as a unit should be locked
together.

> So this whole cross-user+mnt namespace propagation mechanism comes with
> a big hammer that Eric indeed did introduce a while back which is
> MNT_LOCKED (cf. [1] for the relevant commit).
>
> Afaict, MNT_LOCKED is (among other cases) supposed to prevent a user+mnt
> namespace pair to get access to a mount that is hidden underneath an
> additional mount. Consider the following scenario:
>
> sudo mount -t tmpfs tmpfs /mnt
> sudo mount --make-rshared /mnt
> sudo mount -t tmpfs tmpfs /mnt
> sudo mount --make-rshared /mnt
> unshare -U -m --map-root --propagation=unchanged
>
> umount /mnt
> # or
> mount --move -mnt /opt
>
> The last umount/MS_MOVE is supposed to fail since the mount is locked
> with MNT_LOCKED since umounting or MS_MOVing the mount would reveal the
> underlying mount which I didn't have access to prior to the creation of
> my user+mnt namespace pair.
> (Whether or not this is a reasonable security mechanism is a separate
> discussion.)
>
> But now consider the case where from the ancestor user+mnt namespace
> pair I do:
>
> # propagate the mount to the user+mount namespace pair                 
> sudo mount -t tmpfs tmpfs /mnt
> # switch to the child user+mnt namespace pair
> umount /mnt
> # or
> mount --move /mnt /opt
>
> That umount/MS_MOVE should work since that mount was propagated to the
> unprivileged task after the user+mnt namespace pair was created.
> Also, because I already had access to the underlying mount in the first
> place and second because this is literally the only way - we know of -
> to inject a mount cross mount namespaces and this is a must have feature
> that quite a lot of users rely on.

Then it breaking is definitely a regression that needs to be fixed.

I believe the problematic change as made because the new mount
api allows attaching floating mounts.  Or that was the plan last I
looked.   Those floating mounts don't have a mnt_ns so will result
in a NULL pointer dereference when they are attached.

So I suspect fixing this is not as simple as reverting a single patch.

Eric

^ permalink raw reply

* Re: [PATCH 02/13] uapi: General notification ring definitions [ver #4]
From: Randy Dunlap @ 2019-06-13 14:49 UTC (permalink / raw)
  To: David Howells
  Cc: Darrick J. Wong, viro, raven, linux-fsdevel, linux-api,
	linux-block, keyrings, linux-security-module, linux-kernel
In-Reply-To: <30226.1560432885@warthog.procyon.org.uk>

On 6/13/19 6:34 AM, David Howells wrote:
> Randy Dunlap <rdunlap@infradead.org> wrote:
> 
>> What is the problem with inline functions in UAPI headers?
> 
> It makes compiler problems more likely; it increases the potential for name
> collisions with userspace; it makes for more potential problems if the headers
> are imported into some other language; and it's not easy to fix a bug in one
> if userspace uses it, just in case fixing the bug breaks userspace.
> 
> Further, in this case, the first of Darrick's functions (calculating the
> length) is probably reasonable, but the second is not.  It should crank the
> tail pointer and then use that, but that requires 
> 
>>>> Also, weird multiline comment style.
>>>
>>> Not really.
>>
>> Yes really.
> 
> No.  It's not weird.  If anything, the default style is less good for several
> reasons.  I'm going to deal with this separately as I need to generate some
> stats first.
> 
> David
> 

OK, maybe you are objecting to the word "weird."  So the multi-line comment style
that you used is not the preferred Linux kernel multi-line comment style
(except in networking code) [Documentation/process/coding-style.rst] that has been
in effect for 20+ years AFAIK.


-- 
~Randy

^ permalink raw reply

* Re: [PATCH 02/13] uapi: General notification ring definitions [ver #4]
From: David Howells @ 2019-06-13 13:34 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: dhowells, Darrick J. Wong, viro, raven, linux-fsdevel, linux-api,
	linux-block, keyrings, linux-security-module, linux-kernel
In-Reply-To: <6b6f5bb0-1426-239b-ac9f-281e31ddcd04@infradead.org>

Randy Dunlap <rdunlap@infradead.org> wrote:

> What is the problem with inline functions in UAPI headers?

It makes compiler problems more likely; it increases the potential for name
collisions with userspace; it makes for more potential problems if the headers
are imported into some other language; and it's not easy to fix a bug in one
if userspace uses it, just in case fixing the bug breaks userspace.

Further, in this case, the first of Darrick's functions (calculating the
length) is probably reasonable, but the second is not.  It should crank the
tail pointer and then use that, but that requires 

> >> Also, weird multiline comment style.
> > 
> > Not really.
> 
> Yes really.

No.  It's not weird.  If anything, the default style is less good for several
reasons.  I'm going to deal with this separately as I need to generate some
stats first.

David

^ permalink raw reply

* Re: [PATCH v7 22/27] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Dave Martin @ 2019-06-13 13:26 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: Florian Weimer, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <b8fb6626a6ae415fac4d5daa86225e4c68d56673.camel@intel.com>

On Wed, Jun 12, 2019 at 12:04:01PM -0700, Yu-cheng Yu wrote:
> On Wed, 2019-06-12 at 10:32 +0100, Dave Martin wrote:
> > On Tue, Jun 11, 2019 at 12:31:34PM -0700, Yu-cheng Yu wrote:
> > > On Tue, 2019-06-11 at 12:41 +0100, Dave Martin wrote:
> > > > On Mon, Jun 10, 2019 at 07:24:43PM +0200, Florian Weimer wrote:
> > > > > * Yu-cheng Yu:
> > > > > 
> > > > > > To me, looking at PT_GNU_PROPERTY and not trying to support anything
> > > > > > is a
> > > > > > logical choice.  And it breaks only a limited set of toolchains.
> > > > > > 
> > > > > > I will simplify the parser and leave this patch as-is for anyone who
> > > > > > wants
> > > > > > to
> > > > > > back-port.  Are there any objections or concerns?
> > > > > 
> > > > > Red Hat Enterprise Linux 8 does not use PT_GNU_PROPERTY and is probably
> > > > > the largest collection of CET-enabled binaries that exists today.
> > > > 
> > > > For clarity, RHEL is actively parsing these properties today?
> > > > 
> > > > > My hope was that we would backport the upstream kernel patches for CET,
> > > > > port the glibc dynamic loader to the new kernel interface, and be ready
> > > > > to run with CET enabled in principle (except that porting userspace
> > > > > libraries such as OpenSSL has not really started upstream, so many
> > > > > processes where CET is particularly desirable will still run without
> > > > > it).
> > > > > 
> > > > > I'm not sure if it is a good idea to port the legacy support if it's not
> > > > > part of the mainline kernel because it comes awfully close to creating
> > > > > our own private ABI.
> > > > 
> > > > I guess we can aim to factor things so that PT_NOTE scanning is
> > > > available as a fallback on arches for which the absence of
> > > > PT_GNU_PROPERTY is not authoritative.
> > > 
> > > We can probably check PT_GNU_PROPERTY first, and fallback (based on ld-linux
> > > version?) to PT_NOTE scanning?
> > 
> > For arm64, we can check for PT_GNU_PROPERTY and then give up
> > unconditionally.
> > 
> > For x86, we would fall back to PT_NOTE scanning, but this will add a bit
> > of cost to binaries that don't have NT_GNU_PROPERTY_TYPE_0.  The ld.so
> > version doesn't tell you what ELF ABI a given executable conforms to.
> > 
> > Since this sounds like it's largely a distro-specific issue, maybe there
> > could be a Kconfig option to turn the fallback PT_NOTE scanning on?
> 
> Yes, I will make it a Kconfig option.

OK, that works for me.  This would also help keep the PT_NOTE scanning
separate from the rest of the code.

For arm64 we could then unconditionally select/deselect that option,
where x86 could leave it configurable either way.

Cheers
---Dave

^ permalink raw reply

* Re: Regression for MS_MOVE on kernel v5.1
From: Christian Brauner @ 2019-06-13 13:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Linux List Kernel Mailing, linux-fsdevel, Linux API,
	David Howells, Eric W. Biederman
In-Reply-To: <CAHk-=wh2Khe1Lj-Pdu3o2cXxumL1hegg_1JZGJXki6cchg_Q2Q@mail.gmail.com>

On Wed, Jun 12, 2019 at 06:00:39PM -1000, Linus Torvalds wrote:
> On Wed, Jun 12, 2019 at 12:54 PM Christian Brauner <christian@brauner.io> wrote:
> >
> > The commit changes the internal logic to lock mounts when propagating
> > mounts (user+)mount namespaces and - I believe - causes do_mount_move()
> > to fail at:
> 
> You mean 'do_move_mount()'.
> 
> > if (old->mnt.mnt_flags & MNT_LOCKED)
> >         goto out;
> >
> > If that's indeed the case we should either revert this commit (reverts
> > cleanly, just tested it) or find a fix.
> 
> Hmm.. I'm not entirely sure of the logic here, and just looking at
> that commit 3bd045cc9c4b ("separate copying and locking mount tree on
> cross-userns copies") doesn't make me go "Ahh" either.
> 
> Al? My gut feel is that we need to just revert, since this was in 5.1
> and it's getting reasonably late in 5.2 too. But maybe you go "guys,
> don't be silly, this is easily fixed with this one-liner".

David and I have been staring at that code today for a while together.
I think I made some sense of it.
One thing we weren't absolutely sure is if the old MS_MOVE behavior was
intentional or a bug. If it is a bug we have a problem since we quite
heavily rely on this...

So this whole cross-user+mnt namespace propagation mechanism comes with
a big hammer that Eric indeed did introduce a while back which is
MNT_LOCKED (cf. [1] for the relevant commit).

Afaict, MNT_LOCKED is (among other cases) supposed to prevent a user+mnt
namespace pair to get access to a mount that is hidden underneath an
additional mount. Consider the following scenario:

sudo mount -t tmpfs tmpfs /mnt
sudo mount --make-rshared /mnt
sudo mount -t tmpfs tmpfs /mnt
sudo mount --make-rshared /mnt
unshare -U -m --map-root --propagation=unchanged

umount /mnt
# or
mount --move -mnt /opt

The last umount/MS_MOVE is supposed to fail since the mount is locked
with MNT_LOCKED since umounting or MS_MOVing the mount would reveal the
underlying mount which I didn't have access to prior to the creation of
my user+mnt namespace pair.
(Whether or not this is a reasonable security mechanism is a separate
discussion.)

But now consider the case where from the ancestor user+mnt namespace
pair I do:

# propagate the mount to the user+mount namespace pair                 
sudo mount -t tmpfs tmpfs /mnt
# switch to the child user+mnt namespace pair
umount /mnt
# or
mount --move /mnt /opt

That umount/MS_MOVE should work since that mount was propagated to the
unprivileged task after the user+mnt namespace pair was created.
Also, because I already had access to the underlying mount in the first
place and second because this is literally the only way - we know of -
to inject a mount cross mount namespaces and this is a must have feature
that quite a lot of users rely on.

Christian

[1]: git show 5ff9d8a65ce80efb509ce4e8051394e9ed2cd942

^ permalink raw reply

* Re: Regression for MS_MOVE on kernel v5.1
From: David Howells @ 2019-06-13  9:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Eric W. Biederman, Christian Brauner, Al Viro,
	Linux List Kernel Mailing, linux-fsdevel, Linux API
In-Reply-To: <CAHk-=wh2Khe1Lj-Pdu3o2cXxumL1hegg_1JZGJXki6cchg_Q2Q@mail.gmail.com>

[Adding Eric to the cc list since he implemented MNT_LOCKED]

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > The commit changes the internal logic to lock mounts when propagating
> > mounts (user+)mount namespaces and - I believe - causes do_mount_move()
> > to fail at:
> 
> You mean 'do_move_mount()'.
> 
> > if (old->mnt.mnt_flags & MNT_LOCKED)
> >         goto out;
> >
> > If that's indeed the case we should either revert this commit (reverts
> > cleanly, just tested it) or find a fix.
> 
> Hmm.. I'm not entirely sure of the logic here, and just looking at
> that commit 3bd045cc9c4b ("separate copying and locking mount tree on
> cross-userns copies") doesn't make me go "Ahh" either.
> 
> Al? My gut feel is that we need to just revert, since this was in 5.1
> and it's getting reasonably late in 5.2 too. But maybe you go "guys,
> don't be silly, this is easily fixed with this one-liner".

^ permalink raw reply

* Re: [PATCH v2 0/5] Introduce MADV_COLD and MADV_PAGEOUT
From: Minchan Kim @ 2019-06-13  4:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton, lizeb
In-Reply-To: <21cf2918-ba0e-aae1-a20e-36ee1ad4f704@intel.com>

On Mon, Jun 10, 2019 at 11:03:00AM -0700, Dave Hansen wrote:
> I'd really love to see the manpages for these new flags.  The devil is
> in the details of our promises to userspace.

I'm waiting comments from reviewers since I have fixed what they point
out from the previous version.

I will add manpage material in respin after the getting more feedback.
Thanks.

^ permalink raw reply

* Re: [PATCH v1 1/4] mm: introduce MADV_COLD
From: Minchan Kim @ 2019-06-13  4:48 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton
In-Reply-To: <20190612172104.GA125771@google.com>

On Wed, Jun 12, 2019 at 01:21:04PM -0400, Joel Fernandes wrote:
> On Mon, Jun 10, 2019 at 07:09:04PM +0900, Minchan Kim wrote:
> > Hi Joel,
> > 
> > On Tue, Jun 04, 2019 at 04:38:41PM -0400, Joel Fernandes wrote:
> > > On Mon, Jun 03, 2019 at 02:36:52PM +0900, Minchan Kim wrote:
> > > > When a process expects no accesses to a certain memory range, it could
> > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > happens but data should be preserved for future use.  This could reduce
> > > > workingset eviction so it ends up increasing performance.
> > > > 
> > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > to be used in the near future. The hint can help kernel in deciding which
> > > > pages to evict early during memory pressure.
> > > > 
> > > > It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
> > > > 
> > > > 	active file page -> inactive file LRU
> > > > 	active anon page -> inacdtive anon LRU
> > > > 
> > > > Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> > > > files's head because MADV_COLD is a little bit different symantic.
> > > > MADV_FREE means it's okay to discard when the memory pressure because
> > > > the content of the page is *garbage* so freeing such pages is almost zero
> > > > overhead since we don't need to swap out and access afterward causes just
> > > > minor fault. Thus, it would make sense to put those freeable pages in
> > > > inactive file LRU to compete other used-once pages. Even, it could
> > > > give a bonus to make them be reclaimed on swapless system. However,
> > > > MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in
> > > > in the end. So it's better to move inactive anon's LRU list, not file LRU.
> > > > Furthermore, it would help to avoid unnecessary scanning of cold anonymous
> > > > if system doesn't have a swap device.
> > > > 
> > > > All of error rule is same with MADV_DONTNEED.
> > > > 
> > > > Note:
> > > > This hint works with only private pages(IOW, page_mapcount(page) < 2)
> > > > because shared page could have more chance to be accessed from other
> > > > processes sharing the page although the caller reset the reference bits.
> > > > It ends up preventing the reclaim of the page and wastes CPU cycle.
> > > > 
> > > > * RFCv2
> > > >  * add more description - mhocko
> > > > 
> > > > * RFCv1
> > > >  * renaming from MADV_COOL to MADV_COLD - hannes
> > > > 
> > > > * internal review
> > > >  * use clear_page_youn in deactivate_page - joelaf
> > > >  * Revise the description - surenb
> > > >  * Renaming from MADV_WARM to MADV_COOL - surenb
> > > > 
> > > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > > ---
> > > >  include/linux/page-flags.h             |   1 +
> > > >  include/linux/page_idle.h              |  15 ++++
> > > >  include/linux/swap.h                   |   1 +
> > > >  include/uapi/asm-generic/mman-common.h |   1 +
> > > >  mm/internal.h                          |   2 +-
> > > >  mm/madvise.c                           | 115 ++++++++++++++++++++++++-
> > > >  mm/oom_kill.c                          |   2 +-
> > > >  mm/swap.c                              |  43 +++++++++
> > > >  8 files changed, 176 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > > > index 9f8712a4b1a5..58b06654c8dd 100644
> > > > --- a/include/linux/page-flags.h
> > > > +++ b/include/linux/page-flags.h
> > > > @@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
> > > >  TESTPAGEFLAG(Young, young, PF_ANY)
> > > >  SETPAGEFLAG(Young, young, PF_ANY)
> > > >  TESTCLEARFLAG(Young, young, PF_ANY)
> > > > +CLEARPAGEFLAG(Young, young, PF_ANY)
> > > >  PAGEFLAG(Idle, idle, PF_ANY)
> > > >  #endif
> > > >  
> > > > diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> > > > index 1e894d34bdce..f3f43b317150 100644
> > > > --- a/include/linux/page_idle.h
> > > > +++ b/include/linux/page_idle.h
> > > > @@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
> > > >  	SetPageYoung(page);
> > > >  }
> > > >  
> > > > +static inline void clear_page_young(struct page *page)
> > > > +{
> > > > +	ClearPageYoung(page);
> > > > +}
> > > > +
> > > >  static inline bool test_and_clear_page_young(struct page *page)
> > > >  {
> > > >  	return TestClearPageYoung(page);
> > > > @@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
> > > >  	set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> > > >  }
> > > >  
> > > > +static void clear_page_young(struct page *page)
> > > > +{
> > > > +	struct page_ext *page_ext = lookup_page_ext(page);
> > > > +
> > > > +	if (unlikely(!page_ext))
> > > > +		return;
> > > > +
> > > > +	clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> > > > +}
> > > > +
> > > >  static inline bool test_and_clear_page_young(struct page *page)
> > > >  {
> > > >  	struct page_ext *page_ext = lookup_page_ext(page);
> > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > index de2c67a33b7e..0ce997edb8bb 100644
> > > > --- a/include/linux/swap.h
> > > > +++ b/include/linux/swap.h
> > > > @@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
> > > >  extern void lru_add_drain_all(void);
> > > >  extern void rotate_reclaimable_page(struct page *page);
> > > >  extern void deactivate_file_page(struct page *page);
> > > > +extern void deactivate_page(struct page *page);
> > > >  extern void mark_page_lazyfree(struct page *page);
> > > >  extern void swap_setup(void);
> > > >  
> > > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > > > index bea0278f65ab..1190f4e7f7b9 100644
> > > > --- a/include/uapi/asm-generic/mman-common.h
> > > > +++ b/include/uapi/asm-generic/mman-common.h
> > > > @@ -43,6 +43,7 @@
> > > >  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
> > > >  #define MADV_WILLNEED	3		/* will need these pages */
> > > >  #define MADV_DONTNEED	4		/* don't need these pages */
> > > > +#define MADV_COLD	5		/* deactivatie these pages */
> > > >  
> > > >  /* common parameters: try to keep these consistent across architectures */
> > > >  #define MADV_FREE	8		/* free pages only if memory pressure */
> > > > diff --git a/mm/internal.h b/mm/internal.h
> > > > index 9eeaf2b95166..75a4f96ec0fb 100644
> > > > --- a/mm/internal.h
> > > > +++ b/mm/internal.h
> > > > @@ -43,7 +43,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
> > > >  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> > > >  		unsigned long floor, unsigned long ceiling);
> > > >  
> > > > -static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
> > > > +static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
> > > >  {
> > > >  	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
> > > >  }
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 628022e674a7..ab158766858a 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
> > > >  	case MADV_REMOVE:
> > > >  	case MADV_WILLNEED:
> > > >  	case MADV_DONTNEED:
> > > > +	case MADV_COLD:
> > > >  	case MADV_FREE:
> > > >  		return 0;
> > > >  	default:
> > > > @@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> > > > +				unsigned long end, struct mm_walk *walk)
> > > > +{
> > > > +	pte_t *orig_pte, *pte, ptent;
> > > > +	spinlock_t *ptl;
> > > > +	struct page *page;
> > > > +	struct vm_area_struct *vma = walk->vma;
> > > > +	unsigned long next;
> > > > +
> > > > +	next = pmd_addr_end(addr, end);
> > > > +	if (pmd_trans_huge(*pmd)) {
> > > > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > > > +		if (!ptl)
> > > > +			return 0;
> > > > +
> > > > +		if (is_huge_zero_pmd(*pmd))
> > > > +			goto huge_unlock;
> > > > +
> > > > +		page = pmd_page(*pmd);
> > > > +		if (page_mapcount(page) > 1)
> > > > +			goto huge_unlock;
> > > > +
> > > > +		if (next - addr != HPAGE_PMD_SIZE) {
> > > > +			int err;
> > > > +
> > > > +			get_page(page);
> > > > +			spin_unlock(ptl);
> > > > +			lock_page(page);
> > > > +			err = split_huge_page(page);
> > > > +			unlock_page(page);
> > > > +			put_page(page);
> > > > +			if (!err)
> > > > +				goto regular_page;
> > > > +			return 0;
> > > > +		}
> > > > +
> > > > +		pmdp_test_and_clear_young(vma, addr, pmd);
> > > > +		deactivate_page(page);
> > > > +huge_unlock:
> > > > +		spin_unlock(ptl);
> > > > +		return 0;
> > > > +	}
> > > > +
> > > > +	if (pmd_trans_unstable(pmd))
> > > > +		return 0;
> > > > +
> > > > +regular_page:
> > > > +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > > > +		ptent = *pte;
> > > > +
> > > > +		if (pte_none(ptent))
> > > > +			continue;
> > > > +
> > > > +		if (!pte_present(ptent))
> > > > +			continue;
> > > > +
> > > > +		page = vm_normal_page(vma, addr, ptent);
> > > > +		if (!page)
> > > > +			continue;
> > > > +
> > > > +		if (page_mapcount(page) > 1)
> > > > +			continue;
> > > > +
> > > > +		ptep_test_and_clear_young(vma, addr, pte);
> > > 
> > > Wondering here how it interacts with idle page tracking. Here since young
> > > flag is cleared by the cold hint, page_referenced_one() or
> > > page_idle_clear_pte_refs_one() will not be able to clear the page-idle flag
> > > if it was previously set since it does not know any more that a page was
> > > actively referenced.
> > 
> > ptep_test_and_clear_young doesn't change PG_idle/young so idle page tracking
> > doesn't affect.

You said *young flag* in the comment, which made me confused. I thought you meant
PG_young flag but you mean PTE access bit.

> 
> Clearing of the young bit in the PTE does affect idle tracking.
> 
> Both page_referenced_one() and page_idle_clear_pte_refs_one() check this bit.
> 
> > > bit was previously set, just so that page-idle tracking works smoothly when
> > > this hint is concurrently applied?
> > 
> > deactivate_page will remove PG_young bit so that the page will be reclaimed.
> > Do I miss your point?
> 
> Say a process had accessed PTE bit not set, then idle tracking is run and PG_Idle
> is set. Now the page is accessed from userspace thus setting the accessed PTE
> bit.  Now a remote process passes this process_madvise cold hint (I know your
> current series does not support remote process, but I am saying for future
> when you post this). Because you cleared the PTE accessed bit through the
> hint, idle tracking no longer will know that the page is referenced and the
> user gets confused because accessed page appears to be idle.

Right.

> 
> I think to fix this, what you should do is clear the PG_Idle flag if the
> young/accessed PTE bits are set. If PG_Idle is already cleared, then you
> don't need to do anything.

I'm not sure. What does it make MADV_COLD special?
How about MADV_FREE|MADV_DONTNEED?
Why don't they clear PG_Idle if pte was young at tearing down pte? 
The page could be shared by other processes so if we miss to clear out
PG_idle in there, page idle tracking could miss the access history forever.

If it's not what you want, maybe we need to fix all places all at once.
However, I'm not sure. Rather than, I want to keep PG_idle in those hints
even though pte was accesssed because the process now gives strong hint
"The page is idle from now on". It's valid because he knows himself better than
others, even admin. IOW, he declare the page is not workingset any more.
What's the problem if page idle tracking feature miss it?
If other processs still have access bit of their pte for the page, page idle
tracking could find the page as non-idle so it's no problem, either.

^ permalink raw reply

* Re: Regression for MS_MOVE on kernel v5.1
From: Linus Torvalds @ 2019-06-13  4:00 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, Linux List Kernel Mailing, linux-fsdevel, Linux API,
	David Howells
In-Reply-To: <20190612225431.p753mzqynxpsazb7@brauner.io>

On Wed, Jun 12, 2019 at 12:54 PM Christian Brauner <christian@brauner.io> wrote:
>
> The commit changes the internal logic to lock mounts when propagating
> mounts (user+)mount namespaces and - I believe - causes do_mount_move()
> to fail at:

You mean 'do_move_mount()'.

> if (old->mnt.mnt_flags & MNT_LOCKED)
>         goto out;
>
> If that's indeed the case we should either revert this commit (reverts
> cleanly, just tested it) or find a fix.

Hmm.. I'm not entirely sure of the logic here, and just looking at
that commit 3bd045cc9c4b ("separate copying and locking mount tree on
cross-userns copies") doesn't make me go "Ahh" either.

Al? My gut feel is that we need to just revert, since this was in 5.1
and it's getting reasonably late in 5.2 too. But maybe you go "guys,
don't be silly, this is easily fixed with this one-liner".

                      Linus

^ permalink raw reply

* Regression for MS_MOVE on kernel v5.1
From: Christian Brauner @ 2019-06-12 22:54 UTC (permalink / raw)
  To: viro, linux-kernel, torvalds, linux-fsdevel, linux-api, dhowells

Hey,

Sorry to be the bearer of bad news but I think I observed a pretty
gnarly regression for userspace with MS_MOVE from kernel v5.1 onwards.

When propagating mounts across mount namespaces owned by different user
namespaces it is not possible anymore to move the mount in the less
privileged mount namespace.
Here is a reproducer:

sudo mount -t tmpfs tmpfs /mnt
sudo --make-rshared /mnt

# create unprivileged user + mount namespace and preserve propagation
unshare -U -m --map-root --propagation=unchanged

# now change back to the original mount namespace in another terminal:
sudo mkdir /mnt/aaa
sudo mount -t tmpfs tmpfs /mnt/aaa

# now in the unprivileged user + mount namespace
mount --move /mnt/aaa /opt

This will work on kernels prior to 5.1 but will fail on kernels starting
with 5.1.
Unfortunately, this is a pretty big deal for userspace. In LXD - which I
maintain when not doing kernel stuff - we use this mechanism to inject
mounts into running unprivileged containers. Users started reporting
failures against our mount injection feature just a short while ago
(cf.  [1], [2]) and I just came around to looking into this today.

I tracked this down to commit:

commit 3bd045cc9c4be2049602b47505256b43908b4e2f
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Wed Jan 30 13:15:45 2019 -0500

    separate copying and locking mount tree on cross-userns copies

    Rather than having propagate_mnt() check doing unprivileged copies,
    lock them before commit_tree().

    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

reverting it makes MS_MOVE to work correctly again.
The commit changes the internal logic to lock mounts when propagating
mounts (user+)mount namespaces and - I believe - causes do_mount_move()
to fail at:

if (old->mnt.mnt_flags & MNT_LOCKED)
        goto out;

If that's indeed the case we should either revert this commit (reverts
cleanly, just tested it) or find a fix.

Thanks!
Christian

[1]: https://github.com/lxc/lxd/issues/5788
[2]: https://github.com/lxc/lxd/issues/5836

^ permalink raw reply

* [PATCHv4 28/28] selftest/timens: Check that a right vdso is mapped after fork and exec
From: Dmitry Safonov @ 2019-06-12 19:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrei Vagin, Dmitry Safonov, Adrian Reber, Andrei Vagin,
	Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Dmitry Safonov, Eric W. Biederman,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Jeff Dike, Oleg Nesterov,
	Pavel Emelyanov, Shuah Khan, Thomas Gleixner, Vincenzo Frascino,
	containers, criu, linux-api
In-Reply-To: <20190612192628.23797-1-dima@arista.com>

From: Andrei Vagin <avagin@gmail.com>

Signed-off-by: Andrei Vagin <avagin@gmail.com>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
 tools/testing/selftests/timens/.gitignore |  1 +
 tools/testing/selftests/timens/Makefile   |  2 +-
 tools/testing/selftests/timens/exec.c     | 91 +++++++++++++++++++++++
 3 files changed, 93 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/timens/exec.c

diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
index 16292e4d08a5..789f21e81028 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1,4 +1,5 @@
 clock_nanosleep
+exec
 gettime_perf
 gettime_perf_cold
 procfs
diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
index ef65bf96b55c..9e0edf354906 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs gettime_perf
+TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs gettime_perf exec
 
 uname_M := $(shell uname -m 2>/dev/null || echo not)
 ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
diff --git a/tools/testing/selftests/timens/exec.c b/tools/testing/selftests/timens/exec.c
new file mode 100644
index 000000000000..b3a05c41e202
--- /dev/null
+++ b/tools/testing/selftests/timens/exec.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdbool.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
+#include <time.h>
+#include <string.h>
+
+#include "log.h"
+#include "timens.h"
+
+#define OFFSET (36000)
+
+int main(int argc, char *argv[])
+{
+	struct timespec now, tst;
+	int status, i;
+	pid_t pid;
+
+	if (argc > 1) {
+		if (sscanf(argv[1], "%ld", &now.tv_sec) != 1)
+			return pr_perror("sscanf");
+
+		for (i = 0; i < 2; i++) {
+			_gettime(CLOCK_MONOTONIC, &tst, i);
+			if (abs(tst.tv_sec - now.tv_sec) > 5)
+				return pr_fail("%ld %ld\n", now.tv_sec, tst.tv_sec);
+		}
+	}
+
+	nscheck();
+
+	clock_gettime(CLOCK_MONOTONIC, &now);
+
+	if (unshare(CLONE_NEWTIME))
+		return pr_perror("Can't unshare() timens");
+
+	if (_settime(CLOCK_MONOTONIC, OFFSET))
+		return 1;
+
+	for (i = 0; i < 2; i++) {
+		_gettime(CLOCK_MONOTONIC, &tst, i);
+		if (abs(tst.tv_sec - now.tv_sec) > 5)
+			return pr_fail("%ld %ld\n",
+					now.tv_sec, tst.tv_sec);
+	}
+
+	if (argc > 1)
+		return 0;
+
+	pid = fork();
+	if (pid < 0)
+		return pr_perror("fork");
+
+	if (pid == 0) {
+		char now_str[64];
+		char *cargv[] = {"exec", now_str, NULL};
+		char *cenv[] = {NULL};
+
+		/* Check that a child process is in the new timens. */
+		for (i = 0; i < 2; i++) {
+			_gettime(CLOCK_MONOTONIC, &tst, i);
+			if (abs(tst.tv_sec - now.tv_sec - OFFSET) > 5)
+				return pr_fail("%ld %ld\n",
+						now.tv_sec + OFFSET, tst.tv_sec);
+		}
+
+		/* Check that a proper vdso will be mapped after execve. */
+		snprintf(now_str, sizeof(now_str), "%ld", now.tv_sec + OFFSET);
+		execve("/proc/self/exe", cargv, cenv);
+		return pr_perror("execve");
+	}
+
+	if (waitpid(pid, &status, 0) != pid)
+		return pr_perror("waitpid");
+
+	if (status)
+		ksft_exit_fail();
+
+	ksft_test_result_pass("exec\n");
+	ksft_exit_pass();
+	return 0;
+}
-- 
2.22.0

^ permalink raw reply related

* [PATCHv4 27/28] selftests: Add a simple perf test for clock_gettime()
From: Dmitry Safonov @ 2019-06-12 19:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrei Vagin, Dmitry Safonov, Adrian Reber, Andrei Vagin,
	Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Dmitry Safonov, Eric W. Biederman,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Jeff Dike, Oleg Nesterov,
	Pavel Emelyanov, Shuah Khan, Thomas Gleixner, Vincenzo Frascino,
	containers, criu, linux-api
In-Reply-To: <20190612192628.23797-1-dima@arista.com>

From: Andrei Vagin <avagin@gmail.com>

Signed-off-by: Andrei Vagin <avagin@gmail.com>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
 tools/testing/selftests/timens/.gitignore     |  2 +
 tools/testing/selftests/timens/Makefile       |  8 +-
 tools/testing/selftests/timens/gettime_perf.c | 74 +++++++++++++++++++
 .../selftests/timens/gettime_perf_cold.c      | 63 ++++++++++++++++
 4 files changed, 146 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/timens/gettime_perf.c
 create mode 100644 tools/testing/selftests/timens/gettime_perf_cold.c

diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
index 3b7eda8f35ce..16292e4d08a5 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1,4 +1,6 @@
 clock_nanosleep
+gettime_perf
+gettime_perf_cold
 procfs
 timens
 timer
diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
index ae1ffd24cc43..ef65bf96b55c 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,10 @@
-TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs
+TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs gettime_perf
+
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),x86_64)
+TEST_GEN_PROGS += gettime_perf_cold
+endif
 
 CFLAGS := -Wall -Werror
 LDFLAGS := -lrt
diff --git a/tools/testing/selftests/timens/gettime_perf.c b/tools/testing/selftests/timens/gettime_perf.c
new file mode 100644
index 000000000000..510d77a941d9
--- /dev/null
+++ b/tools/testing/selftests/timens/gettime_perf.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <time.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+
+#include "log.h"
+#include "timens.h"
+
+//#define TEST_SYSCALL
+
+static void test(clock_t clockid, char *clockstr, bool in_ns)
+{
+	struct timespec tp, start;
+	long i = 0;
+	const int timeout = 3;
+
+#ifndef TEST_SYSCALL
+	clock_gettime(clockid, &start);
+#else
+	syscall(__NR_clock_gettime, clockid, &start);
+#endif
+	tp = start;
+	for (tp = start; start.tv_sec + timeout > tp.tv_sec ||
+			 (start.tv_sec + timeout == tp.tv_sec &&
+			  start.tv_nsec > tp.tv_nsec); i++) {
+#ifndef TEST_SYSCALL
+		clock_gettime(clockid, &tp);
+#else
+		syscall(__NR_clock_gettime, clockid, &tp);
+#endif
+	}
+
+	ksft_test_result_pass("%s:\tclock: %10s\tcycles:\t%10ld\n",
+			      in_ns ? "ns" : "host", clockstr, i);
+}
+
+int main(int argc, char *argv[])
+{
+	time_t offset = 10;
+	int nsfd;
+
+	test(CLOCK_MONOTONIC, "monotonic", false);
+	test(CLOCK_BOOTTIME, "boottime", false);
+
+	nscheck();
+
+	if (unshare(CLONE_NEWTIME))
+		return pr_perror("Can't unshare() timens");
+
+	nsfd = open("/proc/self/ns/time_for_children", O_RDONLY);
+	if (nsfd < 0)
+		return pr_perror("Can't open a time namespace");
+
+	if (_settime(CLOCK_MONOTONIC, offset))
+		return 1;
+	if (_settime(CLOCK_BOOTTIME, offset))
+		return 1;
+
+	if (setns(nsfd, CLONE_NEWTIME))
+		return pr_perror("setns");
+
+	test(CLOCK_MONOTONIC, "monotonic", true);
+	test(CLOCK_BOOTTIME, "boottime", true);
+
+	ksft_exit_pass();
+	return 0;
+}
diff --git a/tools/testing/selftests/timens/gettime_perf_cold.c b/tools/testing/selftests/timens/gettime_perf_cold.c
new file mode 100644
index 000000000000..f72db8a4c903
--- /dev/null
+++ b/tools/testing/selftests/timens/gettime_perf_cold.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <time.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <string.h>
+
+#include "log.h"
+#include "timens.h"
+
+static __inline__ unsigned long long rdtsc(void)
+{
+	unsigned hi, lo;
+
+	__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
+	return ((unsigned long long) lo) | (((unsigned long long)hi) << 32);
+}
+
+static void test(clock_t clockid, char *clockstr)
+{
+	struct timespec tp;
+	long long s, e;
+
+	s = rdtsc();
+	clock_gettime(clockid, &tp);
+	e = rdtsc();
+	printf("%lld\n", e - s);
+	return;
+}
+
+int main(int argc, char **argv)
+{
+	time_t offset = 10;
+	int nsfd;
+
+	if (argc == 1) {
+		test(CLOCK_MONOTONIC, "monotonic");
+		return 0;
+	}
+	nscheck();
+
+	if (unshare(CLONE_NEWTIME))
+		return pr_perror("Can't unshare() timens");
+
+	nsfd = open("/proc/self/ns/time_for_children", O_RDONLY);
+	if (nsfd < 0)
+		return pr_perror("Can't open a time namespace");
+
+	if (_settime(CLOCK_MONOTONIC, offset))
+		return 1;
+
+	if (setns(nsfd, CLONE_NEWTIME))
+		return pr_perror("setns");
+
+	test(CLOCK_MONOTONIC, "monotonic");
+	return 0;
+}
-- 
2.22.0

^ permalink raw reply related

* [PATCHv4 26/28] x86/vdso: Align VDSO functions by CPU L1 cache line
From: Dmitry Safonov @ 2019-06-12 19:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrei Vagin, Dmitry Safonov, Adrian Reber, Andrei Vagin,
	Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Dmitry Safonov, Eric W. Biederman,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Jeff Dike, Oleg Nesterov,
	Pavel Emelyanov, Shuah Khan, Thomas Gleixner, Vincenzo Frascino,
	containers, criu, linux-api
In-Reply-To: <20190612192628.23797-1-dima@arista.com>

From: Andrei Vagin <avagin@gmail.com>

After performance testing VDSO patches a noticeable 20% regression was
found on gettime_perf selftest with a cold cache.
As it turns to be, before time namespaces introduction, VDSO functions
were quite aligned to cache lines, but adding a new code to adjust
timens offset inside namespace created a small shift and vdso functions
become unaligned on cache lines.

Add align to vdso functions with gcc option to fix performance drop.

Coping the resulting numbers from cover letter:

Hot CPU cache (more gettime_perf.c cycles - the better):
        | before     | CONFIG_TIME_NS=n | host        | inside timens
--------|------------|------------------|-------------|-------------
cycles  | 139887013  | 139453003        | 139899785   | 128792458
diff (%)| 100        | 99.7             | 100         | 92

Cold cache (lesser tsc per gettime_perf_cold.c cycle - the better):
        | before     | CONFIG_TIME_NS=n | host        | inside timens
--------|------------|------------------|-------------|-------------
tsc     | 6748       | 6718             | 6862        | 12682
diff (%)| 100        | 99.6             | 101.7       | 188

Measured on Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz

Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
 arch/x86/entry/vdso/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index b58d34120fd8..c7bfd62d1fc3 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -4,6 +4,7 @@
 #
 
 KBUILD_CFLAGS += $(DISABLE_LTO)
+KBUILD_CFLAGS += -falign-functions=$(CONFIG_X86_L1_CACHE_SHIFT)
 KASAN_SANITIZE			:= n
 UBSAN_SANITIZE			:= n
 OBJECT_FILES_NON_STANDARD	:= y
-- 
2.22.0

^ permalink raw reply related

* [PATCHv4 25/28] selftest/timens: Add timer offsets test
From: Dmitry Safonov @ 2019-06-12 19:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrei Vagin, Dmitry Safonov, Adrian Reber, Andy Lutomirski,
	Arnd Bergmann, Christian Brauner, Cyrill Gorcunov, Dmitry Safonov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Thomas Gleixner, Vincenzo Frascino, containers, criu, linux-api,
	x86
In-Reply-To: <20190612192628.23797-1-dima@arista.com>

From: Andrei Vagin <avagin@openvz.org>

Check that timer_create() takes into account clock offsets.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
 tools/testing/selftests/timens/.gitignore |   1 +
 tools/testing/selftests/timens/Makefile   |   3 +-
 tools/testing/selftests/timens/timer.c    | 116 ++++++++++++++++++++++
 3 files changed, 119 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/timens/timer.c

diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
index 94ffdd9cead7..3b7eda8f35ce 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1,4 +1,5 @@
 clock_nanosleep
 procfs
 timens
+timer
 timerfd
diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
index f96f50d1fef8..ae1ffd24cc43 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,5 +1,6 @@
-TEST_GEN_PROGS := timens timerfd clock_nanosleep procfs
+TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs
 
 CFLAGS := -Wall -Werror
+LDFLAGS := -lrt
 
 include ../lib.mk
diff --git a/tools/testing/selftests/timens/timer.c b/tools/testing/selftests/timens/timer.c
new file mode 100644
index 000000000000..6e33cd54d397
--- /dev/null
+++ b/tools/testing/selftests/timens/timer.c
@@ -0,0 +1,116 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sched.h>
+
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <signal.h>
+#include <time.h>
+
+#include "log.h"
+#include "timens.h"
+
+int run_test(int clockid, struct timespec now)
+{
+	struct itimerspec new_value;
+	long long elapsed;
+	timer_t fd;
+	int i;
+
+	for (i = 0; i < 2; i++) {
+		struct sigevent sevp = {.sigev_notify = SIGEV_NONE};
+		int flags = 0;
+
+		new_value.it_value.tv_sec = 3600;
+		new_value.it_value.tv_nsec = 0;
+		new_value.it_interval.tv_sec = 1;
+		new_value.it_interval.tv_nsec = 0;
+
+		if (i == 1) {
+			new_value.it_value.tv_sec += now.tv_sec;
+			new_value.it_value.tv_nsec += now.tv_nsec;
+		}
+
+		if (timer_create(clockid, &sevp, &fd) == -1)
+			return pr_perror("timerfd_create");
+
+		if (i == 1)
+			flags |= TIMER_ABSTIME;
+		if (timer_settime(fd, flags, &new_value, NULL) == -1)
+			return pr_perror("timerfd_settime");
+
+		if (timer_gettime(fd, &new_value) == -1)
+			return pr_perror("timerfd_gettime");
+
+		elapsed = new_value.it_value.tv_sec;
+		if (abs(elapsed - 3600) > 60) {
+			ksft_test_result_fail("clockid: %d elapsed: %lld\n",
+					      clockid, elapsed);
+			return 1;
+		}
+	}
+
+	ksft_test_result_pass("clockid=%d\n", clockid);
+
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	int ret, status, len, fd;
+	char buf[4096];
+	pid_t pid;
+	struct timespec btime_now, mtime_now;
+
+	nscheck();
+
+	clock_gettime(CLOCK_MONOTONIC, &mtime_now);
+	clock_gettime(CLOCK_BOOTTIME, &btime_now);
+
+	if (unshare(CLONE_NEWTIME))
+		return pr_perror("unshare");
+
+	len = snprintf(buf, sizeof(buf), "%d %d 0\n%d %d 0",
+			CLOCK_MONOTONIC, 70 * 24 * 3600,
+			CLOCK_BOOTTIME, 9 * 24 * 3600);
+	fd = open("/proc/self/timens_offsets", O_WRONLY);
+	if (fd < 0)
+		return pr_perror("/proc/self/timens_offsets");
+
+	if (write(fd, buf, len) != len)
+		return pr_perror("/proc/self/timens_offsets");
+
+	close(fd);
+	mtime_now.tv_sec += 70 * 24 * 3600;
+	btime_now.tv_sec += 9 * 24 * 3600;
+
+	pid = fork();
+	if (pid < 0)
+		return pr_perror("Unable to fork");
+	if (pid == 0) {
+		ret = 0;
+		ret |= run_test(CLOCK_BOOTTIME, btime_now);
+		ret |= run_test(CLOCK_MONOTONIC, mtime_now);
+		ret |= run_test(CLOCK_BOOTTIME_ALARM, btime_now);
+
+		if (ret)
+			ksft_exit_fail();
+		ksft_exit_pass();
+		return ret;
+	}
+
+	if (waitpid(pid, &status, 0) != pid)
+		return pr_perror("Unable to wait the child process");
+
+	if (WIFEXITED(status))
+		return WEXITSTATUS(status);
+
+	return 1;
+}
+
-- 
2.22.0

^ permalink raw reply related

* [PATCHv4 24/28] selftest/timens: Add procfs selftest
From: Dmitry Safonov @ 2019-06-12 19:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dmitry Safonov, Adrian Reber, Andrei Vagin, Andy Lutomirski,
	Arnd Bergmann, Christian Brauner, Cyrill Gorcunov, Dmitry Safonov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Thomas Gleixner, Vincenzo Frascino, containers, criu, linux-api,
	x86
In-Reply-To: <20190612192628.23797-1-dima@arista.com>

Check that /proc/uptime is correct inside a new time namespace.

Co-developed-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
 tools/testing/selftests/timens/.gitignore |   1 +
 tools/testing/selftests/timens/Makefile   |   2 +-
 tools/testing/selftests/timens/procfs.c   | 142 ++++++++++++++++++++++
 3 files changed, 144 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/timens/procfs.c

diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
index 9b6c8ddac2c8..94ffdd9cead7 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1,3 +1,4 @@
 clock_nanosleep
+procfs
 timens
 timerfd
diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
index 76a1dc891184..f96f50d1fef8 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := timens timerfd clock_nanosleep
+TEST_GEN_PROGS := timens timerfd clock_nanosleep procfs
 
 CFLAGS := -Wall -Werror
 
diff --git a/tools/testing/selftests/timens/procfs.c b/tools/testing/selftests/timens/procfs.c
new file mode 100644
index 000000000000..89a24c134510
--- /dev/null
+++ b/tools/testing/selftests/timens/procfs.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <math.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <time.h>
+#include <unistd.h>
+#include <time.h>
+
+#include "log.h"
+#include "timens.h"
+
+/*
+ * Test shouldn't be run for a day, so add 10 days to child
+ * time and check parent's time to be in the same day.
+ */
+#define MAX_TEST_TIME_SEC		(60*5)
+#define DAY_IN_SEC			(60*60*24)
+#define TEN_DAYS_IN_SEC			(10*DAY_IN_SEC)
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+
+static int child_ns, parent_ns;
+
+static int switch_ns(int fd)
+{
+	if (setns(fd, CLONE_NEWTIME))
+		return pr_perror("setns()");
+
+	return 0;
+}
+
+static int init_namespaces(void)
+{
+	char path[] = "/proc/self/ns/time_for_children";
+	struct stat st1, st2;
+
+	parent_ns = open(path, O_RDONLY);
+	if (parent_ns <= 0)
+		return pr_perror("Unable to open %s", path);
+
+	if (fstat(parent_ns, &st1))
+		return pr_perror("Unable to stat the parent timens");
+
+	if (unshare(CLONE_NEWTIME))
+		return pr_perror("Can't unshare() timens");
+
+	child_ns = open(path, O_RDONLY);
+	if (child_ns <= 0)
+		return pr_perror("Unable to open %s", path);
+
+	if (fstat(child_ns, &st2))
+		return pr_perror("Unable to stat the timens");
+
+	if (st1.st_ino == st2.st_ino)
+		return pr_err("The same child_ns after CLONE_NEWTIME");
+
+	if (_settime(CLOCK_BOOTTIME, TEN_DAYS_IN_SEC))
+		return -1;
+
+	return 0;
+}
+
+static int read_proc_uptime(struct timespec *uptime)
+{
+	unsigned long up_sec, up_nsec;
+	FILE *proc;
+
+	proc = fopen("/proc/uptime", "r");
+	if (proc == NULL) {
+		pr_perror("Unable to open /proc/uptime");
+		return -1;
+	}
+
+	if (fscanf(proc, "%lu.%02lu", &up_sec, &up_nsec) != 2) {
+		if (errno) {
+			pr_perror("fscanf");
+			return -errno;
+		}
+		pr_err("failed to parse /proc/uptime");
+		return -1;
+	}
+	fclose(proc);
+
+	uptime->tv_sec = up_sec;
+	uptime->tv_nsec = up_nsec;
+	return 0;
+}
+
+static int check_uptime(void)
+{
+	struct timespec uptime_new, uptime_old;
+	time_t uptime_expected;
+	double prec = MAX_TEST_TIME_SEC;
+
+	if (switch_ns(parent_ns))
+		return pr_err("switch_ns(%d)", parent_ns);
+
+	if (read_proc_uptime(&uptime_old))
+		return 1;
+
+	if (switch_ns(child_ns))
+		return pr_err("switch_ns(%d)", child_ns);
+
+	if (read_proc_uptime(&uptime_new))
+		return 1;
+
+	uptime_expected = uptime_old.tv_sec + TEN_DAYS_IN_SEC;
+	if (fabs(difftime(uptime_new.tv_sec, uptime_expected)) > prec) {
+		pr_fail("uptime in /proc/uptime: old %ld, new %ld [%ld]",
+			uptime_old.tv_sec, uptime_new.tv_sec,
+			uptime_old.tv_sec + TEN_DAYS_IN_SEC);
+		return 1;
+	}
+
+	ksft_test_result_pass("Passed for /proc/uptime\n");
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	int ret = 0;
+
+	nscheck();
+
+	if (init_namespaces())
+		return 1;
+
+	ret |= check_uptime();
+
+	if (ret)
+		ksft_exit_fail();
+	ksft_exit_pass();
+	return ret;
+}
-- 
2.22.0

^ permalink raw reply related

* [PATCHv4 23/28] selftest/timens: Add a test for clock_nanosleep()
From: Dmitry Safonov @ 2019-06-12 19:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrei Vagin, Dmitry Safonov, Adrian Reber, Andrei Vagin,
	Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Dmitry Safonov, Eric W. Biederman,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Jeff Dike, Oleg Nesterov,
	Pavel Emelyanov, Shuah Khan, Thomas Gleixner, Vincenzo Frascino,
	containers, criu, linux-api
In-Reply-To: <20190612192628.23797-1-dima@arista.com>

From: Andrei Vagin <avagin@gmail.com>

Check that clock_nanosleep() takes into account clock offsets.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
 tools/testing/selftests/timens/.gitignore     |   1 +
 tools/testing/selftests/timens/Makefile       |   2 +-
 .../selftests/timens/clock_nanosleep.c        | 100 ++++++++++++++++++
 3 files changed, 102 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c

diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
index b609f6ee9fb9..9b6c8ddac2c8 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1,2 +1,3 @@
+clock_nanosleep
 timens
 timerfd
diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
index 66b90cd28e5c..76a1dc891184 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := timens timerfd
+TEST_GEN_PROGS := timens timerfd clock_nanosleep
 
 CFLAGS := -Wall -Werror
 
diff --git a/tools/testing/selftests/timens/clock_nanosleep.c b/tools/testing/selftests/timens/clock_nanosleep.c
new file mode 100644
index 000000000000..dfd4e3429c75
--- /dev/null
+++ b/tools/testing/selftests/timens/clock_nanosleep.c
@@ -0,0 +1,100 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sched.h>
+
+#include <sys/timerfd.h>
+#include <sys/syscall.h>
+#include <time.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdint.h>
+
+#include "log.h"
+#include "timens.h"
+
+static long long get_elapsed_time(int clockid, struct timespec *start)
+{
+	struct timespec curr;
+	long long secs, nsecs;
+
+	if (clock_gettime(clockid, &curr) == -1)
+		return pr_perror("clock_gettime");
+
+	secs = curr.tv_sec - start->tv_sec;
+	nsecs = curr.tv_nsec - start->tv_nsec;
+	if (nsecs < 0) {
+		secs--;
+		nsecs += 1000000000;
+	}
+	if (nsecs > 1000000000) {
+		secs++;
+		nsecs -= 1000000000;
+	}
+	return secs * 1000 + nsecs / 1000000;
+}
+
+int run_test(int clockid)
+{
+	long long elapsed;
+	int i;
+
+	for (i = 0; i < 2; i++) {
+		struct timespec now = {};
+		struct timespec start;
+
+		if (clock_gettime(clockid, &start) == -1)
+			return pr_perror("clock_gettime");
+
+
+		if (i == 1) {
+			now.tv_sec = start.tv_sec;
+			now.tv_nsec = start.tv_nsec;
+		}
+
+		now.tv_sec += 2;
+		clock_nanosleep(clockid, i ? TIMER_ABSTIME : 0, &now, NULL);
+
+		elapsed = get_elapsed_time(clockid, &start);
+		if (elapsed < 1900 || elapsed > 2100) {
+			pr_fail("clockid: %d abs: %d elapsed: %lld\n",
+				clockid, i, elapsed);
+			return 1;
+		}
+		ksft_test_result_pass("clockid: %d abs:%d\n", clockid, i);
+	}
+
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	int ret, nsfd;
+
+	nscheck();
+
+	if (unshare(CLONE_NEWTIME))
+		return pr_perror("unshare");
+
+	if (_settime(CLOCK_MONOTONIC, 7 * 24 * 3600))
+		return 1;
+	if (_settime(CLOCK_BOOTTIME, 9 * 24 * 3600))
+		return 1;
+
+	nsfd = open("/proc/self/ns/time_for_children", O_RDONLY);
+	if (nsfd < 0)
+		return pr_perror("Unable to open timens_for_children");
+
+	if (setns(nsfd, CLONE_NEWTIME))
+		return pr_perror("Unable to set timens");
+
+	ret = 0;
+	ret |= run_test(CLOCK_MONOTONIC);
+	ret |= run_test(CLOCK_BOOTTIME_ALARM);
+
+	if (ret)
+		ksft_exit_fail();
+	ksft_exit_pass();
+	return ret;
+}
+
-- 
2.22.0

^ permalink raw reply related

* [PATCHv4 22/28] selftest/timens: Add a test for timerfd
From: Dmitry Safonov @ 2019-06-12 19:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrei Vagin, Dmitry Safonov, Adrian Reber, Andrei Vagin,
	Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Dmitry Safonov, Eric W. Biederman,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Jeff Dike, Oleg Nesterov,
	Pavel Emelyanov, Shuah Khan, Thomas Gleixner, Vincenzo Frascino,
	containers, criu, linux-api
In-Reply-To: <20190612192628.23797-1-dima@arista.com>

From: Andrei Vagin <avagin@gmail.com>

Check that timerfd_create() takes into account clock offsets.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
 tools/testing/selftests/timens/.gitignore |   1 +
 tools/testing/selftests/timens/Makefile   |   2 +-
 tools/testing/selftests/timens/timerfd.c  | 127 ++++++++++++++++++++++
 3 files changed, 129 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/timens/timerfd.c

diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
index 27a693229ce1..b609f6ee9fb9 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1 +1,2 @@
 timens
+timerfd
diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
index b877efb78974..66b90cd28e5c 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := timens
+TEST_GEN_PROGS := timens timerfd
 
 CFLAGS := -Wall -Werror
 
diff --git a/tools/testing/selftests/timens/timerfd.c b/tools/testing/selftests/timens/timerfd.c
new file mode 100644
index 000000000000..c9816db4fe79
--- /dev/null
+++ b/tools/testing/selftests/timens/timerfd.c
@@ -0,0 +1,127 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sched.h>
+
+#include <sys/timerfd.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdint.h>
+
+#include "log.h"
+#include "timens.h"
+
+static int tclock_gettime(clock_t clockid, struct timespec *now)
+{
+	if (clockid == CLOCK_BOOTTIME_ALARM)
+		clockid = CLOCK_BOOTTIME;
+	return clock_gettime(clockid, now);
+}
+
+int run_test(int clockid, struct timespec now)
+{
+	struct itimerspec new_value;
+	long long elapsed;
+	int fd, i;
+
+	if (tclock_gettime(clockid, &now))
+		return pr_perror("clock_gettime");
+
+	for (i = 0; i < 2; i++) {
+		int flags = 0;
+
+		new_value.it_value.tv_sec = 3600;
+		new_value.it_value.tv_nsec = 0;
+		new_value.it_interval.tv_sec = 1;
+		new_value.it_interval.tv_nsec = 0;
+
+		if (i == 1) {
+			new_value.it_value.tv_sec += now.tv_sec;
+			new_value.it_value.tv_nsec += now.tv_nsec;
+		}
+
+		fd = timerfd_create(clockid, 0);
+		if (fd == -1)
+			return pr_perror("timerfd_create");
+
+		if (i == 1)
+			flags |= TFD_TIMER_ABSTIME;
+
+		if (timerfd_settime(fd, flags, &new_value, NULL))
+			return pr_perror("timerfd_settime");
+
+		if (timerfd_gettime(fd, &new_value))
+			return pr_perror("timerfd_gettime");
+
+		elapsed = new_value.it_value.tv_sec;
+		if (abs(elapsed - 3600) > 60) {
+			ksft_test_result_fail("clockid: %d elapsed: %lld\n",
+					      clockid, elapsed);
+			return 1;
+		}
+
+		close(fd);
+	}
+
+	ksft_test_result_pass("clockid=%d\n", clockid);
+
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	int ret, status, len, fd;
+	char buf[4096];
+	pid_t pid;
+	struct timespec btime_now, mtime_now;
+
+	nscheck();
+
+	clock_gettime(CLOCK_MONOTONIC, &mtime_now);
+	clock_gettime(CLOCK_BOOTTIME, &btime_now);
+
+	if (unshare(CLONE_NEWTIME))
+		return pr_perror("unshare");
+
+	len = snprintf(buf, sizeof(buf), "%d %d 0\n%d %d 0",
+			CLOCK_MONOTONIC, 70 * 24 * 3600,
+			CLOCK_BOOTTIME, 9 * 24 * 3600);
+	fd = open("/proc/self/timens_offsets", O_WRONLY);
+	if (fd < 0)
+		return pr_perror("/proc/self/timens_offsets");
+
+	if (write(fd, buf, len) != len)
+		return pr_perror("/proc/self/timens_offsets");
+
+	close(fd);
+	mtime_now.tv_sec += 70 * 24 * 3600;
+	btime_now.tv_sec += 9 * 24 * 3600;
+
+	pid = fork();
+	if (pid < 0)
+		return pr_perror("Unable to fork");
+	if (pid == 0) {
+		ret = 0;
+		ret |= run_test(CLOCK_BOOTTIME, btime_now);
+		ret |= run_test(CLOCK_MONOTONIC, mtime_now);
+		ret |= run_test(CLOCK_BOOTTIME_ALARM, btime_now);
+
+		if (ret)
+			ksft_exit_fail();
+		ksft_exit_pass();
+		return ret;
+	}
+
+	if (waitpid(pid, &status, 0) != pid)
+		return pr_perror("Unable to wait the child process");
+
+	if (WIFEXITED(status))
+		return WEXITSTATUS(status);
+
+	return 1;
+}
+
-- 
2.22.0

^ permalink raw reply related

* [PATCHv4 21/28] selftest/timens: Add Time Namespace test for supported clocks
From: Dmitry Safonov @ 2019-06-12 19:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dmitry Safonov, Adrian Reber, Andrei Vagin, Andy Lutomirski,
	Arnd Bergmann, Christian Brauner, Cyrill Gorcunov, Dmitry Safonov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Thomas Gleixner, Vincenzo Frascino, containers, criu, linux-api,
	x86
In-Reply-To: <20190612192628.23797-1-dima@arista.com>

A test to check that all supported clocks work on host and inside
a new time namespace. Use both ways to get time: through VDSO and
by entering the kernel with implicit syscall.

Introduce a new timens directory in selftests framework for
the next timens tests.

Co-developed-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
 tools/testing/selftests/Makefile          |   1 +
 tools/testing/selftests/timens/.gitignore |   1 +
 tools/testing/selftests/timens/Makefile   |   5 +
 tools/testing/selftests/timens/config     |   1 +
 tools/testing/selftests/timens/log.h      |  26 +++
 tools/testing/selftests/timens/timens.c   | 188 ++++++++++++++++++++++
 tools/testing/selftests/timens/timens.h   |  63 ++++++++
 7 files changed, 285 insertions(+)
 create mode 100644 tools/testing/selftests/timens/.gitignore
 create mode 100644 tools/testing/selftests/timens/Makefile
 create mode 100644 tools/testing/selftests/timens/config
 create mode 100644 tools/testing/selftests/timens/log.h
 create mode 100644 tools/testing/selftests/timens/timens.c
 create mode 100644 tools/testing/selftests/timens/timens.h

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 9781ca79794a..f71a59632192 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -47,6 +47,7 @@ TARGETS += splice
 TARGETS += static_keys
 TARGETS += sync
 TARGETS += sysctl
+TARGETS += timens
 ifneq (1, $(quicktest))
 TARGETS += timers
 endif
diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
new file mode 100644
index 000000000000..27a693229ce1
--- /dev/null
+++ b/tools/testing/selftests/timens/.gitignore
@@ -0,0 +1 @@
+timens
diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
new file mode 100644
index 000000000000..b877efb78974
--- /dev/null
+++ b/tools/testing/selftests/timens/Makefile
@@ -0,0 +1,5 @@
+TEST_GEN_PROGS := timens
+
+CFLAGS := -Wall -Werror
+
+include ../lib.mk
diff --git a/tools/testing/selftests/timens/config b/tools/testing/selftests/timens/config
new file mode 100644
index 000000000000..4480620f6f49
--- /dev/null
+++ b/tools/testing/selftests/timens/config
@@ -0,0 +1 @@
+CONFIG_TIME_NS=y
diff --git a/tools/testing/selftests/timens/log.h b/tools/testing/selftests/timens/log.h
new file mode 100644
index 000000000000..db64df2a8483
--- /dev/null
+++ b/tools/testing/selftests/timens/log.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __SELFTEST_TIMENS_LOG_H__
+#define __SELFTEST_TIMENS_LOG_H__
+
+#define pr_msg(fmt, lvl, ...)						\
+	ksft_print_msg("[%s] (%s:%d)\t" fmt "\n",			\
+			lvl, __FILE__, __LINE__, ##__VA_ARGS__)
+
+#define pr_p(func, fmt, ...)	func(fmt ": %m", ##__VA_ARGS__)
+
+#define pr_err(fmt, ...)						\
+	({								\
+		ksft_test_result_error(fmt "\n", ##__VA_ARGS__);		\
+		-1;							\
+	})
+
+#define pr_fail(fmt, ...)					\
+	({							\
+		ksft_test_result_fail(fmt, ##__VA_ARGS__);	\
+		-1;						\
+	})
+
+#define pr_perror(fmt, ...)	pr_p(pr_err, fmt, ##__VA_ARGS__)
+
+#endif
diff --git a/tools/testing/selftests/timens/timens.c b/tools/testing/selftests/timens/timens.c
new file mode 100644
index 000000000000..407e7a97882f
--- /dev/null
+++ b/tools/testing/selftests/timens/timens.c
@@ -0,0 +1,188 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdbool.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <time.h>
+#include <unistd.h>
+#include <time.h>
+#include <string.h>
+
+#include "log.h"
+#include "timens.h"
+
+/*
+ * Test shouldn't be run for a day, so add 10 days to child
+ * time and check parent's time to be in the same day.
+ */
+#define DAY_IN_SEC			(60*60*24)
+#define TEN_DAYS_IN_SEC			(10*DAY_IN_SEC)
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+
+#define CLOCK_TYPES							\
+	ct(CLOCK_BOOTTIME, -1),						\
+	ct(CLOCK_BOOTTIME_ALARM, 1),					\
+	ct(CLOCK_MONOTONIC, -1),					\
+	ct(CLOCK_MONOTONIC_COARSE, 1),					\
+	ct(CLOCK_MONOTONIC_RAW, 1),					\
+
+
+struct test_clock {
+	clockid_t id;
+	char *name;
+	/*
+	 * off_id is -1 if a clock has own offset, or it contains an index
+	 * which contains a right offset of this clock.
+	 */
+	int off_id;
+	time_t offset;
+};
+
+#define ct(clock, off_id)	{ clock, #clock, off_id }
+static struct test_clock clocks[] = {
+	CLOCK_TYPES
+};
+#undef ct
+
+static int child_ns, parent_ns = -1;
+
+static int switch_ns(int fd)
+{
+	if (setns(fd, CLONE_NEWTIME)) {
+		pr_perror("setns()");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int init_namespaces(void)
+{
+	char path[] = "/proc/self/ns/time_for_children";
+	struct stat st1, st2;
+
+	if (parent_ns == -1) {
+		parent_ns = open(path, O_RDONLY);
+		if (parent_ns <= 0)
+			return pr_perror("Unable to open %s", path);
+	}
+
+	if (fstat(parent_ns, &st1))
+		return pr_perror("Unable to stat the parent timens");
+
+	if (unshare(CLONE_NEWTIME))
+		return pr_perror("Can't unshare() timens");
+
+	child_ns = open(path, O_RDONLY);
+	if (child_ns <= 0)
+		return pr_perror("Unable to open %s", path);
+
+	if (fstat(child_ns, &st2))
+		return pr_perror("Unable to stat the timens");
+
+	if (st1.st_ino == st2.st_ino)
+		return pr_perror("The same child_ns after CLONE_NEWTIME");
+
+	return 0;
+}
+
+static int test_gettime(clockid_t clock_index, bool raw_syscall, time_t offset)
+{
+	struct timespec child_ts_new, parent_ts_old, cur_ts;
+	char *entry = raw_syscall ? "syscall" : "vdso";
+	double precision = 0.0;
+
+	switch (clocks[clock_index].id) {
+	case CLOCK_MONOTONIC_COARSE:
+	case CLOCK_MONOTONIC_RAW:
+		precision = -2.0;
+		break;
+	}
+
+	if (switch_ns(parent_ns))
+		return pr_err("switch_ns(%d)", child_ns);
+
+	if (_gettime(clocks[clock_index].id, &parent_ts_old, raw_syscall))
+		return -1;
+
+	child_ts_new.tv_nsec = parent_ts_old.tv_nsec;
+	child_ts_new.tv_sec = parent_ts_old.tv_sec + offset;
+
+	if (switch_ns(child_ns))
+		return pr_err("switch_ns(%d)", child_ns);
+
+	if (_gettime(clocks[clock_index].id, &cur_ts, raw_syscall))
+		return -1;
+
+	if (difftime(cur_ts.tv_sec, child_ts_new.tv_sec) < precision) {
+		ksft_test_result_fail(
+			"Child's %s (%s) time has not changed: %lu -> %lu [%lu]\n",
+			clocks[clock_index].name, entry, parent_ts_old.tv_sec,
+			child_ts_new.tv_sec, cur_ts.tv_sec);
+		return -1;
+	}
+
+	if (switch_ns(parent_ns))
+		return pr_err("switch_ns(%d)", parent_ns);
+
+	if (_gettime(clocks[clock_index].id, &cur_ts, raw_syscall))
+		return -1;
+
+	if (difftime(cur_ts.tv_sec, parent_ts_old.tv_sec) > DAY_IN_SEC) {
+		ksft_test_result_fail(
+			"Parent's %s (%s) time has changed: %lu -> %lu [%lu]\n",
+			clocks[clock_index].name, entry, parent_ts_old.tv_sec,
+			child_ts_new.tv_sec, cur_ts.tv_sec);
+		/* Let's play nice and put it closer to original */
+		clock_settime(clocks[clock_index].id, &cur_ts);
+		return -1;
+	}
+
+	ksft_test_result_pass("Passed for %s (%s)\n",
+				clocks[clock_index].name, entry);
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	unsigned int i;
+	time_t offset;
+	int ret = 0;
+
+	nscheck();
+
+
+	if (init_namespaces())
+		return 1;
+
+	/* Offsets have to be set before tasks enter the namespace. */
+	for (i = 0; i < ARRAY_SIZE(clocks); i++) {
+		if (clocks[i].off_id != -1)
+			continue;
+		offset = TEN_DAYS_IN_SEC + i * 1000;
+		clocks[i].offset = offset;
+		if (_settime(clocks[i].id, offset))
+			return 1;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(clocks); i++) {
+		if (clocks[i].off_id != -1)
+			offset = clocks[clocks[i].off_id].offset;
+		else
+			offset = clocks[i].offset;
+		ret |= test_gettime(i, true, offset);
+		ret |= test_gettime(i, false, offset);
+	}
+
+	if (ret)
+		ksft_exit_fail();
+
+	ksft_exit_pass();
+	return !!ret;
+}
diff --git a/tools/testing/selftests/timens/timens.h b/tools/testing/selftests/timens/timens.h
new file mode 100644
index 000000000000..77c127384810
--- /dev/null
+++ b/tools/testing/selftests/timens/timens.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __TIMENS_H__
+#define __TIMENS_H__
+
+#include <fcntl.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdbool.h>
+
+#include "../kselftest.h"
+
+#ifndef CLONE_NEWTIME
+# define CLONE_NEWTIME	0x00000080
+#endif
+
+static inline int _settime(clockid_t clk_id, time_t offset)
+{
+	int fd, len;
+	char buf[4096];
+
+	if (clk_id == CLOCK_MONOTONIC_COARSE || clk_id == CLOCK_MONOTONIC_RAW)
+		clk_id = CLOCK_MONOTONIC;
+
+	len = snprintf(buf, sizeof(buf), "%d %ld 0", clk_id, offset);
+
+	fd = open("/proc/self/timens_offsets", O_WRONLY);
+	if (fd < 0)
+		return pr_perror("/proc/self/timens_offsets");
+
+	if (write(fd, buf, len) != len)
+		return pr_perror("/proc/self/timens_offsets");
+
+	close(fd);
+
+	return 0;
+}
+
+static inline int _gettime(clockid_t clk_id, struct timespec *res, bool raw_syscall)
+{
+	int err;
+
+	if (!raw_syscall) {
+		if (clock_gettime(clk_id, res)) {
+			pr_perror("clock_gettime(%d)", (int)clk_id);
+			return -1;
+		}
+		return 0;
+	}
+
+	err = syscall(SYS_clock_gettime, clk_id, res);
+	if (err)
+		pr_perror("syscall(SYS_clock_gettime(%d))", (int)clk_id);
+
+	return err;
+}
+
+static inline void nscheck(void)
+{
+	if (access("/proc/self/ns/time", F_OK) < 0)
+		ksft_exit_skip("Time namespaces are not supported\n");
+}
+
+#endif
-- 
2.22.0

^ permalink raw reply related

* [PATCHv4 20/28] timens/fs/proc: Introduce /proc/pid/timens_offsets
From: Dmitry Safonov @ 2019-06-12 19:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrei Vagin, Dmitry Safonov, Adrian Reber, Andrei Vagin,
	Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Dmitry Safonov, Eric W. Biederman,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Jeff Dike, Oleg Nesterov,
	Pavel Emelyanov, Shuah Khan, Thomas Gleixner, Vincenzo Frascino,
	containers, criu, linux-api
In-Reply-To: <20190612192628.23797-1-dima@arista.com>

From: Andrei Vagin <avagin@gmail.com>

API to set time namespace offsets for children processes, i.e.:
echo "clockid off_ses off_nsec" > /proc/self/timens_offsets

Signed-off-by: Andrei Vagin <avagin@gmail.com>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
 fs/proc/base.c                 |  95 ++++++++++++++++++++++++++++++
 include/linux/time_namespace.h |  10 ++++
 kernel/time_namespace.c        | 104 +++++++++++++++++++++++++++++++++
 3 files changed, 209 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9c8ca6cd3ce4..6a96b0543f69 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -94,6 +94,7 @@
 #include <linux/sched/debug.h>
 #include <linux/sched/stat.h>
 #include <linux/posix-timers.h>
+#include <linux/time_namespace.h>
 #include <trace/events/oom.h>
 #include "internal.h"
 #include "fd.h"
@@ -1516,6 +1517,97 @@ static const struct file_operations proc_pid_sched_autogroup_operations = {
 
 #endif /* CONFIG_SCHED_AUTOGROUP */
 
+#ifdef CONFIG_TIME_NS
+static int timens_offsets_show(struct seq_file *m, void *v)
+{
+	struct task_struct *p;
+
+	p = get_proc_task(file_inode(m->file));
+	if (!p)
+		return -ESRCH;
+	proc_timens_show_offsets(p, m);
+
+	put_task_struct(p);
+
+	return 0;
+}
+
+static ssize_t
+timens_offsets_write(struct file *file, const char __user *buf,
+	    size_t count, loff_t *ppos)
+{
+	struct inode *inode = file_inode(file);
+	struct proc_timens_offset offsets[2];
+	char *kbuf = NULL, *pos, *next_line;
+	struct task_struct *p;
+	int ret, noffsets;
+
+	/* Only allow < page size writes at the beginning of the file */
+	if ((*ppos != 0) || (count >= PAGE_SIZE))
+		return -EINVAL;
+
+	/* Slurp in the user data */
+	kbuf = memdup_user_nul(buf, count);
+	if (IS_ERR(kbuf))
+		return PTR_ERR(kbuf);
+
+	/* Parse the user data */
+	ret = -EINVAL;
+	noffsets = 0;
+	for (pos = kbuf; pos; pos = next_line) {
+		struct proc_timens_offset *off = &offsets[noffsets];
+		int err;
+
+		/* Find the end of line and ensure we don't look past it */
+		next_line = strchr(pos, '\n');
+		if (next_line) {
+			*next_line = '\0';
+			next_line++;
+			if (*next_line == '\0')
+				next_line = NULL;
+		}
+
+		err = sscanf(pos, "%u %lld %lu", &off->clockid,
+				&off->val.tv_sec, &off->val.tv_nsec);
+		if (err != 3 || off->val.tv_nsec >= NSEC_PER_SEC)
+			goto out;
+		noffsets++;
+		if (noffsets == ARRAY_SIZE(offsets)) {
+			if (next_line)
+				count = next_line - kbuf;
+			break;
+		}
+	}
+
+	ret = -ESRCH;
+	p = get_proc_task(inode);
+	if (!p)
+		goto out;
+	ret = proc_timens_set_offset(file, p, offsets, noffsets);
+	put_task_struct(p);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	kfree(kbuf);
+	return ret;
+}
+
+static int timens_offsets_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, timens_offsets_show, inode);
+}
+
+static const struct file_operations proc_timens_offsets_operations = {
+	.open		= timens_offsets_open,
+	.read		= seq_read,
+	.write		= timens_offsets_write,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+#endif /* CONFIG_TIME_NS */
+
 static ssize_t comm_write(struct file *file, const char __user *buf,
 				size_t count, loff_t *offset)
 {
@@ -2982,6 +3074,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #endif
 #ifdef CONFIG_SCHED_AUTOGROUP
 	REG("autogroup",  S_IRUGO|S_IWUSR, proc_pid_sched_autogroup_operations),
+#endif
+#ifdef CONFIG_TIME_NS
+	REG("timens_offsets",  S_IRUGO|S_IWUSR, proc_timens_offsets_operations),
 #endif
 	REG("comm",      S_IRUGO|S_IWUSR, proc_pid_set_comm_operations),
 #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index d32b55fad953..8cd16dfea42d 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -40,6 +40,16 @@ static inline void put_time_ns(struct time_namespace *ns)
 	kref_put(&ns->kref, free_time_ns);
 }
 
+extern void proc_timens_show_offsets(struct task_struct *p, struct seq_file *m);
+
+struct proc_timens_offset {
+	int clockid;
+	struct timespec64 val;
+};
+
+extern int proc_timens_set_offset(struct file *file, struct task_struct *p,
+				struct proc_timens_offset *offsets, int n);
+
 static inline void timens_add_monotonic(struct timespec64 *ts)
 {
         struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
index 2a2cab14ac29..a32adeabf9f0 100644
--- a/kernel/time_namespace.c
+++ b/kernel/time_namespace.c
@@ -13,6 +13,7 @@
 #include <linux/user_namespace.h>
 #include <linux/proc_ns.h>
 #include <linux/sched/task.h>
+#include <linux/seq_file.h>
 #include <linux/mm.h>
 #include <asm/vdso.h>
 
@@ -229,6 +230,109 @@ static struct user_namespace *timens_owner(struct ns_common *ns)
 	return to_time_ns(ns)->user_ns;
 }
 
+static void show_offset(struct seq_file *m, int clockid, struct timespec64 *ts)
+{
+	seq_printf(m, "%d %lld %ld\n", clockid, ts->tv_sec, ts->tv_nsec);
+}
+
+void proc_timens_show_offsets(struct task_struct *p, struct seq_file *m)
+{
+	struct ns_common *ns;
+	struct time_namespace *time_ns;
+	struct timens_offsets *ns_offsets;
+
+	ns = timens_for_children_get(p);
+	if (!ns)
+		return;
+	time_ns = to_time_ns(ns);
+
+	if (!time_ns->offsets) {
+		put_time_ns(time_ns);
+		return;
+	}
+	ns_offsets = time_ns->offsets;
+
+	show_offset(m, CLOCK_MONOTONIC, &ns_offsets->monotonic);
+	show_offset(m, CLOCK_BOOTTIME, &ns_offsets->boottime);
+	put_time_ns(time_ns);
+}
+
+int proc_timens_set_offset(struct file *file, struct task_struct *p,
+			   struct proc_timens_offset *offsets, int noffsets)
+{
+	struct ns_common *ns;
+	struct time_namespace *time_ns;
+	struct timens_offsets *ns_offsets;
+	struct timespec64 *offset;
+	struct timespec64 tp;
+	int i, err;
+
+	ns = timens_for_children_get(p);
+	if (!ns)
+		return -ESRCH;
+	time_ns = to_time_ns(ns);
+
+	if (!time_ns->offsets || time_ns->initialized ||
+	    !file_ns_capable(file, time_ns->user_ns, CAP_SYS_TIME)) {
+		put_time_ns(time_ns);
+		return -EPERM;
+	}
+	ns_offsets = time_ns->offsets;
+
+	for (i = 0; i < noffsets; i++) {
+		struct proc_timens_offset *off = &offsets[i];
+
+		switch (off->clockid) {
+		case CLOCK_MONOTONIC:
+			ktime_get_ts64(&tp);
+			break;
+		case CLOCK_BOOTTIME:
+			ktime_get_boottime_ts64(&tp);
+			break;
+		default:
+			err = -EINVAL;
+			goto out;
+		}
+
+		err = -ERANGE;
+
+		if (off->val.tv_sec > KTIME_SEC_MAX || off->val.tv_sec < -KTIME_SEC_MAX)
+			goto out;
+
+		tp = timespec64_add(tp, off->val);
+		/*
+		 * KTIME_SEC_MAX is divided by 2 to be sure that KTIME_MAX is
+		 * still unreachable.
+		 */
+		if (tp.tv_sec < 0 || tp.tv_sec > KTIME_SEC_MAX / 2)
+			goto out;
+	}
+
+	err = 0;
+	/* don't report errors after this line */
+	for (i = 0; i < noffsets; i++) {
+		struct proc_timens_offset *off = &offsets[i];
+
+		switch (off->clockid) {
+		case CLOCK_MONOTONIC:
+			offset = &ns_offsets->monotonic;
+			break;
+		case CLOCK_BOOTTIME:
+			offset = &ns_offsets->boottime;
+			break;
+		default:
+			goto out;
+		}
+
+		*offset = off->val;
+	}
+
+out:
+	put_time_ns(time_ns);
+
+	return err;
+}
+
 const struct proc_ns_operations timens_operations = {
 	.name		= "time",
 	.type		= CLONE_NEWTIME,
-- 
2.22.0

^ permalink raw reply related

* [PATCHv4 19/28] timens: Add align for timens_offsets
From: Dmitry Safonov @ 2019-06-12 19:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dmitry Safonov, Adrian Reber, Andrei Vagin, Andy Lutomirski,
	Arnd Bergmann, Christian Brauner, Cyrill Gorcunov, Dmitry Safonov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Thomas Gleixner, Vincenzo Frascino, containers, criu, linux-api,
	x86
In-Reply-To: <20190612192628.23797-1-dima@arista.com>

Align offsets so that time namespace will work for ia32 applications on
x86_64 host.

Co-developed-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
 include/linux/timens_offsets.h | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/timens_offsets.h b/include/linux/timens_offsets.h
index e93aabaa5e45..05da1b0563ce 100644
--- a/include/linux/timens_offsets.h
+++ b/include/linux/timens_offsets.h
@@ -2,9 +2,17 @@
 #ifndef _LINUX_TIME_OFFSETS_H
 #define _LINUX_TIME_OFFSETS_H
 
+/*
+ * Time offsets need align as they're placed on VVAR page,
+ * which is used by x86_64 and ia32 VDSO code.
+ * On ia32 offset::tv_sec (u64) has align(4), so re-align offsets
+ * to the same positions as 64-bit offsets.
+ * On 64-bit big-endian systems VDSO should convert to timespec64
+ * to timespec because of a padding occurring between the fields.
+ */
 struct timens_offsets {
-	struct timespec64 monotonic;
-	struct timespec64 boottime;
+	struct timespec64 monotonic __aligned(8);
+	struct timespec64 boottime __aligned(8);
 };
 
 #endif
-- 
2.22.0

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox