Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Andy Lutomirski @ 2014-10-07 22:19 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge Hallyn, Al Viro, Andrey Vagin, Linux FS Devel,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Andrey Vagin, Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov,
	Serge Hallyn, Rob Landley
In-Reply-To: <87zjd7r1z9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> I am squinting and looking this way and that but while I can imagine
> someone more clever than I can think up some unique property of rootfs
> that makes it a little more exploitable than just mounting a ramfs,
> but since you have to be root to exploit those properties I think the
> game is pretty much lost.

Yes.  rootfs might not be empty, it might have totally insane
permissions, and it's globally shared, which makes it into a wonderful
channel to pass things around that shouldn't be passed around.

Can non-root do this?  You'd need to be in a userns with a "/" that
isn't MNT_LOCKED.  Can this happen on any normal setup?

FWIW, I think we should unconditionally MNT_LOCKED the root on userns
unshare, even if it's the only mount.


>
>>> >> So it is only root (and not root in a container) who can get to the
>>> >> exposed rootfs.
>>> >>
>>> >> I have a vague memory someone actually had a real use in miminal systems
>>> >> for being able to get back to the rootfs and being able to use rootfs as
>>> >> the rootfs.  There was even a patch at that time that Andrew Morton was
>>> >> carrying for a time to allow unmounting root and get at rootfs, and to
>>> >> prevent the oops on rootfs unmount in some way.
>>> >>
>>> >> So not only do I not think it is a bug to get back too rootfs, I think
>>> >> it is a feature that some people have expressed at least half-way sane
>>> >> uses for.
>>> >
>>> > They can still do that if they want, using chroot :)
>>>
>>> It would take fchdir or fchroot and a directory file descriptor open on
>>> rootfs.  Frequently there is no appropriate directory file descriptor.
>>
>> ?  you can always escape if you're simply chrooted.  waterbuffalo :)
>
> filesystem type rootfs.
>
> Eric
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Andy Lutomirski @ 2014-10-07 21:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, Andrey Vagin, Linux FS Devel,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Andrey Vagin, Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov,
	Serge Hallyn, Rob Landley
In-Reply-To: <87zjd7pn0o.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

On Tue, Oct 7, 2014 at 2:50 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>
>> On Tue, Oct 7, 2014 at 2:26 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>>
>>>> Why should MNT_LOCKED on submounts be enforced?
>>>>
>>>> Is it because, if you retain a reference to the detached tree, then
>>>> you can see under the submounts?
>>>
>>> Yes. MNT_DETACH is a recursive operation that detaches all of the mount
>>> and all of it's submounts.  Which means you can see under the submounts
>>> if you have a reference to a detached mount.
>>>
>>>> If so, let's fix *that*.  Because
>>>> otherwise the whole model of pivot_root + detach will break.
>>>
>>> I am not certain what you are referring to.  pivot_root doesn't
>>> manipulate the mount tree so you can see under anything.
>>>
>>> What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH)
>>> if there are any referenced mount points being detached that have a
>>> locked submount.
>>
>> Most of the container-using things do, roughly:
>>
>> Unshare userns and mountns
>> Mount some new stuff
>> pivot_root to the new stuff
>> MNT_DETACH the old.
>>
>> That last step will almost always fail if you make this change.
>
> I don't think so.
>
> I expect I could add full busy detection of normal umounts and those
> applications would not fail.
>
> What I am proposing is a more targeted version of busy detection that
> looks at each mount in the set that detach will unmount.  For each mount
> if it is busy with non-submount references and it has at least one
> locked submount fail the detach with -EBUSY.
>
> Do you really think we have userspace references to the one or more of the
> mounts under old?
>

I suspect that we have a userspace reference to old itself.

--Andy

> Eric



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Eric W. Biederman @ 2014-10-07 21:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, Andrey Vagin, Linux FS Devel,
	linux-kernel@vger.kernel.org, Linux API, Andrey Vagin,
	Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov, Serge Hallyn,
	Rob Landley
In-Reply-To: <CALCETrWfZwbGCxnUAg0PnM=tN8MGRQkHrJVC42bVF7sdJKXLmw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

> On Tue, Oct 7, 2014 at 2:26 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>
>>> Why should MNT_LOCKED on submounts be enforced?
>>>
>>> Is it because, if you retain a reference to the detached tree, then
>>> you can see under the submounts?
>>
>> Yes. MNT_DETACH is a recursive operation that detaches all of the mount
>> and all of it's submounts.  Which means you can see under the submounts
>> if you have a reference to a detached mount.
>>
>>> If so, let's fix *that*.  Because
>>> otherwise the whole model of pivot_root + detach will break.
>>
>> I am not certain what you are referring to.  pivot_root doesn't
>> manipulate the mount tree so you can see under anything.
>>
>> What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH)
>> if there are any referenced mount points being detached that have a
>> locked submount.
>
> Most of the container-using things do, roughly:
>
> Unshare userns and mountns
> Mount some new stuff
> pivot_root to the new stuff
> MNT_DETACH the old.
>
> That last step will almost always fail if you make this change.

I don't think so.

I expect I could add full busy detection of normal umounts and those
applications would not fail.

What I am proposing is a more targeted version of busy detection that
looks at each mount in the set that detach will unmount.  For each mount
if it is busy with non-submount references and it has at least one
locked submount fail the detach with -EBUSY.

Do you really think we have userspace references to the one or more of the
mounts under old?

Eric

^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Eric W. Biederman @ 2014-10-07 21:42 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Al Viro, Andrey Vagin, linux-fsdevel, linux-kernel, linux-api,
	Andrey Vagin, Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov,
	Serge Hallyn, Rob Landley
In-Reply-To: <20141007213257.GJ28519@ubuntumail>

Serge Hallyn <serge.hallyn@ubuntu.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):

>> What I meant is that it isn't about containers.   It is about something
>> root can do.  So this is not a "container" problem.
>
> Oh, ok.  
>
> Sorry, I'm getting the two thread confused anyway.  I'm going to bow out
> here until I can pay proper attention.

I think there is half a point here that may be legit, when you are using
mount namespaces to jail applications, there may be a problem
with umounting / and making it to the underlying rootfs filesystem.

I am squinting and looking this way and that but while I can imagine
someone more clever than I can think up some unique property of rootfs
that makes it a little more exploitable than just mounting a ramfs,
but since you have to be root to exploit those properties I think the
game is pretty much lost.

>> >> So it is only root (and not root in a container) who can get to the
>> >> exposed rootfs.
>> >> 
>> >> I have a vague memory someone actually had a real use in miminal systems
>> >> for being able to get back to the rootfs and being able to use rootfs as
>> >> the rootfs.  There was even a patch at that time that Andrew Morton was
>> >> carrying for a time to allow unmounting root and get at rootfs, and to
>> >> prevent the oops on rootfs unmount in some way.
>> >> 
>> >> So not only do I not think it is a bug to get back too rootfs, I think
>> >> it is a feature that some people have expressed at least half-way sane
>> >> uses for.
>> >
>> > They can still do that if they want, using chroot :)
>> 
>> It would take fchdir or fchroot and a directory file descriptor open on
>> rootfs.  Frequently there is no appropriate directory file descriptor.
>
> ?  you can always escape if you're simply chrooted.  waterbuffalo :)

filesystem type rootfs.

Eric


^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Andy Lutomirski @ 2014-10-07 21:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, Andrey Vagin, Linux FS Devel,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Andrey Vagin, Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov,
	Serge Hallyn, Rob Landley
In-Reply-To: <87siizshav.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

On Tue, Oct 7, 2014 at 2:26 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>
>> Why should MNT_LOCKED on submounts be enforced?
>>
>> Is it because, if you retain a reference to the detached tree, then
>> you can see under the submounts?
>
> Yes. MNT_DETACH is a recursive operation that detaches all of the mount
> and all of it's submounts.  Which means you can see under the submounts
> if you have a reference to a detached mount.
>
>> If so, let's fix *that*.  Because
>> otherwise the whole model of pivot_root + detach will break.
>
> I am not certain what you are referring to.  pivot_root doesn't
> manipulate the mount tree so you can see under anything.
>
> What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH)
> if there are any referenced mount points being detached that have a
> locked submount.

Most of the container-using things do, roughly:

Unshare userns and mountns
Mount some new stuff
pivot_root to the new stuff
MNT_DETACH the old.

That last step will almost always fail if you make this change.

--Andy

^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Serge Hallyn @ 2014-10-07 21:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Al Viro, Andrey Vagin, Linux FS Devel,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Andrey Vagin, Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov,
	Serge Hallyn, Rob Landley
In-Reply-To: <CALCETrXgssZfi3BirQ=K7-vrPyEh5AzFX2pF+yj76Ngi0sf7Yw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
> On Tue, Oct 7, 2014 at 1:30 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> > Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:
> >
> > 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
> >>> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
> >>> > Another problem is that rootfs can't be hidden from a container, because
> >>> > rootfs can't be moved or umounted.
> >>>
> >>> ... which is a bug in mntns_install(), AFAICS.
> >>
> >> Ability to get to exposed rootfs, that is.
> >
> > The container side of this argument is pretty bogus.  It only applies
> > if user namespaces are not used for the container.
> >
> > So it is only root (and not root in a container) who can get to the
> > exposed rootfs.
> >
> > I have a vague memory someone actually had a real use in miminal systems
> > for being able to get back to the rootfs and being able to use rootfs as
> > the rootfs.  There was even a patch at that time that Andrew Morton was
> > carrying for a time to allow unmounting root and get at rootfs, and to
> > prevent the oops on rootfs unmount in some way.
> >
> > So not only do I not think it is a bug to get back too rootfs, I think
> > it is a feature that some people have expressed at least half-way sane
> > uses for.
> >
> >>> > Here is an example how to get access to rootfs:
> >>> > fd = open("/proc/self/ns/mnt", O_RDONLY)
> >>> > umount2("/", MNT_DETACH);
> >>> > setns(fd, CLONE_NEWNS)
> >>> >
> >>> > rootfs may contain data, which should not be avaliable in CT-s.
> >>>
> >>> Indeed.
> >>
> >> ... and it looks like the above is what your mangled reproducer in previous
> >> patch had been made of -
> >>       fd = open("/proc/self/ns/mnt", O_RDONLY)
> >>       umount2("/", MNT_DETACH);
> >>       setns(fd, CLONE_NEWNS)
> >>       umount2("/", MNT_DETACH);
> >>
> >> IMO what it shows is setns() bug.  This "switch root/cwd, no matter what"
> >> is wrong.
> >
> > IMO the bug is allowing us to unmount things that should never be unmounted.
> >
> > In a mount namespace created with just user namespace permissions we
> > can't get at rootfs because MNT_LOCKED is set on the root directory
> > and thus it can not be mounted.
> >
> > Further if anyone has permission to call chroot and chdir on any mount
> > in a mount namespace (that isn't currently covered) they can get at all
> > of them that are not currently covered.  A mount namespace where no one
> > can get at any uncovered filesystem seems to be the definition of
> > useless and ridiculous.
> >
> >
> > Now there is a bug in that MNT_DETACH today does not currently enforce
> > MNT_LOCKED on submounts of the mount point that is detached. I am
> > currently looking at how to construct the appropriate permission check
> > to prevent that.  Unfortunately I can not disallow MNT_DETACH with
> > submounts all together as that breaks too many legitimate uses.
> 
> Why should MNT_LOCKED on submounts be enforced?
> 
> Is it because, if you retain a reference to the detached tree, then
> you can see under the submounts?  If so, let's fix *that*.  Because
> otherwise the whole model of pivot_root + detach will break.
> 
> Also, damn it, we need change_the_ns_root instead of pivot_root.  I
> doubt that any container programs actually want to keep the old root
> attached after pivot_root.

Right I think that'll fix the problem we were having, and I think
Andrey said the same thing in another list a few days ago.

^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Serge Hallyn @ 2014-10-07 21:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, Andrey Vagin, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Andrey Vagin, Andrew Morton,
	Cyrill Gorcunov, Pavel Emelyanov, Serge Hallyn, Rob Landley
In-Reply-To: <87wq8bvbzg.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> >> Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:
> >> 
> >> 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
> >> >> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
> >> >> > Another problem is that rootfs can't be hidden from a container, because
> >> >> > rootfs can't be moved or umounted.
> >> >> 
> >> >> ... which is a bug in mntns_install(), AFAICS.
> >> >
> >> > Ability to get to exposed rootfs, that is.
> >> 
> >> The container side of this argument is pretty bogus.  It only applies
> >> if user namespaces are not used for the container.
> >
> > User namespaces are still far too restricted for many container use
> > cases.  We can't say "we have user namespaces so now privileged
> > containers can be ignored".  Yes you never should have handed the
> > keys to a privileged container to an untrusted person, but we do
> > still try to protect the host from accidental damage due to a
> > privileged container.
> 
> What I meant is that it isn't about containers.   It is about something
> root can do.  So this is not a "container" problem.

Oh, ok.  

Sorry, I'm getting the two thread confused anyway.  I'm going to bow out
here until I can pay proper attention.

> >> So it is only root (and not root in a container) who can get to the
> >> exposed rootfs.
> >> 
> >> I have a vague memory someone actually had a real use in miminal systems
> >> for being able to get back to the rootfs and being able to use rootfs as
> >> the rootfs.  There was even a patch at that time that Andrew Morton was
> >> carrying for a time to allow unmounting root and get at rootfs, and to
> >> prevent the oops on rootfs unmount in some way.
> >> 
> >> So not only do I not think it is a bug to get back too rootfs, I think
> >> it is a feature that some people have expressed at least half-way sane
> >> uses for.
> >
> > They can still do that if they want, using chroot :)
> 
> It would take fchdir or fchroot and a directory file descriptor open on
> rootfs.  Frequently there is no appropriate directory file descriptor.

?  you can always escape if you're simply chrooted.  waterbuffalo :)

^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Eric W. Biederman @ 2014-10-07 21:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, Andrey Vagin, Linux FS Devel,
	linux-kernel@vger.kernel.org, Linux API, Andrey Vagin,
	Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov, Serge Hallyn,
	Rob Landley
In-Reply-To: <CALCETrXgssZfi3BirQ=K7-vrPyEh5AzFX2pF+yj76Ngi0sf7Yw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

> Why should MNT_LOCKED on submounts be enforced?
>
> Is it because, if you retain a reference to the detached tree, then
> you can see under the submounts?  

Yes. MNT_DETACH is a recursive operation that detaches all of the mount
and all of it's submounts.  Which means you can see under the submounts
if you have a reference to a detached mount.

> If so, let's fix *that*.  Because
> otherwise the whole model of pivot_root + detach will break.

I am not certain what you are referring to.  pivot_root doesn't
manipulate the mount tree so you can see under anything.

What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH)
if there are any referenced mount points being detached that have a
locked submount.

> Also, damn it, we need change_the_ns_root instead of pivot_root.  I
> doubt that any container programs actually want to keep the old root
> attached after pivot_root.

Shrug.  Except for chroot_fs_refs() pivot_root is a cheap.

I'm not particularly in favor of merging pivot_root and umount2.  The
number of weird cases in the current api are high.  A merged piece of
code would just make them higher.

I am hoping that one more round of bug fixing will at least get the bugs
for having unprivileged mounts fixed in the current API.

Eric

^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Andy Lutomirski @ 2014-10-07 21:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, Andrey Vagin, Linux FS Devel,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Andrey Vagin, Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov,
	Serge Hallyn, Rob Landley
In-Reply-To: <87r3yjy64e.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

On Tue, Oct 7, 2014 at 1:30 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:
>
> 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
>>> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
>>> > Another problem is that rootfs can't be hidden from a container, because
>>> > rootfs can't be moved or umounted.
>>>
>>> ... which is a bug in mntns_install(), AFAICS.
>>
>> Ability to get to exposed rootfs, that is.
>
> The container side of this argument is pretty bogus.  It only applies
> if user namespaces are not used for the container.
>
> So it is only root (and not root in a container) who can get to the
> exposed rootfs.
>
> I have a vague memory someone actually had a real use in miminal systems
> for being able to get back to the rootfs and being able to use rootfs as
> the rootfs.  There was even a patch at that time that Andrew Morton was
> carrying for a time to allow unmounting root and get at rootfs, and to
> prevent the oops on rootfs unmount in some way.
>
> So not only do I not think it is a bug to get back too rootfs, I think
> it is a feature that some people have expressed at least half-way sane
> uses for.
>
>>> > Here is an example how to get access to rootfs:
>>> > fd = open("/proc/self/ns/mnt", O_RDONLY)
>>> > umount2("/", MNT_DETACH);
>>> > setns(fd, CLONE_NEWNS)
>>> >
>>> > rootfs may contain data, which should not be avaliable in CT-s.
>>>
>>> Indeed.
>>
>> ... and it looks like the above is what your mangled reproducer in previous
>> patch had been made of -
>>       fd = open("/proc/self/ns/mnt", O_RDONLY)
>>       umount2("/", MNT_DETACH);
>>       setns(fd, CLONE_NEWNS)
>>       umount2("/", MNT_DETACH);
>>
>> IMO what it shows is setns() bug.  This "switch root/cwd, no matter what"
>> is wrong.
>
> IMO the bug is allowing us to unmount things that should never be unmounted.
>
> In a mount namespace created with just user namespace permissions we
> can't get at rootfs because MNT_LOCKED is set on the root directory
> and thus it can not be mounted.
>
> Further if anyone has permission to call chroot and chdir on any mount
> in a mount namespace (that isn't currently covered) they can get at all
> of them that are not currently covered.  A mount namespace where no one
> can get at any uncovered filesystem seems to be the definition of
> useless and ridiculous.
>
>
> Now there is a bug in that MNT_DETACH today does not currently enforce
> MNT_LOCKED on submounts of the mount point that is detached. I am
> currently looking at how to construct the appropriate permission check
> to prevent that.  Unfortunately I can not disallow MNT_DETACH with
> submounts all together as that breaks too many legitimate uses.

Why should MNT_LOCKED on submounts be enforced?

Is it because, if you retain a reference to the detached tree, then
you can see under the submounts?  If so, let's fix *that*.  Because
otherwise the whole model of pivot_root + detach will break.

Also, damn it, we need change_the_ns_root instead of pivot_root.  I
doubt that any container programs actually want to keep the old root
attached after pivot_root.

--Andy

^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Eric W. Biederman @ 2014-10-07 20:52 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Al Viro, Andrey Vagin, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Andrey Vagin, Andrew Morton,
	Cyrill Gorcunov, Pavel Emelyanov, Serge Hallyn, Rob Landley
In-Reply-To: <20141007204627.GI28519@ubuntumail>

Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:

> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
>> Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:
>> 
>> 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
>> >> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
>> >> > Another problem is that rootfs can't be hidden from a container, because
>> >> > rootfs can't be moved or umounted.
>> >> 
>> >> ... which is a bug in mntns_install(), AFAICS.
>> >
>> > Ability to get to exposed rootfs, that is.
>> 
>> The container side of this argument is pretty bogus.  It only applies
>> if user namespaces are not used for the container.
>
> User namespaces are still far too restricted for many container use
> cases.  We can't say "we have user namespaces so now privileged
> containers can be ignored".  Yes you never should have handed the
> keys to a privileged container to an untrusted person, but we do
> still try to protect the host from accidental damage due to a
> privileged container.

What I meant is that it isn't about containers.   It is about something
root can do.  So this is not a "container" problem.

>> So it is only root (and not root in a container) who can get to the
>> exposed rootfs.
>> 
>> I have a vague memory someone actually had a real use in miminal systems
>> for being able to get back to the rootfs and being able to use rootfs as
>> the rootfs.  There was even a patch at that time that Andrew Morton was
>> carrying for a time to allow unmounting root and get at rootfs, and to
>> prevent the oops on rootfs unmount in some way.
>> 
>> So not only do I not think it is a bug to get back too rootfs, I think
>> it is a feature that some people have expressed at least half-way sane
>> uses for.
>
> They can still do that if they want, using chroot :)

It would take fchdir or fchroot and a directory file descriptor open on
rootfs.  Frequently there is no appropriate directory file descriptor.

Eric

^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Serge Hallyn @ 2014-10-07 20:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, Andrey Vagin, linux-fsdevel, linux-kernel, linux-api,
	Andrey Vagin, Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov,
	Serge Hallyn, Rob Landley
In-Reply-To: <87r3yjy64e.fsf@x220.int.ebiederm.org>

Quoting Eric W. Biederman (ebiederm@xmission.com):
> Al Viro <viro@ZenIV.linux.org.uk> writes:
> 
> 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
> >> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
> >> > Another problem is that rootfs can't be hidden from a container, because
> >> > rootfs can't be moved or umounted.
> >> 
> >> ... which is a bug in mntns_install(), AFAICS.
> >
> > Ability to get to exposed rootfs, that is.
> 
> The container side of this argument is pretty bogus.  It only applies
> if user namespaces are not used for the container.

User namespaces are still far too restricted for many container use
cases.  We can't say "we have user namespaces so now privileged
containers can be ignored".  Yes you never should have handed the
keys to a privileged container to an untrusted person, but we do
still try to protect the host from accidental damage due to a
privileged container.

> So it is only root (and not root in a container) who can get to the
> exposed rootfs.
> 
> I have a vague memory someone actually had a real use in miminal systems
> for being able to get back to the rootfs and being able to use rootfs as
> the rootfs.  There was even a patch at that time that Andrew Morton was
> carrying for a time to allow unmounting root and get at rootfs, and to
> prevent the oops on rootfs unmount in some way.
> 
> So not only do I not think it is a bug to get back too rootfs, I think
> it is a feature that some people have expressed at least half-way sane
> uses for.

They can still do that if they want, using chroot :)

> >> > Here is an example how to get access to rootfs:
> >> > fd = open("/proc/self/ns/mnt", O_RDONLY)
> >> > umount2("/", MNT_DETACH);
> >> > setns(fd, CLONE_NEWNS)
> >> > 
> >> > rootfs may contain data, which should not be avaliable in CT-s.
> >> 
> >> Indeed.
> >
> > ... and it looks like the above is what your mangled reproducer in previous
> > patch had been made of -
> > 	fd = open("/proc/self/ns/mnt", O_RDONLY)
> > 	umount2("/", MNT_DETACH);
> > 	setns(fd, CLONE_NEWNS)
> > 	umount2("/", MNT_DETACH);
> >
> > IMO what it shows is setns() bug.  This "switch root/cwd, no matter what"
> > is wrong.
> 
> IMO the bug is allowing us to unmount things that should never be unmounted.
> 
> In a mount namespace created with just user namespace permissions we
> can't get at rootfs because MNT_LOCKED is set on the root directory
> and thus it can not be mounted.
> 
> Further if anyone has permission to call chroot and chdir on any mount
> in a mount namespace (that isn't currently covered) they can get at all
> of them that are not currently covered.  A mount namespace where no one
> can get at any uncovered filesystem seems to be the definition of
> useless and ridiculous.
> 
> 
> Now there is a bug in that MNT_DETACH today does not currently enforce
> MNT_LOCKED on submounts of the mount point that is detached. I am
> currently looking at how to construct the appropriate permission check
> to prevent that.  Unfortunately I can not disallow MNT_DETACH with
> submounts all together as that breaks too many legitimate uses.
> 
> That failure to enforce MNT_LOCKED is my mistake. I had a naive notion
> that submounts would remain mounted after a mount detach and I misread
> the code when I did the original work.  My mistake. 


^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Eric W. Biederman @ 2014-10-07 20:45 UTC (permalink / raw)
  To: Andrey Vagin
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Andrey Vagin, Alexander Viro,
	Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov, Serge Hallyn,
	Rob Landley
In-Reply-To: <1412683977-29543-1-git-send-email-avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>

Andrey Vagin <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> writes:

> From: Andrey Vagin <avagin-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>
> Currently when we create a new container with a separate root,
> we need to clone the current mount namespace with all mounts and then
> clean up it by using pivot_root(). A big part of mountpoints are cloned
> only to be umounted.

Is the motivation performance?  Because if that is the motivation we
need numbers.

> Another problem is that rootfs can't be hidden from a container, because
> rootfs can't be moved or umounted.
>
> Here is an example how to get access to rootfs:
> fd = open("/proc/self/ns/mnt", O_RDONLY)
> umount2("/", MNT_DETACH);
> setns(fd, CLONE_NEWNS)
>
> rootfs may contain data, which should not be avaliable in CT-s.

Well don't give those containers CAP_SYS_ADMIN.  If you aren't using
user namespaces there is no expectation of safety from those kinds of
problems.  Getting at rootfs is perfectly valid for root.

> I suggest to add ability to create a mount namespace with specified
> mount points. A current task root can be used as a root for the new
> mount namespace.

I really don't think you are going to like the result because you will
loose access to /proc and /sys.

> With this patch you can call chroot(ct->rootfs) and
> unshare(UNSHARE_NEWNS2) to get a clean mount namespace.

That is a little bit of an ugly way to smuggle a parameter into the
creation of a mount namespace.  Further I am pretty certain this patch
totally breaks the setting of new_fs->root and new_fs->pwd.


In net my opinion is that the code doesn't work and does not provide
sufficient justification for a new system call.

> UNSHARE_NEWNS2 can be used only with the unshare() syscall. The clone()
> syscall doesn't have unused flags.
>
> Here is an example how it looks like:
> $ cat ../../unshare.c
>
> int main(int argc, char **argv)
> {
	/* You left out
	 * mount --bind /some/root/path /some/root/path
         * chroot /some/root/path
         */

> 	if (unshare(UNSHARE_NEWNS2))
> 		return 1;
>
> 	execl("/bin/bash", "/bin/bash", NULL);
> 	return 1;
> }
> $ mount --bind test/ubuntu/ test/ubuntu/
> $ cd test/ubuntu/
> $ chroot .
> $ ./unshare2
> $ mount -t proc proc proc
> $ cat /proc/self/mountinfo
> 55 55 252:1 /home/avagin/test/ubuntu / rw,relatime - ext4 /dev/disk/by-uuid/d672b85f-533c-4868-9609-ca80be52d3c6 rw,errors=remount-ro,data=ordered
> 56 55 0:3 / /proc rw,relatime - proc proc rw
>
> Cc: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Cc: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> Cc: Cyrill Gorcunov <gorcunov-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
> Cc: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Cc: Rob Landley <rob-VoJi6FS/r0vR7s880joybQ@public.gmane.org>
> Signed-off-by: Andrey Vagin <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
> ---
>  fs/namespace.c             | 16 ++++++++++++++--
>  include/uapi/linux/sched.h |  8 ++++++++
>  kernel/fork.c              | 11 ++++++++---
>  kernel/nsproxy.c           |  2 +-
>  4 files changed, 31 insertions(+), 6 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 730c50e..f50a848 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2569,12 +2569,24 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns,
>  
>  	BUG_ON(!ns);
>  
> -	if (likely(!(flags & CLONE_NEWNS))) {
> +	if (likely(!(flags & (CLONE_NEWNS | UNSHARE_NEWNS2)))) {
>  		get_mnt_ns(ns);
>  		return ns;
>  	}
>  
> -	old = ns->root;
> +	if (flags & CLONE_NEWNS)
> +		old = ns->root;
> +	else { /* UNSHARE_NEWNS2 */
> +		struct path root;
> +
> +		get_fs_root(current->fs, &root);
> +		if (root.mnt->mnt_root != root.dentry) {
> +			path_put(&root);
> +			return ERR_PTR(-EINVAL); /* not a mountpoint */
> +		}
> +		old = real_mount(root.mnt);
> +		path_put(&root);
> +	}
>  
>  	new_ns = alloc_mnt_ns(user_ns);
>  	if (IS_ERR(new_ns))
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 34f9d73..8092e50 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -31,6 +31,14 @@
>  #define CLONE_IO		0x80000000	/* Clone io context */
>  
>  /*
> + * Following flags can be used only with unshare(), because
> + * they are intersected with CSIGNAL
> + */
> +#define UNSHARE_NEWNS2		0x00000001	/* Clone mnt namespace starting with the current task root. */
> +
> +#define UNSHARE_FLAGS		(UNSHARE_NEWNS2)
> +
> +/*
>   * Scheduling policies
>   */
>  #define SCHED_NORMAL		0
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0cf9cdb..52f1fc0 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1381,7 +1381,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	retval = copy_mm(clone_flags, p);
>  	if (retval)
>  		goto bad_fork_cleanup_signal;
> -	retval = copy_namespaces(clone_flags, p);
> +
> +	/*
> +	 * CSIGNAL and UNSHARE_FLAGS are intersected, but
> +	 * UNSHARE_FLAGS can't be used with clone().
> +	 */
> +	retval = copy_namespaces(clone_flags & ~UNSHARE_FLAGS, p);
>  	if (retval)
>  		goto bad_fork_cleanup_mm;
>  	retval = copy_io(clone_flags, p);
> @@ -1790,7 +1795,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
>  	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
>  				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
>  				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
> -				CLONE_NEWUSER|CLONE_NEWPID))
> +				CLONE_NEWUSER|CLONE_NEWPID|UNSHARE_FLAGS))
It seems confusing to use UNSHARE_FLAGS here.
>  		return -EINVAL;
>  	/*
>  	 * Not implemented, but pretend it works if there is nothing to
> @@ -1880,7 +1885,7 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
>  	/*
>  	 * If unsharing namespace, must also unshare filesystem information.
>  	 */
> -	if (unshare_flags & CLONE_NEWNS)
> +	if (unshare_flags & (CLONE_NEWNS | UNSHARE_NEWNS2))
>  		unshare_flags |= CLONE_FS;
>  
>  	err = check_unshare_flags(unshare_flags);
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index ef42d0a..a29e836 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -180,7 +180,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
>  	int err = 0;
>  
>  	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> -			       CLONE_NEWNET | CLONE_NEWPID)))
> +			       CLONE_NEWNET | CLONE_NEWPID | UNSHARE_FLAGS)))

It is inappropriate to assume that all unshare flags will be namespaces.

>  		return 0;
>  
>  	user_ns = new_cred ? new_cred->user_ns : current_user_ns();

^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Eric W. Biederman @ 2014-10-07 20:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Andrey Vagin, Andrew Morton,
	Cyrill Gorcunov, Pavel Emelyanov, Serge Hallyn, Rob Landley
In-Reply-To: <20141007133339.GH7996-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>

Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:

2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
>> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
>> > Another problem is that rootfs can't be hidden from a container, because
>> > rootfs can't be moved or umounted.
>> 
>> ... which is a bug in mntns_install(), AFAICS.
>
> Ability to get to exposed rootfs, that is.

The container side of this argument is pretty bogus.  It only applies
if user namespaces are not used for the container.

So it is only root (and not root in a container) who can get to the
exposed rootfs.

I have a vague memory someone actually had a real use in miminal systems
for being able to get back to the rootfs and being able to use rootfs as
the rootfs.  There was even a patch at that time that Andrew Morton was
carrying for a time to allow unmounting root and get at rootfs, and to
prevent the oops on rootfs unmount in some way.

So not only do I not think it is a bug to get back too rootfs, I think
it is a feature that some people have expressed at least half-way sane
uses for.

>> > Here is an example how to get access to rootfs:
>> > fd = open("/proc/self/ns/mnt", O_RDONLY)
>> > umount2("/", MNT_DETACH);
>> > setns(fd, CLONE_NEWNS)
>> > 
>> > rootfs may contain data, which should not be avaliable in CT-s.
>> 
>> Indeed.
>
> ... and it looks like the above is what your mangled reproducer in previous
> patch had been made of -
> 	fd = open("/proc/self/ns/mnt", O_RDONLY)
> 	umount2("/", MNT_DETACH);
> 	setns(fd, CLONE_NEWNS)
> 	umount2("/", MNT_DETACH);
>
> IMO what it shows is setns() bug.  This "switch root/cwd, no matter what"
> is wrong.

IMO the bug is allowing us to unmount things that should never be unmounted.

In a mount namespace created with just user namespace permissions we
can't get at rootfs because MNT_LOCKED is set on the root directory
and thus it can not be mounted.

Further if anyone has permission to call chroot and chdir on any mount
in a mount namespace (that isn't currently covered) they can get at all
of them that are not currently covered.  A mount namespace where no one
can get at any uncovered filesystem seems to be the definition of
useless and ridiculous.

Now there is a bug in that MNT_DETACH today does not currently enforce
MNT_LOCKED on submounts of the mount point that is detached. I am
currently looking at how to construct the appropriate permission check
to prevent that.  Unfortunately I can not disallow MNT_DETACH with
submounts all together as that breaks too many legitimate uses.

That failure to enforce MNT_LOCKED is my mistake. I had a naive notion
that submounts would remain mounted after a mount detach and I misread
the code when I did the original work.  My mistake. 

Eric

^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Andrew Vagin @ 2014-10-07 19:44 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrey Vagin, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Andrey Vagin, Andrew Morton,
	Eric W. Biederman, Cyrill Gorcunov, Pavel Emelyanov, Serge Hallyn,
	Rob Landley
In-Reply-To: <20141007133339.GH7996-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>

On Tue, Oct 07, 2014 at 02:33:39PM +0100, Al Viro wrote:
> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
> > On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
> > > Another problem is that rootfs can't be hidden from a container, because
> > > rootfs can't be moved or umounted.
> > 
> > ... which is a bug in mntns_install(), AFAICS.
> 
> Ability to get to exposed rootfs, that is.
> 
> > > Here is an example how to get access to rootfs:
> > > fd = open("/proc/self/ns/mnt", O_RDONLY)
> > > umount2("/", MNT_DETACH);
> > > setns(fd, CLONE_NEWNS)
> > > 
> > > rootfs may contain data, which should not be avaliable in CT-s.
> > 
> > Indeed.
> 
> ... and it looks like the above is what your mangled reproducer in previous
> patch had been made of -
> 	fd = open("/proc/self/ns/mnt", O_RDONLY)
> 	umount2("/", MNT_DETACH);
> 	setns(fd, CLONE_NEWNS)
> 	umount2("/", MNT_DETACH);
> 
> IMO what it shows is setns() bug.  This "switch root/cwd, no matter what"
> is wrong.

How can we move in another namespace without this "switch"? If the task
root and cwd isn't switched, they points to the old tree. In this case
how can we open something from the new tree?

Do you mean that root should not be changed if it points on a detached
mount or on a mount from another mount namespace? This looks reasonable.

^ permalink raw reply

* RE: [PATCH 3/8] iio: Add support for DA9150 GPADC
From: Jonathan Cameron @ 2014-10-07 19:36 UTC (permalink / raw)
  To: Opensource [Adam Thomson], Lee Jones, Samuel Ortiz,
	linux-iio@vger.kernel.org, Sebastian Reichel,
	Dmitry Eremin-Solenikov, David Woodhouse,
	linux-pm@vger.kernel.org, Rob Herring, Pawel Moll, Mark Rutland,
	Ian Campbell, Kumar Gala, Grant Likely,
	devicetree@vger.kernel.org, Andrew Morton, Joe Perches,
	linux-api@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org, Support Opensource
In-Reply-To: <2E89032DDAA8B9408CB92943514A0337AB518D23@SW-EX-MBX01.diasemi.com>



On October 7, 2014 3:55:55 PM GMT+01:00, "Opensource [Adam Thomson]" <Adam.Thomson.Opensource@diasemi.com> wrote:
>On September 27, 2014 11:50, Jonathan Cameron wrote:
>
>> On 23/09/14 11:53, Adam Thomson wrote:
>> > This patch adds support for DA9150 Charger & Fuel-Gauge IC GPADC.
>
>> > +
>> > +static inline int da9150_gpadc_gpio_6v_voltage_now(int raw_val)
>> > +{
>> > +	/* Convert to mV */
>> > +	return (6 * ((raw_val * 1000) + 500)) / 1024;
>> These could all be expressed as raw values with offsets
>> and scales (and that would be preferred).
>> E.g. This one has offset 500000 and scale 6000/1024 or even
>> better use IIO_VAL_FRACTIONAL_LOG2 for scale with val1 = 6000
>> and val2 = (log_2 1024) = 10.
>> 
>
>What you've suggested isn't correct. The problem here is that the
>offset is
>added first to the raw ADC reading, without factoring the ADC value
>accordingly
>to match the factor of the offset. If we take the original equation
>provided for
>this channel of the ADC, the offset is actually 0.5 which should be
>added to the
>raw ADC value. This doesn't fit into the implementation in the kernel
>as we
>can't use floating point. If we multiply the offset but not the raw ADC
>value,
>then add them before applying the scale factor, then the result is
>wrong at the
>end. Basically you need a scale for the raw ADC value to match the
>offset scale
>so you can achieve the correct results, which is what my calculation
>does.
>But that seems impossible with the current raw|offset|scale method.
Oops got that wrong.  The fixed point maths to fix the in kernel interface isn't exactly
 difficult but indeed it does not handle this currently.
>
>> > +	ret = iio_map_array_register(indio_dev,
>da9150_gpadc_default_maps);
>> > +	if (ret) {
>> > +		dev_err(dev, "Failed to register IIO maps: %d\n", ret);
>> > +		return ret;
>> > +	}
>> I'd suggest doing the devm_request_thread_irq before the
>iio_map_array
>> stuff.  This is purely to avoid the order during remove not being
>> obviously correct as it isn't the reverse of during probe.
>
>Ok, should still work ok that way so can update.
>
>> > +static int da9150_gpadc_remove(struct platform_device *pdev)
>> > +{
>> > +	struct iio_dev *indio_dev = platform_get_drvdata(pdev);
>> > +
>> > +	iio_map_array_unregister(indio_dev);
>> Twice in one day.  I'm definitely thinking we should add a
>> devm version of iio_map_array_register...
>
>I assume you mean here that iio_device_unregister() should come first?
>Will
>update.
Nope just that such a new function might be useful.

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply

* Re: [PATCH RFC] introduce ioctl to completely invalidate page cache
From: Trond Myklebust @ 2014-10-07 19:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Thanos Makatos, Jens Axboe,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-api@vger.kernel.org, jlayton@poochiereds.net,
	bfields@fieldses.org
In-Reply-To: <20141007191624.GD30038@quack.suse.cz>

On Tue, Oct 7, 2014 at 3:16 PM, Jan Kara <jack@suse.cz> wrote:
> On Tue 07-10-14 12:30:59, Dave Chinner wrote:
>> On Mon, Oct 06, 2014 at 04:30:19PM +0200, Jan Kara wrote:
>> > On Mon 06-10-14 11:33:23, Thanos Makatos wrote:
>> > > > > Trond also had a comment that if we extended the ioctl to work for all
>> > > > > inodes (not just blkdev) and allowed some additional flags of what
>> > > > > needs to be invalidated, the new ioctl would be also useful to NFS
>> > > > > userspace - see Trond's email at
>> > > > >
>> > > > > http://www.spinics.net/lists/linux-fsdevel/msg78917.html
>> > > > >
>> > > > > and the following thread. I would prefer to cover that usecase when we
>> > > > > are introducing new invalidation ioctl. Have you considered that Thanos?
>> > > >
>> > > > Sure, though I don't really know how to do it. I'll start by looking at the code
>> > > > flow when someone does " echo 3 > /proc/sys/vm/drop_caches", unless you
>> > > > already have a rough idea how to do that.
>> > >
>> > > I realise I haven't clearly understood what the semantics of this new ioctl
>> > > should be.
>> > >
>> > > My initial goal was to implement an ioctl that would _completely_ invalidate
>> > > the buffer cache of a block device when there is no file-system involved.
>> > > Unless I'm mistaken the patch I posted achieves this goal.
>> >   Yes.
>> >
>> > > We now want to extend this patch to take care of cached metadata, which seems
>> > > to be of particular importance for NFS, and I suspect that this piece of
>> > > functionality will still be applicable to any kind of file-system, correct?
>> >   So most notably they want the ioctl to work not only for block devices
>> > but also for any regular file. That's easily doable - you just call
>> > filemap_write_and_wait() and invalidate_inode_pages2() in the ioctl handler
>> > for regular files.
>> >
>> > Also they wanted to be able to specify a range of a mapping to invalidate -
>> > that's easily doable as well. Finally they wanted a 'flags' argument so you
>> > can additionally ask fs to invalidate also some metadata. How invalidation
>> > is done will be a fs specific thing and for now I guess we don't need to go
>> > into details. NFS guys can sort that out when they decide to implement it.
>> > So in the beginning we can just have u64 flags argument and in
>> > it a single 'INVAL_DATA' flag meaning that invalidation of data in a given
>> > range is requested. Later NFS guys can add further flags.
>>
>> Why do we need a new ioctl to do this? fadvise64() seems like it's
>> the exact fit for "FADV_INVALIDATE_[META]DATA" flags...
>   Well, fadvise() is currently a hint to kernel. In this case we would
> really like the call to do the invalidation and return error if it fails
> for some reason. So I'm not sure fadvise() is a perfect fit. But I wouldn't
> be strongly opposed to it either.
>

fadvise is about giving programs the ability to "announce an intention
to access file data in a specific pattern in the future, thus allowing
the kernel to perform appropriate optimizations" according to the
manpage.

Cache invalidation and revalidation, OTOH, is about ensuring meta/data
consistency between the disk and inode/page cache.

I'm not seeing a perfect match. :-)

-- 
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust@primarydata.com

^ permalink raw reply

* Re: [PATCH RFC] introduce ioctl to completely invalidate page cache
From: Jan Kara @ 2014-10-07 19:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Thanos Makatos, Jens Axboe,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	jlayton-vpEMnDpepFuMZCB2o+C8xQ@public.gmane.org,
	bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org
In-Reply-To: <20141007013059.GL2301@dastard>

On Tue 07-10-14 12:30:59, Dave Chinner wrote:
> On Mon, Oct 06, 2014 at 04:30:19PM +0200, Jan Kara wrote:
> > On Mon 06-10-14 11:33:23, Thanos Makatos wrote:
> > > > > Trond also had a comment that if we extended the ioctl to work for all
> > > > > inodes (not just blkdev) and allowed some additional flags of what
> > > > > needs to be invalidated, the new ioctl would be also useful to NFS
> > > > > userspace - see Trond's email at
> > > > >
> > > > > http://www.spinics.net/lists/linux-fsdevel/msg78917.html
> > > > >
> > > > > and the following thread. I would prefer to cover that usecase when we
> > > > > are introducing new invalidation ioctl. Have you considered that Thanos?
> > > > 
> > > > Sure, though I don't really know how to do it. I'll start by looking at the code
> > > > flow when someone does " echo 3 > /proc/sys/vm/drop_caches", unless you
> > > > already have a rough idea how to do that.
> > > 
> > > I realise I haven't clearly understood what the semantics of this new ioctl
> > > should be.
> > > 
> > > My initial goal was to implement an ioctl that would _completely_ invalidate
> > > the buffer cache of a block device when there is no file-system involved.
> > > Unless I'm mistaken the patch I posted achieves this goal.
> >   Yes.
> > 
> > > We now want to extend this patch to take care of cached metadata, which seems
> > > to be of particular importance for NFS, and I suspect that this piece of
> > > functionality will still be applicable to any kind of file-system, correct?
> >   So most notably they want the ioctl to work not only for block devices
> > but also for any regular file. That's easily doable - you just call
> > filemap_write_and_wait() and invalidate_inode_pages2() in the ioctl handler
> > for regular files.
> > 
> > Also they wanted to be able to specify a range of a mapping to invalidate -
> > that's easily doable as well. Finally they wanted a 'flags' argument so you
> > can additionally ask fs to invalidate also some metadata. How invalidation
> > is done will be a fs specific thing and for now I guess we don't need to go
> > into details. NFS guys can sort that out when they decide to implement it.
> > So in the beginning we can just have u64 flags argument and in
> > it a single 'INVAL_DATA' flag meaning that invalidation of data in a given
> > range is requested. Later NFS guys can add further flags.
> 
> Why do we need a new ioctl to do this? fadvise64() seems like it's
> the exact fit for "FADV_INVALIDATE_[META]DATA" flags...
  Well, fadvise() is currently a hint to kernel. In this case we would
really like the call to do the invalidation and return error if it fails
for some reason. So I'm not sure fadvise() is a perfect fit. But I wouldn't
be strongly opposed to it either.
								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply

* Re: [tpmdd-devel] [PATCH v2 2/7] tpm: two-phase chip management functions
From: Jarkko Sakkinen @ 2014-10-07 19:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Huewe, Ashley Lai, Marcel Selhorst,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	tpmdd-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20141007182756.GA10774-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

On Tue, Oct 07, 2014 at 12:27:56PM -0600, Jason Gunthorpe wrote:
> On Tue, Oct 07, 2014 at 09:04:17PM +0300, Jarkko Sakkinen wrote:
> 
> > > Did you test compile all the drivers? One of my git commits on github
> > > has some hackery to make that possible on x86.
> > 
> > Yeah, I compiled all the drivers:
> > 
> > $ grep CONFIG_TCG .config
> > CONFIG_TCG_TPM=m
> > CONFIG_TCG_TIS=m
> > CONFIG_TCG_TIS_I2C_ATMEL=m
> > CONFIG_TCG_TIS_I2C_INFINEON=m
> > CONFIG_TCG_TIS_I2C_NUVOTON=m
> > CONFIG_TCG_NSC=m
> > CONFIG_TCG_ATMEL=m
> > CONFIG_TCG_INFINEON=m
> > CONFIG_TCG_ST33_I2C=m
> > CONFIG_TCG_XEN=m
> > CONFIG_TCG_CRB=m
> 
> CONFIG_TCG_IBMVTPM is missing, apply this:
> 
> https://github.com/jgunthorpe/linux/commit/74afd69de603783858283f9b337e1694988964f8

Oops, how did that go unnoticed. Will do.

> Jason

/Jarkko

^ permalink raw reply

* Re: [tpmdd-devel] [PATCH v2 2/7] tpm: two-phase chip management functions
From: Jason Gunthorpe @ 2014-10-07 18:27 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Peter Huewe, Ashley Lai, Marcel Selhorst,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	tpmdd-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20141007180417.GA29459-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

On Tue, Oct 07, 2014 at 09:04:17PM +0300, Jarkko Sakkinen wrote:

> > Did you test compile all the drivers? One of my git commits on github
> > has some hackery to make that possible on x86.
> 
> Yeah, I compiled all the drivers:
> 
> $ grep CONFIG_TCG .config
> CONFIG_TCG_TPM=m
> CONFIG_TCG_TIS=m
> CONFIG_TCG_TIS_I2C_ATMEL=m
> CONFIG_TCG_TIS_I2C_INFINEON=m
> CONFIG_TCG_TIS_I2C_NUVOTON=m
> CONFIG_TCG_NSC=m
> CONFIG_TCG_ATMEL=m
> CONFIG_TCG_INFINEON=m
> CONFIG_TCG_ST33_I2C=m
> CONFIG_TCG_XEN=m
> CONFIG_TCG_CRB=m

CONFIG_TCG_IBMVTPM is missing, apply this:

https://github.com/jgunthorpe/linux/commit/74afd69de603783858283f9b337e1694988964f8

Jason

^ permalink raw reply

* Re: [tpmdd-devel] [PATCH v2 2/7] tpm: two-phase chip management functions
From: Jarkko Sakkinen @ 2014-10-07 18:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Huewe, Ashley Lai, Marcel Selhorst,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	tpmdd-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20141007175017.GA10432-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

Hi

And thanks for the feedback. Change requests look very reasonable.

On Tue, Oct 07, 2014 at 11:50:17AM -0600, Jason Gunthorpe wrote:
> On Tue, Oct 07, 2014 at 08:01:12PM +0300, Jarkko Sakkinen wrote:
> > Added tpm_chip_alloc() and tpm_chip_register() where tpm_chip_alloc()
> > reserves memory resources and tpm_chip_register() initializes the
> > device driver. This way it is possible to alter struct tpm_chip
> > attributes before passing it to tpm_chip_register().
> 
> This looks broadly reasonable to me
> 
> Please add a note to the commit that this is known to still have
> problems with resource reference counting, but they are less severe
> than what existed before, and this is only an interm step.
> 
> > +/**
> > + * tpm_chip_alloc() - allocate a new struct tpm_chip instance
> 
> This is using devm so it should be called 'tpmm_chip_alloc()' for
> clarity
> 
> 
> I know that was there before, but it sure is racy:
> 
> > +	chip->dev_num = find_first_zero_bit(dev_mask, TPM_NUM_DEVICES);
> [..]
> > +	set_bit(chip->dev_num, dev_mask);
> 
> Someday it should use IDR.
> 
> 
> > @@ -896,18 +872,7 @@ void tpm_remove_hardware(struct device *dev)
> >  		return;
> >  	}
> >  
> > -	spin_lock(&driver_lock);
> > -	list_del_rcu(&chip->list);
> > -	spin_unlock(&driver_lock);
> > -	synchronize_rcu();
> > -
> > -	tpm_dev_del_device(chip);
> > -	tpm_sysfs_del_device(chip);
> > -	tpm_remove_ppi(&dev->kobj);
> > -	tpm_bios_log_teardown(chip->bios_dir);
> > -
> > -	/* write it this way to be explicit (chip->dev == dev) */
> > -	put_device(chip->dev);
> > +	tpm_chip_unregister(chip);
> >  }
> >  EXPORT_SYMBOL_GPL(tpm_remove_hardware);
> 
> This can move to tpm-chip too, same with tpm_register_hardware
> 
> > @@ -714,15 +709,10 @@ static int tpm_tis_i2c_remove(struct i2c_client *client)
> >  	struct tpm_chip *chip = tpm_dev.chip;
> >  	release_locality(chip, chip->vendor.locality, 1);
> >  
> > -	/* close file handles */
> > -	tpm_dev_vendor_release(chip);
> > -
> >  	/* remove hardware */
> >  	tpm_remove_hardware(chip->dev);
> 
> Wrong ordering here, tpm_remove_hardware should always be first -
> drivers should not tear down internal state before calling it, so
> release_locality should be second.
> 
> Noting that since we use devm the kfree will not happen until
> remove returns, so the chip pointer is still valid.
> 
> >  	/* reset these pointers, otherwise we oops */
> > -	chip->dev->release = NULL;
> > -	chip->release = NULL;
> >  	tpm_dev.client = NULL;
> 
> The comment can go too
> 
> Note: tpm_dev should be driver private data, but that is not your
> problem..
> 
> Did you test compile all the drivers? One of my git commits on github
> has some hackery to make that possible on x86.

Yeah, I compiled all the drivers:

$ grep CONFIG_TCG .config
CONFIG_TCG_TPM=m
CONFIG_TCG_TIS=m
CONFIG_TCG_TIS_I2C_ATMEL=m
CONFIG_TCG_TIS_I2C_INFINEON=m
CONFIG_TCG_TIS_I2C_NUVOTON=m
CONFIG_TCG_NSC=m
CONFIG_TCG_ATMEL=m
CONFIG_TCG_INFINEON=m
CONFIG_TCG_ST33_I2C=m
CONFIG_TCG_XEN=m
CONFIG_TCG_CRB=m

> Jason

/Jarkko

^ permalink raw reply

* Re: [tpmdd-devel] [PATCH v2 6/7] tpm: TPM 2.0 CRB Interface
From: Jason Gunthorpe @ 2014-10-07 17:54 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Peter Huewe, Ashley Lai, Marcel Selhorst,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	tpmdd-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1412701277-27794-7-git-send-email-jarkko.sakkinen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

On Tue, Oct 07, 2014 at 08:01:16PM +0300, Jarkko Sakkinen wrote:

> +	rc = tpm_chip_register(chip);
> +	if (rc)
> +		return -ENODEV;

Again, wrong order, this needs to be last in the probe function

Jason

^ permalink raw reply

* Re: [tpmdd-devel] [PATCH v2 3/7] tpm: clean up tpm_tis driver life-cycle
From: Jason Gunthorpe @ 2014-10-07 17:53 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Peter Huewe, Ashley Lai, Marcel Selhorst,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	tpmdd-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1412701277-27794-4-git-send-email-jarkko.sakkinen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

On Tue, Oct 07, 2014 at 08:01:13PM +0300, Jarkko Sakkinen wrote:

> +	chip = tpm_chip_alloc(dev, &tpm_tis);
> +	if (!chip)
>  		return -ENODEV;

Needs to use ERR_PTR
  
> +	rc = tpm_chip_register(chip);
> +	if (rc)
> +		return -ENODEV;

Wrong ordering, this needs to be last in the probe function

Return rc not -ENODEV
  
> +static void tpm_tis_chip_remove(struct tpm_chip *chip)
> +{
> +	iowrite32(~TPM_GLOBAL_INT_ENABLE &
> +		  ioread32(chip->vendor.iobase +
> +			   TPM_INT_ENABLE(chip->vendor.
> +					  locality)),
> +		  chip->vendor.iobase +
> +		  TPM_INT_ENABLE(chip->vendor.locality));
> +	release_locality(chip, chip->vendor.locality, 1);
> +	if (chip->vendor.irq)
> +		free_irq(chip->vendor.irq, chip);
> +
> +	tpm_chip_unregister(chip);
> +}

Wrong ordering, tpm_chip_unregister needs to be first

> +	chip = dev_get_drvdata(&pdev->dev);
> +	tpm_tis_chip_remove(chip);
>  	platform_device_unregister(pdev);

I'm under the impression devm does not work outside a device driver
context, so adding devm breaks force mode in this driver. Do you see
differently?

AFAIK the two options are to fix force mode so that it attaches the
dummy platform driver (that is what it is for after all) or remove
force mode.

Jason

^ permalink raw reply

* Re: [tpmdd-devel] [PATCH v2 2/7] tpm: two-phase chip management functions
From: Jason Gunthorpe @ 2014-10-07 17:50 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Peter Huewe, Ashley Lai, Marcel Selhorst, linux-api, tpmdd-devel,
	linux-kernel
In-Reply-To: <1412701277-27794-3-git-send-email-jarkko.sakkinen@linux.intel.com>

On Tue, Oct 07, 2014 at 08:01:12PM +0300, Jarkko Sakkinen wrote:
> Added tpm_chip_alloc() and tpm_chip_register() where tpm_chip_alloc()
> reserves memory resources and tpm_chip_register() initializes the
> device driver. This way it is possible to alter struct tpm_chip
> attributes before passing it to tpm_chip_register().

This looks broadly reasonable to me

Please add a note to the commit that this is known to still have
problems with resource reference counting, but they are less severe
than what existed before, and this is only an interm step.

> +/**
> + * tpm_chip_alloc() - allocate a new struct tpm_chip instance

This is using devm so it should be called 'tpmm_chip_alloc()' for
clarity


I know that was there before, but it sure is racy:

> +	chip->dev_num = find_first_zero_bit(dev_mask, TPM_NUM_DEVICES);
[..]
> +	set_bit(chip->dev_num, dev_mask);

Someday it should use IDR.


> @@ -896,18 +872,7 @@ void tpm_remove_hardware(struct device *dev)
>  		return;
>  	}
>  
> -	spin_lock(&driver_lock);
> -	list_del_rcu(&chip->list);
> -	spin_unlock(&driver_lock);
> -	synchronize_rcu();
> -
> -	tpm_dev_del_device(chip);
> -	tpm_sysfs_del_device(chip);
> -	tpm_remove_ppi(&dev->kobj);
> -	tpm_bios_log_teardown(chip->bios_dir);
> -
> -	/* write it this way to be explicit (chip->dev == dev) */
> -	put_device(chip->dev);
> +	tpm_chip_unregister(chip);
>  }
>  EXPORT_SYMBOL_GPL(tpm_remove_hardware);

This can move to tpm-chip too, same with tpm_register_hardware

> @@ -714,15 +709,10 @@ static int tpm_tis_i2c_remove(struct i2c_client *client)
>  	struct tpm_chip *chip = tpm_dev.chip;
>  	release_locality(chip, chip->vendor.locality, 1);
>  
> -	/* close file handles */
> -	tpm_dev_vendor_release(chip);
> -
>  	/* remove hardware */
>  	tpm_remove_hardware(chip->dev);

Wrong ordering here, tpm_remove_hardware should always be first -
drivers should not tear down internal state before calling it, so
release_locality should be second.

Noting that since we use devm the kfree will not happen until
remove returns, so the chip pointer is still valid.

>  	/* reset these pointers, otherwise we oops */
> -	chip->dev->release = NULL;
> -	chip->release = NULL;
>  	tpm_dev.client = NULL;

The comment can go too

Note: tpm_dev should be driver private data, but that is not your
problem..

Did you test compile all the drivers? One of my git commits on github
has some hackery to make that possible on x86.

Jason

^ permalink raw reply

* Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
From: Dr. David Alan Gilbert @ 2014-10-07 17:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Linus Torvalds, Andrea Arcangeli, qemu-devel, KVM list,
	Linux Kernel Mailing List, linux-mm, Linux API,
	Andres Lagar-Cavilla, Dave Hansen, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Christopher Covington, Johannes Weiner,
	Android Kernel Team, Robert Love, Dmitry Adamushko <dmi>
In-Reply-To: <54341F77.6060400@redhat.com>

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> Il 07/10/2014 19:07, Dr. David Alan Gilbert ha scritto:
> >> > 
> >> > So I'd *much* rather have a "write()" style interface (ie _copying_
> >> > bytes from user space into a newly allocated page that gets mapped)
> >> > than a "remap page" style interface
> > Something like that might work for the postcopy case; it doesn't work
> > for some of the other uses that need to stop a page being changed by the
> > guest, but then need to somehow get a copy of that page internally to QEMU,
> > and perhaps provide it back later.
> 
> I cannot parse this.  Which uses do you have in mind?  Is it for
> QEMU-specific or is it for other applications of userfaults?

> As long as the page is atomically mapped, I'm not sure what the
> difference from remap_anon_pages are (as far as the destination page is
> concerned).  Are you thinking of having userfaults enabled on the source
> as well?

What I'm talking about here is when I want to stop a page being accessed by the
guest, do something with the data in qemu, and give it back to the guest sometime
later.

The main example is: Memory pools for guests where you swap RAM between a series of
VM hosts.  You have to take the page out, send it over the wire, sometime later if
the guest tries to access it you userfault and pull it back.
(There's at least one or two companies selling something like this, and at least
one Linux based implementations with their own much more involved kernel hacks)

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
From: Paolo Bonzini @ 2014-10-07 17:14 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Linus Torvalds
  Cc: Andrea Arcangeli, qemu-devel, KVM list, Linux Kernel Mailing List,
	linux-mm, Linux API, Andres Lagar-Cavilla, Dave Hansen,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown
In-Reply-To: <20141007170731.GO2404@work-vm>

Il 07/10/2014 19:07, Dr. David Alan Gilbert ha scritto:
>> > 
>> > So I'd *much* rather have a "write()" style interface (ie _copying_
>> > bytes from user space into a newly allocated page that gets mapped)
>> > than a "remap page" style interface
> Something like that might work for the postcopy case; it doesn't work
> for some of the other uses that need to stop a page being changed by the
> guest, but then need to somehow get a copy of that page internally to QEMU,
> and perhaps provide it back later.

I cannot parse this.  Which uses do you have in mind?  Is it for
QEMU-specific or is it for other applications of userfaults?

As long as the page is atomically mapped, I'm not sure what the
difference from remap_anon_pages are (as far as the destination page is
concerned).  Are you thinking of having userfaults enabled on the source
as well?

Paolo

> remap_anon_pages worked for those cases
> as well; I can't think of another current way of doing it in userspace.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox