[Ocfs2-devel] [PATCH v8] fs: clear file privilege bits when mmap writing

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jan Kara <jack@suse.cz>
To: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Kees Cook <keescook@chromium.org>, Jan Kara <jack@suse.cz>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andy Lutomirski <luto@amacapital.net>,
	yalin wang <yalin.wang2010@gmail.com>, Willy Tarreau <w@1wt.eu>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	mfasheh@suse.de, ocfs2-devel@oss.oracle.com
Subject: [Ocfs2-devel] [PATCH v8] fs: clear file privilege bits when mmap writing
Date: Fri, 15 Jan 2016 11:17:59 +0100	[thread overview]
Message-ID: <20160115101759.GC15950@quack.suse.cz> (raw)
In-Reply-To: <CALYGNiNn7m=AS9Zopa1g+TT5key=bEi7pKzGz3LzazQfm28qUw@mail.gmail.com>

On Thu 14-01-16 10:35:17, Konstantin Khlebnikov wrote:
> On Wed, Jan 13, 2016 at 11:33 PM, Kees Cook <keescook@chromium.org> wrote:
> > On Wed, Jan 13, 2016 at 12:23 PM, Konstantin Khlebnikov
> > <koct9i@gmail.com> wrote:
> >> On Wed, Jan 13, 2016 at 7:09 PM, Kees Cook <keescook@chromium.org> wrote:
> >>> On Wed, Jan 13, 2016 at 1:03 AM, Jan Kara <jack@suse.cz> wrote:
> >>>> On Tue 12-01-16 11:09:04, Kees Cook wrote:
> >>>>> Normally, when a user can modify a file that has setuid or setgid bits,
> >>>>> those bits are cleared when they are not the file owner or a member
> >>>>> of the group. This is enforced when using write and truncate but not
> >>>>> when writing to a shared mmap on the file. This could allow the file
> >>>>> writer to gain privileges by changing a binary without losing the
> >>>>> setuid/setgid/caps bits.
> >>>>>
> >>>>> Changing the bits requires holding inode->i_mutex, so it cannot be done
> >>>>> during the page fault (due to mmap_sem being held during the fault).
> >>>>> Instead, clear the bits if PROT_WRITE is being used at mmap open time,
> >>>>> or added at mprotect time.
> >>>>>
> >>>>> Since we can't do the check in the right place inside mmap (due to
> >>>>> holding mmap_sem), we have to do it before holding mmap_sem, which
> >>>>> means duplicating some checks, which have to be available to the non-MMU
> >>>>> builds too.
> >>>>>
> >>>>> When walking VMAs during mprotect, we need to drop mmap_sem (while
> >>>>> holding a file reference) and restart the walk after clearing privileges.
> >>>>
> >>>> ...
> >>>>
> >>>>> @@ -375,6 +376,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
> >>>>>
> >>>>>       vm_flags = calc_vm_prot_bits(prot);
> >>>>>
> >>>>> +restart:
> >>>>>       down_write(&current->mm->mmap_sem);
> >>>>>
> >>>>>       vma = find_vma(current->mm, start);
> >>>>> @@ -416,6 +418,28 @@ SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
> >>>>>                       goto out;
> >>>>>               }
> >>>>>
> >>>>> +             /*
> >>>>> +              * If we're adding write permissions to a shared file,
> >>>>> +              * we must clear privileges (like done at mmap time),
> >>>>> +              * but we have to juggle the locks to avoid holding
> >>>>> +              * mmap_sem while holding i_mutex.
> >>>>> +              */
> >>>>> +             if ((vma->vm_flags & VM_SHARED) && vma->vm_file &&
> >>>>> +                 (newflags & VM_WRITE) && !(vma->vm_flags & VM_WRITE) &&
> >>>>> +                 !IS_NOSEC(file_inode(vma->vm_file))) {
> >>>>
> >>>> This code assumes that IS_NOSEC gets set for inode once file_remove_privs()
> >>>> is called. However that is not true for two reasons:
> >>>>
> >>>> 1) When you are root, SUID bit doesn't get cleared and thus you cannot set
> >>>> IS_NOSEC.
> >>>>
> >>>> 2) Some filesystems do not have MS_NOSEC set and for those IS_NOSEC is
> >>>> never true.
> >>>>
> >>>> So in these cases you'll loop forever.
> >>>
> >>> UUuugh.
> >>>
> >>>>
> >>>> You can check SUID bits without i_mutex so that could be done without
> >>>> dropping mmap_sem but you cannot easily call security_inode_need_killpriv()
> >>>> without i_mutex as that checks extended attributes (IMA) and that needs
> >>>> i_mutex to be held to avoid races with someone else changing the attributes
> >>>> under you.
> >>>
> >>> Yeah, that's why I changed this from Konstantin's original suggestion.
> >>>
> >>>> Honestly, I don't see a way of implementing this in mprotect() which would
> >>>> be reasonably elegant.
> >>>
> >>> Konstantin, any thoughts here?
> >>
> >> Getxattr works fine without i_mutex: sys_getxattr/vfs_getxattr doesn't lock it.
> >> If somebody changes xattrs under us we'll end up in race anyway.
> >> But this still safe: setxattrs are sychronized.
> >
> > So I can swap my IS_NOSEC for your original file_needs_remove_privs()?
> > Are the LSM hooks expecting to be called under mm_sem? (Looks like
> > only common_caps implements that, though.)
> 
> getxattr should nests inside mmap_sem safely: it has sort of
> "readpage" semantics,
> actually ext4 uses it when inlines content of tiny files into xattr.

First, sorry Kees for misleading you. Somehow I missed that i_mutex is not
actually acquired for getxattr() calls.

I have checked and lots of filesystems have dedicated xattr semaphore which
should be safe to nest inside mmap_sem. There are filesystems like ocfs2 or
gfs2 which use their equivalent of i_mutex for xattr locking so there we
would create lock inversion when calling file_needs_remove_privs() from
under mmap_sem.

That being said at least OCFS2 has other issues with this xattr locking
scheme and they are working on changing things AFAIK. Mark can you perhaps
comment?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

WARNING: multiple messages have this Message-ID (diff)

From: Jan Kara <jack@suse.cz>
To: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Kees Cook <keescook@chromium.org>, Jan Kara <jack@suse.cz>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andy Lutomirski <luto@amacapital.net>,
	yalin wang <yalin.wang2010@gmail.com>, Willy Tarreau <w@1wt.eu>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	mfasheh@suse.de, ocfs2-devel@oss.oracle.com
Subject: Re: [PATCH v8] fs: clear file privilege bits when mmap writing
Date: Fri, 15 Jan 2016 11:17:59 +0100	[thread overview]
Message-ID: <20160115101759.GC15950@quack.suse.cz> (raw)
In-Reply-To: <CALYGNiNn7m=AS9Zopa1g+TT5key=bEi7pKzGz3LzazQfm28qUw@mail.gmail.com>

On Thu 14-01-16 10:35:17, Konstantin Khlebnikov wrote:
> On Wed, Jan 13, 2016 at 11:33 PM, Kees Cook <keescook@chromium.org> wrote:
> > On Wed, Jan 13, 2016 at 12:23 PM, Konstantin Khlebnikov
> > <koct9i@gmail.com> wrote:
> >> On Wed, Jan 13, 2016 at 7:09 PM, Kees Cook <keescook@chromium.org> wrote:
> >>> On Wed, Jan 13, 2016 at 1:03 AM, Jan Kara <jack@suse.cz> wrote:
> >>>> On Tue 12-01-16 11:09:04, Kees Cook wrote:
> >>>>> Normally, when a user can modify a file that has setuid or setgid bits,
> >>>>> those bits are cleared when they are not the file owner or a member
> >>>>> of the group. This is enforced when using write and truncate but not
> >>>>> when writing to a shared mmap on the file. This could allow the file
> >>>>> writer to gain privileges by changing a binary without losing the
> >>>>> setuid/setgid/caps bits.
> >>>>>
> >>>>> Changing the bits requires holding inode->i_mutex, so it cannot be done
> >>>>> during the page fault (due to mmap_sem being held during the fault).
> >>>>> Instead, clear the bits if PROT_WRITE is being used at mmap open time,
> >>>>> or added at mprotect time.
> >>>>>
> >>>>> Since we can't do the check in the right place inside mmap (due to
> >>>>> holding mmap_sem), we have to do it before holding mmap_sem, which
> >>>>> means duplicating some checks, which have to be available to the non-MMU
> >>>>> builds too.
> >>>>>
> >>>>> When walking VMAs during mprotect, we need to drop mmap_sem (while
> >>>>> holding a file reference) and restart the walk after clearing privileges.
> >>>>
> >>>> ...
> >>>>
> >>>>> @@ -375,6 +376,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
> >>>>>
> >>>>>       vm_flags = calc_vm_prot_bits(prot);
> >>>>>
> >>>>> +restart:
> >>>>>       down_write(&current->mm->mmap_sem);
> >>>>>
> >>>>>       vma = find_vma(current->mm, start);
> >>>>> @@ -416,6 +418,28 @@ SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
> >>>>>                       goto out;
> >>>>>               }
> >>>>>
> >>>>> +             /*
> >>>>> +              * If we're adding write permissions to a shared file,
> >>>>> +              * we must clear privileges (like done at mmap time),
> >>>>> +              * but we have to juggle the locks to avoid holding
> >>>>> +              * mmap_sem while holding i_mutex.
> >>>>> +              */
> >>>>> +             if ((vma->vm_flags & VM_SHARED) && vma->vm_file &&
> >>>>> +                 (newflags & VM_WRITE) && !(vma->vm_flags & VM_WRITE) &&
> >>>>> +                 !IS_NOSEC(file_inode(vma->vm_file))) {
> >>>>
> >>>> This code assumes that IS_NOSEC gets set for inode once file_remove_privs()
> >>>> is called. However that is not true for two reasons:
> >>>>
> >>>> 1) When you are root, SUID bit doesn't get cleared and thus you cannot set
> >>>> IS_NOSEC.
> >>>>
> >>>> 2) Some filesystems do not have MS_NOSEC set and for those IS_NOSEC is
> >>>> never true.
> >>>>
> >>>> So in these cases you'll loop forever.
> >>>
> >>> UUuugh.
> >>>
> >>>>
> >>>> You can check SUID bits without i_mutex so that could be done without
> >>>> dropping mmap_sem but you cannot easily call security_inode_need_killpriv()
> >>>> without i_mutex as that checks extended attributes (IMA) and that needs
> >>>> i_mutex to be held to avoid races with someone else changing the attributes
> >>>> under you.
> >>>
> >>> Yeah, that's why I changed this from Konstantin's original suggestion.
> >>>
> >>>> Honestly, I don't see a way of implementing this in mprotect() which would
> >>>> be reasonably elegant.
> >>>
> >>> Konstantin, any thoughts here?
> >>
> >> Getxattr works fine without i_mutex: sys_getxattr/vfs_getxattr doesn't lock it.
> >> If somebody changes xattrs under us we'll end up in race anyway.
> >> But this still safe: setxattrs are sychronized.
> >
> > So I can swap my IS_NOSEC for your original file_needs_remove_privs()?
> > Are the LSM hooks expecting to be called under mm_sem? (Looks like
> > only common_caps implements that, though.)
> 
> getxattr should nests inside mmap_sem safely: it has sort of
> "readpage" semantics,
> actually ext4 uses it when inlines content of tiny files into xattr.

First, sorry Kees for misleading you. Somehow I missed that i_mutex is not
actually acquired for getxattr() calls.

I have checked and lots of filesystems have dedicated xattr semaphore which
should be safe to nest inside mmap_sem. There are filesystems like ocfs2 or
gfs2 which use their equivalent of i_mutex for xattr locking so there we
would create lock inversion when calling file_needs_remove_privs() from
under mmap_sem.

That being said at least OCFS2 has other issues with this xattr locking
scheme and they are working on changing things AFAIK. Mark can you perhaps
comment?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Jan Kara <jack@suse.cz>
To: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Kees Cook <keescook@chromium.org>, Jan Kara <jack@suse.cz>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andy Lutomirski <luto@amacapital.net>,
	yalin wang <yalin.wang2010@gmail.com>, Willy Tarreau <w@1wt.eu>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	mfasheh@suse.de, ocfs2-devel@oss.oracle.com
Subject: Re: [PATCH v8] fs: clear file privilege bits when mmap writing
Date: Fri, 15 Jan 2016 11:17:59 +0100	[thread overview]
Message-ID: <20160115101759.GC15950@quack.suse.cz> (raw)
In-Reply-To: <CALYGNiNn7m=AS9Zopa1g+TT5key=bEi7pKzGz3LzazQfm28qUw@mail.gmail.com>

On Thu 14-01-16 10:35:17, Konstantin Khlebnikov wrote:
> On Wed, Jan 13, 2016 at 11:33 PM, Kees Cook <keescook@chromium.org> wrote:
> > On Wed, Jan 13, 2016 at 12:23 PM, Konstantin Khlebnikov
> > <koct9i@gmail.com> wrote:
> >> On Wed, Jan 13, 2016 at 7:09 PM, Kees Cook <keescook@chromium.org> wrote:
> >>> On Wed, Jan 13, 2016 at 1:03 AM, Jan Kara <jack@suse.cz> wrote:
> >>>> On Tue 12-01-16 11:09:04, Kees Cook wrote:
> >>>>> Normally, when a user can modify a file that has setuid or setgid bits,
> >>>>> those bits are cleared when they are not the file owner or a member
> >>>>> of the group. This is enforced when using write and truncate but not
> >>>>> when writing to a shared mmap on the file. This could allow the file
> >>>>> writer to gain privileges by changing a binary without losing the
> >>>>> setuid/setgid/caps bits.
> >>>>>
> >>>>> Changing the bits requires holding inode->i_mutex, so it cannot be done
> >>>>> during the page fault (due to mmap_sem being held during the fault).
> >>>>> Instead, clear the bits if PROT_WRITE is being used at mmap open time,
> >>>>> or added at mprotect time.
> >>>>>
> >>>>> Since we can't do the check in the right place inside mmap (due to
> >>>>> holding mmap_sem), we have to do it before holding mmap_sem, which
> >>>>> means duplicating some checks, which have to be available to the non-MMU
> >>>>> builds too.
> >>>>>
> >>>>> When walking VMAs during mprotect, we need to drop mmap_sem (while
> >>>>> holding a file reference) and restart the walk after clearing privileges.
> >>>>
> >>>> ...
> >>>>
> >>>>> @@ -375,6 +376,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
> >>>>>
> >>>>>       vm_flags = calc_vm_prot_bits(prot);
> >>>>>
> >>>>> +restart:
> >>>>>       down_write(&current->mm->mmap_sem);
> >>>>>
> >>>>>       vma = find_vma(current->mm, start);
> >>>>> @@ -416,6 +418,28 @@ SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
> >>>>>                       goto out;
> >>>>>               }
> >>>>>
> >>>>> +             /*
> >>>>> +              * If we're adding write permissions to a shared file,
> >>>>> +              * we must clear privileges (like done at mmap time),
> >>>>> +              * but we have to juggle the locks to avoid holding
> >>>>> +              * mmap_sem while holding i_mutex.
> >>>>> +              */
> >>>>> +             if ((vma->vm_flags & VM_SHARED) && vma->vm_file &&
> >>>>> +                 (newflags & VM_WRITE) && !(vma->vm_flags & VM_WRITE) &&
> >>>>> +                 !IS_NOSEC(file_inode(vma->vm_file))) {
> >>>>
> >>>> This code assumes that IS_NOSEC gets set for inode once file_remove_privs()
> >>>> is called. However that is not true for two reasons:
> >>>>
> >>>> 1) When you are root, SUID bit doesn't get cleared and thus you cannot set
> >>>> IS_NOSEC.
> >>>>
> >>>> 2) Some filesystems do not have MS_NOSEC set and for those IS_NOSEC is
> >>>> never true.
> >>>>
> >>>> So in these cases you'll loop forever.
> >>>
> >>> UUuugh.
> >>>
> >>>>
> >>>> You can check SUID bits without i_mutex so that could be done without
> >>>> dropping mmap_sem but you cannot easily call security_inode_need_killpriv()
> >>>> without i_mutex as that checks extended attributes (IMA) and that needs
> >>>> i_mutex to be held to avoid races with someone else changing the attributes
> >>>> under you.
> >>>
> >>> Yeah, that's why I changed this from Konstantin's original suggestion.
> >>>
> >>>> Honestly, I don't see a way of implementing this in mprotect() which would
> >>>> be reasonably elegant.
> >>>
> >>> Konstantin, any thoughts here?
> >>
> >> Getxattr works fine without i_mutex: sys_getxattr/vfs_getxattr doesn't lock it.
> >> If somebody changes xattrs under us we'll end up in race anyway.
> >> But this still safe: setxattrs are sychronized.
> >
> > So I can swap my IS_NOSEC for your original file_needs_remove_privs()?
> > Are the LSM hooks expecting to be called under mm_sem? (Looks like
> > only common_caps implements that, though.)
> 
> getxattr should nests inside mmap_sem safely: it has sort of
> "readpage" semantics,
> actually ext4 uses it when inlines content of tiny files into xattr.

First, sorry Kees for misleading you. Somehow I missed that i_mutex is not
actually acquired for getxattr() calls.

I have checked and lots of filesystems have dedicated xattr semaphore which
should be safe to nest inside mmap_sem. There are filesystems like ocfs2 or
gfs2 which use their equivalent of i_mutex for xattr locking so there we
would create lock inversion when calling file_needs_remove_privs() from
under mmap_sem.

That being said at least OCFS2 has other issues with this xattr locking
scheme and they are working on changing things AFAIK. Mark can you perhaps
comment?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

next prev parent reply	other threads:[~2016-01-15 10:17 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-12 19:09 [PATCH v8] fs: clear file privilege bits when mmap writing Kees Cook
2016-01-12 19:09 ` Kees Cook
2016-01-13  9:03 ` Jan Kara
2016-01-13  9:03   ` Jan Kara
2016-01-13 16:09   ` Kees Cook
2016-01-13 16:09     ` Kees Cook
2016-01-13 20:23     ` Konstantin Khlebnikov
2016-01-13 20:23       ` Konstantin Khlebnikov
2016-01-13 20:33       ` Kees Cook
2016-01-13 20:33         ` Kees Cook
2016-01-14  7:35         ` Konstantin Khlebnikov
2016-01-14  7:35           ` Konstantin Khlebnikov
2016-01-15 10:17           ` Jan Kara [this message]
2016-01-15 10:17             ` Jan Kara
2016-01-15 10:17             ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160115101759.GC15950@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=keescook@chromium.org \
    --cc=koct9i@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@amacapital.net \
    --cc=mfasheh@suse.de \
    --cc=ocfs2-devel@oss.oracle.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=w@1wt.eu \
    --cc=yalin.wang2010@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.