Linux Documentation
 help / color / mirror / Atom feed
* Re: [PATCH v12 0/5] overlayfs override_creds=off
From: Casey Schaufler @ 2019-07-30 21:37 UTC (permalink / raw)
  To: Mark Salyzyn, linux-kernel
  Cc: kernel-team, Miklos Szeredi, Jonathan Corbet, Vivek Goyal,
	Eric W . Biederman, Amir Goldstein, Randy Dunlap, Stephen Smalley,
	linux-unionfs, linux-doc, Linux Security Module list
In-Reply-To: <20190730172904.79146-1-salyzyn@android.com>

On 7/30/2019 10:28 AM, Mark Salyzyn wrote:
> Patch series:

Please add linux-security-module@vger.kernel.org to the CC
for all changes affecting handling of security xattrs.

>
> overlayfs: check CAP_DAC_READ_SEARCH before issuing exportfs_decode_fh
> Add flags option to get xattr method paired to __vfs_getxattr
> overlayfs: handle XATTR_NOSECURITY flag for get xattr method
> overlayfs: internal getxattr operations without sepolicy checking
> overlayfs: override_creds=off option bypass creator_cred
>
> The first four patches address fundamental security issues that should
> be solved regardless of the override_creds=off feature.
> on them).
>
> The fifth adds the feature depends on these other fixes.
>
> By default, all access to the upper, lower and work directories is the
> recorded mounter's MAC and DAC credentials.  The incoming accesses are
> checked against the caller's credentials.
>
> If the principles of least privilege are applied for sepolicy, the
> mounter's credentials might not overlap the credentials of the caller's
> when accessing the overlayfs filesystem.  For example, a file that a
> lower DAC privileged caller can execute, is MAC denied to the
> generally higher DAC privileged mounter, to prevent an attack vector.
>
> We add the option to turn off override_creds in the mount options; all
> subsequent operations after mount on the filesystem will be only the
> caller's credentials.  The module boolean parameter and mount option
> override_creds is also added as a presence check for this "feature",
> existence of /sys/module/overlay/parameters/overlay_creds
>
> Signed-off-by: Mark Salyzyn <salyzyn@android.com>
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> Cc: Eric W. Biederman <ebiederm@xmission.com>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Stephen Smalley <sds@tycho.nsa.gov>
> Cc: linux-unionfs@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
>
> ---
> v12:
> - Restore squished out patch 2 and 3 in the series,
>   then change algorithm to add flags argument.
>   Per-thread flag is a large security surface.
>
> v11:
> - Squish out v10 introduced patch 2 and 3 in the series,
>   then and use per-thread flag instead for nesting.
> - Switch name to ovl_do_vds_getxattr for __vds_getxattr wrapper.
> - Add sb argument to ovl_revert_creds to match future work.
>
> v10:
> - Return NULL on CAP_DAC_READ_SEARCH
> - Add __get xattr method to solve sepolicy logging issue
> - Drop unnecessary sys_admin sepolicy checking for administrative
>   driver internal xattr functions.
>
> v6:
> - Drop CONFIG_OVERLAY_FS_OVERRIDE_CREDS.
> - Do better with the documentation, drop rationalizations.
> - pr_warn message adjusted to report consequences.
>
> v5:
> - beefed up the caveats in the Documentation
> - Is dependent on
>   "overlayfs: check CAP_DAC_READ_SEARCH before issuing exportfs_decode_fh"
>   "overlayfs: check CAP_MKNOD before issuing vfs_whiteout"
> - Added prwarn when override_creds=off
>
> v4:
> - spelling and grammar errors in text
>
> v3:
> - Change name from caller_credentials / creator_credentials to the
>   boolean override_creds.
> - Changed from creator to mounter credentials.
> - Updated and fortified the documentation.
> - Added CONFIG_OVERLAY_FS_OVERRIDE_CREDS
>
> v2:
> - Forward port changed attr to stat, resulting in a build error.
> - altered commit message.
>

^ permalink raw reply

* Re: [PATCH v2 25/26] docs: rcu: convert some articles from html to ReST
From: Paul E. McKenney @ 2019-07-30 21:22 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Josh Triplett, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan,
	Joel Fernandes, Jonathan Corbet, rcu, linux-doc
In-Reply-To: <8444797277eea7be474f40625bb190775a9cee33.1564145354.git.mchehab+samsung@kernel.org>

On Fri, Jul 26, 2019 at 09:51:35AM -0300, Mauro Carvalho Chehab wrote:
> There are 4 RCU articles that are written on html format.
> 
> The way they are, they can't be part of the Linux Kernel
> documentation body nor share the styles and pdf output.
> 
> So, convert them to ReST format.
> 
> This way, make htmldocs and make pdfdocs will produce a
> documentation output that will be like the original ones, but
> will be part of the Linux Kernel documentation body.
> 
> Part of the conversion was done with the help of pandoc, but
> the result had some broken things that had to be manually
> fixed.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

I am having some trouble applying these, at least in part due to UTF-8
sequences, for example double left quotation mark.  These end up being
"=E2=80=9C", with a few space characters turned into "=20".

Any advice on how to apply these?  Should I just pull commits from
somewhere?

							Thanx, Paul

^ permalink raw reply

* [PATCH v2 1/2] idr: Document calling context for IDA APIs mustn't use locks
From: Stephen Boyd @ 2019-07-30 21:20 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, Greg KH, Tri Vo, Jonathan Corbet, linux-doc

The documentation for these functions indicates that callers don't need
to hold a lock while calling them, but that documentation is only in one
place under "IDA Usage". Let's state the same information on each IDA
function so that it's clear what the calling context requires.
Furthermore, let's document ida_simple_get() with the same information
so that callers know how this API works.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Tri Vo <trong@android.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
---
 include/linux/idr.h | 9 ++++++---
 lib/idr.c           | 9 ++++++---
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index 4ec8986e5dfb..5bb026007044 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -263,7 +263,8 @@ void ida_destroy(struct ida *ida);
  *
  * Allocate an ID between 0 and %INT_MAX, inclusive.
  *
- * Context: Any context.
+ * Context: Any context. It is safe to call this function without
+ * locking in your code.
  * Return: The allocated ID, or %-ENOMEM if memory could not be allocated,
  * or %-ENOSPC if there are no free IDs.
  */
@@ -280,7 +281,8 @@ static inline int ida_alloc(struct ida *ida, gfp_t gfp)
  *
  * Allocate an ID between @min and %INT_MAX, inclusive.
  *
- * Context: Any context.
+ * Context: Any context. It is safe to call this function without
+ * locking in your code.
  * Return: The allocated ID, or %-ENOMEM if memory could not be allocated,
  * or %-ENOSPC if there are no free IDs.
  */
@@ -297,7 +299,8 @@ static inline int ida_alloc_min(struct ida *ida, unsigned int min, gfp_t gfp)
  *
  * Allocate an ID between 0 and @max, inclusive.
  *
- * Context: Any context.
+ * Context: Any context. It is safe to call this function without
+ * locking in your code.
  * Return: The allocated ID, or %-ENOMEM if memory could not be allocated,
  * or %-ENOSPC if there are no free IDs.
  */
diff --git a/lib/idr.c b/lib/idr.c
index 66a374892482..dbd25696162e 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -381,7 +381,8 @@ EXPORT_SYMBOL(idr_replace);
  * Allocate an ID between @min and @max, inclusive.  The allocated ID will
  * not exceed %INT_MAX, even if @max is larger.
  *
- * Context: Any context.
+ * Context: Any context. It is safe to call this function without
+ * locking in your code.
  * Return: The allocated ID, or %-ENOMEM if memory could not be allocated,
  * or %-ENOSPC if there are no free IDs.
  */
@@ -488,7 +489,8 @@ EXPORT_SYMBOL(ida_alloc_range);
  * @ida: IDA handle.
  * @id: Previously allocated ID.
  *
- * Context: Any context.
+ * Context: Any context. It is safe to call this function without
+ * locking in your code.
  */
 void ida_free(struct ida *ida, unsigned int id)
 {
@@ -540,7 +542,8 @@ EXPORT_SYMBOL(ida_free);
  * or freed.  If the IDA is already empty, there is no need to call this
  * function.
  *
- * Context: Any context.
+ * Context: Any context. It is safe to call this function without
+ * locking in your code.
  */
 void ida_destroy(struct ida *ida)
 {
-- 
Sent by a computer through tubes


^ permalink raw reply related

* [PATCH v2 2/2] idr: Document that ida_simple_{get,remove}() are deprecated
From: Stephen Boyd @ 2019-07-30 21:20 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, Greg KH, Tri Vo, Jonathan Corbet, linux-doc
In-Reply-To: <20190730212048.164657-1-swboyd@chromium.org>

These two functions are deprecated. Users should call ida_alloc() or
ida_free() respectively instead. Add documentation to this effect until
the macro can be removed.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Tri Vo <trong@android.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
---
 include/linux/idr.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index 5bb026007044..12f7233c7adb 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -314,6 +314,10 @@ static inline void ida_init(struct ida *ida)
 	xa_init_flags(&ida->xa, IDA_INIT_FLAGS);
 }
 
+/*
+ * ida_simple_get() and ida_simple_remove() are deprecated. Use
+ * ida_alloc() and ida_free() instead respectively.
+ */
 #define ida_simple_get(ida, start, end, gfp)	\
 			ida_alloc_range(ida, start, (end) - 1, gfp)
 #define ida_simple_remove(ida, id)	ida_free(ida, id)
-- 
Sent by a computer through tubes


^ permalink raw reply related

* Re: [PATCH] idr: Document calling context for IDA APIs mustn't use locks
From: Stephen Boyd @ 2019-07-30 21:18 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, Greg KH, Tri Vo, linux-doc, Jonathan Corbet
In-Reply-To: <20190730211250.GD4700@bombadil.infradead.org>

Quoting Matthew Wilcox (2019-07-30 14:12:50)
> On Tue, Jul 30, 2019 at 02:07:52PM -0700, Stephen Boyd wrote:
> > The documentation for these functions indicates that callers don't need
> > to hold a lock while calling them, but that documentation is only in one
> > place under "IDA Usage". Let's state the same information on each IDA
> > function so that it's clear what the calling context requires.
> > Furthermore, let's document ida_simple_get() with the same information
> > so that callers know how this API works.
> 
> I don't want people to use ida_simple_get() any more.  Use ida_alloc()
> instead.

Fair enough. I'll document it as deprecated in another patch.

> 
> > - * Context: Any context.
> > + * Context: Any context. It is safe to call this function without
> > + * synchronisation in your code.
> 
> I prefer "without locking" to "without synchronisation" ...
> 

Ok. Resending shortly.

^ permalink raw reply

* Re: [PATCH] idr: Document calling context for IDA APIs mustn't use locks
From: Matthew Wilcox @ 2019-07-30 21:12 UTC (permalink / raw)
  To: Stephen Boyd; +Cc: linux-kernel, Greg KH, Tri Vo, linux-doc, Jonathan Corbet
In-Reply-To: <20190730210752.157700-1-swboyd@chromium.org>

On Tue, Jul 30, 2019 at 02:07:52PM -0700, Stephen Boyd wrote:
> The documentation for these functions indicates that callers don't need
> to hold a lock while calling them, but that documentation is only in one
> place under "IDA Usage". Let's state the same information on each IDA
> function so that it's clear what the calling context requires.
> Furthermore, let's document ida_simple_get() with the same information
> so that callers know how this API works.

I don't want people to use ida_simple_get() any more.  Use ida_alloc()
instead.

> - * Context: Any context.
> + * Context: Any context. It is safe to call this function without
> + * synchronisation in your code.

I prefer "without locking" to "without synchronisation" ...


^ permalink raw reply

* [PATCH] idr: Document calling context for IDA APIs mustn't use locks
From: Stephen Boyd @ 2019-07-30 21:07 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, Greg KH, Tri Vo, linux-doc, Jonathan Corbet

The documentation for these functions indicates that callers don't need
to hold a lock while calling them, but that documentation is only in one
place under "IDA Usage". Let's state the same information on each IDA
function so that it's clear what the calling context requires.
Furthermore, let's document ida_simple_get() with the same information
so that callers know how this API works.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Tri Vo <trong@android.com>
Cc: linux-doc@vger.kernel.org
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
---

See Greg's comment[1] for the reason why this patch is created.

[1] https://lkml.kernel.org/r/20190730064657.GA1213@kroah.com

 include/linux/idr.h | 23 ++++++++++++++++++++---
 lib/idr.c           |  9 ++++++---
 2 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index 4ec8986e5dfb..b591ecbba3f4 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -263,7 +263,8 @@ void ida_destroy(struct ida *ida);
  *
  * Allocate an ID between 0 and %INT_MAX, inclusive.
  *
- * Context: Any context.
+ * Context: Any context. It is safe to call this function without
+ * synchronisation in your code.
  * Return: The allocated ID, or %-ENOMEM if memory could not be allocated,
  * or %-ENOSPC if there are no free IDs.
  */
@@ -280,7 +281,8 @@ static inline int ida_alloc(struct ida *ida, gfp_t gfp)
  *
  * Allocate an ID between @min and %INT_MAX, inclusive.
  *
- * Context: Any context.
+ * Context: Any context. It is safe to call this function without
+ * synchronisation in your code.
  * Return: The allocated ID, or %-ENOMEM if memory could not be allocated,
  * or %-ENOSPC if there are no free IDs.
  */
@@ -297,7 +299,8 @@ static inline int ida_alloc_min(struct ida *ida, unsigned int min, gfp_t gfp)
  *
  * Allocate an ID between 0 and @max, inclusive.
  *
- * Context: Any context.
+ * Context: Any context. It is safe to call this function without
+ * synchronisation in your code.
  * Return: The allocated ID, or %-ENOMEM if memory could not be allocated,
  * or %-ENOSPC if there are no free IDs.
  */
@@ -311,6 +314,20 @@ static inline void ida_init(struct ida *ida)
 	xa_init_flags(&ida->xa, IDA_INIT_FLAGS);
 }
 
+/**
+ * ida_simple_get() - Allocate an unused ID between (start, end].
+ * @ida: IDA handle.
+ * @start: ID to start from (inclusive)
+ * @end: ID to stop at (exclusive). Use 0 to indicate %INT_MAX.
+ * @gfp: Memory allocation flags.
+ *
+ * Allocate an ID between (start, end].
+ *
+ * Context: Any context. It is safe to call this function without
+ * synchronisation in your code.
+ * Return: The allocated ID, or %-ENOMEM if memory could not be allocated,
+ * or %-ENOSPC if there are no free IDs.
+ */
 #define ida_simple_get(ida, start, end, gfp)	\
 			ida_alloc_range(ida, start, (end) - 1, gfp)
 #define ida_simple_remove(ida, id)	ida_free(ida, id)
diff --git a/lib/idr.c b/lib/idr.c
index 66a374892482..e8a5f47c0c78 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -381,7 +381,8 @@ EXPORT_SYMBOL(idr_replace);
  * Allocate an ID between @min and @max, inclusive.  The allocated ID will
  * not exceed %INT_MAX, even if @max is larger.
  *
- * Context: Any context.
+ * Context: Any context. It is safe to call this function without
+ * synchronisation in your code.
  * Return: The allocated ID, or %-ENOMEM if memory could not be allocated,
  * or %-ENOSPC if there are no free IDs.
  */
@@ -488,7 +489,8 @@ EXPORT_SYMBOL(ida_alloc_range);
  * @ida: IDA handle.
  * @id: Previously allocated ID.
  *
- * Context: Any context.
+ * Context: Any context. It is safe to call this function without
+ * synchronisation in your code.
  */
 void ida_free(struct ida *ida, unsigned int id)
 {
@@ -540,7 +542,8 @@ EXPORT_SYMBOL(ida_free);
  * or freed.  If the IDA is already empty, there is no need to call this
  * function.
  *
- * Context: Any context.
+ * Context: Any context. It is safe to call this function without
+ * synchronisation in your code.
  */
 void ida_destroy(struct ida *ida)
 {
-- 
Sent by a computer through tubes


^ permalink raw reply related

* Re: [PATCH 1/1] psi: do not require setsched permission from the trigger creator
From: Suren Baghdasaryan @ 2019-07-30 17:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, lizefan, Johannes Weiner, axboe, Dennis Zhou,
	Dennis Zhou, Andrew Morton, linux-mm, linux-doc, LKML,
	kernel-team, Nick Kralevich, Thomas Gleixner
In-Reply-To: <20190730081122.GH31381@hirez.programming.kicks-ass.net>

On Tue, Jul 30, 2019 at 1:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jul 29, 2019 at 06:33:10PM -0700, Suren Baghdasaryan wrote:
> > When a process creates a new trigger by writing into /proc/pressure/*
> > files, permissions to write such a file should be used to determine whether
> > the process is allowed to do so or not. Current implementation would also
> > require such a process to have setsched capability. Setting of psi trigger
> > thread's scheduling policy is an implementation detail and should not be
> > exposed to the user level. Remove the permission check by using _nocheck
> > version of the function.
> >
> > Suggested-by: Nick Kralevich <nnk@google.com>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  kernel/sched/psi.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > index 7acc632c3b82..ed9a1d573cb1 100644
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -1061,7 +1061,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
> >                       mutex_unlock(&group->trigger_lock);
> >                       return ERR_CAST(kworker);
> >               }
> > -             sched_setscheduler(kworker->task, SCHED_FIFO, &param);
> > +             sched_setscheduler_nocheck(kworker->task, SCHED_FIFO, &param);
>
> ARGGH, wtf is there a FIFO-99!! thread here at all !?

We need psi poll_kworker to be an rt-priority thread so that psi
notifications are delivered to the userspace without delay even when
the CPUs are very congested. Otherwise it's easy to delay psi
notifications by running a simple CPU hogger executing "chrt -f 50 dd
if=/dev/zero of=/dev/null". Because these notifications are
time-critical for reacting to memory shortages we can't allow for such
delays.
Notice that this kworker is created only if userspace creates a psi
trigger. So unless you are using psi triggers you will never see this
kthread created.

> >               kthread_init_delayed_work(&group->poll_work,
> >                               psi_poll_work);
> >               rcu_assign_pointer(group->poll_kworker, kworker);
> > --
> > 2.22.0.709.g102302147b-goog
> >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply

* [PATCH v12 5/5] overlayfs: override_creds=off option bypass creator_cred
From: Mark Salyzyn @ 2019-07-30 17:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, Mark Salyzyn, Miklos Szeredi, Jonathan Corbet,
	Vivek Goyal, Eric W . Biederman, Amir Goldstein, Randy Dunlap,
	Stephen Smalley, linux-unionfs, linux-doc
In-Reply-To: <20190730172904.79146-1-salyzyn@android.com>

By default, all access to the upper, lower and work directories is the
recorded mounter's MAC and DAC credentials.  The incoming accesses are
checked against the caller's credentials.

If the principles of least privilege are applied, the mounter's
credentials might not overlap the credentials of the caller's when
accessing the overlayfs filesystem.  For example, a file that a lower
DAC privileged caller can execute, is MAC denied to the generally
higher DAC privileged mounter, to prevent an attack vector.

We add the option to turn off override_creds in the mount options; all
subsequent operations after mount on the filesystem will be only the
caller's credentials.  The module boolean parameter and mount option
override_creds is also added as a presence check for this "feature",
existence of /sys/module/overlay/parameters/override_creds.

It was not always this way.  Circa 4.6 there was no recorded mounter's
credentials, instead privileged access to upper or work directories
were temporarily increased to perform the operations.  The MAC
(selinux) policies were caller's in all cases.  override_creds=off
partially returns us to this older access model minus the insecure
temporary credential increases.  This is to permit use in a system
with non-overlapping security models for each executable including
the agent that mounts the overlayfs filesystem.  In Android
this is the case since init, which performs the mount operations,
has a minimal MAC set of privileges to reduce any attack surface,
and services that use the content have a different set of MAC
privileges (eg: read, for vendor labelled configuration, execute for
vendor libraries and modules).  The caveats are not a problem in
the Android usage model, however they should be fixed for
completeness and for general use in time.

Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: linux-unionfs@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kernel-team@android.com
---
v12:
- Rebase

v11:
- add sb argument to ovl_revert_creds to match future work

v10:
- Rebase (and expand because of increased revert_cred usage)

v9:
- Add to the caveats

v8:
- drop pr_warn message after straw poll to remove it.
- added a use case in the commit message

v7:
- change name of internal parameter to ovl_override_creds_def
- report override_creds only if different than default

v6:
- Drop CONFIG_OVERLAY_FS_OVERRIDE_CREDS.
- Do better with the documentation.
- pr_warn message adjusted to report consequences.

v5:
- beefed up the caveats in the Documentation
- Is dependent on
  "overlayfs: check CAP_DAC_READ_SEARCH before issuing exportfs_decode_fh"
  "overlayfs: check CAP_MKNOD before issuing vfs_whiteout"
- Added prwarn when override_creds=off

v4:
- spelling and grammar errors in text

v3:
- Change name from caller_credentials / creator_credentials to the
  boolean override_creds.
- Changed from creator to mounter credentials.
- Updated and fortified the documentation.
- Added CONFIG_OVERLAY_FS_OVERRIDE_CREDS

v2:
- Forward port changed attr to stat, resulting in a build error.
- altered commit message.

a
---
 Documentation/filesystems/overlayfs.txt | 23 +++++++++++++++++++++++
 fs/overlayfs/copy_up.c                  |  2 +-
 fs/overlayfs/dir.c                      | 11 ++++++-----
 fs/overlayfs/file.c                     | 20 ++++++++++----------
 fs/overlayfs/inode.c                    | 18 +++++++++---------
 fs/overlayfs/namei.c                    |  6 +++---
 fs/overlayfs/overlayfs.h                |  1 +
 fs/overlayfs/ovl_entry.h                |  1 +
 fs/overlayfs/readdir.c                  |  4 ++--
 fs/overlayfs/super.c                    | 22 +++++++++++++++++++++-
 fs/overlayfs/util.c                     | 12 ++++++++++--
 11 files changed, 87 insertions(+), 33 deletions(-)

diff --git a/Documentation/filesystems/overlayfs.txt b/Documentation/filesystems/overlayfs.txt
index 1da2f1668f08..d48125076602 100644
--- a/Documentation/filesystems/overlayfs.txt
+++ b/Documentation/filesystems/overlayfs.txt
@@ -102,6 +102,29 @@ Only the lists of names from directories are merged.  Other content
 such as metadata and extended attributes are reported for the upper
 directory only.  These attributes of the lower directory are hidden.
 
+credentials
+-----------
+
+By default, all access to the upper, lower and work directories is the
+recorded mounter's MAC and DAC credentials.  The incoming accesses are
+checked against the caller's credentials.
+
+In the case where caller MAC or DAC credentials do not overlap, a
+use case available in older versions of the driver, the
+override_creds mount flag can be turned off and help when the use
+pattern has caller with legitimate credentials where the mounter
+does not.  Several unintended side effects will occur though.  The
+caller without certain key capabilities or lower privilege will not
+always be able to delete files or directories, create nodes, or
+search some restricted directories.  The ability to search and read
+a directory entry is spotty as a result of the cache mechanism not
+retesting the credentials because of the assumption, a privileged
+caller can fill cache, then a lower privilege can read the directory
+cache.  The uneven security model where cache, upperdir and workdir
+are opened at privilege, but accessed without creating a form of
+privilege escalation, should only be used with strict understanding
+of the side effects and of the security policies.
+
 whiteouts and opaque directories
 --------------------------------
 
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index b801c6353100..1c1b9415e533 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -886,7 +886,7 @@ int ovl_copy_up_flags(struct dentry *dentry, int flags)
 		dput(parent);
 		dput(next);
 	}
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 
 	return err;
 }
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 702aa63f6774..49b8ffc1294f 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -563,7 +563,8 @@ static int ovl_create_or_link(struct dentry *dentry, struct inode *inode,
 		override_cred->fsgid = inode->i_gid;
 		if (!attr->hardlink) {
 			err = security_dentry_create_files_as(dentry,
-					attr->mode, &dentry->d_name, old_cred,
+					attr->mode, &dentry->d_name,
+					old_cred ? old_cred : current_cred(),
 					override_cred);
 			if (err) {
 				put_cred(override_cred);
@@ -579,7 +580,7 @@ static int ovl_create_or_link(struct dentry *dentry, struct inode *inode,
 			err = ovl_create_over_whiteout(dentry, inode, attr);
 	}
 out_revert_creds:
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 	return err;
 }
 
@@ -655,7 +656,7 @@ static int ovl_set_link_redirect(struct dentry *dentry)
 
 	old_cred = ovl_override_creds(dentry->d_sb);
 	err = ovl_set_redirect(dentry, false);
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 
 	return err;
 }
@@ -851,7 +852,7 @@ static int ovl_do_remove(struct dentry *dentry, bool is_dir)
 		err = ovl_remove_upper(dentry, is_dir, &list);
 	else
 		err = ovl_remove_and_whiteout(dentry, &list);
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 	if (!err) {
 		if (is_dir)
 			clear_nlink(dentry->d_inode);
@@ -1221,7 +1222,7 @@ static int ovl_rename(struct inode *olddir, struct dentry *old,
 out_unlock:
 	unlock_rename(new_upperdir, old_upperdir);
 out_revert_creds:
-	revert_creds(old_cred);
+	ovl_revert_creds(old->d_sb, old_cred);
 	if (update_nlink)
 		ovl_nlink_end(new);
 out_drop_write:
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index e235a635d9ec..627a303c95da 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -32,7 +32,7 @@ static struct file *ovl_open_realfile(const struct file *file,
 	old_cred = ovl_override_creds(inode->i_sb);
 	realfile = open_with_fake_path(&file->f_path, flags, realinode,
 				       current_cred());
-	revert_creds(old_cred);
+	ovl_revert_creds(inode->i_sb, old_cred);
 
 	pr_debug("open(%p[%pD2/%c], 0%o) -> (%p, 0%o)\n",
 		 file, file, ovl_whatisit(inode, realinode), file->f_flags,
@@ -176,7 +176,7 @@ static loff_t ovl_llseek(struct file *file, loff_t offset, int whence)
 
 	old_cred = ovl_override_creds(inode->i_sb);
 	ret = vfs_llseek(real.file, offset, whence);
-	revert_creds(old_cred);
+	ovl_revert_creds(inode->i_sb, old_cred);
 
 	file->f_pos = real.file->f_pos;
 	inode_unlock(inode);
@@ -242,7 +242,7 @@ static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
 	ret = vfs_iter_read(real.file, iter, &iocb->ki_pos,
 			    ovl_iocb_to_rwf(iocb));
-	revert_creds(old_cred);
+	ovl_revert_creds(file_inode(file)->i_sb, old_cred);
 
 	ovl_file_accessed(file);
 
@@ -278,7 +278,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
 			     ovl_iocb_to_rwf(iocb));
 	file_end_write(real.file);
-	revert_creds(old_cred);
+	ovl_revert_creds(file_inode(file)->i_sb, old_cred);
 
 	/* Update size */
 	ovl_copyattr(ovl_inode_real(inode), inode);
@@ -305,7 +305,7 @@ static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	if (file_inode(real.file) == ovl_inode_upper(file_inode(file))) {
 		old_cred = ovl_override_creds(file_inode(file)->i_sb);
 		ret = vfs_fsync_range(real.file, start, end, datasync);
-		revert_creds(old_cred);
+		ovl_revert_creds(file_inode(file)->i_sb, old_cred);
 	}
 
 	fdput(real);
@@ -329,7 +329,7 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
 
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
 	ret = call_mmap(vma->vm_file, vma);
-	revert_creds(old_cred);
+	ovl_revert_creds(file_inode(file)->i_sb, old_cred);
 
 	if (ret) {
 		/* Drop reference count from new vm_file value */
@@ -357,7 +357,7 @@ static long ovl_fallocate(struct file *file, int mode, loff_t offset, loff_t len
 
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
 	ret = vfs_fallocate(real.file, mode, offset, len);
-	revert_creds(old_cred);
+	ovl_revert_creds(file_inode(file)->i_sb, old_cred);
 
 	/* Update size */
 	ovl_copyattr(ovl_inode_real(inode), inode);
@@ -379,7 +379,7 @@ static int ovl_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
 	ret = vfs_fadvise(real.file, offset, len, advice);
-	revert_creds(old_cred);
+	ovl_revert_creds(file_inode(file)->i_sb, old_cred);
 
 	fdput(real);
 
@@ -399,7 +399,7 @@ static long ovl_real_ioctl(struct file *file, unsigned int cmd,
 
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
 	ret = vfs_ioctl(real.file, cmd, arg);
-	revert_creds(old_cred);
+	ovl_revert_creds(file_inode(file)->i_sb, old_cred);
 
 	fdput(real);
 
@@ -589,7 +589,7 @@ static loff_t ovl_copyfile(struct file *file_in, loff_t pos_in,
 						flags);
 		break;
 	}
-	revert_creds(old_cred);
+	ovl_revert_creds(file_inode(file)->i_sb, old_cred);
 
 	/* Update size */
 	ovl_copyattr(ovl_inode_real(inode_out), inode_out);
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index ce66f4050557..bc46f45f65ba 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -61,7 +61,7 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
 		inode_lock(upperdentry->d_inode);
 		old_cred = ovl_override_creds(dentry->d_sb);
 		err = notify_change(upperdentry, attr, NULL);
-		revert_creds(old_cred);
+		ovl_revert_creds(dentry->d_sb, old_cred);
 		if (!err)
 			ovl_copyattr(upperdentry->d_inode, dentry->d_inode);
 		inode_unlock(upperdentry->d_inode);
@@ -257,7 +257,7 @@ int ovl_getattr(const struct path *path, struct kstat *stat,
 		stat->nlink = dentry->d_inode->i_nlink;
 
 out:
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 
 	return err;
 }
@@ -291,7 +291,7 @@ int ovl_permission(struct inode *inode, int mask)
 		mask |= MAY_READ;
 	}
 	err = inode_permission(realinode, mask);
-	revert_creds(old_cred);
+	ovl_revert_creds(inode->i_sb, old_cred);
 
 	return err;
 }
@@ -308,7 +308,7 @@ static const char *ovl_get_link(struct dentry *dentry,
 
 	old_cred = ovl_override_creds(dentry->d_sb);
 	p = vfs_get_link(ovl_dentry_real(dentry), done);
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 	return p;
 }
 
@@ -351,7 +351,7 @@ int ovl_xattr_set(struct dentry *dentry, struct inode *inode, const char *name,
 		WARN_ON(flags != XATTR_REPLACE);
 		err = vfs_removexattr(realdentry, name);
 	}
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 
 	/* copy c/mtime */
 	ovl_copyattr(d_inode(realdentry), inode);
@@ -376,7 +376,7 @@ int ovl_xattr_get(struct dentry *dentry, struct inode *inode, const char *name,
 				     value, size);
 	else
 		res = vfs_getxattr(realdentry, name, value, size);
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 	return res;
 }
 
@@ -400,7 +400,7 @@ ssize_t ovl_listxattr(struct dentry *dentry, char *list, size_t size)
 
 	old_cred = ovl_override_creds(dentry->d_sb);
 	res = vfs_listxattr(realdentry, list, size);
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 	if (res <= 0 || size == 0)
 		return res;
 
@@ -435,7 +435,7 @@ struct posix_acl *ovl_get_acl(struct inode *inode, int type)
 
 	old_cred = ovl_override_creds(inode->i_sb);
 	acl = get_acl(realinode, type);
-	revert_creds(old_cred);
+	ovl_revert_creds(inode->i_sb, old_cred);
 
 	return acl;
 }
@@ -473,7 +473,7 @@ static int ovl_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		filemap_write_and_wait(realinode->i_mapping);
 
 	err = realinode->i_op->fiemap(realinode, fieinfo, start, len);
-	revert_creds(old_cred);
+	ovl_revert_creds(inode->i_sb, old_cred);
 
 	return err;
 }
diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
index a4a452c489fa..bab1f97dc201 100644
--- a/fs/overlayfs/namei.c
+++ b/fs/overlayfs/namei.c
@@ -1079,7 +1079,7 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 			goto out_free_oe;
 	}
 
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 	if (origin_path) {
 		dput(origin_path->dentry);
 		kfree(origin_path);
@@ -1106,7 +1106,7 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 	kfree(upperredirect);
 out:
 	kfree(d.redirect);
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 	return ERR_PTR(err);
 }
 
@@ -1160,7 +1160,7 @@ bool ovl_lower_positive(struct dentry *dentry)
 			dput(this);
 		}
 	}
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 
 	return positive;
 }
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 9d26d8758513..ad1a11e7ecbd 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -205,6 +205,7 @@ int ovl_want_write(struct dentry *dentry);
 void ovl_drop_write(struct dentry *dentry);
 struct dentry *ovl_workdir(struct dentry *dentry);
 const struct cred *ovl_override_creds(struct super_block *sb);
+void ovl_revert_creds(struct super_block *sb, const struct cred *oldcred);
 ssize_t ovl_do_vfs_getxattr(struct dentry *dentry, const char *name, void *buf,
 			    size_t size);
 struct super_block *ovl_same_sb(struct super_block *sb);
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 28a2d12a1029..2637c5aadf7f 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -17,6 +17,7 @@ struct ovl_config {
 	bool nfs_export;
 	int xino;
 	bool metacopy;
+	bool override_creds;
 };
 
 struct ovl_sb {
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 47a91c9733a5..874a1b3ff99a 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -286,7 +286,7 @@ static int ovl_check_whiteouts(struct dentry *dir, struct ovl_readdir_data *rdd)
 		}
 		inode_unlock(dir->d_inode);
 	}
-	revert_creds(old_cred);
+	ovl_revert_creds(rdd->dentry->d_sb, old_cred);
 
 	return err;
 }
@@ -918,7 +918,7 @@ int ovl_check_empty_dir(struct dentry *dentry, struct list_head *list)
 
 	old_cred = ovl_override_creds(dentry->d_sb);
 	err = ovl_dir_read_merged(dentry, list, &root);
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 	if (err)
 		return err;
 
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 6f041e1fceda..2c1278451f38 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -53,6 +53,11 @@ module_param_named(xino_auto, ovl_xino_auto_def, bool, 0644);
 MODULE_PARM_DESC(xino_auto,
 		 "Auto enable xino feature");
 
+static bool __read_mostly ovl_override_creds_def = true;
+module_param_named(override_creds, ovl_override_creds_def, bool, 0644);
+MODULE_PARM_DESC(ovl_override_creds_def,
+		 "Use mounter's credentials for accesses");
+
 static void ovl_entry_stack_free(struct ovl_entry *oe)
 {
 	unsigned int i;
@@ -362,6 +367,9 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
 	if (ofs->config.metacopy != ovl_metacopy_def)
 		seq_printf(m, ",metacopy=%s",
 			   ofs->config.metacopy ? "on" : "off");
+	if (ofs->config.override_creds != ovl_override_creds_def)
+		seq_show_option(m, "override_creds",
+				ofs->config.override_creds ? "on" : "off");
 	return 0;
 }
 
@@ -402,6 +410,8 @@ enum {
 	OPT_XINO_AUTO,
 	OPT_METACOPY_ON,
 	OPT_METACOPY_OFF,
+	OPT_OVERRIDE_CREDS_ON,
+	OPT_OVERRIDE_CREDS_OFF,
 	OPT_ERR,
 };
 
@@ -420,6 +430,8 @@ static const match_table_t ovl_tokens = {
 	{OPT_XINO_AUTO,			"xino=auto"},
 	{OPT_METACOPY_ON,		"metacopy=on"},
 	{OPT_METACOPY_OFF,		"metacopy=off"},
+	{OPT_OVERRIDE_CREDS_ON,		"override_creds=on"},
+	{OPT_OVERRIDE_CREDS_OFF,	"override_creds=off"},
 	{OPT_ERR,			NULL}
 };
 
@@ -478,6 +490,7 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config)
 	config->redirect_mode = kstrdup(ovl_redirect_mode_def(), GFP_KERNEL);
 	if (!config->redirect_mode)
 		return -ENOMEM;
+	config->override_creds = ovl_override_creds_def;
 
 	while ((p = ovl_next_opt(&opt)) != NULL) {
 		int token;
@@ -558,6 +571,14 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config)
 			config->metacopy = false;
 			break;
 
+		case OPT_OVERRIDE_CREDS_ON:
+			config->override_creds = true;
+			break;
+
+		case OPT_OVERRIDE_CREDS_OFF:
+			config->override_creds = false;
+			break;
+
 		default:
 			pr_err("overlayfs: unrecognized mount option \"%s\" or missing value\n", p);
 			return -EINVAL;
@@ -1674,7 +1695,6 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 		       ovl_dentry_lower(root_dentry), NULL);
 
 	sb->s_root = root_dentry;
-
 	return 0;
 
 out_free_oe:
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index f80b95423043..4720a7a6fea3 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -37,9 +37,17 @@ const struct cred *ovl_override_creds(struct super_block *sb)
 {
 	struct ovl_fs *ofs = sb->s_fs_info;
 
+	if (!ofs->config.override_creds)
+		return NULL;
 	return override_creds(ofs->creator_cred);
 }
 
+void ovl_revert_creds(struct super_block *sb, const struct cred *old_cred)
+{
+	if (old_cred)
+		revert_creds(old_cred);
+}
+
 ssize_t ovl_do_vfs_getxattr(struct dentry *dentry, const char *name, void *buf,
 			    size_t size)
 {
@@ -797,7 +805,7 @@ int ovl_nlink_start(struct dentry *dentry)
 	 * value relative to the upper inode nlink in an upper inode xattr.
 	 */
 	err = ovl_set_nlink_upper(dentry);
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 
 out:
 	if (err)
@@ -815,7 +823,7 @@ void ovl_nlink_end(struct dentry *dentry)
 
 		old_cred = ovl_override_creds(dentry->d_sb);
 		ovl_cleanup_index(dentry);
-		revert_creds(old_cred);
+		ovl_revert_creds(dentry->d_sb, old_cred);
 	}
 
 	ovl_inode_unlock(inode);
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related

* [PATCH v12 4/5] overlayfs: internal getxattr operations without sepolicy checking
From: Mark Salyzyn @ 2019-07-30 17:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, Mark Salyzyn, Miklos Szeredi, Jonathan Corbet,
	Vivek Goyal, Eric W . Biederman, Amir Goldstein, Randy Dunlap,
	Stephen Smalley, linux-unionfs, linux-doc
In-Reply-To: <20190730172904.79146-1-salyzyn@android.com>

Check impure, opaque, origin & meta xattr with no sepolicy audit
(using __vfs_getxattr) since these operations are internal to
overlayfs operations and do not disclose any data.  This became
an issue for credential override off since sys_admin would have
been required by the caller; whereas would have been inherently
present for the creator since it performed the mount.

This is a change in operations since we do not check in the new
ovl_do_vfs_getxattr function if the credential override is off or
not.  Reasoning is that the sepolicy check is unnecessary overhead,
especially since the check can be expensive.

Because for override credentials off, this affects _everyone_ that
underneath performs private xattr calls without the appropriate
sepolicy permissions and sys_admin capability.  Providing blanket
support for sys_admin would be bad for all possible callers.

For the override credentials on, this will affect only the mounter,
should it lack sepolicy permissions. Not considered a security
problem since mounting by definition has sys_admin capabilities,
but sepolicy contexts would still need to be crafted.

It should be noted that there is precedence, __vfs_getxattr is used
in other filesystems for their own internal trusted xattr management.

Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: linux-unionfs@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kernel-team@android.com
---
v12 - rebase

v11 - switch name to ovl_do_vfs_getxattr, fortify comment

v10 - added to patch series
---
 fs/overlayfs/namei.c     | 12 +++++++-----
 fs/overlayfs/overlayfs.h |  2 ++
 fs/overlayfs/util.c      | 24 +++++++++++++++---------
 3 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
index 9702f0d5309d..a4a452c489fa 100644
--- a/fs/overlayfs/namei.c
+++ b/fs/overlayfs/namei.c
@@ -106,10 +106,11 @@ int ovl_check_fh_len(struct ovl_fh *fh, int fh_len)
 
 static struct ovl_fh *ovl_get_fh(struct dentry *dentry, const char *name)
 {
-	int res, err;
+	ssize_t res;
+	int err;
 	struct ovl_fh *fh = NULL;
 
-	res = vfs_getxattr(dentry, name, NULL, 0);
+	res = ovl_do_vfs_getxattr(dentry, name, NULL, 0);
 	if (res < 0) {
 		if (res == -ENODATA || res == -EOPNOTSUPP)
 			return NULL;
@@ -123,7 +124,7 @@ static struct ovl_fh *ovl_get_fh(struct dentry *dentry, const char *name)
 	if (!fh)
 		return ERR_PTR(-ENOMEM);
 
-	res = vfs_getxattr(dentry, name, fh, res);
+	res = ovl_do_vfs_getxattr(dentry, name, fh, res);
 	if (res < 0)
 		goto fail;
 
@@ -141,10 +142,11 @@ static struct ovl_fh *ovl_get_fh(struct dentry *dentry, const char *name)
 	return NULL;
 
 fail:
-	pr_warn_ratelimited("overlayfs: failed to get origin (%i)\n", res);
+	pr_warn_ratelimited("overlayfs: failed to get origin (%zi)\n", res);
 	goto out;
 invalid:
-	pr_warn_ratelimited("overlayfs: invalid origin (%*phN)\n", res, fh);
+	pr_warn_ratelimited("overlayfs: invalid origin (%*phN)\n",
+			    (int)res, fh);
 	goto out;
 }
 
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index ab3d031c422b..9d26d8758513 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -205,6 +205,8 @@ int ovl_want_write(struct dentry *dentry);
 void ovl_drop_write(struct dentry *dentry);
 struct dentry *ovl_workdir(struct dentry *dentry);
 const struct cred *ovl_override_creds(struct super_block *sb);
+ssize_t ovl_do_vfs_getxattr(struct dentry *dentry, const char *name, void *buf,
+			    size_t size);
 struct super_block *ovl_same_sb(struct super_block *sb);
 int ovl_can_decode_fh(struct super_block *sb);
 struct dentry *ovl_indexdir(struct super_block *sb);
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index f5678a3f8350..f80b95423043 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -40,6 +40,12 @@ const struct cred *ovl_override_creds(struct super_block *sb)
 	return override_creds(ofs->creator_cred);
 }
 
+ssize_t ovl_do_vfs_getxattr(struct dentry *dentry, const char *name, void *buf,
+			    size_t size)
+{
+	return __vfs_getxattr(dentry, d_inode(dentry), name, buf, size);
+}
+
 struct super_block *ovl_same_sb(struct super_block *sb)
 {
 	struct ovl_fs *ofs = sb->s_fs_info;
@@ -537,9 +543,9 @@ void ovl_copy_up_end(struct dentry *dentry)
 
 bool ovl_check_origin_xattr(struct dentry *dentry)
 {
-	int res;
+	ssize_t res;
 
-	res = vfs_getxattr(dentry, OVL_XATTR_ORIGIN, NULL, 0);
+	res = ovl_do_vfs_getxattr(dentry, OVL_XATTR_ORIGIN, NULL, 0);
 
 	/* Zero size value means "copied up but origin unknown" */
 	if (res >= 0)
@@ -550,13 +556,13 @@ bool ovl_check_origin_xattr(struct dentry *dentry)
 
 bool ovl_check_dir_xattr(struct dentry *dentry, const char *name)
 {
-	int res;
+	ssize_t res;
 	char val;
 
 	if (!d_is_dir(dentry))
 		return false;
 
-	res = vfs_getxattr(dentry, name, &val, 1);
+	res = ovl_do_vfs_getxattr(dentry, name, &val, 1);
 	if (res == 1 && val == 'y')
 		return true;
 
@@ -837,13 +843,13 @@ int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir)
 /* err < 0, 0 if no metacopy xattr, 1 if metacopy xattr found */
 int ovl_check_metacopy_xattr(struct dentry *dentry)
 {
-	int res;
+	ssize_t res;
 
 	/* Only regular files can have metacopy xattr */
 	if (!S_ISREG(d_inode(dentry)->i_mode))
 		return 0;
 
-	res = vfs_getxattr(dentry, OVL_XATTR_METACOPY, NULL, 0);
+	res = ovl_do_vfs_getxattr(dentry, OVL_XATTR_METACOPY, NULL, 0);
 	if (res < 0) {
 		if (res == -ENODATA || res == -EOPNOTSUPP)
 			return 0;
@@ -852,7 +858,7 @@ int ovl_check_metacopy_xattr(struct dentry *dentry)
 
 	return 1;
 out:
-	pr_warn_ratelimited("overlayfs: failed to get metacopy (%i)\n", res);
+	pr_warn_ratelimited("overlayfs: failed to get metacopy (%zi)\n", res);
 	return res;
 }
 
@@ -878,7 +884,7 @@ ssize_t ovl_getxattr(struct dentry *dentry, char *name, char **value,
 	ssize_t res;
 	char *buf = NULL;
 
-	res = vfs_getxattr(dentry, name, NULL, 0);
+	res = ovl_do_vfs_getxattr(dentry, name, NULL, 0);
 	if (res < 0) {
 		if (res == -ENODATA || res == -EOPNOTSUPP)
 			return -ENODATA;
@@ -890,7 +896,7 @@ ssize_t ovl_getxattr(struct dentry *dentry, char *name, char **value,
 		if (!buf)
 			return -ENOMEM;
 
-		res = vfs_getxattr(dentry, name, buf, res);
+		res = ovl_do_vfs_getxattr(dentry, name, buf, res);
 		if (res < 0)
 			goto fail;
 	}
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related

* [PATCH v12 3/5] overlayfs: handle XATTR_NOSECURITY flag for get xattr method
From: Mark Salyzyn @ 2019-07-30 17:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, Mark Salyzyn, Miklos Szeredi, Jonathan Corbet,
	Vivek Goyal, Eric W . Biederman, Amir Goldstein, Randy Dunlap,
	Stephen Smalley, linux-unionfs, linux-doc
In-Reply-To: <20190730172904.79146-1-salyzyn@android.com>

Because of the overlayfs getxattr recursion, the incoming inode fails
to update the selinux sid resulting in avc denials being reported
against a target context of u:object_r:unlabeled:s0.

Solution is to respond to the XATTR_NOSECURITY flag in get xattr
method that calls the __vfs_getxattr handler instead so that the
context can be read in, rather than being denied with an -EACCES
when vfs_getxattr handler is called.

For the use case where access is to be blocked by the security layer.

The path then would be security(dentry) -> __vfs_getxattr(dentry) ->
handler->get(dentry...XATTR_NOSECURITY) ->
__vfs_getxattr(realdentry) -> lower_handler->get(realdentry) which
would report back through the chain data and success as expected,
the logging security layer at the top would have the data to
determine the access permissions and report back to the logs and
the caller that the target context was blocked.

For selinux this would solve the cosmetic issue of the selinux log
and allow audit2allow to correctly report the rule needed to address
the access problem.

Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: linux-unionfs@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kernel-team@android.com
---
v12 - Added back to patch series as get xattr with flag option.

v11 - Squashed out of patch series and replaced with per-thread flag
      solution.

v10 - Added to patch series as __get xattr method.
---
 fs/overlayfs/inode.c     | 8 ++++++--
 fs/overlayfs/overlayfs.h | 2 +-
 fs/overlayfs/super.c     | 7 ++++---
 3 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 7663aeb85fa3..ce66f4050557 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -363,7 +363,7 @@ int ovl_xattr_set(struct dentry *dentry, struct inode *inode, const char *name,
 }
 
 int ovl_xattr_get(struct dentry *dentry, struct inode *inode, const char *name,
-		  void *value, size_t size)
+		  void *value, size_t size, int flags)
 {
 	ssize_t res;
 	const struct cred *old_cred;
@@ -371,7 +371,11 @@ int ovl_xattr_get(struct dentry *dentry, struct inode *inode, const char *name,
 		ovl_i_dentry_upper(inode) ?: ovl_dentry_lower(dentry);
 
 	old_cred = ovl_override_creds(dentry->d_sb);
-	res = vfs_getxattr(realdentry, name, value, size);
+	if (flags & XATTR_NOSECURITY)
+		res = __vfs_getxattr(realdentry, d_inode(realdentry), name,
+				     value, size);
+	else
+		res = vfs_getxattr(realdentry, name, value, size);
 	revert_creds(old_cred);
 	return res;
 }
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 6934bcf030f0..ab3d031c422b 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -356,7 +356,7 @@ int ovl_permission(struct inode *inode, int mask);
 int ovl_xattr_set(struct dentry *dentry, struct inode *inode, const char *name,
 		  const void *value, size_t size, int flags);
 int ovl_xattr_get(struct dentry *dentry, struct inode *inode, const char *name,
-		  void *value, size_t size);
+		  void *value, size_t size, int flags);
 ssize_t ovl_listxattr(struct dentry *dentry, char *list, size_t size);
 struct posix_acl *ovl_get_acl(struct inode *inode, int type);
 int ovl_update_time(struct inode *inode, struct timespec64 *ts, int flags);
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 57df03f3259f..6f041e1fceda 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -856,7 +856,7 @@ ovl_posix_acl_xattr_get(const struct xattr_handler *handler,
 			struct dentry *dentry, struct inode *inode,
 			const char *name, void *buffer, size_t size, int flags)
 {
-	return ovl_xattr_get(dentry, inode, handler->name, buffer, size);
+	return ovl_xattr_get(dentry, inode, handler->name, buffer, size, flags);
 }
 
 static int __maybe_unused
@@ -919,7 +919,8 @@ ovl_posix_acl_xattr_set(const struct xattr_handler *handler,
 
 static int ovl_own_xattr_get(const struct xattr_handler *handler,
 			     struct dentry *dentry, struct inode *inode,
-			     const char *name, void *buffer, size_t size)
+			     const char *name, void *buffer, size_t size,
+			     int flags)
 {
 	return -EOPNOTSUPP;
 }
@@ -937,7 +938,7 @@ static int ovl_other_xattr_get(const struct xattr_handler *handler,
 			       const char *name, void *buffer, size_t size,
 			       int flags)
 {
-	return ovl_xattr_get(dentry, inode, name, buffer, size);
+	return ovl_xattr_get(dentry, inode, name, buffer, size, flags);
 }
 
 static int ovl_other_xattr_set(const struct xattr_handler *handler,
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related

* [PATCH v12 1/5] overlayfs: check CAP_DAC_READ_SEARCH before issuing exportfs_decode_fh
From: Mark Salyzyn @ 2019-07-30 17:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, Mark Salyzyn, Miklos Szeredi, Jonathan Corbet,
	Vivek Goyal, Eric W . Biederman, Amir Goldstein, Randy Dunlap,
	Stephen Smalley, linux-unionfs, linux-doc
In-Reply-To: <20190730172904.79146-1-salyzyn@android.com>

Assumption never checked, should fail if the mounter creds are not
sufficient.

Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: linux-unionfs@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kernel-team@android.com
---
v11 + v12 - rebase

v10:
- return NULL rather than ERR_PTR(-EPERM)
- did _not_ add it ovl_can_decode_fh() because of changes since last
  review, suspect needs to be added to ovl_lower_uuid_ok()?

v8 + v9:
- rebase

v7:
- This time for realz

v6:
- rebase

v5:
- dependency of "overlayfs: override_creds=off option bypass creator_cred"
---
 fs/overlayfs/namei.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
index e9717c2f7d45..9702f0d5309d 100644
--- a/fs/overlayfs/namei.c
+++ b/fs/overlayfs/namei.c
@@ -161,6 +161,9 @@ struct dentry *ovl_decode_real_fh(struct ovl_fh *fh, struct vfsmount *mnt,
 	if (!uuid_equal(&fh->uuid, &mnt->mnt_sb->s_uuid))
 		return NULL;
 
+	if (!capable(CAP_DAC_READ_SEARCH))
+		return NULL;
+
 	bytes = (fh->len - offsetof(struct ovl_fh, fid));
 	real = exportfs_decode_fh(mnt, (struct fid *)fh->fid,
 				  bytes >> 2, (int)fh->type,
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related

* [PATCH v12 0/5] overlayfs override_creds=off
From: Mark Salyzyn @ 2019-07-30 17:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, Mark Salyzyn, Miklos Szeredi, Jonathan Corbet,
	Vivek Goyal, Eric W . Biederman, Amir Goldstein, Randy Dunlap,
	Stephen Smalley, linux-unionfs, linux-doc

Patch series:

overlayfs: check CAP_DAC_READ_SEARCH before issuing exportfs_decode_fh
Add flags option to get xattr method paired to __vfs_getxattr
overlayfs: handle XATTR_NOSECURITY flag for get xattr method
overlayfs: internal getxattr operations without sepolicy checking
overlayfs: override_creds=off option bypass creator_cred

The first four patches address fundamental security issues that should
be solved regardless of the override_creds=off feature.
on them).

The fifth adds the feature depends on these other fixes.

By default, all access to the upper, lower and work directories is the
recorded mounter's MAC and DAC credentials.  The incoming accesses are
checked against the caller's credentials.

If the principles of least privilege are applied for sepolicy, the
mounter's credentials might not overlap the credentials of the caller's
when accessing the overlayfs filesystem.  For example, a file that a
lower DAC privileged caller can execute, is MAC denied to the
generally higher DAC privileged mounter, to prevent an attack vector.

We add the option to turn off override_creds in the mount options; all
subsequent operations after mount on the filesystem will be only the
caller's credentials.  The module boolean parameter and mount option
override_creds is also added as a presence check for this "feature",
existence of /sys/module/overlay/parameters/overlay_creds

Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: linux-unionfs@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

---
v12:
- Restore squished out patch 2 and 3 in the series,
  then change algorithm to add flags argument.
  Per-thread flag is a large security surface.

v11:
- Squish out v10 introduced patch 2 and 3 in the series,
  then and use per-thread flag instead for nesting.
- Switch name to ovl_do_vds_getxattr for __vds_getxattr wrapper.
- Add sb argument to ovl_revert_creds to match future work.

v10:
- Return NULL on CAP_DAC_READ_SEARCH
- Add __get xattr method to solve sepolicy logging issue
- Drop unnecessary sys_admin sepolicy checking for administrative
  driver internal xattr functions.

v6:
- Drop CONFIG_OVERLAY_FS_OVERRIDE_CREDS.
- Do better with the documentation, drop rationalizations.
- pr_warn message adjusted to report consequences.

v5:
- beefed up the caveats in the Documentation
- Is dependent on
  "overlayfs: check CAP_DAC_READ_SEARCH before issuing exportfs_decode_fh"
  "overlayfs: check CAP_MKNOD before issuing vfs_whiteout"
- Added prwarn when override_creds=off

v4:
- spelling and grammar errors in text

v3:
- Change name from caller_credentials / creator_credentials to the
  boolean override_creds.
- Changed from creator to mounter credentials.
- Updated and fortified the documentation.
- Added CONFIG_OVERLAY_FS_OVERRIDE_CREDS

v2:
- Forward port changed attr to stat, resulting in a build error.
- altered commit message.

^ permalink raw reply

* Re: [PATCH v10 3/5] overlayfs: add __get xattr method
From: Mark Salyzyn @ 2019-07-30 16:54 UTC (permalink / raw)
  To: Stephen Smalley, Amir Goldstein
  Cc: linux-kernel, kernel-team, Miklos Szeredi, Jonathan Corbet,
	Vivek Goyal, Eric W . Biederman, Randy Dunlap, overlayfs,
	linux-doc
In-Reply-To: <e83ceef6-70ae-fd9e-2087-50baf2fbd402@tycho.nsa.gov>

On 7/30/19 8:55 AM, Stephen Smalley wrote:
> On 7/26/19 2:30 PM, Mark Salyzyn wrote:
>> On 7/25/19 10:04 PM, Amir Goldstein wrote:
>>> On Thu, Jul 25, 2019 at 7:22 PM Mark Salyzyn <salyzyn@android.com> 
>>> wrote:
>>>> On 7/25/19 8:43 AM, Amir Goldstein wrote:
>>>>> On Thu, Jul 25, 2019 at 6:03 PM Mark Salyzyn <salyzyn@android.com> 
>>>>> wrote:
>>>>>> On 7/24/19 10:48 PM, Amir Goldstein wrote:
>>>>>>> On Wed, Jul 24, 2019 at 10:57 PM Mark Salyzyn 
>>>>>>> <salyzyn@android.com> wrote:
>>>>>>>> Because of the overlayfs getxattr recursion, the incoming inode 
>>>>>>>> fails
>>>>>>>> to update the selinux sid resulting in avc denials being reported
>>>>>>>> against a target context of u:object_r:unlabeled:s0.
>>>>>>> This description is too brief for me to understand the root 
>>>>>>> problem.
>>>>>>> What's wring with the overlayfs getxattr recursion w.r.t the 
>>>>>>> selinux
>>>>>>> security model?
>>>>>> __vfs_getxattr (the way the security layer acquires the target sid
>>>>>> without recursing back to security to check the access permissions)
>>>>>> calls get xattr method, which in overlayfs calls vfs_getxattr on the
>>>>>> lower layer (which then recurses back to security to check 
>>>>>> permissions)
>>>>>> and reports back -EACCES if there was a denial (which is OK) and 
>>>>>> _no_
>>>>>> sid copied to caller's inode security data, bubbles back to the 
>>>>>> security
>>>>>> layer caller, which reports an invalid avc: message for
>>>>>> u:object_r:unlabeled:s0 (the uninitialized sid instead of the sid 
>>>>>> for
>>>>>> the lower filesystem target). The blocked access is 100% valid, 
>>>>>> it is
>>>>>> supposed to be blocked. This does however result in a cosmetic issue
>>>>>> that makes it impossible to use audit2allow to construct a rule that
>>>>>> would be usable to fix the access problem.
>>>>>>
>>>>> Ahhh you are talking about getting the security.selinux.* xattrs?
>>>>> I was under the impression (Vivek please correct me if I wrong)
>>>>> that overlayfs objects cannot have individual security labels and
>>>> They can, and we _need_ them for Android's use cases, upper and lower
>>>> filesystems.
>>>>
>>>> Some (most?) union filesystems (like Android's sdcardfs) set sepolicy
>>>> from the mount options, we did not need this adjustment there of 
>>>> course.
>>>>
>>>>> the only way to label overlayfs objects is by mount options on the
>>>>> entire mount? Or is this just for lower layer objects?
>>>>>
>>>>> Anyway, the API I would go for is adding a @flags argument to
>>>>> get() which can take XATTR_NOSECURITY akin to
>>>>> FMODE_NONOTIFY, GFP_NOFS, meant to avoid recursions.
>>>> I do like it better (with the following 7 stages of grief below), best
>>>> for the future.
>>>>
>>>> The change in this handler's API will affect all filesystem drivers
>>>> (well, my change affects the ABI, so it is not as-if I saved the world
>>>> from a module recompile) touching all filesystem sources with an even
>>>> larger audience of stakeholders. Larger audience of stakeholders, the
>>>> harder to get the change in ;-/. This is also concerning since I would
>>>> like this change to go to stable 4.4, 4.9, 4.14 and 4.19 where this
>>>> regression got introduced. I can either craft specific stable 
>>>> patches or
>>>> just let it go and deal with them in the android-common distributions
>>>> rather than seeking stable merged down. ABI/API breaks are a 
>>>> problem for
>>>> stable anyway ...
>>>>
>>> Use the memalloc_nofs_save/restore design pattern will avoid all that
>>> grief.
>>> As a matter of fact, this issue could and should be handled inside 
>>> security
>>> subsystem without bothering any other subsystem.
>>> LSM have per task context right? That context could carry the recursion
>>> flags to know that the getxattr call is made by the security 
>>> subsystem itself.
>>> The problem is not limited to union filesystems.
>>> In general its a stacking issue. ecryptfs is also a stacking fs, 
>>> out-of-tree
>>> shiftfs as well. But it doesn't end there.
>>> A filesystem on top of a loop device inside another filesystem could
>>> also maybe result in security hook recursion (not sure if in practice).
>>>
>>> Thanks,
>>> Amir.
>>
>> Good point, back to Stephen Smalley?
>>
>> There are four __vfs_getxattr calls inside security, not sure I see 
>> any natural way to determine the recursion in security/selinux I can 
>> beg/borrow/steal from; but I get the strange feeling that it is 
>> better to detect recursion in __vfs_getxattr in this manner, and 
>> switch out checking in vfs_getxattr since it is localized to just 
>> fs/xattr.c. selinux might not be the only user of __vfs_getxattr 
>> nature ...
>>
>> I have implemented and tested the solution where we add a flag to the 
>> .get method, it works. I would be tempted to submit that instead in 
>> case someone in the future can imagine using that flag argument to 
>> solve other problem(s) (if you build it, they will come).
>>
>> <flips coin>
>>
>> Will add a new per-process flag that __vfs_getxattr and vfs_getxattr 
>> plays with and see how it works and what it looks like.
>
> As you say, SELinux is not the only user of __vfs_getxattr; in 
> addition to the other security modules, there is the integrity/evm 
> subsystem and ecryptfs.  Further, __vfs_getxattr does not merely skip 
> LSM/SELinux-related processing; it also skips xattr_permission().  As 
> such, I don't believe this is something that can be solved entirely 
> within the security subsystem.
>
> Not excited about a process flag to implicitly disable LSM/SELinux and 
> other security-related processing on a code path; potential for abuse 
> is high.

So you will not like my solution in "[PATCH v11 2/5] fs: __vfs_getxattr 
nesting paradigm"sent out this morning; so adding the flag option and 
widespread touching of _all_ the filesystem xattr.c/acl.c/inode.c/etc 
files to the calls is probably the easiest to stomach with the lowest 
attack surface.

Any other ideas (with less impact to tons of API/ABI/filesystems) that 
we have not thought about before I spin a v12 patch set?

-- Mark


^ permalink raw reply

* Re: [PATCH v10 3/5] overlayfs: add __get xattr method
From: Stephen Smalley @ 2019-07-30 15:55 UTC (permalink / raw)
  To: Mark Salyzyn, Amir Goldstein
  Cc: linux-kernel, kernel-team, Miklos Szeredi, Jonathan Corbet,
	Vivek Goyal, Eric W . Biederman, Randy Dunlap, overlayfs,
	linux-doc
In-Reply-To: <f56cd45d-2926-094e-7f02-e2ca972214ba@android.com>

On 7/26/19 2:30 PM, Mark Salyzyn wrote:
> On 7/25/19 10:04 PM, Amir Goldstein wrote:
>> On Thu, Jul 25, 2019 at 7:22 PM Mark Salyzyn <salyzyn@android.com> wrote:
>>> On 7/25/19 8:43 AM, Amir Goldstein wrote:
>>>> On Thu, Jul 25, 2019 at 6:03 PM Mark Salyzyn <salyzyn@android.com> 
>>>> wrote:
>>>>> On 7/24/19 10:48 PM, Amir Goldstein wrote:
>>>>>> On Wed, Jul 24, 2019 at 10:57 PM Mark Salyzyn 
>>>>>> <salyzyn@android.com> wrote:
>>>>>>> Because of the overlayfs getxattr recursion, the incoming inode 
>>>>>>> fails
>>>>>>> to update the selinux sid resulting in avc denials being reported
>>>>>>> against a target context of u:object_r:unlabeled:s0.
>>>>>> This description is too brief for me to understand the root problem.
>>>>>> What's wring with the overlayfs getxattr recursion w.r.t the selinux
>>>>>> security model?
>>>>> __vfs_getxattr (the way the security layer acquires the target sid
>>>>> without recursing back to security to check the access permissions)
>>>>> calls get xattr method, which in overlayfs calls vfs_getxattr on the
>>>>> lower layer (which then recurses back to security to check 
>>>>> permissions)
>>>>> and reports back -EACCES if there was a denial (which is OK) and _no_
>>>>> sid copied to caller's inode security data, bubbles back to the 
>>>>> security
>>>>> layer caller, which reports an invalid avc: message for
>>>>> u:object_r:unlabeled:s0 (the uninitialized sid instead of the sid for
>>>>> the lower filesystem target). The blocked access is 100% valid, it is
>>>>> supposed to be blocked. This does however result in a cosmetic issue
>>>>> that makes it impossible to use audit2allow to construct a rule that
>>>>> would be usable to fix the access problem.
>>>>>
>>>> Ahhh you are talking about getting the security.selinux.* xattrs?
>>>> I was under the impression (Vivek please correct me if I wrong)
>>>> that overlayfs objects cannot have individual security labels and
>>> They can, and we _need_ them for Android's use cases, upper and lower
>>> filesystems.
>>>
>>> Some (most?) union filesystems (like Android's sdcardfs) set sepolicy
>>> from the mount options, we did not need this adjustment there of course.
>>>
>>>> the only way to label overlayfs objects is by mount options on the
>>>> entire mount? Or is this just for lower layer objects?
>>>>
>>>> Anyway, the API I would go for is adding a @flags argument to
>>>> get() which can take XATTR_NOSECURITY akin to
>>>> FMODE_NONOTIFY, GFP_NOFS, meant to avoid recursions.
>>> I do like it better (with the following 7 stages of grief below), best
>>> for the future.
>>>
>>> The change in this handler's API will affect all filesystem drivers
>>> (well, my change affects the ABI, so it is not as-if I saved the world
>>> from a module recompile) touching all filesystem sources with an even
>>> larger audience of stakeholders. Larger audience of stakeholders, the
>>> harder to get the change in ;-/. This is also concerning since I would
>>> like this change to go to stable 4.4, 4.9, 4.14 and 4.19 where this
>>> regression got introduced. I can either craft specific stable patches or
>>> just let it go and deal with them in the android-common distributions
>>> rather than seeking stable merged down. ABI/API breaks are a problem for
>>> stable anyway ...
>>>
>> Use the memalloc_nofs_save/restore design pattern will avoid all that
>> grief.
>> As a matter of fact, this issue could and should be handled inside 
>> security
>> subsystem without bothering any other subsystem.
>> LSM have per task context right? That context could carry the recursion
>> flags to know that the getxattr call is made by the security subsystem 
>> itself.
>> The problem is not limited to union filesystems.
>> In general its a stacking issue. ecryptfs is also a stacking fs, 
>> out-of-tree
>> shiftfs as well. But it doesn't end there.
>> A filesystem on top of a loop device inside another filesystem could
>> also maybe result in security hook recursion (not sure if in practice).
>>
>> Thanks,
>> Amir.
> 
> Good point, back to Stephen Smalley?
> 
> There are four __vfs_getxattr calls inside security, not sure I see any 
> natural way to determine the recursion in security/selinux I can 
> beg/borrow/steal from; but I get the strange feeling that it is better 
> to detect recursion in __vfs_getxattr in this manner, and switch out 
> checking in vfs_getxattr since it is localized to just fs/xattr.c. 
> selinux might not be the only user of __vfs_getxattr nature ...
> 
> I have implemented and tested the solution where we add a flag to the 
> .get method, it works. I would be tempted to submit that instead in case 
> someone in the future can imagine using that flag argument to solve 
> other problem(s) (if you build it, they will come).
> 
> <flips coin>
> 
> Will add a new per-process flag that __vfs_getxattr and vfs_getxattr 
> plays with and see how it works and what it looks like.

As you say, SELinux is not the only user of __vfs_getxattr; in addition 
to the other security modules, there is the integrity/evm subsystem and 
ecryptfs.  Further, __vfs_getxattr does not merely skip 
LSM/SELinux-related processing; it also skips xattr_permission().  As 
such, I don't believe this is something that can be solved entirely 
within the security subsystem.

Not excited about a process flag to implicitly disable LSM/SELinux and 
other security-related processing on a code path; potential for abuse is 
high.

^ permalink raw reply

* [PATCH v11 4/4] overlayfs: override_creds=off option bypass creator_cred
From: Mark Salyzyn @ 2019-07-30 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, Mark Salyzyn, Miklos Szeredi, Jonathan Corbet,
	Vivek Goyal, Eric W . Biederman, Amir Goldstein, Randy Dunlap,
	Stephen Smalley, linux-unionfs, linux-doc
In-Reply-To: <20190730155227.41468-1-salyzyn@android.com>

By default, all access to the upper, lower and work directories is the
recorded mounter's MAC and DAC credentials.  The incoming accesses are
checked against the caller's credentials.

If the principles of least privilege are applied, the mounter's
credentials might not overlap the credentials of the caller's when
accessing the overlayfs filesystem.  For example, a file that a lower
DAC privileged caller can execute, is MAC denied to the generally
higher DAC privileged mounter, to prevent an attack vector.

We add the option to turn off override_creds in the mount options; all
subsequent operations after mount on the filesystem will be only the
caller's credentials.  The module boolean parameter and mount option
override_creds is also added as a presence check for this "feature",
existence of /sys/module/overlay/parameters/override_creds.

It was not always this way.  Circa 4.6 there was no recorded mounter's
credentials, instead privileged access to upper or work directories
were temporarily increased to perform the operations.  The MAC
(selinux) policies were caller's in all cases.  override_creds=off
partially returns us to this older access model minus the insecure
temporary credential increases.  This is to permit use in a system
with non-overlapping security models for each executable including
the agent that mounts the overlayfs filesystem.  In Android
this is the case since init, which performs the mount operations,
has a minimal MAC set of privileges to reduce any attack surface,
and services that use the content have a different set of MAC
privileges (eg: read, for vendor labelled configuration, execute for
vendor libraries and modules).  The caveats are not a problem in
the Android usage model, however they should be fixed for
completeness and for general use in time.

Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: linux-unionfs@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kernel-team@android.com
---
v11:
- Add sb argument to ovl_revert_creds to match future work.

v10:
- Rebase (and expand because of increased revert_cred usage)

v9:
- Add to the caveats

v8:
- drop pr_warn message after straw poll to remove it.
- added a use case in the commit message

v7:
- change name of internal parameter to ovl_override_creds_def
- report override_creds only if different than default

v6:
- Drop CONFIG_OVERLAY_FS_OVERRIDE_CREDS.
- Do better with the documentation.
- pr_warn message adjusted to report consequences.

v5:
- beefed up the caveats in the Documentation
- Is dependent on
  "overlayfs: check CAP_DAC_READ_SEARCH before issuing exportfs_decode_fh"
  "overlayfs: check CAP_MKNOD before issuing vfs_whiteout"
- Added prwarn when override_creds=off

v4:
- spelling and grammar errors in text

v3:
- Change name from caller_credentials / creator_credentials to the
  boolean override_creds.
- Changed from creator to mounter credentials.
- Updated and fortified the documentation.
- Added CONFIG_OVERLAY_FS_OVERRIDE_CREDS

v2:
- Forward port changed attr to stat, resulting in a build error.
- altered commit message.

a
---
 Documentation/filesystems/overlayfs.txt | 23 +++++++++++++++++++++++
 fs/overlayfs/copy_up.c                  |  2 +-
 fs/overlayfs/dir.c                      | 11 ++++++-----
 fs/overlayfs/file.c                     | 20 ++++++++++----------
 fs/overlayfs/inode.c                    | 18 +++++++++---------
 fs/overlayfs/namei.c                    |  6 +++---
 fs/overlayfs/overlayfs.h                |  1 +
 fs/overlayfs/ovl_entry.h                |  1 +
 fs/overlayfs/readdir.c                  |  4 ++--
 fs/overlayfs/super.c                    | 22 +++++++++++++++++++++-
 fs/overlayfs/util.c                     | 12 ++++++++++--
 11 files changed, 87 insertions(+), 33 deletions(-)

diff --git a/Documentation/filesystems/overlayfs.txt b/Documentation/filesystems/overlayfs.txt
index 1da2f1668f08..d48125076602 100644
--- a/Documentation/filesystems/overlayfs.txt
+++ b/Documentation/filesystems/overlayfs.txt
@@ -102,6 +102,29 @@ Only the lists of names from directories are merged.  Other content
 such as metadata and extended attributes are reported for the upper
 directory only.  These attributes of the lower directory are hidden.
 
+credentials
+-----------
+
+By default, all access to the upper, lower and work directories is the
+recorded mounter's MAC and DAC credentials.  The incoming accesses are
+checked against the caller's credentials.
+
+In the case where caller MAC or DAC credentials do not overlap, a
+use case available in older versions of the driver, the
+override_creds mount flag can be turned off and help when the use
+pattern has caller with legitimate credentials where the mounter
+does not.  Several unintended side effects will occur though.  The
+caller without certain key capabilities or lower privilege will not
+always be able to delete files or directories, create nodes, or
+search some restricted directories.  The ability to search and read
+a directory entry is spotty as a result of the cache mechanism not
+retesting the credentials because of the assumption, a privileged
+caller can fill cache, then a lower privilege can read the directory
+cache.  The uneven security model where cache, upperdir and workdir
+are opened at privilege, but accessed without creating a form of
+privilege escalation, should only be used with strict understanding
+of the side effects and of the security policies.
+
 whiteouts and opaque directories
 --------------------------------
 
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index b801c6353100..1c1b9415e533 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -886,7 +886,7 @@ int ovl_copy_up_flags(struct dentry *dentry, int flags)
 		dput(parent);
 		dput(next);
 	}
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 
 	return err;
 }
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 702aa63f6774..49b8ffc1294f 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -563,7 +563,8 @@ static int ovl_create_or_link(struct dentry *dentry, struct inode *inode,
 		override_cred->fsgid = inode->i_gid;
 		if (!attr->hardlink) {
 			err = security_dentry_create_files_as(dentry,
-					attr->mode, &dentry->d_name, old_cred,
+					attr->mode, &dentry->d_name,
+					old_cred ? old_cred : current_cred(),
 					override_cred);
 			if (err) {
 				put_cred(override_cred);
@@ -579,7 +580,7 @@ static int ovl_create_or_link(struct dentry *dentry, struct inode *inode,
 			err = ovl_create_over_whiteout(dentry, inode, attr);
 	}
 out_revert_creds:
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 	return err;
 }
 
@@ -655,7 +656,7 @@ static int ovl_set_link_redirect(struct dentry *dentry)
 
 	old_cred = ovl_override_creds(dentry->d_sb);
 	err = ovl_set_redirect(dentry, false);
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 
 	return err;
 }
@@ -851,7 +852,7 @@ static int ovl_do_remove(struct dentry *dentry, bool is_dir)
 		err = ovl_remove_upper(dentry, is_dir, &list);
 	else
 		err = ovl_remove_and_whiteout(dentry, &list);
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 	if (!err) {
 		if (is_dir)
 			clear_nlink(dentry->d_inode);
@@ -1221,7 +1222,7 @@ static int ovl_rename(struct inode *olddir, struct dentry *old,
 out_unlock:
 	unlock_rename(new_upperdir, old_upperdir);
 out_revert_creds:
-	revert_creds(old_cred);
+	ovl_revert_creds(old->d_sb, old_cred);
 	if (update_nlink)
 		ovl_nlink_end(new);
 out_drop_write:
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index e235a635d9ec..627a303c95da 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -32,7 +32,7 @@ static struct file *ovl_open_realfile(const struct file *file,
 	old_cred = ovl_override_creds(inode->i_sb);
 	realfile = open_with_fake_path(&file->f_path, flags, realinode,
 				       current_cred());
-	revert_creds(old_cred);
+	ovl_revert_creds(inode->i_sb, old_cred);
 
 	pr_debug("open(%p[%pD2/%c], 0%o) -> (%p, 0%o)\n",
 		 file, file, ovl_whatisit(inode, realinode), file->f_flags,
@@ -176,7 +176,7 @@ static loff_t ovl_llseek(struct file *file, loff_t offset, int whence)
 
 	old_cred = ovl_override_creds(inode->i_sb);
 	ret = vfs_llseek(real.file, offset, whence);
-	revert_creds(old_cred);
+	ovl_revert_creds(inode->i_sb, old_cred);
 
 	file->f_pos = real.file->f_pos;
 	inode_unlock(inode);
@@ -242,7 +242,7 @@ static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
 	ret = vfs_iter_read(real.file, iter, &iocb->ki_pos,
 			    ovl_iocb_to_rwf(iocb));
-	revert_creds(old_cred);
+	ovl_revert_creds(file_inode(file)->i_sb, old_cred);
 
 	ovl_file_accessed(file);
 
@@ -278,7 +278,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
 			     ovl_iocb_to_rwf(iocb));
 	file_end_write(real.file);
-	revert_creds(old_cred);
+	ovl_revert_creds(file_inode(file)->i_sb, old_cred);
 
 	/* Update size */
 	ovl_copyattr(ovl_inode_real(inode), inode);
@@ -305,7 +305,7 @@ static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	if (file_inode(real.file) == ovl_inode_upper(file_inode(file))) {
 		old_cred = ovl_override_creds(file_inode(file)->i_sb);
 		ret = vfs_fsync_range(real.file, start, end, datasync);
-		revert_creds(old_cred);
+		ovl_revert_creds(file_inode(file)->i_sb, old_cred);
 	}
 
 	fdput(real);
@@ -329,7 +329,7 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
 
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
 	ret = call_mmap(vma->vm_file, vma);
-	revert_creds(old_cred);
+	ovl_revert_creds(file_inode(file)->i_sb, old_cred);
 
 	if (ret) {
 		/* Drop reference count from new vm_file value */
@@ -357,7 +357,7 @@ static long ovl_fallocate(struct file *file, int mode, loff_t offset, loff_t len
 
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
 	ret = vfs_fallocate(real.file, mode, offset, len);
-	revert_creds(old_cred);
+	ovl_revert_creds(file_inode(file)->i_sb, old_cred);
 
 	/* Update size */
 	ovl_copyattr(ovl_inode_real(inode), inode);
@@ -379,7 +379,7 @@ static int ovl_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
 	ret = vfs_fadvise(real.file, offset, len, advice);
-	revert_creds(old_cred);
+	ovl_revert_creds(file_inode(file)->i_sb, old_cred);
 
 	fdput(real);
 
@@ -399,7 +399,7 @@ static long ovl_real_ioctl(struct file *file, unsigned int cmd,
 
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
 	ret = vfs_ioctl(real.file, cmd, arg);
-	revert_creds(old_cred);
+	ovl_revert_creds(file_inode(file)->i_sb, old_cred);
 
 	fdput(real);
 
@@ -589,7 +589,7 @@ static loff_t ovl_copyfile(struct file *file_in, loff_t pos_in,
 						flags);
 		break;
 	}
-	revert_creds(old_cred);
+	ovl_revert_creds(file_inode(file)->i_sb, old_cred);
 
 	/* Update size */
 	ovl_copyattr(ovl_inode_real(inode_out), inode_out);
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 7663aeb85fa3..420143dac15f 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -61,7 +61,7 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
 		inode_lock(upperdentry->d_inode);
 		old_cred = ovl_override_creds(dentry->d_sb);
 		err = notify_change(upperdentry, attr, NULL);
-		revert_creds(old_cred);
+		ovl_revert_creds(dentry->d_sb, old_cred);
 		if (!err)
 			ovl_copyattr(upperdentry->d_inode, dentry->d_inode);
 		inode_unlock(upperdentry->d_inode);
@@ -257,7 +257,7 @@ int ovl_getattr(const struct path *path, struct kstat *stat,
 		stat->nlink = dentry->d_inode->i_nlink;
 
 out:
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 
 	return err;
 }
@@ -291,7 +291,7 @@ int ovl_permission(struct inode *inode, int mask)
 		mask |= MAY_READ;
 	}
 	err = inode_permission(realinode, mask);
-	revert_creds(old_cred);
+	ovl_revert_creds(inode->i_sb, old_cred);
 
 	return err;
 }
@@ -308,7 +308,7 @@ static const char *ovl_get_link(struct dentry *dentry,
 
 	old_cred = ovl_override_creds(dentry->d_sb);
 	p = vfs_get_link(ovl_dentry_real(dentry), done);
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 	return p;
 }
 
@@ -351,7 +351,7 @@ int ovl_xattr_set(struct dentry *dentry, struct inode *inode, const char *name,
 		WARN_ON(flags != XATTR_REPLACE);
 		err = vfs_removexattr(realdentry, name);
 	}
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 
 	/* copy c/mtime */
 	ovl_copyattr(d_inode(realdentry), inode);
@@ -372,7 +372,7 @@ int ovl_xattr_get(struct dentry *dentry, struct inode *inode, const char *name,
 
 	old_cred = ovl_override_creds(dentry->d_sb);
 	res = vfs_getxattr(realdentry, name, value, size);
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 	return res;
 }
 
@@ -396,7 +396,7 @@ ssize_t ovl_listxattr(struct dentry *dentry, char *list, size_t size)
 
 	old_cred = ovl_override_creds(dentry->d_sb);
 	res = vfs_listxattr(realdentry, list, size);
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 	if (res <= 0 || size == 0)
 		return res;
 
@@ -431,7 +431,7 @@ struct posix_acl *ovl_get_acl(struct inode *inode, int type)
 
 	old_cred = ovl_override_creds(inode->i_sb);
 	acl = get_acl(realinode, type);
-	revert_creds(old_cred);
+	ovl_revert_creds(inode->i_sb, old_cred);
 
 	return acl;
 }
@@ -469,7 +469,7 @@ static int ovl_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		filemap_write_and_wait(realinode->i_mapping);
 
 	err = realinode->i_op->fiemap(realinode, fieinfo, start, len);
-	revert_creds(old_cred);
+	ovl_revert_creds(inode->i_sb, old_cred);
 
 	return err;
 }
diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
index a4a452c489fa..bab1f97dc201 100644
--- a/fs/overlayfs/namei.c
+++ b/fs/overlayfs/namei.c
@@ -1079,7 +1079,7 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 			goto out_free_oe;
 	}
 
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 	if (origin_path) {
 		dput(origin_path->dentry);
 		kfree(origin_path);
@@ -1106,7 +1106,7 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 	kfree(upperredirect);
 out:
 	kfree(d.redirect);
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 	return ERR_PTR(err);
 }
 
@@ -1160,7 +1160,7 @@ bool ovl_lower_positive(struct dentry *dentry)
 			dput(this);
 		}
 	}
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 
 	return positive;
 }
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 9c7c72af1550..dea2253f6bdb 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -205,6 +205,7 @@ int ovl_want_write(struct dentry *dentry);
 void ovl_drop_write(struct dentry *dentry);
 struct dentry *ovl_workdir(struct dentry *dentry);
 const struct cred *ovl_override_creds(struct super_block *sb);
+void ovl_revert_creds(struct super_block *sb, const struct cred *oldcred);
 ssize_t ovl_do_vfs_getxattr(struct dentry *dentry, const char *name, void *buf,
 			    size_t size);
 struct super_block *ovl_same_sb(struct super_block *sb);
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 28a2d12a1029..2637c5aadf7f 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -17,6 +17,7 @@ struct ovl_config {
 	bool nfs_export;
 	int xino;
 	bool metacopy;
+	bool override_creds;
 };
 
 struct ovl_sb {
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 47a91c9733a5..874a1b3ff99a 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -286,7 +286,7 @@ static int ovl_check_whiteouts(struct dentry *dir, struct ovl_readdir_data *rdd)
 		}
 		inode_unlock(dir->d_inode);
 	}
-	revert_creds(old_cred);
+	ovl_revert_creds(rdd->dentry->d_sb, old_cred);
 
 	return err;
 }
@@ -918,7 +918,7 @@ int ovl_check_empty_dir(struct dentry *dentry, struct list_head *list)
 
 	old_cred = ovl_override_creds(dentry->d_sb);
 	err = ovl_dir_read_merged(dentry, list, &root);
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 	if (err)
 		return err;
 
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index b368e2e102fa..9a16dd120025 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -53,6 +53,11 @@ module_param_named(xino_auto, ovl_xino_auto_def, bool, 0644);
 MODULE_PARM_DESC(xino_auto,
 		 "Auto enable xino feature");
 
+static bool __read_mostly ovl_override_creds_def = true;
+module_param_named(override_creds, ovl_override_creds_def, bool, 0644);
+MODULE_PARM_DESC(ovl_override_creds_def,
+		 "Use mounter's credentials for accesses");
+
 static void ovl_entry_stack_free(struct ovl_entry *oe)
 {
 	unsigned int i;
@@ -362,6 +367,9 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
 	if (ofs->config.metacopy != ovl_metacopy_def)
 		seq_printf(m, ",metacopy=%s",
 			   ofs->config.metacopy ? "on" : "off");
+	if (ofs->config.override_creds != ovl_override_creds_def)
+		seq_show_option(m, "override_creds",
+				ofs->config.override_creds ? "on" : "off");
 	return 0;
 }
 
@@ -402,6 +410,8 @@ enum {
 	OPT_XINO_AUTO,
 	OPT_METACOPY_ON,
 	OPT_METACOPY_OFF,
+	OPT_OVERRIDE_CREDS_ON,
+	OPT_OVERRIDE_CREDS_OFF,
 	OPT_ERR,
 };
 
@@ -420,6 +430,8 @@ static const match_table_t ovl_tokens = {
 	{OPT_XINO_AUTO,			"xino=auto"},
 	{OPT_METACOPY_ON,		"metacopy=on"},
 	{OPT_METACOPY_OFF,		"metacopy=off"},
+	{OPT_OVERRIDE_CREDS_ON,		"override_creds=on"},
+	{OPT_OVERRIDE_CREDS_OFF,	"override_creds=off"},
 	{OPT_ERR,			NULL}
 };
 
@@ -478,6 +490,7 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config)
 	config->redirect_mode = kstrdup(ovl_redirect_mode_def(), GFP_KERNEL);
 	if (!config->redirect_mode)
 		return -ENOMEM;
+	config->override_creds = ovl_override_creds_def;
 
 	while ((p = ovl_next_opt(&opt)) != NULL) {
 		int token;
@@ -558,6 +571,14 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config)
 			config->metacopy = false;
 			break;
 
+		case OPT_OVERRIDE_CREDS_ON:
+			config->override_creds = true;
+			break;
+
+		case OPT_OVERRIDE_CREDS_OFF:
+			config->override_creds = false;
+			break;
+
 		default:
 			pr_err("overlayfs: unrecognized mount option \"%s\" or missing value\n", p);
 			return -EINVAL;
@@ -1672,7 +1693,6 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 		       ovl_dentry_lower(root_dentry), NULL);
 
 	sb->s_root = root_dentry;
-
 	return 0;
 
 out_free_oe:
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index f80b95423043..4720a7a6fea3 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -37,9 +37,17 @@ const struct cred *ovl_override_creds(struct super_block *sb)
 {
 	struct ovl_fs *ofs = sb->s_fs_info;
 
+	if (!ofs->config.override_creds)
+		return NULL;
 	return override_creds(ofs->creator_cred);
 }
 
+void ovl_revert_creds(struct super_block *sb, const struct cred *old_cred)
+{
+	if (old_cred)
+		revert_creds(old_cred);
+}
+
 ssize_t ovl_do_vfs_getxattr(struct dentry *dentry, const char *name, void *buf,
 			    size_t size)
 {
@@ -797,7 +805,7 @@ int ovl_nlink_start(struct dentry *dentry)
 	 * value relative to the upper inode nlink in an upper inode xattr.
 	 */
 	err = ovl_set_nlink_upper(dentry);
-	revert_creds(old_cred);
+	ovl_revert_creds(dentry->d_sb, old_cred);
 
 out:
 	if (err)
@@ -815,7 +823,7 @@ void ovl_nlink_end(struct dentry *dentry)
 
 		old_cred = ovl_override_creds(dentry->d_sb);
 		ovl_cleanup_index(dentry);
-		revert_creds(old_cred);
+		ovl_revert_creds(dentry->d_sb, old_cred);
 	}
 
 	ovl_inode_unlock(inode);
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related

* [PATCH v11 3/4] overlayfs: internal getxattr operations without sepolicy checking
From: Mark Salyzyn @ 2019-07-30 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, Mark Salyzyn, Miklos Szeredi, Jonathan Corbet,
	Vivek Goyal, Eric W . Biederman, Amir Goldstein, Randy Dunlap,
	Stephen Smalley, linux-unionfs, linux-doc
In-Reply-To: <20190730155227.41468-1-salyzyn@android.com>

Check impure, opaque, origin & meta xattr with no sepolicy audit
(using __vfs_getxattr) since these operations are internal to
overlayfs operations and do not disclose any data.  This became
an issue for credential override off since sys_admin would have
been required by the caller; whereas would have been inherently
present for the creator since it performed the mount.

This is a change in operations since we do not check in the new
ovl_do_vfs_getxattr function if the credential override is off or
not.  Reasoning is that the sepolicy check is unnecessary overhead,
especially since the check can be expensive.

Because for override credentials off, this affects _everyone_ that
underneath performs private xattr calls without the appropriate
sepolicy permissions and sys_admin capability.  Providing blanket
support for sys_admin would be bad for all possible callers.

For the override credentials on, this will affect only the mounter,
should it lack sepolicy permissions. Not considered a security
problem since mounting by definition has sys_admin capabilities,
but sepolicy contexts would still need to be crafted.

It should be noted that there is precedence, __vfs_getxattr is used
in other filesystems for their own internal trusted xattr management.

Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: linux-unionfs@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kernel-team@android.com
---
v11 - Switch name to ovl_do_vfs_getxattr, fortify comment.

v10 - Added to patch series.
---
 fs/overlayfs/namei.c     | 12 +++++++-----
 fs/overlayfs/overlayfs.h |  2 ++
 fs/overlayfs/util.c      | 24 +++++++++++++++---------
 3 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
index 9702f0d5309d..a4a452c489fa 100644
--- a/fs/overlayfs/namei.c
+++ b/fs/overlayfs/namei.c
@@ -106,10 +106,11 @@ int ovl_check_fh_len(struct ovl_fh *fh, int fh_len)
 
 static struct ovl_fh *ovl_get_fh(struct dentry *dentry, const char *name)
 {
-	int res, err;
+	ssize_t res;
+	int err;
 	struct ovl_fh *fh = NULL;
 
-	res = vfs_getxattr(dentry, name, NULL, 0);
+	res = ovl_do_vfs_getxattr(dentry, name, NULL, 0);
 	if (res < 0) {
 		if (res == -ENODATA || res == -EOPNOTSUPP)
 			return NULL;
@@ -123,7 +124,7 @@ static struct ovl_fh *ovl_get_fh(struct dentry *dentry, const char *name)
 	if (!fh)
 		return ERR_PTR(-ENOMEM);
 
-	res = vfs_getxattr(dentry, name, fh, res);
+	res = ovl_do_vfs_getxattr(dentry, name, fh, res);
 	if (res < 0)
 		goto fail;
 
@@ -141,10 +142,11 @@ static struct ovl_fh *ovl_get_fh(struct dentry *dentry, const char *name)
 	return NULL;
 
 fail:
-	pr_warn_ratelimited("overlayfs: failed to get origin (%i)\n", res);
+	pr_warn_ratelimited("overlayfs: failed to get origin (%zi)\n", res);
 	goto out;
 invalid:
-	pr_warn_ratelimited("overlayfs: invalid origin (%*phN)\n", res, fh);
+	pr_warn_ratelimited("overlayfs: invalid origin (%*phN)\n",
+			    (int)res, fh);
 	goto out;
 }
 
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 6934bcf030f0..9c7c72af1550 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -205,6 +205,8 @@ int ovl_want_write(struct dentry *dentry);
 void ovl_drop_write(struct dentry *dentry);
 struct dentry *ovl_workdir(struct dentry *dentry);
 const struct cred *ovl_override_creds(struct super_block *sb);
+ssize_t ovl_do_vfs_getxattr(struct dentry *dentry, const char *name, void *buf,
+			    size_t size);
 struct super_block *ovl_same_sb(struct super_block *sb);
 int ovl_can_decode_fh(struct super_block *sb);
 struct dentry *ovl_indexdir(struct super_block *sb);
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index f5678a3f8350..f80b95423043 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -40,6 +40,12 @@ const struct cred *ovl_override_creds(struct super_block *sb)
 	return override_creds(ofs->creator_cred);
 }
 
+ssize_t ovl_do_vfs_getxattr(struct dentry *dentry, const char *name, void *buf,
+			    size_t size)
+{
+	return __vfs_getxattr(dentry, d_inode(dentry), name, buf, size);
+}
+
 struct super_block *ovl_same_sb(struct super_block *sb)
 {
 	struct ovl_fs *ofs = sb->s_fs_info;
@@ -537,9 +543,9 @@ void ovl_copy_up_end(struct dentry *dentry)
 
 bool ovl_check_origin_xattr(struct dentry *dentry)
 {
-	int res;
+	ssize_t res;
 
-	res = vfs_getxattr(dentry, OVL_XATTR_ORIGIN, NULL, 0);
+	res = ovl_do_vfs_getxattr(dentry, OVL_XATTR_ORIGIN, NULL, 0);
 
 	/* Zero size value means "copied up but origin unknown" */
 	if (res >= 0)
@@ -550,13 +556,13 @@ bool ovl_check_origin_xattr(struct dentry *dentry)
 
 bool ovl_check_dir_xattr(struct dentry *dentry, const char *name)
 {
-	int res;
+	ssize_t res;
 	char val;
 
 	if (!d_is_dir(dentry))
 		return false;
 
-	res = vfs_getxattr(dentry, name, &val, 1);
+	res = ovl_do_vfs_getxattr(dentry, name, &val, 1);
 	if (res == 1 && val == 'y')
 		return true;
 
@@ -837,13 +843,13 @@ int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir)
 /* err < 0, 0 if no metacopy xattr, 1 if metacopy xattr found */
 int ovl_check_metacopy_xattr(struct dentry *dentry)
 {
-	int res;
+	ssize_t res;
 
 	/* Only regular files can have metacopy xattr */
 	if (!S_ISREG(d_inode(dentry)->i_mode))
 		return 0;
 
-	res = vfs_getxattr(dentry, OVL_XATTR_METACOPY, NULL, 0);
+	res = ovl_do_vfs_getxattr(dentry, OVL_XATTR_METACOPY, NULL, 0);
 	if (res < 0) {
 		if (res == -ENODATA || res == -EOPNOTSUPP)
 			return 0;
@@ -852,7 +858,7 @@ int ovl_check_metacopy_xattr(struct dentry *dentry)
 
 	return 1;
 out:
-	pr_warn_ratelimited("overlayfs: failed to get metacopy (%i)\n", res);
+	pr_warn_ratelimited("overlayfs: failed to get metacopy (%zi)\n", res);
 	return res;
 }
 
@@ -878,7 +884,7 @@ ssize_t ovl_getxattr(struct dentry *dentry, char *name, char **value,
 	ssize_t res;
 	char *buf = NULL;
 
-	res = vfs_getxattr(dentry, name, NULL, 0);
+	res = ovl_do_vfs_getxattr(dentry, name, NULL, 0);
 	if (res < 0) {
 		if (res == -ENODATA || res == -EOPNOTSUPP)
 			return -ENODATA;
@@ -890,7 +896,7 @@ ssize_t ovl_getxattr(struct dentry *dentry, char *name, char **value,
 		if (!buf)
 			return -ENOMEM;
 
-		res = vfs_getxattr(dentry, name, buf, res);
+		res = ovl_do_vfs_getxattr(dentry, name, buf, res);
 		if (res < 0)
 			goto fail;
 	}
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related

* [PATCH v11 2/4] fs: __vfs_getxattr nesting paradigm
From: Mark Salyzyn @ 2019-07-30 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, Mark Salyzyn, Miklos Szeredi, Jonathan Corbet,
	Vivek Goyal, Eric W . Biederman, Amir Goldstein, Randy Dunlap,
	Stephen Smalley, linux-unionfs, linux-doc, Alexander Viro,
	Ingo Molnar, Peter Zijlstra, linux-fsdevel
In-Reply-To: <20190730155227.41468-1-salyzyn@android.com>

Add a per-thread PF_NO_SECURITY flag that ensures that nested calls
that result in vfs_getxattr do not fall under security framework
scrutiny.  Use cases include selinux when acquiring the xattr data
to evaluate security, and internal trusted xattr data soleley managed
by the filesystem drivers.

This handles the case of a union filesystem driver that is being
requested by the security layer to report back the data that is the
target label or context embedded into wrapped filesystem's xattr.

For the use case where access is to be blocked by the security layer.

The path then could be security(dentry) -> __vfs_getxattr(dentry) ->
handler->get(dentry) -> __vfs_getxattr(lower_dentry) ->
lower_handler->get(lower_dentry) which would report back through the
chain data and success as expected, but the logging security layer at
the top would have the data to determine the access permissions and
report back the target context that was blocked.

Without the nesting check, the path on a union filesystem would be
the errant security(dentry) -> __vfs_getxattr(dentry) ->
handler->get(dentry) -> vfs_getxattr(lower_dentry) -> *nested*
security(lower_dentry, log off) -> lower_handler->get(lower_dentry)
which would report back through the chain no data, and -EACCES.

For selinux for both cases, this would translate to a correctly
determined blocked access. In the first corrected case a correct avc
log would be reported, in the second legacy case an incorrect avc log
would be reported against an uninitialized u:object_r:unlabeled:s0
context making the logs cosmetically useless for audit2allow.

Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: linux-unionfs@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kernel-team@android.com
---
v11 - squish out v10 introduced patch 2 and 3 in the series,
      then use per-thread flag instead for nesting.
---
 fs/xattr.c            | 10 +++++++++-
 include/linux/sched.h |  1 +
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index 90dd78f0eb27..46ebd5014e01 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -302,13 +302,19 @@ __vfs_getxattr(struct dentry *dentry, struct inode *inode, const char *name,
 	       void *value, size_t size)
 {
 	const struct xattr_handler *handler;
+	ssize_t ret;
+	unsigned int flags;
 
 	handler = xattr_resolve_name(inode, &name);
 	if (IS_ERR(handler))
 		return PTR_ERR(handler);
 	if (!handler->get)
 		return -EOPNOTSUPP;
-	return handler->get(handler, dentry, inode, name, value, size);
+	flags = current->flags;
+	current->flags |= PF_NO_SECURITY;
+	ret = handler->get(handler, dentry, inode, name, value, size);
+	current_restore_flags(flags, PF_NO_SECURITY);
+	return ret;
 }
 EXPORT_SYMBOL(__vfs_getxattr);
 
@@ -318,6 +324,8 @@ vfs_getxattr(struct dentry *dentry, const char *name, void *value, size_t size)
 	struct inode *inode = dentry->d_inode;
 	int error;
 
+	if (unlikely(current->flags & PF_NO_SECURITY))
+		goto nolsm;
 	error = xattr_permission(inode, name, MAY_READ);
 	if (error)
 		return error;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8dc1811487f5..5cda3ff89d4e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1468,6 +1468,7 @@ extern struct pid *cad_pid;
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_NOCMA	0x10000000	/* All allocation request will have _GFP_MOVABLE cleared */
+#define PF_NO_SECURITY		0x20000000	/* nested security context */
 #define PF_FREEZER_SKIP		0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
 
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related

* [PATCH v11 1/4] overlayfs: check CAP_DAC_READ_SEARCH before issuing exportfs_decode_fh
From: Mark Salyzyn @ 2019-07-30 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, Mark Salyzyn, Miklos Szeredi, Jonathan Corbet,
	Vivek Goyal, Eric W . Biederman, Amir Goldstein, Randy Dunlap,
	Stephen Smalley, linux-unionfs, linux-doc
In-Reply-To: <20190730155227.41468-1-salyzyn@android.com>

Assumption never checked, should fail if the mounter creds are not
sufficient.

Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: linux-unionfs@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kernel-team@android.com
---
v11 - Rebase

v10:
- return NULL rather than ERR_PTR(-EPERM)
- did _not_ add it ovl_can_decode_fh() because of changes since last
  review, suspect needs to be added to ovl_lower_uuid_ok()?

v8 + v9:
- rebase

v7:
- This time for realz

v6:
- rebase

v5:
- dependency of "overlayfs: override_creds=off option bypass creator_cred"
---
 fs/overlayfs/namei.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
index e9717c2f7d45..9702f0d5309d 100644
--- a/fs/overlayfs/namei.c
+++ b/fs/overlayfs/namei.c
@@ -161,6 +161,9 @@ struct dentry *ovl_decode_real_fh(struct ovl_fh *fh, struct vfsmount *mnt,
 	if (!uuid_equal(&fh->uuid, &mnt->mnt_sb->s_uuid))
 		return NULL;
 
+	if (!capable(CAP_DAC_READ_SEARCH))
+		return NULL;
+
 	bytes = (fh->len - offsetof(struct ovl_fh, fid));
 	real = exportfs_decode_fh(mnt, (struct fid *)fh->fid,
 				  bytes >> 2, (int)fh->type,
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related

* [PATCH v11 0/4] overlayfs override_creds=off
From: Mark Salyzyn @ 2019-07-30 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, Mark Salyzyn, Miklos Szeredi, Jonathan Corbet,
	Vivek Goyal, Eric W . Biederman, Amir Goldstein, Randy Dunlap,
	Stephen Smalley, linux-unionfs, linux-doc

Patch series:

overlayfs: check CAP_DAC_READ_SEARCH before issuing exportfs_decode_fh
fs: __vfs_getxattr nesting paradigm
overlayfs: internal getxattr operations without sepolicy checking
overlayfs: override_creds=off option bypass creator_cred

The first three patches address fundamental security issues that should
be solved regardless of the override_creds=off feature.
on them).

The fourth adds the feature depends on these other fixes.

By default, all access to the upper, lower and work directories is the
recorded mounter's MAC and DAC credentials.  The incoming accesses are
checked against the caller's credentials.

If the principles of least privilege are applied for sepolicy, the
mounter's credentials might not overlap the credentials of the caller's
when accessing the overlayfs filesystem.  For example, a file that a
lower DAC privileged caller can execute, is MAC denied to the
generally higher DAC privileged mounter, to prevent an attack vector.

We add the option to turn off override_creds in the mount options; all
subsequent operations after mount on the filesystem will be only the
caller's credentials.  The module boolean parameter and mount option
override_creds is also added as a presence check for this "feature",
existence of /sys/module/overlay/parameters/overlay_creds

Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: linux-unionfs@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

---
v11:
- Squish out v10 introduced patch 2 and 3 in the series,
  then and use per-thread flag instead for nesting.
- Switch name to ovl_do_vds_getxattr for __vds_getxattr wrapper.
- Add sb argument to ovl_revert_creds to match future work.

v10:
- Return NULL on CAP_DAC_READ_SEARCH
- Add __get xattr method to solve sepolicy logging issue
- Drop unnecessary sys_admin sepolicy checking for administrative
  driver internal xattr functions.

v6:
- Drop CONFIG_OVERLAY_FS_OVERRIDE_CREDS.
- Do better with the documentation, drop rationalizations.
- pr_warn message adjusted to report consequences.

v5:
- beefed up the caveats in the Documentation
- Is dependent on
  "overlayfs: check CAP_DAC_READ_SEARCH before issuing exportfs_decode_fh"
  "overlayfs: check CAP_MKNOD before issuing vfs_whiteout"
- Added prwarn when override_creds=off

v4:
- spelling and grammar errors in text

v3:
- Change name from caller_credentials / creator_credentials to the
  boolean override_creds.
- Changed from creator to mounter credentials.
- Updated and fortified the documentation.
- Added CONFIG_OVERLAY_FS_OVERRIDE_CREDS

v2:
- Forward port changed attr to stat, resulting in a build error.
- altered commit message.

^ permalink raw reply

* Re: [PATCH v6 1/2] arm64: Define Documentation/arm64/tagged-address-abi.rst
From: Kevin Brodsky @ 2019-07-30 14:48 UTC (permalink / raw)
  To: Vincenzo Frascino, linux-arm-kernel, linux-doc, linux-mm,
	linux-arch, linux-kselftest, linux-kernel
  Cc: Szabolcs Nagy, Catalin Marinas, Will Deacon, Andrey Konovalov
In-Reply-To: <fb2e7693-9fc9-da47-0c8d-a8367cf8060f@arm.com>

On 30/07/2019 15:24, Vincenzo Frascino wrote:
> Hi Kevin,
>
> On 7/30/19 2:57 PM, Kevin Brodsky wrote:
>> On 30/07/2019 14:25, Vincenzo Frascino wrote:
>>> Hi Kevin,
>>>
>>> On 7/30/19 11:32 AM, Kevin Brodsky wrote:
>>>> Some more comments. Mostly minor wording issues, except the prctl() exclusion at
>>>> the end.
>>>>
>>>> On 25/07/2019 14:50, Vincenzo Frascino wrote:
>>>>> On arm64 the TCR_EL1.TBI0 bit has been always enabled hence
>>>>> the userspace (EL0) is allowed to set a non-zero value in the
>>>>> top byte but the resulting pointers are not allowed at the
>>>>> user-kernel syscall ABI boundary.
>>>>>
>>>>> With the relaxed ABI proposed through this document, it is now possible
>>>>> to pass tagged pointers to the syscalls, when these pointers are in
>>>>> memory ranges obtained by an anonymous (MAP_ANONYMOUS) mmap().
>>>>>
>>>>> This change in the ABI requires a mechanism to requires the userspace
>>>>> to opt-in to such an option.
>>>>>
>>>>> Specify and document the way in which sysctl and prctl() can be used
>>>>> in combination to allow the userspace to opt-in this feature.
>>>>>
>>>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>>>>> Cc: Will Deacon <will.deacon@arm.com>
>>>>> CC: Andrey Konovalov <andreyknvl@google.com>
>>>>> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
>>>>> Acked-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
>>>>> ---
>>>>>     Documentation/arm64/tagged-address-abi.rst | 148 +++++++++++++++++++++
>>>>>     1 file changed, 148 insertions(+)
>>>>>     create mode 100644 Documentation/arm64/tagged-address-abi.rst
>>>>>
>>>>> diff --git a/Documentation/arm64/tagged-address-abi.rst
>>>>> b/Documentation/arm64/tagged-address-abi.rst
>>>>> new file mode 100644
>>>>> index 000000000000..a8ecb991de82
>>>>> --- /dev/null
>>>>> +++ b/Documentation/arm64/tagged-address-abi.rst
>>>>> @@ -0,0 +1,148 @@
>>>>> +========================
>>>>> +ARM64 TAGGED ADDRESS ABI
>>>>> +========================
>>>>> +
>>>>> +Author: Vincenzo Frascino <vincenzo.frascino@arm.com>
>>>>> +
>>>>> +Date: 25 July 2019
>>>>> +
>>>>> +This document describes the usage and semantics of the Tagged Address
>>>>> +ABI on arm64.
>>>>> +
>>>>> +1. Introduction
>>>>> +---------------
>>>>> +
>>>>> +On arm64 the TCR_EL1.TBI0 bit has always been enabled on the kernel, hence
>>>>> +the userspace (EL0) is entitled to perform a user memory access through a
>>>>> +64-bit pointer with a non-zero top byte but the resulting pointers are not
>>>>> +allowed at the user-kernel syscall ABI boundary.
>>>>> +
>>>>> +This document describes a relaxation of the ABI that makes it possible to
>>>>> +to pass tagged pointers to the syscalls, when these pointers are in memory
>>>> One too many "to" (at the end the previous line).
>>>>
>>> Yep will fix in v7.
>>>
>>>>> +ranges obtained as described in section 2.
>>>>> +
>>>>> +Since it is not desirable to relax the ABI to allow tagged user addresses
>>>>> +into the kernel indiscriminately, arm64 provides a new sysctl interface
>>>>> +(/proc/sys/abi/tagged_addr) that is used to prevent the applications from
>>>>> +enabling the relaxed ABI and a new prctl() interface that can be used to
>>>>> +enable or disable the relaxed ABI.
>>>>> +A detailed description of the newly introduced mechanisms will be provided
>>>>> +in section 2.
>>>>> +
>>>>> +2. ARM64 Tagged Address ABI
>>>>> +---------------------------
>>>>> +
>>>>> +From the kernel syscall interface perspective, we define, for the purposes
>>>>> +of this document, a "valid tagged pointer" as a pointer that either has a
>>>>> +zero value set in the top byte or has a non-zero value, is in memory ranges
>>>>> +privately owned by a userspace process and is obtained in one of the
>>>>> +following ways:
>>>>> +- mmap() done by the process itself, where either:
>>>>> +
>>>>> +  - flags have **MAP_PRIVATE** and **MAP_ANONYMOUS**
>>>>> +  - flags have **MAP_PRIVATE** and the file descriptor refers to a regular
>>>>> +    file or **/dev/zero**
>>>>> +
>>>>> +- brk() system call done by the process itself (i.e. the heap area between
>>>>> +  the initial location of the program break at process creation and its
>>>>> +  current location).
>>>>> +- any memory mapped by the kernel in the process's address space during
>>>>> +  creation and with the same restrictions as for mmap() (e.g. data, bss,
>>>>> +  stack).
>>>>> +
>>>>> +The ARM64 Tagged Address ABI is an opt-in feature, and an application can
>>>>> +control it using the following:
>>>>> +
>>>>> +- **/proc/sys/abi/tagged_addr**: a new sysctl interface that can be used to
>>>>> +  prevent the applications from enabling the access to the relaxed ABI.
>>>>> +  The sysctl supports the following configuration options:
>>>>> +
>>>>> +  - **0**: Disable the access to the ARM64 Tagged Address ABI for all
>>>>> +    the applications.
>>>>> +  - **1** (Default): Enable the access to the ARM64 Tagged Address ABI for
>>>>> +    all the applications.
>>>>> +
>>>>> +   If the access to the ARM64 Tagged Address ABI is disabled at a certain
>>>>> +   point in time, all the applications that were using tagging before this
>>>>> +   event occurs, will continue to use tagging.
>>>> "tagging" may be misinterpreted here. I would be more explicit by saying that
>>>> the tagged address ABI remains enabled in processes that opted in before the
>>>> access got disabled.
>>>>
>>> Assuming that ARM64 Tagged Address ABI gives access to "tagging" and since it is
>>> what this document is talking about, I do not see how it can be misinterpreted ;)
>> "tagging" is a confusing term ("using tagging" even more so), it could be
>> interpreted as memory tagging (especially in the presence of MTE). This document
>> does not use "tagging" anywhere else, which is good. Let's stick to the same
>> name for the ABI throughout the document, repetition is less problematic than
>> vague wording.
>>
> This document does not cover MTE, it covers the "ARM64 Tagged Address ABI" hence
> "tagging" has a precise semantical meaning in this context. Still I do not see
> how it can be confused.
>
>>>>> +- **prctl()s**:
>>>>> +
>>>>> +  - **PR_SET_TAGGED_ADDR_CTRL**: Invoked by a process, can be used to
>>>>> enable or
>>>>> +    disable its access to the ARM64 Tagged Address ABI.
>>>> I still find the wording confusing, because "access to the ABI" is not used
>>>> consistently. The "tagged_addr" sysctl enables *access to the ABI*, that's fine.
>>>> However, PR_SET_TAGGED_ADDR_CTRL enables *the ABI itself* (which is only
>>>> possible if access to the ABI is enabled).
>>>>
>>> As it stands, it enables or disables the ABI itself when used with
>>> PR_TAGGED_ADDR_ENABLE, or can enable other things in future. IMHO the only thing
>>> that these features have in common is the access to the ABI which is granted by
>>> this prctl().
>> I see your point, you could have other bits controlling other aspects. However,
>> I would really avoid saying that this prctl is used to enable or disable access
>> to the new ABI, because it isn't (either you have access to the new ABI and this
>> prctl can be used, or you don't and this prctl will fail).
>>
> What is the system wide evidence that the access to the ABI is denied? Or what
> is the system wide evidence that it is granted?
>
> In other words, is it enough for a process to have the sysctl set (system wide)
> to know that the the ABI is enabled and have granted access to it? or does it
> need to do something else?

I think we really have a wording problem here, which is why this part of the document 
and this discussion is confusing.

tagged_addr=1 (system-wide) allows processes to enable the tagged address ABI by 
calling prctl(PR_SET_TAGGED_ADDR_CTRL). It does not alter the state of any running 
process, and does not enable the ABI by default for new processes either. Conversely, 
when tagged_addr=0, that prctl() is always denied.

The current description of the sysctl and prctl does not make that clear. I think 
that it would be much more obvious by reorganising that section as such:
- prctl() first, the current wording is fine.
- sysctl() second, described *only* in terms of the prctl() (denying 
PR_SET_TAGGED_ADDR_CTRL or not), and nothing else, to avoid wording issues.

It's certainly not the only way to do it, but that would be much clearer to me :)

Kevin

>>>>> +
>>>>> +    The (unsigned int) arg2 argument is a bit mask describing the control mode
>>>>> +    used:
>>>>> +
>>>>> +    - **PR_TAGGED_ADDR_ENABLE**: Enable ARM64 Tagged Address ABI.
>>>>> +
>>>>> +    The prctl(PR_SET_TAGGED_ADDR_CTRL, ...) will return -EINVAL if the ARM64
>>>>> +    Tagged Address ABI is not available.
>>>> For clarity, it would be good to mention that one possible reason for the ABI
>>>> not to be available is tagged_addr == 0.
>>>>
>>> The logical implication is already quite clear tagged_addr == 0 (Disabled) =>
>>> Tagged Address ABI not available => return -EINVAL. I do not see the need to
>>> repeat the concept twice.
>>>
>>>>> +
>>>>> +    The arguments arg3, arg4, and arg5 are ignored.
>>>>> +  - **PR_GET_TAGGED_ADDR_CTRL**: can be used to check the status of the Tagged
>>>>> +    Address ABI.
>>>>> +
>>>>> +    The arguments arg2, arg3, arg4, and arg5 are ignored.
>>>>> +
>>>>> +The ABI properties set by the mechanisms described above are inherited by
>>>>> threads
>>>>> +of the same application and fork()'ed children but cleared by execve().
>>>>> +
>>>>> +When a process has successfully opted into the new ABI by invoking
>>>>> +PR_SET_TAGGED_ADDR_CTRL prctl(), this guarantees the following behaviours:
>>>>> +
>>>>> + - Every currently available syscall, except the cases mentioned in section
>>>>> 3, can
>>>>> +   accept any valid tagged pointer. The same rule is applicable to any syscall
>>>>> +   introduced in the future.
>>>> I thought Catalin wanted to drop this guarantee?
>>>>
>>> The guarantee is changed and explicitly includes the syscalls that can be added
>>> in the future. IMHO since we are defining an ABI, we cannot leave that topic in
>>> an uncharted territory, we need to address it.
>> It makes sense to me, just wanted to be sure that Catalin is on the same page.
>>
>>>>> + - If a non valid tagged pointer is passed to a syscall then the behaviour
>>>>> +   is undefined.
>>>>> + - Every valid tagged pointer is expected to work as an untagged one.
>>>>> + - The kernel preserves any valid tagged pointer and returns it to the
>>>>> +   userspace unchanged (i.e. on syscall return) in all the cases except the
>>>>> +   ones documented in the "Preserving tags" section of tagged-pointers.txt.
>>>>> +
>>>>> +A definition of the meaning of tagged pointers on arm64 can be found in:
>>>>> +Documentation/arm64/tagged-pointers.txt.
>>>>> +
>>>>> +3. ARM64 Tagged Address ABI Exceptions
>>>>> +--------------------------------------
>>>>> +
>>>>> +The behaviours described in section 2, with particular reference to the
>>>>> +acceptance by the syscalls of any valid tagged pointer are not applicable
>>>>> +to the following cases:
>>>>> +
>>>>> + - mmap() addr parameter.
>>>>> + - mremap() new_address parameter.
>>>>> + - prctl(PR_SET_MM, PR_SET_MM_MAP, ...) struct prctl_mm_map fields.
>>>>> + - prctl(PR_SET_MM, PR_SET_MM_MAP_SIZE, ...) struct prctl_mm_map fields.
>>>> All the PR_SET_MM options that specify pointers (PR_SET_MM_START_CODE,
>>>> PR_SET_MM_END_CODE, ...) should be excluded as well. AFAICT (but don't take my
>>>> word for it), that's all of them except PR_SET_MM_EXE_FILE. Conversely,
>>>> PR_SET_MM_MAP_SIZE should not be excluded (it does not pass a prctl_mm_map
>>>> struct, and the pointer to unsigned int can be tagged).
>>>>
>>> Agreed, I clearly misread the prctl() man page here. Fill fix in v7.
>>> PR_SET_MM_MAP_SIZE _returns_  struct prctl_mm_map, does not take it as a
>>> parameter.
>> OK. About PR_SET_MM_MAP_SIZE, it neither takes nor returns struct prctl_mm_map.
>> It writes the size of prctl_map to the int pointed to by arg3, and does nothing
>> else. Therefore, there's no need to exclude it.
>>
> Agreed, I missed the word size in my reply: s/_returns_  struct
> prctl_mm_map/_returns_  the size of struct prctl_mm_map/
>
>> BTW I've just realised that the man page is wrong about PR_SET_MM_MAP_SIZE, the
>> pointer to int is passed in arg3, not arg4. Anyone knows where to report that?
>>
>> Thanks,
>> Kevin
>>
>>> Vincenzo
>>>
>>>> Kevin
>>>>
>>>>> +
>>>>> +Any attempt to use non-zero tagged pointers will lead to undefined behaviour.
>>>>> +
>>>>> +4. Example of correct usage
>>>>> +---------------------------
>>>>> +.. code-block:: c
>>>>> +
>>>>> +   void main(void)
>>>>> +   {
>>>>> +           static int tbi_enabled = 0;
>>>>> +           unsigned long tag = 0;
>>>>> +
>>>>> +           char *ptr = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE,
>>>>> +                            MAP_ANONYMOUS, -1, 0);
>>>>> +
>>>>> +           if (prctl(PR_SET_TAGGED_ADDR_CTRL, PR_TAGGED_ADDR_ENABLE,
>>>>> +                     0, 0, 0) == 0)
>>>>> +                   tbi_enabled = 1;
>>>>> +
>>>>> +           if (ptr == (void *)-1) /* MAP_FAILED */
>>>>> +                   return -1;
>>>>> +
>>>>> +           if (tbi_enabled)
>>>>> +                   tag = rand() & 0xff;
>>>>> +
>>>>> +           ptr = (char *)((unsigned long)ptr | (tag << TAG_SHIFT));
>>>>> +
>>>>> +           *ptr = 'a';
>>>>> +
>>>>> +           ...
>>>>> +   }
>>>>> +
>>>> _______________________________________________
>>>> linux-arm-kernel mailing list
>>>> linux-arm-kernel@lists.infradead.org
>>>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


^ permalink raw reply

* Re: [PATCH v6 1/2] arm64: Define Documentation/arm64/tagged-address-abi.rst
From: Vincenzo Frascino @ 2019-07-30 14:24 UTC (permalink / raw)
  To: Kevin Brodsky, linux-arm-kernel, linux-doc, linux-mm, linux-arch,
	linux-kselftest, linux-kernel
  Cc: Szabolcs Nagy, Catalin Marinas, Will Deacon, Andrey Konovalov
In-Reply-To: <6eba1250-c0a2-0a51-c8c2-0e77e6241f29@arm.com>

Hi Kevin,

On 7/30/19 2:57 PM, Kevin Brodsky wrote:
> On 30/07/2019 14:25, Vincenzo Frascino wrote:
>> Hi Kevin,
>>
>> On 7/30/19 11:32 AM, Kevin Brodsky wrote:
>>> Some more comments. Mostly minor wording issues, except the prctl() exclusion at
>>> the end.
>>>
>>> On 25/07/2019 14:50, Vincenzo Frascino wrote:
>>>> On arm64 the TCR_EL1.TBI0 bit has been always enabled hence
>>>> the userspace (EL0) is allowed to set a non-zero value in the
>>>> top byte but the resulting pointers are not allowed at the
>>>> user-kernel syscall ABI boundary.
>>>>
>>>> With the relaxed ABI proposed through this document, it is now possible
>>>> to pass tagged pointers to the syscalls, when these pointers are in
>>>> memory ranges obtained by an anonymous (MAP_ANONYMOUS) mmap().
>>>>
>>>> This change in the ABI requires a mechanism to requires the userspace
>>>> to opt-in to such an option.
>>>>
>>>> Specify and document the way in which sysctl and prctl() can be used
>>>> in combination to allow the userspace to opt-in this feature.
>>>>
>>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>>>> Cc: Will Deacon <will.deacon@arm.com>
>>>> CC: Andrey Konovalov <andreyknvl@google.com>
>>>> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
>>>> Acked-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
>>>> ---
>>>>    Documentation/arm64/tagged-address-abi.rst | 148 +++++++++++++++++++++
>>>>    1 file changed, 148 insertions(+)
>>>>    create mode 100644 Documentation/arm64/tagged-address-abi.rst
>>>>
>>>> diff --git a/Documentation/arm64/tagged-address-abi.rst
>>>> b/Documentation/arm64/tagged-address-abi.rst
>>>> new file mode 100644
>>>> index 000000000000..a8ecb991de82
>>>> --- /dev/null
>>>> +++ b/Documentation/arm64/tagged-address-abi.rst
>>>> @@ -0,0 +1,148 @@
>>>> +========================
>>>> +ARM64 TAGGED ADDRESS ABI
>>>> +========================
>>>> +
>>>> +Author: Vincenzo Frascino <vincenzo.frascino@arm.com>
>>>> +
>>>> +Date: 25 July 2019
>>>> +
>>>> +This document describes the usage and semantics of the Tagged Address
>>>> +ABI on arm64.
>>>> +
>>>> +1. Introduction
>>>> +---------------
>>>> +
>>>> +On arm64 the TCR_EL1.TBI0 bit has always been enabled on the kernel, hence
>>>> +the userspace (EL0) is entitled to perform a user memory access through a
>>>> +64-bit pointer with a non-zero top byte but the resulting pointers are not
>>>> +allowed at the user-kernel syscall ABI boundary.
>>>> +
>>>> +This document describes a relaxation of the ABI that makes it possible to
>>>> +to pass tagged pointers to the syscalls, when these pointers are in memory
>>> One too many "to" (at the end the previous line).
>>>
>> Yep will fix in v7.
>>
>>>> +ranges obtained as described in section 2.
>>>> +
>>>> +Since it is not desirable to relax the ABI to allow tagged user addresses
>>>> +into the kernel indiscriminately, arm64 provides a new sysctl interface
>>>> +(/proc/sys/abi/tagged_addr) that is used to prevent the applications from
>>>> +enabling the relaxed ABI and a new prctl() interface that can be used to
>>>> +enable or disable the relaxed ABI.
>>>> +A detailed description of the newly introduced mechanisms will be provided
>>>> +in section 2.
>>>> +
>>>> +2. ARM64 Tagged Address ABI
>>>> +---------------------------
>>>> +
>>>> +From the kernel syscall interface perspective, we define, for the purposes
>>>> +of this document, a "valid tagged pointer" as a pointer that either has a
>>>> +zero value set in the top byte or has a non-zero value, is in memory ranges
>>>> +privately owned by a userspace process and is obtained in one of the
>>>> +following ways:
>>>> +- mmap() done by the process itself, where either:
>>>> +
>>>> +  - flags have **MAP_PRIVATE** and **MAP_ANONYMOUS**
>>>> +  - flags have **MAP_PRIVATE** and the file descriptor refers to a regular
>>>> +    file or **/dev/zero**
>>>> +
>>>> +- brk() system call done by the process itself (i.e. the heap area between
>>>> +  the initial location of the program break at process creation and its
>>>> +  current location).
>>>> +- any memory mapped by the kernel in the process's address space during
>>>> +  creation and with the same restrictions as for mmap() (e.g. data, bss,
>>>> +  stack).
>>>> +
>>>> +The ARM64 Tagged Address ABI is an opt-in feature, and an application can
>>>> +control it using the following:
>>>> +
>>>> +- **/proc/sys/abi/tagged_addr**: a new sysctl interface that can be used to
>>>> +  prevent the applications from enabling the access to the relaxed ABI.
>>>> +  The sysctl supports the following configuration options:
>>>> +
>>>> +  - **0**: Disable the access to the ARM64 Tagged Address ABI for all
>>>> +    the applications.
>>>> +  - **1** (Default): Enable the access to the ARM64 Tagged Address ABI for
>>>> +    all the applications.
>>>> +
>>>> +   If the access to the ARM64 Tagged Address ABI is disabled at a certain
>>>> +   point in time, all the applications that were using tagging before this
>>>> +   event occurs, will continue to use tagging.
>>> "tagging" may be misinterpreted here. I would be more explicit by saying that
>>> the tagged address ABI remains enabled in processes that opted in before the
>>> access got disabled.
>>>
>> Assuming that ARM64 Tagged Address ABI gives access to "tagging" and since it is
>> what this document is talking about, I do not see how it can be misinterpreted ;)
> 
> "tagging" is a confusing term ("using tagging" even more so), it could be
> interpreted as memory tagging (especially in the presence of MTE). This document
> does not use "tagging" anywhere else, which is good. Let's stick to the same
> name for the ABI throughout the document, repetition is less problematic than
> vague wording.
> 

This document does not cover MTE, it covers the "ARM64 Tagged Address ABI" hence
"tagging" has a precise semantical meaning in this context. Still I do not see
how it can be confused.

>>
>>>> +- **prctl()s**:
>>>> +
>>>> +  - **PR_SET_TAGGED_ADDR_CTRL**: Invoked by a process, can be used to
>>>> enable or
>>>> +    disable its access to the ARM64 Tagged Address ABI.
>>> I still find the wording confusing, because "access to the ABI" is not used
>>> consistently. The "tagged_addr" sysctl enables *access to the ABI*, that's fine.
>>> However, PR_SET_TAGGED_ADDR_CTRL enables *the ABI itself* (which is only
>>> possible if access to the ABI is enabled).
>>>
>> As it stands, it enables or disables the ABI itself when used with
>> PR_TAGGED_ADDR_ENABLE, or can enable other things in future. IMHO the only thing
>> that these features have in common is the access to the ABI which is granted by
>> this prctl().
> 
> I see your point, you could have other bits controlling other aspects. However,
> I would really avoid saying that this prctl is used to enable or disable access
> to the new ABI, because it isn't (either you have access to the new ABI and this
> prctl can be used, or you don't and this prctl will fail).
> 

What is the system wide evidence that the access to the ABI is denied? Or what
is the system wide evidence that it is granted?

In other words, is it enough for a process to have the sysctl set (system wide)
to know that the the ABI is enabled and have granted access to it? or does it
need to do something else?

>>
>>>> +
>>>> +    The (unsigned int) arg2 argument is a bit mask describing the control mode
>>>> +    used:
>>>> +
>>>> +    - **PR_TAGGED_ADDR_ENABLE**: Enable ARM64 Tagged Address ABI.
>>>> +
>>>> +    The prctl(PR_SET_TAGGED_ADDR_CTRL, ...) will return -EINVAL if the ARM64
>>>> +    Tagged Address ABI is not available.
>>> For clarity, it would be good to mention that one possible reason for the ABI
>>> not to be available is tagged_addr == 0.
>>>
>> The logical implication is already quite clear tagged_addr == 0 (Disabled) =>
>> Tagged Address ABI not available => return -EINVAL. I do not see the need to
>> repeat the concept twice.
>>
>>>> +
>>>> +    The arguments arg3, arg4, and arg5 are ignored.
>>>> +  - **PR_GET_TAGGED_ADDR_CTRL**: can be used to check the status of the Tagged
>>>> +    Address ABI.
>>>> +
>>>> +    The arguments arg2, arg3, arg4, and arg5 are ignored.
>>>> +
>>>> +The ABI properties set by the mechanisms described above are inherited by
>>>> threads
>>>> +of the same application and fork()'ed children but cleared by execve().
>>>> +
>>>> +When a process has successfully opted into the new ABI by invoking
>>>> +PR_SET_TAGGED_ADDR_CTRL prctl(), this guarantees the following behaviours:
>>>> +
>>>> + - Every currently available syscall, except the cases mentioned in section
>>>> 3, can
>>>> +   accept any valid tagged pointer. The same rule is applicable to any syscall
>>>> +   introduced in the future.
>>> I thought Catalin wanted to drop this guarantee?
>>>
>> The guarantee is changed and explicitly includes the syscalls that can be added
>> in the future. IMHO since we are defining an ABI, we cannot leave that topic in
>> an uncharted territory, we need to address it.
> 
> It makes sense to me, just wanted to be sure that Catalin is on the same page.
> 
>>
>>>> + - If a non valid tagged pointer is passed to a syscall then the behaviour
>>>> +   is undefined.
>>>> + - Every valid tagged pointer is expected to work as an untagged one.
>>>> + - The kernel preserves any valid tagged pointer and returns it to the
>>>> +   userspace unchanged (i.e. on syscall return) in all the cases except the
>>>> +   ones documented in the "Preserving tags" section of tagged-pointers.txt.
>>>> +
>>>> +A definition of the meaning of tagged pointers on arm64 can be found in:
>>>> +Documentation/arm64/tagged-pointers.txt.
>>>> +
>>>> +3. ARM64 Tagged Address ABI Exceptions
>>>> +--------------------------------------
>>>> +
>>>> +The behaviours described in section 2, with particular reference to the
>>>> +acceptance by the syscalls of any valid tagged pointer are not applicable
>>>> +to the following cases:
>>>> +
>>>> + - mmap() addr parameter.
>>>> + - mremap() new_address parameter.
>>>> + - prctl(PR_SET_MM, PR_SET_MM_MAP, ...) struct prctl_mm_map fields.
>>>> + - prctl(PR_SET_MM, PR_SET_MM_MAP_SIZE, ...) struct prctl_mm_map fields.
>>> All the PR_SET_MM options that specify pointers (PR_SET_MM_START_CODE,
>>> PR_SET_MM_END_CODE, ...) should be excluded as well. AFAICT (but don't take my
>>> word for it), that's all of them except PR_SET_MM_EXE_FILE. Conversely,
>>> PR_SET_MM_MAP_SIZE should not be excluded (it does not pass a prctl_mm_map
>>> struct, and the pointer to unsigned int can be tagged).
>>>
>> Agreed, I clearly misread the prctl() man page here. Fill fix in v7.
>> PR_SET_MM_MAP_SIZE _returns_  struct prctl_mm_map, does not take it as a
>> parameter.
> 
> OK. About PR_SET_MM_MAP_SIZE, it neither takes nor returns struct prctl_mm_map.
> It writes the size of prctl_map to the int pointed to by arg3, and does nothing
> else. Therefore, there's no need to exclude it.
> 

Agreed, I missed the word size in my reply: s/_returns_  struct
prctl_mm_map/_returns_  the size of struct prctl_mm_map/

> BTW I've just realised that the man page is wrong about PR_SET_MM_MAP_SIZE, the
> pointer to int is passed in arg3, not arg4. Anyone knows where to report that?
> 
> Thanks,
> Kevin
> 
>> Vincenzo
>>
>>> Kevin
>>>
>>>> +
>>>> +Any attempt to use non-zero tagged pointers will lead to undefined behaviour.
>>>> +
>>>> +4. Example of correct usage
>>>> +---------------------------
>>>> +.. code-block:: c
>>>> +
>>>> +   void main(void)
>>>> +   {
>>>> +           static int tbi_enabled = 0;
>>>> +           unsigned long tag = 0;
>>>> +
>>>> +           char *ptr = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE,
>>>> +                            MAP_ANONYMOUS, -1, 0);
>>>> +
>>>> +           if (prctl(PR_SET_TAGGED_ADDR_CTRL, PR_TAGGED_ADDR_ENABLE,
>>>> +                     0, 0, 0) == 0)
>>>> +                   tbi_enabled = 1;
>>>> +
>>>> +           if (ptr == (void *)-1) /* MAP_FAILED */
>>>> +                   return -1;
>>>> +
>>>> +           if (tbi_enabled)
>>>> +                   tag = rand() & 0xff;
>>>> +
>>>> +           ptr = (char *)((unsigned long)ptr | (tag << TAG_SHIFT));
>>>> +
>>>> +           *ptr = 'a';
>>>> +
>>>> +           ...
>>>> +   }
>>>> +
>>> _______________________________________________
>>> linux-arm-kernel mailing list
>>> linux-arm-kernel@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

-- 
Regards,
Vincenzo

^ permalink raw reply

* Re: [PATCH v6 1/2] arm64: Define Documentation/arm64/tagged-address-abi.rst
From: Kevin Brodsky @ 2019-07-30 13:57 UTC (permalink / raw)
  To: Vincenzo Frascino, linux-arm-kernel, linux-doc, linux-mm,
	linux-arch, linux-kselftest, linux-kernel
  Cc: Szabolcs Nagy, Catalin Marinas, Will Deacon, Andrey Konovalov
In-Reply-To: <c45df19e-8f48-7f4e-3eae-ada54cb6f707@arm.com>

On 30/07/2019 14:25, Vincenzo Frascino wrote:
> Hi Kevin,
>
> On 7/30/19 11:32 AM, Kevin Brodsky wrote:
>> Some more comments. Mostly minor wording issues, except the prctl() exclusion at
>> the end.
>>
>> On 25/07/2019 14:50, Vincenzo Frascino wrote:
>>> On arm64 the TCR_EL1.TBI0 bit has been always enabled hence
>>> the userspace (EL0) is allowed to set a non-zero value in the
>>> top byte but the resulting pointers are not allowed at the
>>> user-kernel syscall ABI boundary.
>>>
>>> With the relaxed ABI proposed through this document, it is now possible
>>> to pass tagged pointers to the syscalls, when these pointers are in
>>> memory ranges obtained by an anonymous (MAP_ANONYMOUS) mmap().
>>>
>>> This change in the ABI requires a mechanism to requires the userspace
>>> to opt-in to such an option.
>>>
>>> Specify and document the way in which sysctl and prctl() can be used
>>> in combination to allow the userspace to opt-in this feature.
>>>
>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>>> Cc: Will Deacon <will.deacon@arm.com>
>>> CC: Andrey Konovalov <andreyknvl@google.com>
>>> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
>>> Acked-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
>>> ---
>>>    Documentation/arm64/tagged-address-abi.rst | 148 +++++++++++++++++++++
>>>    1 file changed, 148 insertions(+)
>>>    create mode 100644 Documentation/arm64/tagged-address-abi.rst
>>>
>>> diff --git a/Documentation/arm64/tagged-address-abi.rst
>>> b/Documentation/arm64/tagged-address-abi.rst
>>> new file mode 100644
>>> index 000000000000..a8ecb991de82
>>> --- /dev/null
>>> +++ b/Documentation/arm64/tagged-address-abi.rst
>>> @@ -0,0 +1,148 @@
>>> +========================
>>> +ARM64 TAGGED ADDRESS ABI
>>> +========================
>>> +
>>> +Author: Vincenzo Frascino <vincenzo.frascino@arm.com>
>>> +
>>> +Date: 25 July 2019
>>> +
>>> +This document describes the usage and semantics of the Tagged Address
>>> +ABI on arm64.
>>> +
>>> +1. Introduction
>>> +---------------
>>> +
>>> +On arm64 the TCR_EL1.TBI0 bit has always been enabled on the kernel, hence
>>> +the userspace (EL0) is entitled to perform a user memory access through a
>>> +64-bit pointer with a non-zero top byte but the resulting pointers are not
>>> +allowed at the user-kernel syscall ABI boundary.
>>> +
>>> +This document describes a relaxation of the ABI that makes it possible to
>>> +to pass tagged pointers to the syscalls, when these pointers are in memory
>> One too many "to" (at the end the previous line).
>>
> Yep will fix in v7.
>
>>> +ranges obtained as described in section 2.
>>> +
>>> +Since it is not desirable to relax the ABI to allow tagged user addresses
>>> +into the kernel indiscriminately, arm64 provides a new sysctl interface
>>> +(/proc/sys/abi/tagged_addr) that is used to prevent the applications from
>>> +enabling the relaxed ABI and a new prctl() interface that can be used to
>>> +enable or disable the relaxed ABI.
>>> +A detailed description of the newly introduced mechanisms will be provided
>>> +in section 2.
>>> +
>>> +2. ARM64 Tagged Address ABI
>>> +---------------------------
>>> +
>>> +From the kernel syscall interface perspective, we define, for the purposes
>>> +of this document, a "valid tagged pointer" as a pointer that either has a
>>> +zero value set in the top byte or has a non-zero value, is in memory ranges
>>> +privately owned by a userspace process and is obtained in one of the
>>> +following ways:
>>> +- mmap() done by the process itself, where either:
>>> +
>>> +  - flags have **MAP_PRIVATE** and **MAP_ANONYMOUS**
>>> +  - flags have **MAP_PRIVATE** and the file descriptor refers to a regular
>>> +    file or **/dev/zero**
>>> +
>>> +- brk() system call done by the process itself (i.e. the heap area between
>>> +  the initial location of the program break at process creation and its
>>> +  current location).
>>> +- any memory mapped by the kernel in the process's address space during
>>> +  creation and with the same restrictions as for mmap() (e.g. data, bss,
>>> +  stack).
>>> +
>>> +The ARM64 Tagged Address ABI is an opt-in feature, and an application can
>>> +control it using the following:
>>> +
>>> +- **/proc/sys/abi/tagged_addr**: a new sysctl interface that can be used to
>>> +  prevent the applications from enabling the access to the relaxed ABI.
>>> +  The sysctl supports the following configuration options:
>>> +
>>> +  - **0**: Disable the access to the ARM64 Tagged Address ABI for all
>>> +    the applications.
>>> +  - **1** (Default): Enable the access to the ARM64 Tagged Address ABI for
>>> +    all the applications.
>>> +
>>> +   If the access to the ARM64 Tagged Address ABI is disabled at a certain
>>> +   point in time, all the applications that were using tagging before this
>>> +   event occurs, will continue to use tagging.
>> "tagging" may be misinterpreted here. I would be more explicit by saying that
>> the tagged address ABI remains enabled in processes that opted in before the
>> access got disabled.
>>
> Assuming that ARM64 Tagged Address ABI gives access to "tagging" and since it is
> what this document is talking about, I do not see how it can be misinterpreted ;)

"tagging" is a confusing term ("using tagging" even more so), it could be interpreted 
as memory tagging (especially in the presence of MTE). This document does not use 
"tagging" anywhere else, which is good. Let's stick to the same name for the ABI 
throughout the document, repetition is less problematic than vague wording.

>
>>> +- **prctl()s**:
>>> +
>>> +  - **PR_SET_TAGGED_ADDR_CTRL**: Invoked by a process, can be used to enable or
>>> +    disable its access to the ARM64 Tagged Address ABI.
>> I still find the wording confusing, because "access to the ABI" is not used
>> consistently. The "tagged_addr" sysctl enables *access to the ABI*, that's fine.
>> However, PR_SET_TAGGED_ADDR_CTRL enables *the ABI itself* (which is only
>> possible if access to the ABI is enabled).
>>
> As it stands, it enables or disables the ABI itself when used with
> PR_TAGGED_ADDR_ENABLE, or can enable other things in future. IMHO the only thing
> that these features have in common is the access to the ABI which is granted by
> this prctl().

I see your point, you could have other bits controlling other aspects. However, I 
would really avoid saying that this prctl is used to enable or disable access to the 
new ABI, because it isn't (either you have access to the new ABI and this prctl can 
be used, or you don't and this prctl will fail).

>
>>> +
>>> +    The (unsigned int) arg2 argument is a bit mask describing the control mode
>>> +    used:
>>> +
>>> +    - **PR_TAGGED_ADDR_ENABLE**: Enable ARM64 Tagged Address ABI.
>>> +
>>> +    The prctl(PR_SET_TAGGED_ADDR_CTRL, ...) will return -EINVAL if the ARM64
>>> +    Tagged Address ABI is not available.
>> For clarity, it would be good to mention that one possible reason for the ABI
>> not to be available is tagged_addr == 0.
>>
> The logical implication is already quite clear tagged_addr == 0 (Disabled) =>
> Tagged Address ABI not available => return -EINVAL. I do not see the need to
> repeat the concept twice.
>
>>> +
>>> +    The arguments arg3, arg4, and arg5 are ignored.
>>> +  - **PR_GET_TAGGED_ADDR_CTRL**: can be used to check the status of the Tagged
>>> +    Address ABI.
>>> +
>>> +    The arguments arg2, arg3, arg4, and arg5 are ignored.
>>> +
>>> +The ABI properties set by the mechanisms described above are inherited by
>>> threads
>>> +of the same application and fork()'ed children but cleared by execve().
>>> +
>>> +When a process has successfully opted into the new ABI by invoking
>>> +PR_SET_TAGGED_ADDR_CTRL prctl(), this guarantees the following behaviours:
>>> +
>>> + - Every currently available syscall, except the cases mentioned in section
>>> 3, can
>>> +   accept any valid tagged pointer. The same rule is applicable to any syscall
>>> +   introduced in the future.
>> I thought Catalin wanted to drop this guarantee?
>>
> The guarantee is changed and explicitly includes the syscalls that can be added
> in the future. IMHO since we are defining an ABI, we cannot leave that topic in
> an uncharted territory, we need to address it.

It makes sense to me, just wanted to be sure that Catalin is on the same page.

>
>>> + - If a non valid tagged pointer is passed to a syscall then the behaviour
>>> +   is undefined.
>>> + - Every valid tagged pointer is expected to work as an untagged one.
>>> + - The kernel preserves any valid tagged pointer and returns it to the
>>> +   userspace unchanged (i.e. on syscall return) in all the cases except the
>>> +   ones documented in the "Preserving tags" section of tagged-pointers.txt.
>>> +
>>> +A definition of the meaning of tagged pointers on arm64 can be found in:
>>> +Documentation/arm64/tagged-pointers.txt.
>>> +
>>> +3. ARM64 Tagged Address ABI Exceptions
>>> +--------------------------------------
>>> +
>>> +The behaviours described in section 2, with particular reference to the
>>> +acceptance by the syscalls of any valid tagged pointer are not applicable
>>> +to the following cases:
>>> +
>>> + - mmap() addr parameter.
>>> + - mremap() new_address parameter.
>>> + - prctl(PR_SET_MM, PR_SET_MM_MAP, ...) struct prctl_mm_map fields.
>>> + - prctl(PR_SET_MM, PR_SET_MM_MAP_SIZE, ...) struct prctl_mm_map fields.
>> All the PR_SET_MM options that specify pointers (PR_SET_MM_START_CODE,
>> PR_SET_MM_END_CODE, ...) should be excluded as well. AFAICT (but don't take my
>> word for it), that's all of them except PR_SET_MM_EXE_FILE. Conversely,
>> PR_SET_MM_MAP_SIZE should not be excluded (it does not pass a prctl_mm_map
>> struct, and the pointer to unsigned int can be tagged).
>>
> Agreed, I clearly misread the prctl() man page here. Fill fix in v7.
> PR_SET_MM_MAP_SIZE _returns_  struct prctl_mm_map, does not take it as a parameter.

OK. About PR_SET_MM_MAP_SIZE, it neither takes nor returns struct prctl_mm_map. It 
writes the size of prctl_map to the int pointed to by arg3, and does nothing else. 
Therefore, there's no need to exclude it.

BTW I've just realised that the man page is wrong about PR_SET_MM_MAP_SIZE, the 
pointer to int is passed in arg3, not arg4. Anyone knows where to report that?

Thanks,
Kevin

> Vincenzo
>
>> Kevin
>>
>>> +
>>> +Any attempt to use non-zero tagged pointers will lead to undefined behaviour.
>>> +
>>> +4. Example of correct usage
>>> +---------------------------
>>> +.. code-block:: c
>>> +
>>> +   void main(void)
>>> +   {
>>> +           static int tbi_enabled = 0;
>>> +           unsigned long tag = 0;
>>> +
>>> +           char *ptr = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE,
>>> +                            MAP_ANONYMOUS, -1, 0);
>>> +
>>> +           if (prctl(PR_SET_TAGGED_ADDR_CTRL, PR_TAGGED_ADDR_ENABLE,
>>> +                     0, 0, 0) == 0)
>>> +                   tbi_enabled = 1;
>>> +
>>> +           if (ptr == (void *)-1) /* MAP_FAILED */
>>> +                   return -1;
>>> +
>>> +           if (tbi_enabled)
>>> +                   tag = rand() & 0xff;
>>> +
>>> +           ptr = (char *)((unsigned long)ptr | (tag << TAG_SHIFT));
>>> +
>>> +           *ptr = 'a';
>>> +
>>> +           ...
>>> +   }
>>> +
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


^ permalink raw reply

* Re: [PATCH v6 1/2] arm64: Define Documentation/arm64/tagged-address-abi.rst
From: Vincenzo Frascino @ 2019-07-30 13:25 UTC (permalink / raw)
  To: Kevin Brodsky, linux-arm-kernel, linux-doc, linux-mm, linux-arch,
	linux-kselftest, linux-kernel
  Cc: Szabolcs Nagy, Catalin Marinas, Will Deacon, Andrey Konovalov
In-Reply-To: <52fa2cfc-f7a6-af6f-0dc2-f9ea0e41ac3c@arm.com>

Hi Kevin,

On 7/30/19 11:32 AM, Kevin Brodsky wrote:
> Some more comments. Mostly minor wording issues, except the prctl() exclusion at
> the end.
> 
> On 25/07/2019 14:50, Vincenzo Frascino wrote:
>> On arm64 the TCR_EL1.TBI0 bit has been always enabled hence
>> the userspace (EL0) is allowed to set a non-zero value in the
>> top byte but the resulting pointers are not allowed at the
>> user-kernel syscall ABI boundary.
>>
>> With the relaxed ABI proposed through this document, it is now possible
>> to pass tagged pointers to the syscalls, when these pointers are in
>> memory ranges obtained by an anonymous (MAP_ANONYMOUS) mmap().
>>
>> This change in the ABI requires a mechanism to requires the userspace
>> to opt-in to such an option.
>>
>> Specify and document the way in which sysctl and prctl() can be used
>> in combination to allow the userspace to opt-in this feature.
>>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will.deacon@arm.com>
>> CC: Andrey Konovalov <andreyknvl@google.com>
>> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
>> Acked-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
>> ---
>>   Documentation/arm64/tagged-address-abi.rst | 148 +++++++++++++++++++++
>>   1 file changed, 148 insertions(+)
>>   create mode 100644 Documentation/arm64/tagged-address-abi.rst
>>
>> diff --git a/Documentation/arm64/tagged-address-abi.rst
>> b/Documentation/arm64/tagged-address-abi.rst
>> new file mode 100644
>> index 000000000000..a8ecb991de82
>> --- /dev/null
>> +++ b/Documentation/arm64/tagged-address-abi.rst
>> @@ -0,0 +1,148 @@
>> +========================
>> +ARM64 TAGGED ADDRESS ABI
>> +========================
>> +
>> +Author: Vincenzo Frascino <vincenzo.frascino@arm.com>
>> +
>> +Date: 25 July 2019
>> +
>> +This document describes the usage and semantics of the Tagged Address
>> +ABI on arm64.
>> +
>> +1. Introduction
>> +---------------
>> +
>> +On arm64 the TCR_EL1.TBI0 bit has always been enabled on the kernel, hence
>> +the userspace (EL0) is entitled to perform a user memory access through a
>> +64-bit pointer with a non-zero top byte but the resulting pointers are not
>> +allowed at the user-kernel syscall ABI boundary.
>> +
>> +This document describes a relaxation of the ABI that makes it possible to
>> +to pass tagged pointers to the syscalls, when these pointers are in memory
> 
> One too many "to" (at the end the previous line).
> 

Yep will fix in v7.

>> +ranges obtained as described in section 2.
>> +
>> +Since it is not desirable to relax the ABI to allow tagged user addresses
>> +into the kernel indiscriminately, arm64 provides a new sysctl interface
>> +(/proc/sys/abi/tagged_addr) that is used to prevent the applications from
>> +enabling the relaxed ABI and a new prctl() interface that can be used to
>> +enable or disable the relaxed ABI.
>> +A detailed description of the newly introduced mechanisms will be provided
>> +in section 2.
>> +
>> +2. ARM64 Tagged Address ABI
>> +---------------------------
>> +
>> +From the kernel syscall interface perspective, we define, for the purposes
>> +of this document, a "valid tagged pointer" as a pointer that either has a
>> +zero value set in the top byte or has a non-zero value, is in memory ranges
>> +privately owned by a userspace process and is obtained in one of the
>> +following ways:
>> +- mmap() done by the process itself, where either:
>> +
>> +  - flags have **MAP_PRIVATE** and **MAP_ANONYMOUS**
>> +  - flags have **MAP_PRIVATE** and the file descriptor refers to a regular
>> +    file or **/dev/zero**
>> +
>> +- brk() system call done by the process itself (i.e. the heap area between
>> +  the initial location of the program break at process creation and its
>> +  current location).
>> +- any memory mapped by the kernel in the process's address space during
>> +  creation and with the same restrictions as for mmap() (e.g. data, bss,
>> +  stack).
>> +
>> +The ARM64 Tagged Address ABI is an opt-in feature, and an application can
>> +control it using the following:
>> +
>> +- **/proc/sys/abi/tagged_addr**: a new sysctl interface that can be used to
>> +  prevent the applications from enabling the access to the relaxed ABI.
>> +  The sysctl supports the following configuration options:
>> +
>> +  - **0**: Disable the access to the ARM64 Tagged Address ABI for all
>> +    the applications.
>> +  - **1** (Default): Enable the access to the ARM64 Tagged Address ABI for
>> +    all the applications.
>> +
>> +   If the access to the ARM64 Tagged Address ABI is disabled at a certain
>> +   point in time, all the applications that were using tagging before this
>> +   event occurs, will continue to use tagging.
> 
> "tagging" may be misinterpreted here. I would be more explicit by saying that
> the tagged address ABI remains enabled in processes that opted in before the
> access got disabled.
> 

Assuming that ARM64 Tagged Address ABI gives access to "tagging" and since it is
what this document is talking about, I do not see how it can be misinterpreted ;)

>> +- **prctl()s**:
>> +
>> +  - **PR_SET_TAGGED_ADDR_CTRL**: Invoked by a process, can be used to enable or
>> +    disable its access to the ARM64 Tagged Address ABI.
> 
> I still find the wording confusing, because "access to the ABI" is not used
> consistently. The "tagged_addr" sysctl enables *access to the ABI*, that's fine.
> However, PR_SET_TAGGED_ADDR_CTRL enables *the ABI itself* (which is only
> possible if access to the ABI is enabled).
> 

As it stands, it enables or disables the ABI itself when used with
PR_TAGGED_ADDR_ENABLE, or can enable other things in future. IMHO the only thing
that these features have in common is the access to the ABI which is granted by
this prctl().

>> +
>> +    The (unsigned int) arg2 argument is a bit mask describing the control mode
>> +    used:
>> +
>> +    - **PR_TAGGED_ADDR_ENABLE**: Enable ARM64 Tagged Address ABI.
>> +
>> +    The prctl(PR_SET_TAGGED_ADDR_CTRL, ...) will return -EINVAL if the ARM64
>> +    Tagged Address ABI is not available.
> 
> For clarity, it would be good to mention that one possible reason for the ABI
> not to be available is tagged_addr == 0.
> 

The logical implication is already quite clear tagged_addr == 0 (Disabled) =>
Tagged Address ABI not available => return -EINVAL. I do not see the need to
repeat the concept twice.

>> +
>> +    The arguments arg3, arg4, and arg5 are ignored.
>> +  - **PR_GET_TAGGED_ADDR_CTRL**: can be used to check the status of the Tagged
>> +    Address ABI.
>> +
>> +    The arguments arg2, arg3, arg4, and arg5 are ignored.
>> +
>> +The ABI properties set by the mechanisms described above are inherited by
>> threads
>> +of the same application and fork()'ed children but cleared by execve().
>> +
>> +When a process has successfully opted into the new ABI by invoking
>> +PR_SET_TAGGED_ADDR_CTRL prctl(), this guarantees the following behaviours:
>> +
>> + - Every currently available syscall, except the cases mentioned in section
>> 3, can
>> +   accept any valid tagged pointer. The same rule is applicable to any syscall
>> +   introduced in the future.
> 
> I thought Catalin wanted to drop this guarantee?
> 

The guarantee is changed and explicitly includes the syscalls that can be added
in the future. IMHO since we are defining an ABI, we cannot leave that topic in
an uncharted territory, we need to address it.

>> + - If a non valid tagged pointer is passed to a syscall then the behaviour
>> +   is undefined.
>> + - Every valid tagged pointer is expected to work as an untagged one.
>> + - The kernel preserves any valid tagged pointer and returns it to the
>> +   userspace unchanged (i.e. on syscall return) in all the cases except the
>> +   ones documented in the "Preserving tags" section of tagged-pointers.txt.
>> +
>> +A definition of the meaning of tagged pointers on arm64 can be found in:
>> +Documentation/arm64/tagged-pointers.txt.
>> +
>> +3. ARM64 Tagged Address ABI Exceptions
>> +--------------------------------------
>> +
>> +The behaviours described in section 2, with particular reference to the
>> +acceptance by the syscalls of any valid tagged pointer are not applicable
>> +to the following cases:
>> +
>> + - mmap() addr parameter.
>> + - mremap() new_address parameter.
>> + - prctl(PR_SET_MM, PR_SET_MM_MAP, ...) struct prctl_mm_map fields.
>> + - prctl(PR_SET_MM, PR_SET_MM_MAP_SIZE, ...) struct prctl_mm_map fields.
> 
> All the PR_SET_MM options that specify pointers (PR_SET_MM_START_CODE,
> PR_SET_MM_END_CODE, ...) should be excluded as well. AFAICT (but don't take my
> word for it), that's all of them except PR_SET_MM_EXE_FILE. Conversely,
> PR_SET_MM_MAP_SIZE should not be excluded (it does not pass a prctl_mm_map
> struct, and the pointer to unsigned int can be tagged).
> 

Agreed, I clearly misread the prctl() man page here. Fill fix in v7.
PR_SET_MM_MAP_SIZE _returns_  struct prctl_mm_map, does not take it as a parameter.

Vincenzo

> Kevin
> 
>> +
>> +Any attempt to use non-zero tagged pointers will lead to undefined behaviour.
>> +
>> +4. Example of correct usage
>> +---------------------------
>> +.. code-block:: c
>> +
>> +   void main(void)
>> +   {
>> +           static int tbi_enabled = 0;
>> +           unsigned long tag = 0;
>> +
>> +           char *ptr = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE,
>> +                            MAP_ANONYMOUS, -1, 0);
>> +
>> +           if (prctl(PR_SET_TAGGED_ADDR_CTRL, PR_TAGGED_ADDR_ENABLE,
>> +                     0, 0, 0) == 0)
>> +                   tbi_enabled = 1;
>> +
>> +           if (ptr == (void *)-1) /* MAP_FAILED */
>> +                   return -1;
>> +
>> +           if (tbi_enabled)
>> +                   tag = rand() & 0xff;
>> +
>> +           ptr = (char *)((unsigned long)ptr | (tag << TAG_SHIFT));
>> +
>> +           *ptr = 'a';
>> +
>> +           ...
>> +   }
>> +
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

-- 
Regards,
Vincenzo

^ permalink raw reply

* Re: [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing
From: Joel Fernandes @ 2019-07-30 13:06 UTC (permalink / raw)
  To: LKML
  Cc: Alexey Dobriyan, Andrew Morton, Brendan Gregg, Christian Hansen,
	Daniel Colascione, Florian Mayer, John Dias, Joel Fernandes,
	Jonathan Corbet, Kees Cook, kernel-team, Linux API,
	open list:DOCUMENTATION, Linux FS Devel, linux-mm, Michal Hocko,
	Mike Rapoport, Minchan Kim, Namhyung Kim, Roman Gushchin,
	Stephen Rothwell, Suren Baghdasaryan, Todd Kjos, Vladimir Davydov,
	Vlastimil Babka, Wei Wang
In-Reply-To: <20190726152319.134152-1-joel@joelfernandes.org>

On Fri, Jul 26, 2019 at 11:23 AM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> The page_idle tracking feature currently requires looking up the pagemap
> for a process followed by interacting with /sys/kernel/mm/page_idle.
> Looking up PFN from pagemap in Android devices is not supported by
> unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
>
> This patch adds support to directly interact with page_idle tracking at
> the PID level by introducing a /proc/<pid>/page_idle file.  It follows
> the exact same semantics as the global /sys/kernel/mm/page_idle, but now
> looking up PFN through pagemap is not needed since the interface uses
> virtual frame numbers, and at the same time also does not require
> SYS_ADMIN.
>
> In Android, we are using this for the heap profiler (heapprofd) which
> profiles and pin points code paths which allocates and leaves memory
> idle for long periods of time. This method solves the security issue
> with userspace learning the PFN, and while at it is also shown to yield
> better results than the pagemap lookup, the theory being that the window
> where the address space can change is reduced by eliminating the
> intermediate pagemap look up stage. In virtual address indexing, the
> process's mmap_sem is held for the duration of the access.
>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>
> ---
> v2->v3:
> Fixed a bug where I was doing a kfree that is not needed due to not
> needing to do GFP_ATOMIC allocations.
>
> v1->v2:
> Mark swap ptes as idle (Minchan)
> Avoid need for GFP_ATOMIC (Andrew)
> Get rid of idle_page_list lock by moving list to stack

I believe all suggestions have been addressed.  Do these look good now?

thanks,

 - Joel



> Internal review -> v1:
> Fixes from Suren.
> Corrections to change log, docs (Florian, Sandeep)
>
>  fs/proc/base.c            |   3 +
>  fs/proc/internal.h        |   1 +
>  fs/proc/task_mmu.c        |  57 +++++++
>  include/linux/page_idle.h |   4 +
>  mm/page_idle.c            | 340 +++++++++++++++++++++++++++++++++-----
>  5 files changed, 360 insertions(+), 45 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 77eb628ecc7f..a58dd74606e9 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = {
>         REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
>         REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox