Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Kirill A. Shutemov @ 2019-10-28 12:31 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport
In-Reply-To: <1572171452-7958-2-git-send-email-rppt@kernel.org>

On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> The mappings created with MAP_EXCLUSIVE are visible only in the context of
> the owning process and can be used by applications to store secret
> information that will not be visible not only to other processes but to the
> kernel as well.
> 
> The pages in these mappings are removed from the kernel direct map and
> marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> the pages are mapped back into the direct map.

I probably blind, but I don't see where you manipulate direct map...

-- 
 Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS
From: Linus Torvalds @ 2019-10-28 12:27 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-fsdevel, Linux-MM, Linux Kernel Mailing List, Linux API,
	Michal Hocko, Alexander Viro, Johannes Weiner, Andrew Morton,
	Roman Gushchin
In-Reply-To: <157225848971.557.16257813537984792761.stgit@buzz>

On Mon, Oct 28, 2019 at 11:28 AM Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
>
> This implements fcntl() for getting amount of resident memory in cache.
> Kernel already maintains counter for each inode, this patch just exposes
> it into userspace. Returned size is in kilobytes like values in procfs.

This doesn't actually explain why anybody would want it, and what the
usage scenario is.

             Linus

^ permalink raw reply

* Re: [PATCH ghak90 V7 04/21] audit: convert to contid list to check for orch/engine ownership
From: Neil Horman @ 2019-10-28 12:20 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
	LKML, netdev, netfilter-devel, Paul Moore, sgrubb, omosnace,
	dhowells, simo, eparis, serge, ebiederm, dwalsh, mpatel
In-Reply-To: <20191025200019.vfd66aygccpf5yoe@madcap2.tricolour.ca>

On Fri, Oct 25, 2019 at 04:00:19PM -0400, Richard Guy Briggs wrote:
> On 2019-09-26 10:46, Neil Horman wrote:
> > On Wed, Sep 18, 2019 at 09:22:21PM -0400, Richard Guy Briggs wrote:
> > > Store the audit container identifier in a refcounted kernel object that
> > > is added to the master list of audit container identifiers.  This will
> > > allow multiple container orchestrators/engines to work on the same
> > > machine without danger of inadvertantly re-using an existing identifier.
> > > It will also allow an orchestrator to inject a process into an existing
> > > container by checking if the original container owner is the one
> > > injecting the task.  A hash table list is used to optimize searches.
> > > 
> > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > ---
> > >  include/linux/audit.h | 26 ++++++++++++++--
> > >  kernel/audit.c        | 86 ++++++++++++++++++++++++++++++++++++++++++++++++---
> > >  kernel/audit.h        |  8 +++++
> > >  3 files changed, 112 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/include/linux/audit.h b/include/linux/audit.h
> > > index f2e3b81f2942..e317807cdd3e 100644
> > > --- a/include/linux/audit.h
> > > +++ b/include/linux/audit.h
> > > @@ -95,10 +95,18 @@ struct audit_ntp_data {
> > >  struct audit_ntp_data {};
> > >  #endif
> > >  
> > > +struct audit_cont {
> > > +	struct list_head	list;
> > > +	u64			id;
> > > +	struct task_struct	*owner;
> > > +	refcount_t              refcount;
> > > +	struct rcu_head         rcu;
> > > +};
> > > +
> > >  struct audit_task_info {
> > >  	kuid_t			loginuid;
> > >  	unsigned int		sessionid;
> > > -	u64			contid;
> > > +	struct audit_cont	*cont;
> > >  #ifdef CONFIG_AUDITSYSCALL
> > >  	struct audit_context	*ctx;
> > >  #endif
> > > @@ -203,11 +211,15 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
> > >  
> > >  static inline u64 audit_get_contid(struct task_struct *tsk)
> > >  {
> > > -	if (!tsk->audit)
> > > +	if (!tsk->audit || !tsk->audit->cont)
> > >  		return AUDIT_CID_UNSET;
> > > -	return tsk->audit->contid;
> > > +	return tsk->audit->cont->id;
> > >  }
> > >  
> > > +extern struct audit_cont *audit_cont(struct task_struct *tsk);
> > > +
> > > +extern void audit_cont_put(struct audit_cont *cont);
> > > +
> > I see that you manual increment this refcount at various call sites, why
> > no corresponding audit_contid_hold function?
> 
> I was trying to avoid the get function due to having one site where I
> needed the pointer for later but didn't need a refcount to it so that I
> could release the refcount it if it was replaced by another cont object.
> A hold function would just contain one line that would call the
> refcount_inc().  If I did convert things over to a get function, it
> would hide some of this extra conditional code in the main calling
> function, but in one place I could just call put immediately to
> neutralize that unneeded refcount.
> 
Ok, but this pattern:

static inline u64 __audit_contid_get(struct audit_cont *c) {
        return c->id;
}

audit_contid_get(struct audit_cont *c) {
        refcount_hold(c)
        return __audit_contid_get(c)
}

Squares that up, doesn't it?  It gives you an internal non refcount
holding version then to use.

> Would you see any issue with that extra get/put refcount that would only
> happen in the case of changing a contid in a nesting situation?
> 
No, I personally wouldn't have an issue with it, but the above would
make it pretty readable I think

> > Neil
> > 
> > >  extern u32 audit_enabled;
> > >  
> > >  extern int audit_signal_info(int sig, struct task_struct *t);
> > > @@ -277,6 +289,14 @@ static inline u64 audit_get_contid(struct task_struct *tsk)
> > >  	return AUDIT_CID_UNSET;
> > >  }
> > >  
> > > +static inline struct audit_cont *audit_cont(struct task_struct *tsk)
> > > +{
> > > +	return NULL;
> > > +}
> > > +
> > > +static inline void audit_cont_put(struct audit_cont *cont)
> > > +{ }
> > > +
> > >  #define audit_enabled AUDIT_OFF
> > >  
> > >  static inline int audit_signal_info(int sig, struct task_struct *t)
> > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > index a36ea57cbb61..ea0899130cc1 100644
> > > --- a/kernel/audit.c
> > > +++ b/kernel/audit.c
> > > @@ -137,6 +137,8 @@ struct audit_net {
> > >  
> > >  /* Hash for inode-based rules */
> > >  struct list_head audit_inode_hash[AUDIT_INODE_BUCKETS];
> > > +/* Hash for contid-based rules */
> > > +struct list_head audit_contid_hash[AUDIT_CONTID_BUCKETS];
> > >  
> > >  static struct kmem_cache *audit_buffer_cache;
> > >  
> > > @@ -204,6 +206,8 @@ struct audit_reply {
> > >  
> > >  static struct kmem_cache *audit_task_cache;
> > >  
> > > +static DEFINE_SPINLOCK(audit_contid_list_lock);
> > > +
> > >  void __init audit_task_init(void)
> > >  {
> > >  	audit_task_cache = kmem_cache_create("audit_task",
> > > @@ -231,7 +235,9 @@ int audit_alloc(struct task_struct *tsk)
> > >  	}
> > >  	info->loginuid = audit_get_loginuid(current);
> > >  	info->sessionid = audit_get_sessionid(current);
> > > -	info->contid = audit_get_contid(current);
> > > +	info->cont = audit_cont(current);
> > > +	if (info->cont)
> > > +		refcount_inc(&info->cont->refcount);
> > >  	tsk->audit = info;
> > >  
> > >  	ret = audit_alloc_syscall(tsk);
> > > @@ -246,7 +252,7 @@ int audit_alloc(struct task_struct *tsk)
> > >  struct audit_task_info init_struct_audit = {
> > >  	.loginuid = INVALID_UID,
> > >  	.sessionid = AUDIT_SID_UNSET,
> > > -	.contid = AUDIT_CID_UNSET,
> > > +	.cont = NULL,
> > >  #ifdef CONFIG_AUDITSYSCALL
> > >  	.ctx = NULL,
> > >  #endif
> > > @@ -266,6 +272,9 @@ void audit_free(struct task_struct *tsk)
> > >  	/* Freeing the audit_task_info struct must be performed after
> > >  	 * audit_log_exit() due to need for loginuid and sessionid.
> > >  	 */
> > > +	spin_lock(&audit_contid_list_lock); 
> > > +	audit_cont_put(tsk->audit->cont);
> > > +	spin_unlock(&audit_contid_list_lock); 
> > >  	info = tsk->audit;
> > >  	tsk->audit = NULL;
> > >  	kmem_cache_free(audit_task_cache, info);
> > > @@ -1657,6 +1666,9 @@ static int __init audit_init(void)
> > >  	for (i = 0; i < AUDIT_INODE_BUCKETS; i++)
> > >  		INIT_LIST_HEAD(&audit_inode_hash[i]);
> > >  
> > > +	for (i = 0; i < AUDIT_CONTID_BUCKETS; i++)
> > > +		INIT_LIST_HEAD(&audit_contid_hash[i]);
> > > +
> > >  	mutex_init(&audit_cmd_mutex.lock);
> > >  	audit_cmd_mutex.owner = NULL;
> > >  
> > > @@ -2356,6 +2368,32 @@ int audit_signal_info(int sig, struct task_struct *t)
> > >  	return audit_signal_info_syscall(t);
> > >  }
> > >  
> > > +struct audit_cont *audit_cont(struct task_struct *tsk)
> > > +{
> > > +	if (!tsk->audit || !tsk->audit->cont)
> > > +		return NULL;
> > > +	return tsk->audit->cont;
> > > +}
> > > +
> > > +/* audit_contid_list_lock must be held by caller */
> > > +void audit_cont_put(struct audit_cont *cont)
> > > +{
> > > +	if (!cont)
> > > +		return;
> > > +	if (refcount_dec_and_test(&cont->refcount)) {
> > > +		put_task_struct(cont->owner);
> > > +		list_del_rcu(&cont->list);
> > > +		kfree_rcu(cont, rcu);
> > > +	}
> > > +}
> > > +
> > > +static struct task_struct *audit_cont_owner(struct task_struct *tsk)
> > > +{
> > > +	if (tsk->audit && tsk->audit->cont)
> > > +		return tsk->audit->cont->owner;
> > > +	return NULL;
> > > +}
> > > +
> > >  /*
> > >   * audit_set_contid - set current task's audit contid
> > >   * @task: target task
> > > @@ -2382,9 +2420,12 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> > >  	}
> > >  	oldcontid = audit_get_contid(task);
> > >  	read_lock(&tasklist_lock);
> > > -	/* Don't allow the audit containerid to be unset */
> > > +	/* Don't allow the contid to be unset */
> > >  	if (!audit_contid_valid(contid))
> > >  		rc = -EINVAL;
> > > +	/* Don't allow the contid to be set to the same value again */
> > > +	else if (contid == oldcontid) {
> > > +		rc = -EADDRINUSE;
> > >  	/* if we don't have caps, reject */
> > >  	else if (!capable(CAP_AUDIT_CONTROL))
> > >  		rc = -EPERM;
> > > @@ -2397,8 +2438,43 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> > >  	else if (audit_contid_set(task))
> > >  		rc = -ECHILD;
> > >  	read_unlock(&tasklist_lock);
> > > -	if (!rc)
> > > -		task->audit->contid = contid;
> > > +	if (!rc) {
> > > +		struct audit_cont *oldcont = audit_cont(task);
> > > +		struct audit_cont *cont = NULL;
> > > +		struct audit_cont *newcont = NULL;
> > > +		int h = audit_hash_contid(contid);
> > > +
> > > +		spin_lock(&audit_contid_list_lock);
> > > +		list_for_each_entry_rcu(cont, &audit_contid_hash[h], list)
> > > +			if (cont->id == contid) {
> > > +				/* task injection to existing container */
> > > +				if (current == cont->owner) {
> > > +					refcount_inc(&cont->refcount);
> > > +					newcont = cont;
> > > +				} else {
> > > +					rc = -ENOTUNIQ;
> > > +					goto conterror;
> > > +				}
> > > +			}
> > > +		if (!newcont) {
> > > +			newcont = kmalloc(sizeof(struct audit_cont), GFP_ATOMIC);
> > > +			if (newcont) {
> > > +				INIT_LIST_HEAD(&newcont->list);
> > > +				newcont->id = contid;
> > > +				get_task_struct(current);
> > > +				newcont->owner = current;
> > > +				refcount_set(&newcont->refcount, 1);
> > > +				list_add_rcu(&newcont->list, &audit_contid_hash[h]);
> > > +			} else {
> > > +				rc = -ENOMEM;
> > > +				goto conterror;
> > > +			}
> > > +		}
> > > +		task->audit->cont = newcont;
> > > +		audit_cont_put(oldcont);
> > > +conterror:
> > > +		spin_unlock(&audit_contid_list_lock);
> > > +	}
> > >  	task_unlock(task);
> > >  
> > >  	if (!audit_enabled)
> > > diff --git a/kernel/audit.h b/kernel/audit.h
> > > index 16bd03b88e0d..e4a31aa92dfe 100644
> > > --- a/kernel/audit.h
> > > +++ b/kernel/audit.h
> > > @@ -211,6 +211,14 @@ static inline int audit_hash_ino(u32 ino)
> > >  	return (ino & (AUDIT_INODE_BUCKETS-1));
> > >  }
> > >  
> > > +#define AUDIT_CONTID_BUCKETS	32
> > > +extern struct list_head audit_contid_hash[AUDIT_CONTID_BUCKETS];
> > > +
> > > +static inline int audit_hash_contid(u64 contid)
> > > +{
> > > +	return (contid & (AUDIT_CONTID_BUCKETS-1));
> > > +}
> > > +
> > >  /* Indicates that audit should log the full pathname. */
> > >  #define AUDIT_NAME_FULL -1
> > >  
> > > -- 
> > > 1.8.3.1
> > > 
> > > 
> 
> - RGB
> 
> --
> Richard Guy Briggs <rgb@redhat.com>
> Sr. S/W Engineer, Kernel Security, Base Operating Systems
> Remote, Ottawa, Red Hat Canada
> IRC: rgb, SunRaycer
> Voice: +1.647.777.2635, Internal: (81) 32635
> 
> 

^ permalink raw reply

* Re: [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS
From: Florian Weimer @ 2019-10-28 11:46 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-fsdevel, linux-mm, linux-kernel, linux-api, Michal Hocko,
	Alexander Viro, Johannes Weiner, Andrew Morton, Linus Torvalds,
	Roman Gushchin
In-Reply-To: <157225848971.557.16257813537984792761.stgit@buzz>

* Konstantin Khlebnikov:

> This implements fcntl() for getting amount of resident memory in cache.
> Kernel already maintains counter for each inode, this patch just exposes
> it into userspace. Returned size is in kilobytes like values in procfs.

I think this needs a 32-bit compat implementation which clamps the
returned value to INT_MAX.

^ permalink raw reply

* Re: [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS
From: Konstantin Khlebnikov @ 2019-10-28 11:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-mm, linux-kernel, linux-api, Michal Hocko,
	Alexander Viro, Johannes Weiner, Andrew Morton, Linus Torvalds,
	Roman Gushchin
In-Reply-To: <20191028111034.GS2963@bombadil.infradead.org>

On 28/10/2019 14.10, Matthew Wilcox wrote:
> On Mon, Oct 28, 2019 at 01:28:09PM +0300, Konstantin Khlebnikov wrote:
>> +	if (dax_mapping(mapping))
>> +		pages = READ_ONCE(mapping->nrexceptional);
>> +	else
>> +		pages = READ_ONCE(mapping->nrpages);
> 
> I'm not sure this is the right calculation for DAX files.  We haven't
> allocated any memory for DAX; we're just accessing storage directly.
> The entries in the page caache are just translation from file offset to
> physical address.
> 

Yep, makes sense. If RSS declared as memory usage then this chunk must do
pages = READ_ONCE(mapping->nrpages) unconditionally and report 0 for DAX.

^ permalink raw reply

* Re: [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS
From: Matthew Wilcox @ 2019-10-28 11:10 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-fsdevel, linux-mm, linux-kernel, linux-api, Michal Hocko,
	Alexander Viro, Johannes Weiner, Andrew Morton, Linus Torvalds,
	Roman Gushchin
In-Reply-To: <157225848971.557.16257813537984792761.stgit@buzz>

On Mon, Oct 28, 2019 at 01:28:09PM +0300, Konstantin Khlebnikov wrote:
> +	if (dax_mapping(mapping))
> +		pages = READ_ONCE(mapping->nrexceptional);
> +	else
> +		pages = READ_ONCE(mapping->nrpages);

I'm not sure this is the right calculation for DAX files.  We haven't
allocated any memory for DAX; we're just accessing storage directly.
The entries in the page caache are just translation from file offset to
physical address.

^ permalink raw reply

* [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS
From: Konstantin Khlebnikov @ 2019-10-28 10:28 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel, linux-api
  Cc: Michal Hocko, Alexander Viro, Johannes Weiner, Andrew Morton,
	Linus Torvalds, Roman Gushchin

This implements fcntl() for getting amount of resident memory in cache.
Kernel already maintains counter for each inode, this patch just exposes
it into userspace. Returned size is in kilobytes like values in procfs.

Alternatively this could be implemented via mapping file and collecting
map of cached pages with mincore(). Which is much slower and O(n*log n).

Syscall fincore() never was implemented in Linux.
This fcntl() covers one of its use-cases with minimal footprint.

Unlike to mincore() this fcntl counts all pages, including allocated but
not read yet (non-uptodate) and pages beyond end of file.

This employs same security model as mincore() and requires one of:
- file is opened for writing
- current user owns inode
- current user could open inode for writing

Usage:
resident_kb = fcntl(fd, F_GET_RSS);

Error codes:
-EINVAL		- not supported
-EPERM		- not writable / owner
-ENODATA	- special inode without cache

Notes:
Range of pages could be evicted from cache using POSIX_FADV_DONTNEED.
Populating with POSIX_FADV_WILLNEED is asynchronous and limited with
disk read_ahead_kb and max_sectors_kb. It seems most effective way to
read data into cache synchronously is a sendfile() into /dev/null.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 fs/fcntl.c                 |   30 ++++++++++++++++++++++++++++++
 include/uapi/linux/fcntl.h |    5 +++++
 2 files changed, 35 insertions(+)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 3d40771e8e7c..b241d3c925db 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -25,6 +25,8 @@
 #include <linux/user_namespace.h>
 #include <linux/memfd.h>
 #include <linux/compat.h>
+#include <linux/dax.h>
+#include <linux/hugetlb.h>
 
 #include <linux/poll.h>
 #include <asm/siginfo.h>
@@ -319,6 +321,31 @@ static long fcntl_rw_hint(struct file *file, unsigned int cmd,
 	}
 }
 
+static long fcntl_get_rss(struct file *filp)
+{
+	struct address_space *mapping = filp->f_mapping;
+	unsigned long pages;
+
+	if (!mapping)
+		return -ENODATA;
+
+	/* The same limitations as for sys_mincore() */
+	if (!(filp->f_mode & FMODE_WRITE) &&
+	    !inode_owner_or_capable(mapping->host) &&
+	    inode_permission(mapping->host, MAY_WRITE))
+		return -EPERM;
+
+	if (dax_mapping(mapping))
+		pages = READ_ONCE(mapping->nrexceptional);
+	else
+		pages = READ_ONCE(mapping->nrpages);
+
+	if (is_file_hugepages(filp))
+		pages <<= huge_page_order(hstate_file(filp));
+
+	return pages << (PAGE_SHIFT - 10);	/* page -> kb */
+}
+
 static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 		struct file *filp)
 {
@@ -426,6 +453,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_SET_FILE_RW_HINT:
 		err = fcntl_rw_hint(filp, cmd, arg);
 		break;
+	case F_GET_RSS:
+		err = fcntl_get_rss(filp);
+		break;
 	default:
 		break;
 	}
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 1d338357df8a..d467f1dbfc67 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -54,6 +54,11 @@
 #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
 #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)
 
+/*
+ * Get amount of resident memory in file cache in kilobytes.
+ */
+#define F_GET_RSS		(F_LINUX_SPECIFIC_BASE + 15)
+
 /*
  * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
  * used to clear any hints previously set.

^ permalink raw reply related

* [PATCH v5 2/2] fpga: dfl: fme: add performance reporting support
From: Wu Hao @ 2019-10-28  3:31 UTC (permalink / raw)
  To: mdf, linux-fpga, linux-kernel
  Cc: linux-api, will, mark.rutland, atull, gregkh, Wu Hao, Luwei Kang,
	Xu Yilun
In-Reply-To: <1572233488-26533-1-git-send-email-hao.wu@intel.com>

This patch adds support for performance reporting private feature
for FPGA Management Engine (FME). Now it supports several different
performance counters, including 'basic', 'cache', 'fabric', 'vtd'
and 'vtd_sip'. It allows user to use standard linux tools to access
these performance counters.

e.g. List all events by "perf list"

  perf list | grep fme

  fme0/cache_read_hit/                         [Kernel PMU event]
  fme0/cache_read_miss/                        [Kernel PMU event]
  ...

  fme0/fab_mmio_read/                          [Kernel PMU event]
  fme0/fab_mmio_write/                         [Kernel PMU event]
  ...

  fme0/fab_port_mmio_read,portid=?/            [Kernel PMU event]
  fme0/fab_port_mmio_write,portid=?/           [Kernel PMU event]
  ...

  fme0/vtd_port_devtlb_1g_fill,portid=?/       [Kernel PMU event]
  fme0/vtd_port_devtlb_2m_fill,portid=?/       [Kernel PMU event]
  ...

  fme0/vtd_sip_iotlb_1g_hit/                   [Kernel PMU event]
  fme0/vtd_sip_iotlb_1g_miss/                  [Kernel PMU event]
  ...

  fme0/clock                                   [Kernel PMU event]
  ...

e.g. check increased counter value after run one application using
"perf stat" command.

 perf stat -e fme0/fab_mmio_read/,fme0/fab_mmio_write/, ./test

 Performance counter stats for './test':

                 1      fme0/fab_mmio_read/
                 2      fme0/fab_mmio_write/

       1.009496520 seconds time elapsed

Please note that fabric counters support both fab_* and fab_port_*, but
actually they are sharing one set of performance counters in hardware.
If user wants to monitor overall data events on fab_* then fab_port_*
can't be supported at the same time, see example below:

perf stat -e fme0/fab_mmio_read/,fme0/fab_port_mmio_write,portid=0/

 Performance counter stats for 'system wide':

                 0      fme0/fab_mmio_read/
   <not supported>      fme0/fab_port_mmio_write,portid=0/

       2.141064085 seconds time elapsed

Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
---
v3: replace scnprintf with sprintf in sysfs interfaces.
    update sysfs doc kernel version and date.
    fix sysfs doc issue for fabric counter.
    refine PERF_OBJ_ATTR_* macro, doesn't count on __ATTR anymore.
    introduce PERF_OBJ_ATTR_F_* macro, as it needs to use different
    filenames for some of the sysfs attributes.
    remove kobject_del when destroy kobject, kobject_put is enough.
    do sysfs_remove_groups first when destroying perf_obj.
    WARN_ON_ONCE in case internal parms are wrong in read_*_count().
v4: rework this patch to use standard perf API as user interfaces.
v5: rebase and clean up.
---
 drivers/fpga/Kconfig        |   2 +-
 drivers/fpga/Makefile       |   1 +
 drivers/fpga/dfl-fme-main.c |   4 +
 drivers/fpga/dfl-fme-perf.c | 869 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/fpga/dfl-fme.h      |   2 +
 drivers/fpga/dfl.h          |   2 +
 6 files changed, 879 insertions(+), 1 deletion(-)
 create mode 100644 drivers/fpga/dfl-fme-perf.c

diff --git a/drivers/fpga/Kconfig b/drivers/fpga/Kconfig
index 72380e1..b2408a7 100644
--- a/drivers/fpga/Kconfig
+++ b/drivers/fpga/Kconfig
@@ -156,7 +156,7 @@ config FPGA_DFL
 
 config FPGA_DFL_FME
 	tristate "FPGA DFL FME Driver"
-	depends on FPGA_DFL && HWMON
+	depends on FPGA_DFL && HWMON && PERF_EVENTS
 	help
 	  The FPGA Management Engine (FME) is a feature device implemented
 	  under Device Feature List (DFL) framework. Select this option to
diff --git a/drivers/fpga/Makefile b/drivers/fpga/Makefile
index 4865b74..d8e21df 100644
--- a/drivers/fpga/Makefile
+++ b/drivers/fpga/Makefile
@@ -40,6 +40,7 @@ obj-$(CONFIG_FPGA_DFL_FME_REGION)	+= dfl-fme-region.o
 obj-$(CONFIG_FPGA_DFL_AFU)		+= dfl-afu.o
 
 dfl-fme-objs := dfl-fme-main.o dfl-fme-pr.o dfl-fme-error.o
+dfl-fme-objs += dfl-fme-perf.o
 dfl-afu-objs := dfl-afu-main.o dfl-afu-region.o dfl-afu-dma-region.o
 dfl-afu-objs += dfl-afu-error.o
 
diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
index 7c930e6..53a1447 100644
--- a/drivers/fpga/dfl-fme-main.c
+++ b/drivers/fpga/dfl-fme-main.c
@@ -580,6 +580,10 @@ static int fme_power_mgmt_init(struct platform_device *pdev,
 		.ops = &fme_power_mgmt_ops,
 	},
 	{
+		.id_table = fme_perf_id_table,
+		.ops = &fme_perf_ops,
+	},
+	{
 		.ops = NULL,
 	},
 };
diff --git a/drivers/fpga/dfl-fme-perf.c b/drivers/fpga/dfl-fme-perf.c
new file mode 100644
index 0000000..441acd8
--- /dev/null
+++ b/drivers/fpga/dfl-fme-perf.c
@@ -0,0 +1,869 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Driver for FPGA Management Engine (FME) Global Performance Reporting
+ *
+ * Copyright 2019 Intel Corporation, Inc.
+ *
+ * Authors:
+ *   Kang Luwei <luwei.kang@intel.com>
+ *   Xiao Guangrong <guangrong.xiao@linux.intel.com>
+ *   Wu Hao <hao.wu@intel.com>
+ *   Xu Yilun <yilun.xu@intel.com>
+ *   Joseph Grecco <joe.grecco@intel.com>
+ *   Enno Luebbers <enno.luebbers@intel.com>
+ *   Tim Whisonant <tim.whisonant@intel.com>
+ *   Ananda Ravuri <ananda.ravuri@intel.com>
+ *   Mitchel, Henry <henry.mitchel@intel.com>
+ */
+
+#include <linux/perf_event.h>
+#include "dfl.h"
+#include "dfl-fme.h"
+
+/*
+ * Performance Counter Registers for Cache.
+ *
+ * Cache Events are listed below as CACHE_EVNT_*.
+ */
+#define CACHE_CTRL			0x8
+#define CACHE_RESET_CNTR		BIT_ULL(0)
+#define CACHE_FREEZE_CNTR		BIT_ULL(8)
+#define CACHE_CTRL_EVNT			GENMASK_ULL(19, 16)
+#define CACHE_EVNT_RD_HIT		0x0
+#define CACHE_EVNT_WR_HIT		0x1
+#define CACHE_EVNT_RD_MISS		0x2
+#define CACHE_EVNT_WR_MISS		0x3
+#define CACHE_EVNT_RSVD			0x4
+#define CACHE_EVNT_HOLD_REQ		0x5
+#define CACHE_EVNT_DATA_WR_PORT_CONTEN	0x6
+#define CACHE_EVNT_TAG_WR_PORT_CONTEN	0x7
+#define CACHE_EVNT_TX_REQ_STALL		0x8
+#define CACHE_EVNT_RX_REQ_STALL		0x9
+#define CACHE_EVNT_EVICTIONS		0xa
+#define CACHE_CHANNEL_SEL		BIT_ULL(20)
+#define CACHE_CHANNEL_RD		0
+#define CACHE_CHANNEL_WR		1
+#define CACHE_CNTR0			0x10
+#define CACHE_CNTR1			0x18
+#define CACHE_CNTR_EVNT_CNTR		GENMASK_ULL(47, 0)
+#define CACHE_CNTR_EVNT			GENMASK_ULL(63, 60)
+
+/*
+ * Performance Counter Registers for Fabric.
+ *
+ * Fabric Events are listed below as FAB_EVNT_*
+ */
+#define FAB_CTRL			0x20
+#define FAB_RESET_CNTR			BIT_ULL(0)
+#define FAB_FREEZE_CNTR			BIT_ULL(8)
+#define FAB_CTRL_EVNT			GENMASK_ULL(19, 16)
+#define FAB_EVNT_PCIE0_RD		0x0
+#define FAB_EVNT_PCIE0_WR		0x1
+#define FAB_EVNT_PCIE1_RD		0x2
+#define FAB_EVNT_PCIE1_WR		0x3
+#define FAB_EVNT_UPI_RD			0x4
+#define FAB_EVNT_UPI_WR			0x5
+#define FAB_EVNT_MMIO_RD		0x6
+#define FAB_EVNT_MMIO_WR		0x7
+#define FAB_PORT_ID			GENMASK_ULL(21, 20)
+#define FAB_PORT_FILTER			BIT_ULL(23)
+#define FAB_PORT_FILTER_DISABLE		0
+#define FAB_PORT_FILTER_ENABLE		1
+#define FAB_CNTR			0x28
+#define FAB_CNTR_EVNT_CNTR		GENMASK_ULL(59, 0)
+#define FAB_CNTR_EVNT			GENMASK_ULL(63, 60)
+
+/*
+ * Performance Counter Registers for Clock.
+ *
+ * Clock Counter can't be reset or frozen by SW.
+ */
+#define CLK_CNTR			0x30
+#define BASIC_EVNT_CLK			0x0
+
+/*
+ * Performance Counter Registers for IOMMU / VT-D.
+ *
+ * VT-D Events are listed below as VTD_EVNT_* and VTD_SIP_EVNT_*
+ */
+#define VTD_CTRL			0x38
+#define VTD_RESET_CNTR			BIT_ULL(0)
+#define VTD_FREEZE_CNTR			BIT_ULL(8)
+#define VTD_CTRL_EVNT			GENMASK_ULL(19, 16)
+#define VTD_EVNT_AFU_MEM_RD_TRANS	0x0
+#define VTD_EVNT_AFU_MEM_WR_TRANS	0x1
+#define VTD_EVNT_AFU_DEVTLB_RD_HIT	0x2
+#define VTD_EVNT_AFU_DEVTLB_WR_HIT	0x3
+#define VTD_EVNT_DEVTLB_4K_FILL		0x4
+#define VTD_EVNT_DEVTLB_2M_FILL		0x5
+#define VTD_EVNT_DEVTLB_1G_FILL		0x6
+#define VTD_CNTR			0x40
+#define VTD_CNTR_EVNT_CNTR		GENMASK_ULL(47, 0)
+#define VTD_CNTR_EVNT			GENMASK_ULL(63, 60)
+
+#define VTD_SIP_CTRL			0x48
+#define VTD_SIP_RESET_CNTR		BIT_ULL(0)
+#define VTD_SIP_FREEZE_CNTR		BIT_ULL(8)
+#define VTD_SIP_CTRL_EVNT		GENMASK_ULL(19, 16)
+#define VTD_SIP_EVNT_IOTLB_4K_HIT	0x0
+#define VTD_SIP_EVNT_IOTLB_2M_HIT	0x1
+#define VTD_SIP_EVNT_IOTLB_1G_HIT	0x2
+#define VTD_SIP_EVNT_SLPWC_L3_HIT	0x3
+#define VTD_SIP_EVNT_SLPWC_L4_HIT	0x4
+#define VTD_SIP_EVNT_RCC_HIT		0x5
+#define VTD_SIP_EVNT_IOTLB_4K_MISS	0x6
+#define VTD_SIP_EVNT_IOTLB_2M_MISS	0x7
+#define VTD_SIP_EVNT_IOTLB_1G_MISS	0x8
+#define VTD_SIP_EVNT_SLPWC_L3_MISS	0x9
+#define VTD_SIP_EVNT_SLPWC_L4_MISS	0xa
+#define VTD_SIP_EVNT_RCC_MISS		0xb
+#define VTD_SIP_CNTR			0X50
+#define VTD_SIP_CNTR_EVNT_CNTR		GENMASK_ULL(47, 0)
+#define VTD_SIP_CNTR_EVNT		GENMASK_ULL(63, 60)
+
+#define PERF_TIMEOUT			30
+
+#define PERF_MAX_PORT_NUM		1
+
+/**
+ * struct fme_perf_priv - priv data structure for fme perf driver
+ *
+ * @dev: parent device.
+ * @ioaddr: mapped base address of mmio region.
+ * @pmu: pmu data structure for fme perf counters.
+ * @id: id of this fme performance report private feature.
+ * @fab_users: current user number on fabric counters.
+ * @fab_port_id: used to indicate current working mode of fabric counters.
+ * @fab_lock: lock to protect fabric counters working mode.
+ * @events_group: events attribute group for fme perf pmu.
+ * @attr_groups: attribute groups for fme perf pmu.
+ */
+struct fme_perf_priv {
+	struct device *dev;
+	void __iomem *ioaddr;
+	struct pmu pmu;
+	u64 id;
+
+	u32 fab_users;
+	u32 fab_port_id;
+	spinlock_t fab_lock;
+
+	struct attribute_group events_group;
+	const struct attribute_group *attr_groups[4];
+};
+
+/**
+ * struct fme_perf_event_attr - fme perf event attribute
+ *
+ * @attr: device attribute of fme perf event.
+ * @event_id: id of fme perf event.
+ * @event_type: type of fme perf event.
+ * @is_port_event: indicate if this is a port based event.
+ * @data: private data for fme perf event.
+ */
+struct fme_perf_event_attr {
+	struct device_attribute attr;
+	u32 event_id;
+	u32 event_type;
+	bool is_port_event;
+	u64 data;
+};
+
+/**
+ * struct fme_perf_event_ops - callbacks for fme perf events
+ *
+ * @event_init: callback invoked during event init.
+ * @event_destroy: callback invoked during event destroy.
+ * @read_counter: callback to read hardware counters.
+ */
+struct fme_perf_event_ops {
+	int (*event_init)(struct fme_perf_priv *priv, u32 event,
+			  u32 port_id, u64 data);
+	void (*event_destroy)(struct fme_perf_priv *priv, u32 event,
+			      u32 port_id, u64 data);
+	u64 (*read_counter)(struct fme_perf_priv *priv, u32 event,
+			    u32 port_id, u64 data);
+};
+
+/**
+ * struct fme_perf_event_group - fme perf groups
+ *
+ * @ev_attrs: fme perf event attributes.
+ * @num: events number in this group.
+ * @ops: same callbacks shared by all fme perf events in this group.
+ */
+struct fme_perf_event_group {
+	struct fme_perf_event_attr *ev_attrs;
+	unsigned int num;
+	struct fme_perf_event_ops *ops;
+};
+
+#define to_fme_perf_priv(_pmu)	container_of(_pmu, struct fme_perf_priv, pmu)
+
+static cpumask_t fme_perf_cpumask = CPU_MASK_CPU0;
+
+static ssize_t cpumask_show(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	return cpumap_print_to_pagebuf(true, buf, &fme_perf_cpumask);
+}
+static DEVICE_ATTR_RO(cpumask);
+
+static struct attribute *fme_perf_cpumask_attrs[] = {
+	&dev_attr_cpumask.attr,
+	NULL,
+};
+
+static struct attribute_group fme_perf_cpumask_group = {
+	.attrs = fme_perf_cpumask_attrs,
+};
+
+#define FME_EVENT_MASK		GENMASK_ULL(11, 0)
+#define FME_EVTYPE_MASK		GENMASK_ULL(15, 12)
+#define FME_EVTYPE_BASIC	0
+#define FME_EVTYPE_CACHE	1
+#define FME_EVTYPE_FABRIC	2
+#define FME_EVTYPE_VTD		3
+#define FME_EVTYPE_VTD_SIP	4
+#define FME_EVTYPE_MAX		FME_EVTYPE_VTD_SIP
+#define FME_PORTID_MASK		GENMASK_ULL(23, 16)
+#define FME_PORTID_ROOT		(0xffU)
+
+PMU_FORMAT_ATTR(event,		"config:0-11");
+PMU_FORMAT_ATTR(evtype,		"config:12-15");
+PMU_FORMAT_ATTR(portid,		"config:16-23");
+
+static struct attribute *fme_perf_format_attrs[] = {
+	&format_attr_event.attr,
+	&format_attr_evtype.attr,
+	&format_attr_portid.attr,
+	NULL,
+};
+
+static struct attribute_group fme_perf_format_group = {
+	.name = "format",
+	.attrs = fme_perf_format_attrs,
+};
+
+static ssize_t fme_perf_event_sysfs_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *page)
+{
+	struct fme_perf_event_attr *ev_attr =
+		container_of(attr, struct fme_perf_event_attr, attr);
+	char *buf = page;
+
+	buf += sprintf(buf, "event=0x%02x", ev_attr->event_id);
+	buf += sprintf(buf, ",evtype=0x%02x", ev_attr->event_type);
+
+	if (ev_attr->is_port_event)
+		buf += sprintf(buf, ",portid=?\n");
+	else
+		buf += sprintf(buf, ",portid=0x%02x\n", FME_PORTID_ROOT);
+
+	return (ssize_t)(buf - page);
+}
+
+#define FME_EVENT_ATTR(_name) \
+	__ATTR(_name, 0444, fme_perf_event_sysfs_show, NULL)
+
+#define FME_EVENT_BASIC(_name, _event) {		\
+	.attr = FME_EVENT_ATTR(_name),			\
+	.event_id = _event,				\
+	.event_type = FME_EVTYPE_BASIC,			\
+	.is_port_event = false,				\
+}
+
+/* data is used to save hardware channel information for cache events */
+#define FME_EVENT_CACHE(_name, _event, _data) {		\
+	.attr = FME_EVENT_ATTR(cache_##_name),		\
+	.event_id = _event,				\
+	.event_type = FME_EVTYPE_CACHE,			\
+	.is_port_event = false,				\
+	.data = _data,					\
+}
+
+#define FME_EVENT_FABRIC(_name, _event) {		\
+	.attr = FME_EVENT_ATTR(fab_##_name),		\
+	.event_id = _event,				\
+	.event_type = FME_EVTYPE_FABRIC,		\
+	.is_port_event = false,				\
+}
+
+#define FME_EVENT_FABRIC_PORT(_name, _event) {		\
+	.attr = FME_EVENT_ATTR(fab_port_##_name),	\
+	.event_id = _event,				\
+	.event_type = FME_EVTYPE_FABRIC,		\
+	.is_port_event = true,				\
+}
+
+#define FME_EVENT_VTD_PORT(_name, _event) {		\
+	.attr = FME_EVENT_ATTR(vtd_port_##_name),	\
+	.event_id = _event,				\
+	.event_type = FME_EVTYPE_VTD,			\
+	.is_port_event = true,				\
+}
+
+#define FME_EVENT_VTD_SIP(_name, _event) {		\
+	.attr = FME_EVENT_ATTR(vtd_sip_##_name),	\
+	.event_id = _event,				\
+	.event_type = FME_EVTYPE_VTD_SIP,		\
+	.is_port_event = false,				\
+}
+
+static struct fme_perf_event_attr fme_perf_basic_events[] = {
+	FME_EVENT_BASIC(clock, BASIC_EVNT_CLK),
+};
+
+static struct fme_perf_event_attr fme_perf_cache_events[] = {
+	FME_EVENT_CACHE(read_hit,     CACHE_EVNT_RD_HIT,    CACHE_CHANNEL_RD),
+	FME_EVENT_CACHE(read_miss,    CACHE_EVNT_RD_MISS,   CACHE_CHANNEL_RD),
+	FME_EVENT_CACHE(write_hit,    CACHE_EVNT_WR_HIT,    CACHE_CHANNEL_WR),
+	FME_EVENT_CACHE(write_miss,   CACHE_EVNT_WR_MISS,   CACHE_CHANNEL_WR),
+	FME_EVENT_CACHE(hold_request, CACHE_EVNT_HOLD_REQ,  CACHE_CHANNEL_RD),
+	FME_EVENT_CACHE(data_write_port_contention,
+			CACHE_EVNT_DATA_WR_PORT_CONTEN, CACHE_CHANNEL_WR),
+	FME_EVENT_CACHE(tag_write_port_contention,
+			CACHE_EVNT_TAG_WR_PORT_CONTEN,  CACHE_CHANNEL_WR),
+	FME_EVENT_CACHE(tx_req_stall, CACHE_EVNT_TX_REQ_STALL,
+			CACHE_CHANNEL_RD),
+	FME_EVENT_CACHE(rx_req_stall, CACHE_EVNT_RX_REQ_STALL,
+			CACHE_CHANNEL_RD),
+	FME_EVENT_CACHE(eviction,  CACHE_EVNT_EVICTIONS, CACHE_CHANNEL_RD),
+};
+
+static struct fme_perf_event_attr fme_perf_fab_events[] = {
+	FME_EVENT_FABRIC(pcie0_read,  FAB_EVNT_PCIE0_RD),
+	FME_EVENT_FABRIC(pcie0_write, FAB_EVNT_PCIE0_WR),
+	FME_EVENT_FABRIC(pcie1_read,  FAB_EVNT_PCIE1_RD),
+	FME_EVENT_FABRIC(pcie1_write, FAB_EVNT_PCIE1_WR),
+	FME_EVENT_FABRIC(upi_read,    FAB_EVNT_UPI_RD),
+	FME_EVENT_FABRIC(upi_write,   FAB_EVNT_UPI_WR),
+	FME_EVENT_FABRIC(mmio_read,   FAB_EVNT_MMIO_RD),
+	FME_EVENT_FABRIC(mmio_write,  FAB_EVNT_MMIO_WR),
+
+	FME_EVENT_FABRIC_PORT(pcie0_read,  FAB_EVNT_PCIE0_RD),
+	FME_EVENT_FABRIC_PORT(pcie0_write, FAB_EVNT_PCIE0_WR),
+	FME_EVENT_FABRIC_PORT(pcie1_read,  FAB_EVNT_PCIE1_RD),
+	FME_EVENT_FABRIC_PORT(pcie1_write, FAB_EVNT_PCIE1_WR),
+	FME_EVENT_FABRIC_PORT(upi_read,    FAB_EVNT_UPI_RD),
+	FME_EVENT_FABRIC_PORT(upi_write,   FAB_EVNT_UPI_WR),
+	FME_EVENT_FABRIC_PORT(mmio_read,   FAB_EVNT_MMIO_RD),
+	FME_EVENT_FABRIC_PORT(mmio_write,  FAB_EVNT_MMIO_WR),
+};
+
+static struct fme_perf_event_attr fme_perf_vtd_events[] = {
+	FME_EVENT_VTD_PORT(read_transaction,  VTD_EVNT_AFU_MEM_RD_TRANS),
+	FME_EVENT_VTD_PORT(write_transaction, VTD_EVNT_AFU_MEM_WR_TRANS),
+	FME_EVENT_VTD_PORT(devtlb_read_hit,   VTD_EVNT_AFU_DEVTLB_RD_HIT),
+	FME_EVENT_VTD_PORT(devtlb_write_hit,  VTD_EVNT_AFU_DEVTLB_WR_HIT),
+	FME_EVENT_VTD_PORT(devtlb_4k_fill,    VTD_EVNT_DEVTLB_4K_FILL),
+	FME_EVENT_VTD_PORT(devtlb_2m_fill,    VTD_EVNT_DEVTLB_2M_FILL),
+	FME_EVENT_VTD_PORT(devtlb_1g_fill,    VTD_EVNT_DEVTLB_1G_FILL),
+};
+
+static struct fme_perf_event_attr fme_perf_vtd_sip_events[] = {
+	FME_EVENT_VTD_SIP(iotlb_4k_hit,  VTD_SIP_EVNT_IOTLB_4K_HIT),
+	FME_EVENT_VTD_SIP(iotlb_2m_hit,  VTD_SIP_EVNT_IOTLB_2M_HIT),
+	FME_EVENT_VTD_SIP(iotlb_1g_hit,  VTD_SIP_EVNT_IOTLB_1G_HIT),
+	FME_EVENT_VTD_SIP(slpwc_l3_hit,  VTD_SIP_EVNT_SLPWC_L3_HIT),
+	FME_EVENT_VTD_SIP(slpwc_l4_hit,  VTD_SIP_EVNT_SLPWC_L4_HIT),
+	FME_EVENT_VTD_SIP(rcc_hit,       VTD_SIP_EVNT_RCC_HIT),
+	FME_EVENT_VTD_SIP(iotlb_4k_miss, VTD_SIP_EVNT_IOTLB_4K_MISS),
+	FME_EVENT_VTD_SIP(iotlb_2m_miss, VTD_SIP_EVNT_IOTLB_2M_MISS),
+	FME_EVENT_VTD_SIP(iotlb_1g_miss, VTD_SIP_EVNT_IOTLB_1G_MISS),
+	FME_EVENT_VTD_SIP(slpwc_l3_miss, VTD_SIP_EVNT_SLPWC_L3_MISS),
+	FME_EVENT_VTD_SIP(slpwc_l4_miss, VTD_SIP_EVNT_SLPWC_L4_MISS),
+	FME_EVENT_VTD_SIP(rcc_miss,      VTD_SIP_EVNT_RCC_MISS),
+};
+
+static u64 basic_read_event_counter(struct fme_perf_priv *priv,
+				    u32 event, u32 port_id, u64 data)
+{
+	void __iomem *base = priv->ioaddr;
+
+	if (event == BASIC_EVNT_CLK)
+		return readq(base + CLK_CNTR);
+
+	return 0;
+}
+
+static struct fme_perf_event_ops fme_perf_basic_ops = {
+	.read_counter = basic_read_event_counter,
+};
+
+static u64 cache_read_event_counter(struct fme_perf_priv *priv,
+				    u32 event, u32 port_id, u64 data)
+{
+	void __iomem *base = priv->ioaddr;
+	u8 channel = (u8)data;
+	u64 v, count;
+
+	/* set channel access type and cache event code. */
+	v = readq(base + CACHE_CTRL);
+	v &= ~(CACHE_CHANNEL_SEL | CACHE_CTRL_EVNT);
+	v |= FIELD_PREP(CACHE_CHANNEL_SEL, channel);
+	v |= FIELD_PREP(CACHE_CTRL_EVNT, event);
+	writeq(v, base + CACHE_CTRL);
+
+	if (readq_poll_timeout_atomic(base + CACHE_CNTR0, v,
+				      FIELD_GET(CACHE_CNTR_EVNT, v) == event,
+				      1, PERF_TIMEOUT)) {
+		dev_err(priv->dev, "timeout, unmatched cache event code in counter register.\n");
+		return 0;
+	}
+
+	v = readq(base + CACHE_CNTR0);
+	count = FIELD_GET(CACHE_CNTR_EVNT_CNTR, v);
+	v = readq(base + CACHE_CNTR1);
+	count += FIELD_GET(CACHE_CNTR_EVNT_CNTR, v);
+
+	return count;
+}
+
+static struct fme_perf_event_ops fme_perf_cache_ops = {
+	.read_counter = cache_read_event_counter,
+};
+
+static int fabric_event_init(struct fme_perf_priv *priv,
+			     u32 event, u32 port_id, u64 data)
+{
+	void __iomem *base = priv->ioaddr;
+	int ret = 0;
+	u64 v;
+
+	/*
+	 * as fabric counter set only can be in either overall or port mode.
+	 * In overall mode, it counts overall data for FPGA, and in port mode,
+	 * it is configured to monitor on one individual port.
+	 *
+	 * so every time, a new event is initialized, driver checks
+	 * current working mode and if someone is using this counter set.
+	 */
+	spin_lock(&priv->fab_lock);
+	if (priv->fab_users && priv->fab_port_id != port_id) {
+		dev_dbg(priv->dev, "conflict fabric event monitoring mode.\n");
+		ret = -EOPNOTSUPP;
+		goto exit;
+	}
+
+	priv->fab_users++;
+
+	/*
+	 * skip if current working mode matches, otherwise change the working
+	 * mode per input port_id, to monitor overall data or another port.
+	 */
+	if (priv->fab_port_id == port_id)
+		goto exit;
+
+	priv->fab_port_id = port_id;
+
+	v = readq(base + FAB_CTRL);
+	v &= ~(FAB_PORT_FILTER | FAB_PORT_ID);
+
+	if (port_id == FME_PORTID_ROOT) {
+		v |= FIELD_PREP(FAB_PORT_FILTER, FAB_PORT_FILTER_DISABLE);
+	} else {
+		v |= FIELD_PREP(FAB_PORT_FILTER, FAB_PORT_FILTER_ENABLE);
+		v |= FIELD_PREP(FAB_PORT_ID, port_id);
+	}
+	writeq(v, base + FAB_CTRL);
+
+exit:
+	spin_unlock(&priv->fab_lock);
+	return ret;
+}
+
+static void fabric_event_destroy(struct fme_perf_priv *priv,
+				 u32 event, u32 port_id, u64 data)
+{
+	spin_lock(&priv->fab_lock);
+	priv->fab_users--;
+	spin_unlock(&priv->fab_lock);
+}
+
+static u64 fabric_read_event_counter(struct fme_perf_priv *priv,
+				     u32 event, u32 port_id, u64 data)
+{
+	void __iomem *base = priv->ioaddr;
+	u64 v;
+
+	v = readq(base + FAB_CTRL);
+	v &= ~FAB_CTRL_EVNT;
+	v |= FIELD_PREP(FAB_CTRL_EVNT, event);
+	writeq(v, base + FAB_CTRL);
+
+	if (readq_poll_timeout_atomic(base + FAB_CNTR, v,
+				      FIELD_GET(FAB_CNTR_EVNT, v) == event,
+				      1, PERF_TIMEOUT)) {
+		dev_err(priv->dev, "timeout, unmatched fab event code in counter register.\n");
+		return 0;
+	}
+
+	v = readq(base + FAB_CNTR);
+	return FIELD_GET(FAB_CNTR_EVNT_CNTR, v);
+}
+
+static struct fme_perf_event_ops fme_perf_fab_ops = {
+	.event_init = fabric_event_init,
+	.event_destroy = fabric_event_destroy,
+	.read_counter = fabric_read_event_counter,
+};
+
+static u64 vtd_read_event_counter(struct fme_perf_priv *priv,
+				  u32 event, u32 port_id, u64 data)
+{
+	void __iomem *base = priv->ioaddr;
+	u64 v;
+
+	event += port_id;
+
+	v = readq(base + VTD_CTRL);
+	v &= ~VTD_CTRL_EVNT;
+	v |= FIELD_PREP(VTD_CTRL_EVNT, event);
+	writeq(v, base + VTD_CTRL);
+
+	if (readq_poll_timeout_atomic(base + VTD_CNTR, v,
+				      FIELD_GET(VTD_CNTR_EVNT, v) == event,
+				      1, PERF_TIMEOUT)) {
+		dev_err(priv->dev, "timeout, unmatched vtd event code in counter register.\n");
+		return 0;
+	}
+
+	v = readq(base + VTD_CNTR);
+	return FIELD_GET(VTD_CNTR_EVNT_CNTR, v);
+}
+
+static struct fme_perf_event_ops fme_perf_vtd_ops = {
+	.read_counter = vtd_read_event_counter,
+};
+
+static u64 vtd_sip_read_event_counter(struct fme_perf_priv *priv,
+				      u32 event, u32 port_id, u64 data)
+{
+	void __iomem *base = priv->ioaddr;
+	u64 v;
+
+	v = readq(base + VTD_SIP_CTRL);
+	v &= ~VTD_SIP_CTRL_EVNT;
+	v |= FIELD_PREP(VTD_SIP_CTRL_EVNT, event);
+	writeq(v, base + VTD_SIP_CTRL);
+
+	if (readq_poll_timeout_atomic(base + VTD_SIP_CNTR, v,
+				      FIELD_GET(VTD_SIP_CNTR_EVNT, v) == event,
+				      1, PERF_TIMEOUT)) {
+		dev_err(priv->dev, "timeout, unmatched vtd sip event code in counter register\n");
+		return 0;
+	}
+
+	v = readq(base + VTD_SIP_CNTR);
+	return FIELD_GET(VTD_SIP_CNTR_EVNT_CNTR, v);
+}
+
+static struct fme_perf_event_ops fme_perf_vtd_sip_ops = {
+	.read_counter = vtd_sip_read_event_counter,
+};
+
+#define FME_EVENT_GROUP(_name) {			\
+	.ev_attrs = fme_perf_##_name##_events,		\
+	.num = ARRAY_SIZE(fme_perf_##_name##_events),	\
+	.ops = &fme_perf_##_name##_ops,			\
+}
+
+/* event group array is indexed by FME_EVTYPE_* */
+static struct fme_perf_event_group fme_perf_event_groups[] = {
+	FME_EVENT_GROUP(basic),
+	FME_EVENT_GROUP(cache),
+	FME_EVENT_GROUP(fab),
+	FME_EVENT_GROUP(vtd),
+	FME_EVENT_GROUP(vtd_sip),
+};
+
+static struct fme_perf_event_attr *
+get_event_attr(u32 event_id, u32 event_type, u32 port_id)
+{
+	bool is_port_event = (port_id != FME_PORTID_ROOT);
+	struct fme_perf_event_group *group;
+	unsigned int i;
+
+	if (event_type > FME_EVTYPE_MAX)
+		return NULL;
+
+	group = &fme_perf_event_groups[event_type];
+
+	for (i = 0; i < group->num; i++) {
+		if (event_id == group->ev_attrs[i].event_id &&
+		    is_port_event == group->ev_attrs[i].is_port_event)
+			return &group->ev_attrs[i];
+	}
+
+	return NULL;
+}
+
+static struct fme_perf_event_ops *get_event_ops(u32 event_type)
+{
+	return fme_perf_event_groups[event_type].ops;
+}
+
+static void fme_perf_event_destroy(struct perf_event *event)
+{
+	struct fme_perf_event_ops *ops = get_event_ops(event->hw.event_base);
+	struct fme_perf_priv *priv = to_fme_perf_priv(event->pmu);
+
+	if (ops->event_destroy)
+		ops->event_destroy(priv, event->hw.idx,
+				   event->hw.config_base, event->hw.config);
+}
+
+static int fme_perf_event_init(struct perf_event *event)
+{
+	struct fme_perf_priv *priv = to_fme_perf_priv(event->pmu);
+	struct hw_perf_event *hwc = &event->hw;
+	struct fme_perf_event_attr *ev_attr;
+	u32 event_id, event_type, port_id;
+	struct fme_perf_event_ops *ops;
+
+	/* test the event attr type check for PMU enumeration */
+	if (event->attr.type != event->pmu->type)
+		return -ENOENT;
+
+	/*
+	 * fme counters are shared across all cores.
+	 * Therefore, it does not support per-process mode.
+	 * Also, it does not support event sampling mode.
+	 */
+	if (is_sampling_event(event) || event->attach_state & PERF_ATTACH_TASK)
+		return -EINVAL;
+
+	if (event->cpu < 0)
+		return -EINVAL;
+
+	event_id = FIELD_GET(FME_EVENT_MASK, event->attr.config);
+	event_type = FIELD_GET(FME_EVTYPE_MASK, event->attr.config);
+	port_id = FIELD_GET(FME_PORTID_MASK, event->attr.config);
+
+	ev_attr = get_event_attr(event_id, event_type, port_id);
+	if (!ev_attr)
+		return -EINVAL;
+
+	if ((ev_attr->is_port_event && port_id >= PERF_MAX_PORT_NUM) ||
+	    (!ev_attr->is_port_event && port_id != FME_PORTID_ROOT))
+		return -EINVAL;
+
+	hwc->event_base = event_type;
+	hwc->idx = (int)event_id;
+	hwc->config_base = port_id;
+	hwc->config = ev_attr->data;
+
+	event->destroy = fme_perf_event_destroy;
+
+	dev_dbg(priv->dev,
+		"%s eventid=0x%x, evtype=0x%x, portid=0x%x, data=0x%llx\n",
+		__func__, event_id, event_type, port_id, ev_attr->data);
+
+	ops = get_event_ops(event->hw.event_base);
+
+	if (ops->event_init)
+		return ops->event_init(priv, hwc->idx,
+				       hwc->config_base, hwc->config);
+
+	return 0;
+}
+
+static void fme_perf_event_update(struct perf_event *event)
+{
+	struct fme_perf_event_ops *ops = get_event_ops(event->hw.event_base);
+	struct fme_perf_priv *priv = to_fme_perf_priv(event->pmu);
+	u64 now, prev, delta;
+
+	now = ops->read_counter(priv, (u32)event->hw.idx,
+				event->hw.config_base, event->hw.config);
+	prev = local64_read(&event->hw.prev_count);
+	delta = now - prev;
+
+	local64_add(delta, &event->count);
+}
+
+static void fme_perf_event_start(struct perf_event *event, int flags)
+{
+	struct fme_perf_event_ops *ops = get_event_ops(event->hw.event_base);
+	struct fme_perf_priv *priv = to_fme_perf_priv(event->pmu);
+	u64 count;
+
+	count = ops->read_counter(priv, (u32)event->hw.idx,
+				  event->hw.config_base, event->hw.config);
+	local64_set(&event->hw.prev_count, count);
+}
+
+static void fme_perf_event_stop(struct perf_event *event, int flags)
+{
+	fme_perf_event_update(event);
+}
+
+static int fme_perf_event_add(struct perf_event *event, int flags)
+{
+	if (flags & PERF_EF_START)
+		fme_perf_event_start(event, flags);
+
+	return 0;
+}
+
+static void fme_perf_event_del(struct perf_event *event, int flags)
+{
+	fme_perf_event_stop(event, PERF_EF_UPDATE);
+}
+
+static void fme_perf_event_read(struct perf_event *event)
+{
+	fme_perf_event_update(event);
+}
+
+static int fme_perf_setup_attrs(struct fme_perf_priv *priv)
+{
+	struct fme_perf_event_group *group;
+	unsigned int i, idx = 0, num = 0;
+	struct attribute **attrs;
+
+	/*
+	 * if feature id is FME_FEATURE_ID_GLOBAL_IPERF, hardware supports
+	 * all performance counters, otherwise only basic and fabric counters.
+	 */
+	num += fme_perf_event_groups[FME_EVTYPE_BASIC].num;
+	num += fme_perf_event_groups[FME_EVTYPE_FABRIC].num;
+
+	if (priv->id == FME_FEATURE_ID_GLOBAL_IPERF) {
+		num += fme_perf_event_groups[FME_EVTYPE_CACHE].num;
+		num += fme_perf_event_groups[FME_EVTYPE_VTD].num;
+		num += fme_perf_event_groups[FME_EVTYPE_VTD_SIP].num;
+	}
+
+	attrs = devm_kcalloc(priv->dev, num + 1, sizeof(*attrs), GFP_KERNEL);
+	if (!attrs)
+		return -ENOMEM;
+
+	group = &fme_perf_event_groups[FME_EVTYPE_BASIC];
+	for (i = 0; i < group->num; i++)
+		attrs[idx++] = &group->ev_attrs[i].attr.attr;
+
+	group = &fme_perf_event_groups[FME_EVTYPE_FABRIC];
+	for (i = 0; i < group->num; i++)
+		attrs[idx++] = &group->ev_attrs[i].attr.attr;
+
+	if (priv->id == FME_FEATURE_ID_GLOBAL_IPERF) {
+		group = &fme_perf_event_groups[FME_EVTYPE_CACHE];
+		for (i = 0; i < group->num; i++)
+			attrs[idx++] = &group->ev_attrs[i].attr.attr;
+
+		group = &fme_perf_event_groups[FME_EVTYPE_VTD];
+		for (i = 0; i < group->num; i++)
+			attrs[idx++] = &group->ev_attrs[i].attr.attr;
+
+		group = &fme_perf_event_groups[FME_EVTYPE_VTD_SIP];
+		for (i = 0; i < group->num; i++)
+			attrs[idx++] = &group->ev_attrs[i].attr.attr;
+	}
+
+	priv->events_group.name = "events";
+	priv->events_group.attrs = attrs;
+
+	priv->attr_groups[0] = &fme_perf_format_group;
+	priv->attr_groups[1] = &fme_perf_cpumask_group;
+	priv->attr_groups[2] = &priv->events_group;
+
+	return 0;
+}
+
+static void fme_perf_setup_hardware(struct fme_perf_priv *priv)
+{
+	void __iomem *base = priv->ioaddr;
+	u64 v;
+
+	/* read and save current working mode for fabric counters */
+	v = readq(base + FAB_CTRL);
+
+	if (FIELD_GET(FAB_PORT_FILTER, v) == FAB_PORT_FILTER_DISABLE)
+		priv->fab_port_id = FME_PORTID_ROOT;
+	else
+		priv->fab_port_id = FIELD_GET(FAB_PORT_ID, v);
+}
+
+static int fme_perf_pmu_register(struct platform_device *pdev,
+				 struct fme_perf_priv *priv)
+{
+	struct pmu *pmu = &priv->pmu;
+	char *name;
+	int ret;
+
+	spin_lock_init(&priv->fab_lock);
+
+	ret = fme_perf_setup_attrs(priv);
+	if (ret)
+		return ret;
+
+	fme_perf_setup_hardware(priv);
+
+	pmu->task_ctx_nr =	perf_invalid_context;
+	pmu->attr_groups =	priv->attr_groups;
+	pmu->event_init =	fme_perf_event_init;
+	pmu->add =		fme_perf_event_add;
+	pmu->del =		fme_perf_event_del;
+	pmu->start =		fme_perf_event_start;
+	pmu->stop =		fme_perf_event_stop;
+	pmu->read =		fme_perf_event_read;
+	pmu->capabilities =	PERF_PMU_CAP_NO_INTERRUPT |
+				PERF_PMU_CAP_NO_EXCLUDE;
+
+	name = devm_kasprintf(priv->dev, GFP_KERNEL, "fme%d", pdev->id);
+
+	ret = perf_pmu_register(pmu, name, -1);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static void fme_perf_pmu_unregister(struct fme_perf_priv *priv)
+{
+	perf_pmu_unregister(&priv->pmu);
+}
+
+static int fme_perf_init(struct platform_device *pdev,
+			 struct dfl_feature *feature)
+{
+	struct fme_perf_priv *priv;
+	int ret;
+
+	priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	priv->dev = &pdev->dev;
+	priv->ioaddr = feature->ioaddr;
+	priv->id = feature->id;
+
+	ret = fme_perf_pmu_register(pdev, priv);
+	if (ret)
+		return ret;
+
+	feature->priv = priv;
+	return 0;
+}
+
+static void fme_perf_uinit(struct platform_device *pdev,
+			   struct dfl_feature *feature)
+{
+	struct fme_perf_priv *priv = feature->priv;
+
+	fme_perf_pmu_unregister(priv);
+}
+
+const struct dfl_feature_id fme_perf_id_table[] = {
+	{.id = FME_FEATURE_ID_GLOBAL_IPERF,},
+	{.id = FME_FEATURE_ID_GLOBAL_DPERF,},
+	{0,}
+};
+
+const struct dfl_feature_ops fme_perf_ops = {
+	.init = fme_perf_init,
+	.uinit = fme_perf_uinit,
+};
diff --git a/drivers/fpga/dfl-fme.h b/drivers/fpga/dfl-fme.h
index 6685c8e..4195dd6 100644
--- a/drivers/fpga/dfl-fme.h
+++ b/drivers/fpga/dfl-fme.h
@@ -38,5 +38,7 @@ struct dfl_fme {
 extern const struct dfl_feature_ops fme_global_err_ops;
 extern const struct dfl_feature_id fme_global_err_id_table[];
 extern const struct attribute_group fme_global_err_group;
+extern const struct dfl_feature_ops fme_perf_ops;
+extern const struct dfl_feature_id fme_perf_id_table[];
 
 #endif /* __DFL_FME_H */
diff --git a/drivers/fpga/dfl.h b/drivers/fpga/dfl.h
index 9f0e656..d312678 100644
--- a/drivers/fpga/dfl.h
+++ b/drivers/fpga/dfl.h
@@ -197,12 +197,14 @@ struct dfl_feature_driver {
  *		    feature dev (platform device)'s reources.
  * @ioaddr: mapped mmio resource address.
  * @ops: ops of this sub feature.
+ * @priv: priv data of this feature.
  */
 struct dfl_feature {
 	u64 id;
 	int resource_index;
 	void __iomem *ioaddr;
 	const struct dfl_feature_ops *ops;
+	void *priv;
 };
 
 #define DEV_STATUS_IN_USE	0
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v5 1/2] Documentation: fpga: dfl: add description for performance reporting support
From: Wu Hao @ 2019-10-28  3:31 UTC (permalink / raw)
  To: mdf, linux-fpga, linux-kernel
  Cc: linux-api, will, mark.rutland, atull, gregkh, Xu Yilun, Wu Hao
In-Reply-To: <1572233488-26533-1-git-send-email-hao.wu@intel.com>

From: Xu Yilun <yilun.xu@intel.com>

This patch adds description for performance reporting support for
Device Feature List (DFL) based FPGA.

Signed-off-by: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
---
v5: rebase due to format change (txt->rst), and fix some typos.
---
 Documentation/fpga/dfl.rst | 83 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 83 insertions(+)

diff --git a/Documentation/fpga/dfl.rst b/Documentation/fpga/dfl.rst
index 094fc8a..93e51de 100644
--- a/Documentation/fpga/dfl.rst
+++ b/Documentation/fpga/dfl.rst
@@ -118,6 +118,11 @@ More functions are exposed through sysfs
      management information (current temperature, thresholds, threshold status,
      etc.).
 
+ Performance reporting
+     performance counters are exposed through perf PMU APIs. Standard perf tool
+     can be used to monitor all available perf events. Please see performance
+     counter section below for more detailed information.
+
 
 FIU - PORT
 ==========
@@ -378,6 +383,84 @@ The device nodes used for ioctl() or mmap() can be referenced through::
 	/sys/class/fpga_region/<regionX>/<dfl-port.n>/dev
 
 
+Performance Counters
+====================
+Performance reporting is one private feature implemented in FME. It could
+support several independent, system-wide, device counter sets in hardware to
+monitor and count performance events, including "basic", "cache", "fabric",
+"vtd" and "vtd_sip" counters. Users could use standard perf tool to monitor
+FPGA cache hit/miss rate, transaction number, interface clock counter of AFU
+and other FPGA performance events.
+
+Different FPGA devices may have different counter sets, it depends on hardware
+implementation. e.g. some discrete FPGA cards don't have any cache. User could
+use "perf list" to check which perf events are supported by target hardware.
+
+In order to allow user to use standard perf API to access these performance
+counters, driver creates a perf PMU, and related sysfs interfaces in
+/sys/bus/event_source/devices/fme* to describe available perf events and
+configuration options.
+
+The "format" directory describes the format of the config field of struct
+perf_event_attr. There are 3 bitfields for config, "evtype" defines which type
+the perf event belongs to. "event" is the identity of the event within its
+category. "portid" is introduced to decide which counter set to monitor on FPGA
+overall data or a specific port.
+
+The "events" directory describes the configuration templates for all available
+events which can be used with perf tool directly. For example, fab_mmio_read
+has the configuration "event=0x06,evtype=0x02,portid=0xff", which shows this
+event belongs to fabric type (0x02), the local event id is 0x06 and it is for
+overall monitoring (portid=0xff).
+
+Example usage of perf::
+
+  $# perf list |grep fme
+
+  fme0/fab_mmio_read/                              [Kernel PMU event]
+  <...>
+  fme0/fab_port_mmio_read,portid=?/                [Kernel PMU event]
+  <...>
+
+  $# perf stat -a -e fme0/fab_mmio_read/ <command>
+  or
+  $# perf stat -a -e fme0/event=0x06,evtype=0x02,portid=0xff/ <command>
+  or
+  $# perf stat -a -e fme0/config=0xff2006/ <command>
+
+Another example, fab_port_mmio_read monitors mmio read of a specific port. So
+its configuration template is "event=0x06,evtype=0x01,portid=?". The portid
+should be explicitly set.
+
+Its usage of perf::
+
+  $# perf stat -a -e fme0/fab_port_mmio_read,portid=0x0/ <command>
+  or
+  $# perf stat -a -e fme0/event=0x06,evtype=0x02,portid=0x0/ <command>
+  or
+  $# perf stat -a -e fme0/config=0x2006/ <command>
+
+Please note for fabric counters, overall perf events (fab_*) and port perf
+events (fab_port_*) actually share one set of counters in hardware, so it can't
+monitor both at the same time. If this set of counters is configured to monitor
+overall data, then per port perf data is not supported. See below example::
+
+  $# perf stat -e fme0/fab_mmio_read/,fme0/fab_port_mmio_write,\
+                                                    portid=0/ sleep 1
+
+  Performance counter stats for 'system wide':
+
+                 3      fme0/fab_mmio_read/
+   <not supported>      fme0/fab_port_mmio_write,portid=0x0/
+
+       1.001750904 seconds time elapsed
+
+The driver also provides a "cpumask" sysfs attribute, which always shows fixed
+value cpu0 as all perf events are from system-wide counters on FPGA device.
+
+The current driver does not support sampling. So "perf record" is unsupported.
+
+
 Add new FIUs support
 ====================
 It's possible that developers made some new function blocks (FIUs) under this
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v5 0/2] add performance reporting support to FPGA DFL drivers
From: Wu Hao @ 2019-10-28  3:31 UTC (permalink / raw)
  To: mdf, linux-fpga, linux-kernel
  Cc: linux-api, will, mark.rutland, atull, gregkh, Wu Hao

Hi Moritz and all,

This patchset adds performance reporting support for FPGA DFL drivers. It
introduces one pmu to expose userspace interfaces via standard perf API.
User could use standard perf tool to access perf events exposed via pmu.

This patchset is generated based on fpga/for-next branch.

Main changes from v4:
 - rebase and clean up.
 - update Kconfig for PERF_EVENTS dependency.

Main changes from v3:
 - add more descriptions in doc, including how to use perf tool for these
   hardware counters. (patch #1)
 - use standard perf API instead of sysfs entries. (patch #2)

Wu Hao (1):
  fpga: dfl: fme: add performance reporting support

Xu Yilun (1):
  Documentation: fpga: dfl: add description for performance reporting
    support

 Documentation/fpga/dfl.rst  |  83 +++++
 drivers/fpga/Kconfig        |   2 +-
 drivers/fpga/Makefile       |   1 +
 drivers/fpga/dfl-fme-main.c |   4 +
 drivers/fpga/dfl-fme-perf.c | 869 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/fpga/dfl-fme.h      |   2 +
 drivers/fpga/dfl.h          |   2 +
 7 files changed, 962 insertions(+), 1 deletion(-)
 create mode 100644 drivers/fpga/dfl-fme-perf.c

-- 
1.8.3.1

^ permalink raw reply

* Re: [RFC PATCH 11/10] pipe: Add fsync() support [ver #2]
From: Konstantin Khlebnikov @ 2019-10-27 16:04 UTC (permalink / raw)
  To: David Howells, torvalds
  Cc: Rasmus Villemoes, Greg Kroah-Hartman, Peter Zijlstra,
	nicolas.dichtel, raven, Christian Brauner, keyrings, linux-usb,
	linux-block, linux-security-module, linux-fsdevel, linux-api,
	linux-kernel
In-Reply-To: <30394.1571936252@warthog.procyon.org.uk>

On 24/10/2019 19.57, David Howells wrote:
> pipe: Add fsync() support
> 
> The keyrings testsuite needs the ability to wait for all the outstanding
> notifications in the queue to have been processed so that it can then go
> through them to find out whether the notifications it expected have been
> emitted.

Similar synchronization is required for reusing memory after vmsplice()?
I don't see other way how sender could safely change these pages.

> 
> Implement fsync() support for pipes to provide this.  The tailmost buffer
> at the point of calling is marked and fsync adds itself to the list of
> waiters, noting the tail position to be waited for and marking the buffer
> as no longer mergeable.  Then when the buffer is consumed, if the flag is
> set, any matching waiters are woken up.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
>   fs/fuse/dev.c             |    1
>   fs/pipe.c                 |   61 ++++++++++++++++++++++++++++++++++++++++++++++
>   fs/splice.c               |    3 ++
>   include/linux/pipe_fs_i.h |   22 ++++++++++++++++
>   lib/iov_iter.c            |    2 -
>   5 files changed, 88 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 5ef57a322cb8..9617a35579cb 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -1983,6 +1983,7 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
>   		if (rem >= ibuf->len) {
>   			*obuf = *ibuf;
>   			ibuf->ops = NULL;
> +			pipe_wake_fsync(pipe, ibuf, tail);
>   			tail++;
>   			pipe_commit_read(pipe, tail);
>   		} else {
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 6a982a88f658..8e5fd7314be1 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -30,6 +30,12 @@
>   
>   #include "internal.h"
>   
> +struct pipe_fsync {
> +	struct list_head	link;		/* Link in pipe->fsync */
> +	struct completion	done;
> +	unsigned int		tail;		/* The buffer being waited for */
> +};
> +
>   /*
>    * The max size that a non-root user is allowed to grow the pipe. Can
>    * be set by root in /proc/sys/fs/pipe-max-size
> @@ -269,6 +275,58 @@ static bool pipe_buf_can_merge(struct pipe_buffer *buf)
>   	return buf->ops == &anon_pipe_buf_ops;
>   }
>   
> +/*
> + * Wait for all the data currently in the pipe to be consumed.
> + */
> +static int pipe_fsync(struct file *file, loff_t a, loff_t b, int datasync)
> +{
> +	struct pipe_inode_info *pipe = file->private_data;
> +	struct pipe_buffer *buf;
> +	struct pipe_fsync fsync;
> +	unsigned int head, tail, mask;
> +
> +	pipe_lock(pipe);
> +
> +	head = pipe->head;
> +	tail = pipe->tail;
> +	mask = pipe->ring_size - 1;
> +
> +	if (pipe_empty(head, tail)) {
> +		pipe_unlock(pipe);
> +		return 0;
> +	}
> +
> +	init_completion(&fsync.done);
> +	fsync.tail = tail;
> +	buf = &pipe->bufs[tail & mask];
> +	buf->flags |= PIPE_BUF_FLAG_FSYNC;
> +	pipe_buf_mark_unmergeable(buf);
> +	list_add_tail(&fsync.link, &pipe->fsync);
> +	pipe_unlock(pipe);
> +
> +	if (wait_for_completion_interruptible(&fsync.done) < 0) {
> +		pipe_lock(pipe);
> +		list_del(&fsync.link);
> +		pipe_unlock(pipe);
> +		return -EINTR;
> +	}
> +
> +	return 0;
> +}
> +
> +void __pipe_wake_fsync(struct pipe_inode_info *pipe, unsigned int tail)
> +{
> +	struct pipe_fsync *fsync, *p;
> +
> +	list_for_each_entry_safe(fsync, p, &pipe->fsync, link) {
> +		if (fsync->tail == tail) {
> +			list_del_init(&fsync->link);
> +			complete(&fsync->done);
> +		}
> +	}
> +}
> +EXPORT_SYMBOL(__pipe_wake_fsync);
> +
>   static ssize_t
>   pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   {
> @@ -325,6 +383,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   			if (!buf->len) {
>   				pipe_buf_release(pipe, buf);
>   				spin_lock_irq(&pipe->wait.lock);
> +				pipe_wake_fsync(pipe, buf, tail);
>   				tail++;
>   				pipe_commit_read(pipe, tail);
>   				do_wakeup = 1;
> @@ -717,6 +776,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
>   		pipe->ring_size = pipe_bufs;
>   		pipe->user = user;
>   		mutex_init(&pipe->mutex);
> +		INIT_LIST_HEAD(&pipe->fsync);
>   		return pipe;
>   	}
>   
> @@ -1060,6 +1120,7 @@ const struct file_operations pipefifo_fops = {
>   	.llseek		= no_llseek,
>   	.read_iter	= pipe_read,
>   	.write_iter	= pipe_write,
> +	.fsync		= pipe_fsync,
>   	.poll		= pipe_poll,
>   	.unlocked_ioctl	= pipe_ioctl,
>   	.release	= pipe_release,
> diff --git a/fs/splice.c b/fs/splice.c
> index 3f72bc31b6ec..e106367e1be6 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -523,6 +523,7 @@ static int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_des
>   
>   		if (!buf->len) {
>   			pipe_buf_release(pipe, buf);
> +			pipe_wake_fsync(pipe, buf, tail);
>   			tail++;
>   			pipe_commit_read(pipe, tail);
>   			if (pipe->files)
> @@ -771,6 +772,7 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
>   				ret -= buf->len;
>   				buf->len = 0;
>   				pipe_buf_release(pipe, buf);
> +				pipe_wake_fsync(pipe, buf, tail);
>   				tail++;
>   				pipe_commit_read(pipe, tail);
>   				if (pipe->files)
> @@ -1613,6 +1615,7 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
>   			 */
>   			*obuf = *ibuf;
>   			ibuf->ops = NULL;
> +			pipe_wake_fsync(ipipe, ibuf, i_tail);
>   			i_tail++;
>   			pipe_commit_read(ipipe, i_tail);
>   			input_wakeup = true;
> diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
> index 90055ff16550..1a3027089558 100644
> --- a/include/linux/pipe_fs_i.h
> +++ b/include/linux/pipe_fs_i.h
> @@ -8,6 +8,7 @@
>   #define PIPE_BUF_FLAG_ATOMIC	0x02	/* was atomically mapped */
>   #define PIPE_BUF_FLAG_GIFT	0x04	/* page is a gift */
>   #define PIPE_BUF_FLAG_PACKET	0x08	/* read() as a packet */
> +#define PIPE_BUF_FLAG_FSYNC	0x10	/* fsync() is waiting for this buffer to die */
>   
>   /**
>    *	struct pipe_buffer - a linux kernel pipe buffer
> @@ -43,6 +44,7 @@ struct pipe_buffer {
>    *	@w_counter: writer counter
>    *	@fasync_readers: reader side fasync
>    *	@fasync_writers: writer side fasync
> + *	@fsync: Waiting fsyncs
>    *	@bufs: the circular array of pipe buffers
>    *	@user: the user who created this pipe
>    **/
> @@ -62,6 +64,7 @@ struct pipe_inode_info {
>   	struct page *tmp_page;
>   	struct fasync_struct *fasync_readers;
>   	struct fasync_struct *fasync_writers;
> +	struct list_head fsync;
>   	struct pipe_buffer *bufs;
>   	struct user_struct *user;
>   };
> @@ -268,6 +271,25 @@ extern const struct pipe_buf_operations nosteal_pipe_buf_ops;
>   long pipe_fcntl(struct file *, unsigned int, unsigned long arg);
>   struct pipe_inode_info *get_pipe_info(struct file *file);
>   
> +void __pipe_wake_fsync(struct pipe_inode_info *pipe, unsigned int tail);
> +
> +/**
> + * pipe_wake_fsync - Wake up anyone waiting with fsync for this point
> + * @pipe: The pipe that owns the buffer
> + * @buf: The pipe buffer in question
> + * @tail: The index in the ring of the buffer
> + *
> + * Check to see if anyone is waiting for the pipe ring to clear up to and
> + * including this buffer, and, if they are, wake them up.
> + */
> +static inline void pipe_wake_fsync(struct pipe_inode_info *pipe,
> +				   struct pipe_buffer *buf,
> +				   unsigned int tail)
> +{
> +	if (unlikely(buf->flags & PIPE_BUF_FLAG_FSYNC))
> +		__pipe_wake_fsync(pipe, tail);
> +}
> +
>   int create_pipe_files(struct file **, int);
>   unsigned int round_pipe_size(unsigned long size);
>   
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index e22f4e283f6d..38d52524cd21 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -404,7 +404,7 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
>   	buf->offset = offset;
>   	buf->len = bytes;
>   
> -	pipe_commit_read(pipe, i_head);
> +	pipe_commit_write(pipe, i_head);
>   	i->iov_offset = offset + bytes;
>   	i->head = i_head;
>   out:
> 
> 

^ permalink raw reply

* Re: [RFC PATCH 07/10] pipe: Conditionalise wakeup in pipe_read() [ver #2]
From: Konstantin Khlebnikov @ 2019-10-27 15:57 UTC (permalink / raw)
  To: David Howells, torvalds
  Cc: Rasmus Villemoes, Greg Kroah-Hartman, Peter Zijlstra,
	nicolas.dichtel, raven, Christian Brauner, keyrings, linux-usb,
	linux-block, linux-security-module, linux-fsdevel, linux-api,
	linux-kernel
In-Reply-To: <157186189069.3995.10292601951655075484.stgit@warthog.procyon.org.uk>

On 23/10/2019 23.18, David Howells wrote:
> Only do a wakeup in pipe_read() if we made space in a completely full
> buffer.  The producer shouldn't be waiting on pipe->wait otherwise.

We could go further and wakeup writer only when at least half of buffer is empty.
This gives better batching and reduces rate of context switches.

https://lore.kernel.org/lkml/157219118016.7078.16223055699799396042.stgit@buzz/T/#u

> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>   fs/pipe.c |   15 ++++++---------
>   1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 1274305772fb..e3a8f10750c9 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -327,11 +327,13 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   				spin_lock_irq(&pipe->wait.lock);
>   				tail++;
>   				pipe_commit_read(pipe, tail);
> -				do_wakeup = 0;
> -				wake_up_interruptible_sync_poll_locked(
> -					&pipe->wait, EPOLLOUT | EPOLLWRNORM);
> +				do_wakeup = 1;
> +				if (head - (tail - 1) == pipe->max_usage)
> +					wake_up_interruptible_sync_poll_locked(
> +						&pipe->wait, EPOLLOUT | EPOLLWRNORM);
>   				spin_unlock_irq(&pipe->wait.lock);
> -				kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> +				if (head - (tail - 1) == pipe->max_usage)
> +					kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
>   			}
>   			total_len -= chars;
>   			if (!total_len)
> @@ -360,11 +362,6 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   				ret = -ERESTARTSYS;
>   			break;
>   		}
> -		if (do_wakeup) {
> -			wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM);
> - 			kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> -			do_wakeup = 0;
> -		}
>   		pipe_wait(pipe);
>   	}
>   	__pipe_unlock(pipe);
> 
> 

^ permalink raw reply

* Re: [PATCH RESEND v14 2/6] namei: LOOKUP_IN_ROOT: chroot-like path resolution
From: Aleksa Sarai @ 2019-10-27 15:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	David Drysdale, Chanho Min, Oleg Nesterov, Rasmus Villemoes
In-Reply-To: <CAHk-=wjPPWvm5_eR4uaHJaU1isTUk-4iXQV3Z2Px9A+w6j2nHg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1436 bytes --]

On 2019-10-27, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sat, Oct 26, 2019 at 2:58 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > +       /* LOOKUP_IN_ROOT treats absolute paths as being relative-to-dirfd. */
> > +       if (flags & LOOKUP_IN_ROOT)
> > +               while (*s == '/')
> > +                       s++;
> > +
> >         /* Figure out the starting path and root (if needed). */
> >         if (*s == '/') {
> >                 error = nd_jump_root(nd);
> 
> So I'm still hung up on this.
> 
> I guess I can't help it, but I look at the above, and it makes me go
> "whoever wrote those tests wasn't thinking".
> 
> It just annoys me how it tests for '/' completely unnecessarily.
> 
> If LOOKUP_IN_ROOT is true, we know the subsequent test for '/' is not
> going to match, because we just removed it. So I look at that code and
> go "that code is doing stupid things".

Okay, fair enough.

> That's why I suggested moving the LOOKUP_IN_ROOT check inside the '/' test.
> 
> Alternatively, just make the logic be
> 
>         if (flags & LOOKUP_IN_ROOT) {
>                .. remove '/'s ...
>         } else if (*s == '/') {
>                 .. handl;e root ..
> 
> and remove the next "else" clause

I've gone with the latter since I think it reads better.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [RFC PATCH 11/10] pipe: Add fsync() support [ver #2]
From: Christoph Hellwig @ 2019-10-27 15:22 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, Rasmus Villemoes, Greg Kroah-Hartman, Peter Zijlstra,
	nicolas.dichtel, raven, Christian Brauner, keyrings, linux-usb,
	linux-block, linux-security-module, linux-fsdevel, linux-api,
	linux-kernel
In-Reply-To: <30394.1571936252@warthog.procyon.org.uk>

On Thu, Oct 24, 2019 at 05:57:32PM +0100, David Howells wrote:
> pipe: Add fsync() support
> 
> The keyrings testsuite needs the ability to wait for all the outstanding
> notifications in the queue to have been processed so that it can then go
> through them to find out whether the notifications it expected have been
> emitted.
> 
> Implement fsync() support for pipes to provide this.  The tailmost buffer
> at the point of calling is marked and fsync adds itself to the list of
> waiters, noting the tail position to be waited for and marking the buffer
> as no longer mergeable.  Then when the buffer is consumed, if the flag is
> set, any matching waiters are woken up.

I am _really_ worried about overloading fsync for this behavior.  fsync
hasn't done anything for 50 years, and suddenly adding any action
is not helpful.  If you can't use FIONREAD please add a new ioctls
instead, and document it properly.

^ permalink raw reply

* Re: [RFC PATCH 04/10] pipe: Use head and tail pointers for the ring, not cursor and length [ver #2]
From: Linus Torvalds @ 2019-10-27 14:03 UTC (permalink / raw)
  To: David Howells
  Cc: Rasmus Villemoes, Greg Kroah-Hartman, Peter Zijlstra,
	Nicolas Dichtel, raven, Christian Brauner, keyrings, linux-usb,
	linux-block, LSM List, linux-fsdevel, Linux API,
	Linux Kernel Mailing List
In-Reply-To: <157186186167.3995.7568100174393739543.stgit@warthog.procyon.org.uk>

This still has signs of that earlier series:

On Wed, Oct 23, 2019 at 4:17 PM David Howells <dhowells@redhat.com> wrote:
>
>                 if (rem >= ibuf->len) {
>                         *obuf = *ibuf;
>                         ibuf->ops = NULL;
> -                       pipe->curbuf = (pipe->curbuf + 1) & (pipe->buffers - 1);
> -                       pipe->nrbufs--;
> +                       tail++;
> +                       pipe_commit_read(pipe, tail);
>                 } else {
>                         if (!pipe_buf_get(pipe, ibuf))
>                                 goto out_free;

with those odd "pipe_commit_read/write()" helpers.

They make no sense, and they don't make things more legible.

It's shorter and more obvious to just write

   pipe->head = head;

than it is to write

   pipe_commit_write(pipe, head);

Even when the addition of the notifications,  it's all under the
pipe->wait.lock, so it's all just regular assignments.

Now, if at some point it starts doing fancy lockless things, at _that_
point the updates might become more complex, but that's a potential
future thing that wouldn't be relevant for a while, and isn't a reason
to make the code more obscure now.

Hmm?

             Linus

^ permalink raw reply

* Re: [PATCH RESEND v14 2/6] namei: LOOKUP_IN_ROOT: chroot-like path resolution
From: Linus Torvalds @ 2019-10-27 12:05 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	David Drysdale, Chanho Min, Oleg Nesterov, Rasmus Villemoes
In-Reply-To: <20191026185700.10708-3-cyphar@cyphar.com>

On Sat, Oct 26, 2019 at 2:58 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> +       /* LOOKUP_IN_ROOT treats absolute paths as being relative-to-dirfd. */
> +       if (flags & LOOKUP_IN_ROOT)
> +               while (*s == '/')
> +                       s++;
> +
>         /* Figure out the starting path and root (if needed). */
>         if (*s == '/') {
>                 error = nd_jump_root(nd);

So I'm still hung up on this.

I guess I can't help it, but I look at the above, and it makes me go
"whoever wrote those tests wasn't thinking".

It just annoys me how it tests for '/' completely unnecessarily.

If LOOKUP_IN_ROOT is true, we know the subsequent test for '/' is not
going to match, because we just removed it. So I look at that code and
go "that code is doing stupid things".

That's why I suggested moving the LOOKUP_IN_ROOT check inside the '/' test.

Alternatively, just make the logic be

        if (flags & LOOKUP_IN_ROOT) {
               .. remove '/'s ...
        } else if (*s == '/') {
                .. handl;e root ..

and remove the next "else" clause

    Linus

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Mike Rapoport @ 2019-10-27 11:00 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport
In-Reply-To: <87d0eieb0i.fsf@mid.deneb.enyo.de>

On October 27, 2019 12:30:21 PM GMT+02:00, Florian Weimer <fw@deneb.enyo.de> wrote:
>* Mike Rapoport:
>
>> The patch below aims to allow applications to create mappins that
>have
>> pages visible only to the owning process. Such mappings could be used
>to
>> store secrets so that these secrets are not visible neither to other
>> processes nor to the kernel.
>
>How is this expected to interact with CRIU?

CRIU dumps the memory contents using a parasite code from inside the dumpee address space, so it would work the same way as for the other mappings. Of course, at the restore time the exclusive mapping should be recreated with the appropriate flags.

>> I've only tested the basic functionality, the changes should be
>verified
>> against THP/migration/compaction. Yet, I'd appreciate early feedback.
>
>What are the expected semantics for VM migration?  Should it fail?

I don't quite follow. If qemu would use such mappings it would be able to transfer them during live migration.

-- 
Sincerely yours,
Mike

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Florian Weimer @ 2019-10-27 10:30 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport
In-Reply-To: <1572171452-7958-1-git-send-email-rppt@kernel.org>

* Mike Rapoport:

> The patch below aims to allow applications to create mappins that have
> pages visible only to the owning process. Such mappings could be used to
> store secrets so that these secrets are not visible neither to other
> processes nor to the kernel.

How is this expected to interact with CRIU?

> I've only tested the basic functionality, the changes should be verified
> against THP/migration/compaction. Yet, I'd appreciate early feedback.

What are the expected semantics for VM migration?  Should it fail?

^ permalink raw reply

* [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Mike Rapoport @ 2019-10-27 10:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Mike Rapoport,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport
In-Reply-To: <1572171452-7958-1-git-send-email-rppt@kernel.org>

From: Mike Rapoport <rppt@linux.ibm.com>

The mappings created with MAP_EXCLUSIVE are visible only in the context of
the owning process and can be used by applications to store secret
information that will not be visible not only to other processes but to the
kernel as well.

The pages in these mappings are removed from the kernel direct map and
marked with PG_user_exclusive flag. When the exclusive area is unmapped,
the pages are mapped back into the direct map.

The MAP_EXCLUSIVE flag implies MAP_POPULATE and MAP_LOCKED.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 arch/x86/mm/fault.c                    | 14 ++++++++++
 fs/proc/task_mmu.c                     |  1 +
 include/linux/mm.h                     |  9 +++++++
 include/linux/page-flags.h             |  7 +++++
 include/linux/page_excl.h              | 49 ++++++++++++++++++++++++++++++++++
 include/trace/events/mmflags.h         |  9 ++++++-
 include/uapi/asm-generic/mman-common.h |  1 +
 kernel/fork.c                          |  3 ++-
 mm/Kconfig                             |  3 +++
 mm/gup.c                               |  8 ++++++
 mm/memory.c                            |  3 +++
 mm/mmap.c                              | 16 +++++++++++
 mm/page_alloc.c                        |  5 ++++
 13 files changed, 126 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/page_excl.h

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 9ceacd1..8f73a75 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -17,6 +17,7 @@
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 #include <linux/efi.h>			/* efi_recover_from_page_fault()*/
+#include <linux/page_excl.h>		/* page_is_user_exclusive()	*/
 #include <linux/mm_types.h>
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
@@ -1218,6 +1219,13 @@ static int fault_in_kernel_space(unsigned long address)
 	return address >= TASK_SIZE_MAX;
 }
 
+static bool fault_in_user_exclusive_page(unsigned long address)
+{
+	struct page *page = virt_to_page(address);
+
+	return page_is_user_exclusive(page);
+}
+
 /*
  * Called for all faults where 'address' is part of the kernel address
  * space.  Might get called for faults that originate from *code* that
@@ -1261,6 +1269,12 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 	if (spurious_kernel_fault(hw_error_code, address))
 		return;
 
+	/* FIXME: warn and handle gracefully */
+	if (unlikely(fault_in_user_exclusive_page(address))) {
+		pr_err("page fault in user exclusive page at %lx", address);
+		force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address);
+	}
+
 	/* kprobes don't want to hook the spurious faults: */
 	if (kprobe_page_fault(regs, X86_TRAP_PF))
 		return;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 9442631..99e14d1 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -655,6 +655,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 #ifdef CONFIG_X86_INTEL_MPX
 		[ilog2(VM_MPX)]		= "mp",
 #endif
+		[ilog2(VM_EXCLUSIVE)]	= "xl",
 		[ilog2(VM_LOCKED)]	= "lo",
 		[ilog2(VM_IO)]		= "io",
 		[ilog2(VM_SEQ_READ)]	= "sr",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cc29227..9c43375 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -298,11 +298,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -340,6 +342,12 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_MPX		VM_NONE
 #endif
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+# define VM_EXCLUSIVE	VM_HIGH_ARCH_5
+#else
+# define VM_EXCLUSIVE	VM_NONE
+#endif
+
 #ifndef VM_GROWSUP
 # define VM_GROWSUP	VM_NONE
 #endif
@@ -2594,6 +2602,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_ANON	0x8000	/* don't do file mappings */
 #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
+#define FOLL_EXCLUSIVE	0x40000	/* mapping is exclusive to owning mm */
 
 /*
  * NOTE on FOLL_LONGTERM:
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f91cb88..32d0aee 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -131,6 +131,9 @@ enum pageflags {
 	PG_young,
 	PG_idle,
 #endif
+#if defined(CONFIG_EXCLUSIVE_USER_PAGES)
+	PG_user_exclusive,
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -431,6 +434,10 @@ TESTCLEARFLAG(Young, young, PF_ANY)
 PAGEFLAG(Idle, idle, PF_ANY)
 #endif
 
+#ifdef CONFIG_EXCLUSIVE_USER_PAGES
+__PAGEFLAG(UserExclusive, user_exclusive, PF_ANY)
+#endif
+
 /*
  * On an anonymous page mapped into a user virtual memory area,
  * page->mapping points to its anon_vma, not to a struct address_space;
diff --git a/include/linux/page_excl.h b/include/linux/page_excl.h
new file mode 100644
index 0000000..b7ea3ce
--- /dev/null
+++ b/include/linux/page_excl.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MM_PAGE_EXCLUSIVE_H
+#define _LINUX_MM_PAGE_EXCLUSIVE_H
+
+#include <linux/bitops.h>
+#include <linux/page-flags.h>
+#include <linux/set_memory.h>
+#include <asm/tlbflush.h>
+
+#ifdef CONFIG_EXCLUSIVE_USER_PAGES
+
+static inline bool page_is_user_exclusive(struct page *page)
+{
+	return PageUserExclusive(page);
+}
+
+static inline void __set_page_user_exclusive(struct page *page)
+{
+	unsigned long addr = (unsigned long)page_address(page);
+
+	__SetPageUserExclusive(page);
+	set_direct_map_invalid_noflush(page);
+	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+}
+
+static inline void __clear_page_user_exclusive(struct page *page)
+{
+	__ClearPageUserExclusive(page);
+	set_direct_map_default_noflush(page);
+}
+
+#else /* !CONFIG_EXCLUSIVE_USER_PAGES */
+
+static inline bool page_is_user_exclusive(struct page *page)
+{
+	return false;
+}
+
+static inline void __set_page_user_exclusive(struct page *page)
+{
+}
+
+static inline void __clear_page_user_exclusive(struct page *page)
+{
+}
+
+#endif /* CONFIG_EXCLUSIVE_USER_PAGES */
+
+#endif /* _LINUX_MM_PAGE_EXCLUSIVE_H */
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a1675d4..2d3c14a 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -79,6 +79,12 @@
 #define IF_HAVE_PG_IDLE(flag,string)
 #endif
 
+#ifdef CONFIG_EXCLUSIVE_USER_PAGES
+#define IF_HAVE_PG_USER_EXCLUSIVE(flag,string) ,{1UL << flag, string}
+#else
+#define IF_HAVE_PG_USER_EXCLUSIVE(flag,string)
+#endif
+
 #define __def_pageflag_names						\
 	{1UL << PG_locked,		"locked"	},		\
 	{1UL << PG_waiters,		"waiters"	},		\
@@ -105,7 +111,8 @@ IF_HAVE_PG_MLOCK(PG_mlocked,		"mlocked"	)		\
 IF_HAVE_PG_UNCACHED(PG_uncached,	"uncached"	)		\
 IF_HAVE_PG_HWPOISON(PG_hwpoison,	"hwpoison"	)		\
 IF_HAVE_PG_IDLE(PG_young,		"young"		)		\
-IF_HAVE_PG_IDLE(PG_idle,		"idle"		)
+IF_HAVE_PG_IDLE(PG_idle,		"idle"		)		\
+IF_HAVE_PG_USER_EXCLUSIVE(PG_user_exclusive,	"user_exclusive" )
 
 #define show_page_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",				\
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index c160a53..bf8f23e 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -27,6 +27,7 @@
 #define MAP_HUGETLB		0x040000	/* create a huge page mapping */
 #define MAP_SYNC		0x080000 /* perform synchronous page faults for the mapping */
 #define MAP_FIXED_NOREPLACE	0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
+#define MAP_EXCLUSIVE		0x200000	/* mapping is exclusive to the owning task; the pages in it are dropped from the direct map */
 
 #define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
 					 * uninitialized */
diff --git a/kernel/fork.c b/kernel/fork.c
index bcdf531..d63adec 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -518,7 +518,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
 		struct file *file;
 
-		if (mpnt->vm_flags & VM_DONTCOPY) {
+		if (mpnt->vm_flags & VM_DONTCOPY ||
+		    mpnt->vm_flags & VM_EXCLUSIVE) {
 			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
 			continue;
 		}
diff --git a/mm/Kconfig b/mm/Kconfig
index a5dae9a..9d60141 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,4 +736,7 @@ config ARCH_HAS_PTE_SPECIAL
 config ARCH_HAS_HUGEPD
 	bool
 
+config EXCLUSIVE_USER_PAGES
+        def_bool ARCH_USES_HIGH_VMA_FLAGS && ARCH_HAS_SET_DIRECT_MAP
+
 endmenu
diff --git a/mm/gup.c b/mm/gup.c
index 8f236a3..a98c0ca0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -17,6 +17,7 @@
 #include <linux/migrate.h>
 #include <linux/mm_inline.h>
 #include <linux/sched/mm.h>
+#include <linux/page_excl.h>
 
 #include <asm/mmu_context.h>
 #include <asm/pgtable.h>
@@ -868,6 +869,10 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			ret = PTR_ERR(page);
 			goto out;
 		}
+
+		if (gup_flags & FOLL_EXCLUSIVE)
+			__set_page_user_exclusive(page);
+
 		if (pages) {
 			pages[i] = page;
 			flush_anon_page(vma, page, start);
@@ -1216,6 +1221,9 @@ long populate_vma_page_range(struct vm_area_struct *vma,
 	if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))
 		gup_flags |= FOLL_FORCE;
 
+	if (vma->vm_flags & VM_EXCLUSIVE)
+		gup_flags |= FOLL_EXCLUSIVE;
+
 	/*
 	 * We made sure addr is within a VMA, so the following will
 	 * not result in a stack expansion that recurses back here.
diff --git a/mm/memory.c b/mm/memory.c
index b1ca51a..a4b4cff 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -71,6 +71,7 @@
 #include <linux/dax.h>
 #include <linux/oom.h>
 #include <linux/numa.h>
+#include <linux/page_excl.h>
 
 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -1062,6 +1063,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			page_remove_rmap(page, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
+			if (page_is_user_exclusive(page))
+				__clear_page_user_exclusive(page);
 			if (unlikely(__tlb_remove_page(tlb, page))) {
 				force_flush = 1;
 				addr += PAGE_SIZE;
diff --git a/mm/mmap.c b/mm/mmap.c
index a7d8c84..d8cc82d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1574,6 +1574,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
+	if (flags & MAP_EXCLUSIVE)
+		vm_flags |= VM_EXCLUSIVE;
+
 	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
@@ -1591,6 +1594,19 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
 
 	addr = untagged_addr(addr);
 
+	if (flags & MAP_EXCLUSIVE) {
+		/*
+		 * MAP_EXCLUSIVE is only supported for private
+		 * anonymous memory not backed by hugetlbfs
+		 */
+		if (!(flags & MAP_ANONYMOUS) || !(flags & MAP_PRIVATE) ||
+		    (flags & MAP_HUGETLB))
+			return -EINVAL;
+
+		/* and impies MAP_LOCKED and MAP_POPULATE */
+		flags |= (MAP_LOCKED | MAP_POPULATE);
+	}
+
 	if (!(flags & MAP_ANONYMOUS)) {
 		audit_mmap_fd(fd, flags);
 		file = fget(fd);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ecc3dba..2f1de9d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -68,6 +68,7 @@
 #include <linux/lockdep.h>
 #include <linux/nmi.h>
 #include <linux/psi.h>
+#include <linux/page_excl.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -4779,6 +4780,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 		page = NULL;
 	}
 
+	/* FIXME: should not happen! */
+	if (WARN_ON(page_is_user_exclusive(page)))
+		__clear_page_user_exclusive(page);
+
 	trace_mm_page_alloc(page, order, alloc_mask, ac.migratetype);
 
 	return page;
-- 
2.7.4

^ permalink raw reply related

* [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Mike Rapoport @ 2019-10-27 10:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Mike Rapoport,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

Hi,

The patch below aims to allow applications to create mappins that have
pages visible only to the owning process. Such mappings could be used to
store secrets so that these secrets are not visible neither to other
processes nor to the kernel.

I've only tested the basic functionality, the changes should be verified
against THP/migration/compaction. Yet, I'd appreciate early feedback.

Mike Rapoport (1):
  mm: add MAP_EXCLUSIVE to create exclusive user mappings

 arch/x86/mm/fault.c                    | 14 ++++++++++
 fs/proc/task_mmu.c                     |  1 +
 include/linux/mm.h                     |  9 +++++++
 include/linux/page-flags.h             |  7 +++++
 include/linux/page_excl.h              | 49 ++++++++++++++++++++++++++++++++++
 include/trace/events/mmflags.h         |  9 ++++++-
 include/uapi/asm-generic/mman-common.h |  1 +
 kernel/fork.c                          |  3 ++-
 mm/Kconfig                             |  3 +++
 mm/gup.c                               |  8 ++++++
 mm/memory.c                            |  3 +++
 mm/mmap.c                              | 16 +++++++++++
 mm/page_alloc.c                        |  5 ++++
 13 files changed, 126 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/page_excl.h

-- 
2.7.4

^ permalink raw reply

* [PATCH RESEND v14 6/6] Documentation: path-lookup: mention LOOKUP_MAGICLINK_JUMPED
From: Aleksa Sarai @ 2019-10-26 18:57 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra
  Cc: Aleksa Sarai, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	David Drysdale, Chanho Min, Oleg Nesterov, Rasmus Villemoes,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Christian Brauner,
	Aleksa Sarai, Linus Torvalds, containers, linux-alpha
In-Reply-To: <20191026185700.10708-1-cyphar@cyphar.com>

Now that we have a special flag to signify magic-link jumps, mention it
within the path-lookup docs. And now that "magic link" is the correct
term for nd_jump_link()-style symlinks, clean up references to this type
of "symlink".

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 Documentation/filesystems/path-lookup.rst | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst
index 434a07b0002b..2c32795389bd 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -405,6 +405,10 @@ is requested.  Keeping a reference in the ``nameidata`` ensures that
 only one root is in effect for the entire path walk, even if it races
 with a ``chroot()`` system call.
 
+It should be noted that in the case of ``LOOKUP_IN_ROOT`` or
+``LOOKUP_BENEATH``, the effective root becomes the directory file descriptor
+passed to ``openat2()`` (which exposes these ``LOOKUP_`` flags).
+
 The root is needed when either of two conditions holds: (1) either the
 pathname or a symbolic link starts with a "'/'", or (2) a "``..``"
 component is being handled, since "``..``" from the root must always stay
@@ -1149,7 +1153,7 @@ so ``NULL`` is returned to indicate that the symlink can be released and
 the stack frame discarded.
 
 The other case involves things in ``/proc`` that look like symlinks but
-aren't really::
+aren't really (and are therefore commonly referred to as "magic-links")::
 
      $ ls -l /proc/self/fd/1
      lrwx------ 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4
@@ -1310,12 +1314,14 @@ longer needed.
 ``LOOKUP_JUMPED`` means that the current dentry was chosen not because
 it had the right name but for some other reason.  This happens when
 following "``..``", following a symlink to ``/``, crossing a mount point
-or accessing a "``/proc/$PID/fd/$FD``" symlink.  In this case the
-filesystem has not been asked to revalidate the name (with
-``d_revalidate()``).  In such cases the inode may still need to be
-revalidated, so ``d_op->d_weak_revalidate()`` is called if
+or accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic
+link"). In this case the filesystem has not been asked to revalidate the
+name (with ``d_revalidate()``).  In such cases the inode may still need
+to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
 ``LOOKUP_JUMPED`` is set when the look completes - which may be at the
-final component or, when creating, unlinking, or renaming, at the penultimate component.
+final component or, when creating, unlinking, or renaming, at the
+penultimate component. ``LOOKUP_MAGICLINK_JUMPED`` is set alongside
+``LOOKUP_JUMPED`` if a magic-link was traversed.
 
 Final-component flags
 ~~~~~~~~~~~~~~~~~~~~~
-- 
2.23.0

^ permalink raw reply related

* [PATCH RESEND v14 5/6] selftests: add openat2(2) selftests
From: Aleksa Sarai @ 2019-10-26 18:56 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra
  Cc: Aleksa Sarai, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	David Drysdale, Chanho Min, Oleg Nesterov, Rasmus Villemoes,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Christian Brauner,
	Aleksa Sarai, Linus Torvalds, containers, linux-alpha
In-Reply-To: <20191026185700.10708-1-cyphar@cyphar.com>

Test all of the various openat2(2) flags. A small stress-test of a
symlink-rename attack is included to show that the protections against
".."-based attacks are sufficient.

The main things these self-tests are enforcing are:

  * The struct+usize ABI for openat2(2) and copy_struct_from_user() to
    ensure that upgrades will be handled gracefully (in addition,
    ensuring that misaligned structures are also handled correctly).

  * The -EINVAL checks for openat2(2) are all correctly handled to avoid
    userspace passing unknown or conflicting flag sets (most
    importantly, ensuring that invalid flag combinations are checked).

  * All of the RESOLVE_* semantics (including errno values) are
    correctly handled with various combinations of paths and flags.

  * RESOLVE_IN_ROOT correctly protects against the symlink rename(2)
    attack that has been responsible for several CVEs (and likely will
    be responsible for several more).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/openat2/.gitignore    |   1 +
 tools/testing/selftests/openat2/Makefile      |   8 +
 tools/testing/selftests/openat2/helpers.c     | 109 ++++
 tools/testing/selftests/openat2/helpers.h     | 107 ++++
 .../testing/selftests/openat2/openat2_test.c  | 297 ++++++++++
 .../selftests/openat2/rename_attack_test.c    | 160 ++++++
 .../testing/selftests/openat2/resolve_test.c  | 523 ++++++++++++++++++
 8 files changed, 1206 insertions(+)
 create mode 100644 tools/testing/selftests/openat2/.gitignore
 create mode 100644 tools/testing/selftests/openat2/Makefile
 create mode 100644 tools/testing/selftests/openat2/helpers.c
 create mode 100644 tools/testing/selftests/openat2/helpers.h
 create mode 100644 tools/testing/selftests/openat2/openat2_test.c
 create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
 create mode 100644 tools/testing/selftests/openat2/resolve_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 4cdbae6f4e61..28996856ed5e 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -37,6 +37,7 @@ TARGETS += powerpc
 TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
+TARGETS += openat2
 TARGETS += rseq
 TARGETS += rtc
 TARGETS += seccomp
diff --git a/tools/testing/selftests/openat2/.gitignore b/tools/testing/selftests/openat2/.gitignore
new file mode 100644
index 000000000000..bd68f6c3fd07
--- /dev/null
+++ b/tools/testing/selftests/openat2/.gitignore
@@ -0,0 +1 @@
+/*_test
diff --git a/tools/testing/selftests/openat2/Makefile b/tools/testing/selftests/openat2/Makefile
new file mode 100644
index 000000000000..4b93b1417b86
--- /dev/null
+++ b/tools/testing/selftests/openat2/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g -fsanitize=address -fsanitize=undefined
+TEST_GEN_PROGS := openat2_test resolve_test rename_attack_test
+
+include ../lib.mk
+
+$(TEST_GEN_PROGS): helpers.c
diff --git a/tools/testing/selftests/openat2/helpers.c b/tools/testing/selftests/openat2/helpers.c
new file mode 100644
index 000000000000..e9a6557ab16f
--- /dev/null
+++ b/tools/testing/selftests/openat2/helpers.c
@@ -0,0 +1,109 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <cyphar@cyphar.com>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <string.h>
+#include <syscall.h>
+#include <limits.h>
+
+#include "helpers.h"
+
+bool needs_openat2(const struct open_how *how)
+{
+	return how->resolve != 0;
+}
+
+int raw_openat2(int dfd, const char *path, void *how, size_t size)
+{
+	int ret = syscall(__NR_openat2, dfd, path, how, size);
+	return ret >= 0 ? ret : -errno;
+}
+
+int sys_openat2(int dfd, const char *path, struct open_how *how)
+{
+	return raw_openat2(dfd, path, how, sizeof(*how));
+}
+
+int sys_openat(int dfd, const char *path, struct open_how *how)
+{
+	int ret = openat(dfd, path, how->flags, how->mode);
+	return ret >= 0 ? ret : -errno;
+}
+
+int sys_renameat2(int olddirfd, const char *oldpath,
+		  int newdirfd, const char *newpath, unsigned int flags)
+{
+	int ret = syscall(__NR_renameat2, olddirfd, oldpath,
+					  newdirfd, newpath, flags);
+	return ret >= 0 ? ret : -errno;
+}
+
+int touchat(int dfd, const char *path)
+{
+	int fd = openat(dfd, path, O_CREAT);
+	if (fd >= 0)
+		close(fd);
+	return fd;
+}
+
+char *fdreadlink(int fd)
+{
+	char *target, *tmp;
+
+	E_asprintf(&tmp, "/proc/self/fd/%d", fd);
+
+	target = malloc(PATH_MAX);
+	if (!target)
+		ksft_exit_fail_msg("fdreadlink: malloc failed\n");
+	memset(target, 0, PATH_MAX);
+
+	E_readlink(tmp, target, PATH_MAX);
+	free(tmp);
+	return target;
+}
+
+bool fdequal(int fd, int dfd, const char *path)
+{
+	char *fdpath, *dfdpath, *other;
+	bool cmp;
+
+	fdpath = fdreadlink(fd);
+	dfdpath = fdreadlink(dfd);
+
+	if (!path)
+		E_asprintf(&other, "%s", dfdpath);
+	else if (*path == '/')
+		E_asprintf(&other, "%s", path);
+	else
+		E_asprintf(&other, "%s/%s", dfdpath, path);
+
+	cmp = !strcmp(fdpath, other);
+
+	free(fdpath);
+	free(dfdpath);
+	free(other);
+	return cmp;
+}
+
+bool openat2_supported = false;
+
+void __attribute__((constructor)) init(void)
+{
+	struct open_how how = {};
+	int fd;
+
+	BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_VER0);
+
+	/* Check openat2(2) support. */
+	fd = sys_openat2(AT_FDCWD, ".", &how);
+	openat2_supported = (fd >= 0);
+
+	if (fd >= 0)
+		close(fd);
+}
diff --git a/tools/testing/selftests/openat2/helpers.h b/tools/testing/selftests/openat2/helpers.h
new file mode 100644
index 000000000000..43ca5ceab6e3
--- /dev/null
+++ b/tools/testing/selftests/openat2/helpers.h
@@ -0,0 +1,107 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <cyphar@cyphar.com>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#ifndef __RESOLVEAT_H__
+#define __RESOLVEAT_H__
+
+#define _GNU_SOURCE
+#include <stdint.h>
+#include <errno.h>
+#include <linux/types.h>
+#include "../kselftest.h"
+
+#define ARRAY_LEN(X) (sizeof (X) / sizeof (*(X)))
+#define BUILD_BUG_ON(e) ((void)(sizeof(struct { int:(-!!(e)); })))
+
+#ifndef SYS_openat2
+#ifndef __NR_openat2
+#define __NR_openat2 437
+#endif /* __NR_openat2 */
+#define SYS_openat2 __NR_openat2
+#endif /* SYS_openat2 */
+
+/*
+ * Arguments for how openat2(2) should open the target path. If @resolve is
+ * zero, then openat2(2) operates very similarly to openat(2).
+ *
+ * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
+ * than being silently ignored. @mode must be zero unless one of {O_CREAT,
+ * O_TMPFILE} are set.
+ *
+ * @flags: O_* flags.
+ * @mode: O_CREAT/O_TMPFILE file mode.
+ * @resolve: RESOLVE_* flags.
+ */
+struct open_how {
+	__aligned_u64 flags;
+	__u16 mode;
+	__u16 __padding[3]; /* must be zeroed */
+	__aligned_u64 resolve;
+};
+
+#define OPEN_HOW_SIZE_VER0	24 /* sizeof first published struct */
+#define OPEN_HOW_SIZE_LATEST	OPEN_HOW_SIZE_VER0
+
+bool needs_openat2(const struct open_how *how);
+
+#ifndef RESOLVE_IN_ROOT
+/* how->resolve flags for openat2(2). */
+#define RESOLVE_NO_XDEV		0x01 /* Block mount-point crossings
+					(includes bind-mounts). */
+#define RESOLVE_NO_MAGICLINKS	0x02 /* Block traversal through procfs-style
+					"magic-links". */
+#define RESOLVE_NO_SYMLINKS	0x04 /* Block traversal through all symlinks
+					(implies OEXT_NO_MAGICLINKS) */
+#define RESOLVE_BENEATH		0x08 /* Block "lexical" trickery like
+					"..", symlinks, and absolute
+					paths which escape the dirfd. */
+#define RESOLVE_IN_ROOT		0x10 /* Make all jumps to "/" and ".."
+					be scoped inside the dirfd
+					(similar to chroot(2)). */
+#endif /* RESOLVE_IN_ROOT */
+
+#define E_func(func, ...)						\
+	do {								\
+		if (func(__VA_ARGS__) < 0)				\
+			ksft_exit_fail_msg("%s:%d %s failed\n", \
+					   __FILE__, __LINE__, #func);\
+	} while (0)
+
+#define E_asprintf(...)		E_func(asprintf,	__VA_ARGS__)
+#define E_chmod(...)		E_func(chmod,		__VA_ARGS__)
+#define E_dup2(...)		E_func(dup2,		__VA_ARGS__)
+#define E_fchdir(...)		E_func(fchdir,		__VA_ARGS__)
+#define E_fstatat(...)		E_func(fstatat,		__VA_ARGS__)
+#define E_kill(...)		E_func(kill,		__VA_ARGS__)
+#define E_mkdirat(...)		E_func(mkdirat,		__VA_ARGS__)
+#define E_mount(...)		E_func(mount,		__VA_ARGS__)
+#define E_prctl(...)		E_func(prctl,		__VA_ARGS__)
+#define E_readlink(...)		E_func(readlink,	__VA_ARGS__)
+#define E_setresuid(...)	E_func(setresuid,	__VA_ARGS__)
+#define E_symlinkat(...)	E_func(symlinkat,	__VA_ARGS__)
+#define E_touchat(...)		E_func(touchat,		__VA_ARGS__)
+#define E_unshare(...)		E_func(unshare,		__VA_ARGS__)
+
+#define E_assert(expr, msg, ...)					\
+	do {								\
+		if (!(expr))						\
+			ksft_exit_fail_msg("ASSERT(%s:%d) failed (%s): " msg "\n", \
+					   __FILE__, __LINE__, #expr, ##__VA_ARGS__); \
+	} while (0)
+
+int raw_openat2(int dfd, const char *path, void *how, size_t size);
+int sys_openat2(int dfd, const char *path, struct open_how *how);
+int sys_openat(int dfd, const char *path, struct open_how *how);
+int sys_renameat2(int olddirfd, const char *oldpath,
+		  int newdirfd, const char *newpath, unsigned int flags);
+
+int touchat(int dfd, const char *path);
+char *fdreadlink(int fd);
+bool fdequal(int fd, int dfd, const char *path);
+
+extern bool openat2_supported;
+
+#endif /* __RESOLVEAT_H__ */
diff --git a/tools/testing/selftests/openat2/openat2_test.c b/tools/testing/selftests/openat2/openat2_test.c
new file mode 100644
index 000000000000..c5e98f4b9ef1
--- /dev/null
+++ b/tools/testing/selftests/openat2/openat2_test.c
@@ -0,0 +1,297 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <cyphar@cyphar.com>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+
+#include "../kselftest.h"
+#include "helpers.h"
+
+/*
+ * O_LARGEFILE is set to 0 by glibc.
+ * XXX: This is wrong on {mips, parisc, powerpc, sparc}.
+ */
+#undef	O_LARGEFILE
+#define	O_LARGEFILE 0x8000
+
+struct open_how_ext {
+	struct open_how inner;
+	uint32_t extra1;
+	char pad1[128];
+	uint32_t extra2;
+	char pad2[128];
+	uint32_t extra3;
+};
+
+struct struct_test {
+	const char *name;
+	struct open_how_ext arg;
+	size_t size;
+	int err;
+};
+
+#define NUM_OPENAT2_STRUCT_TESTS 9
+#define NUM_OPENAT2_STRUCT_VARIATIONS 13
+
+void test_openat2_struct(void)
+{
+	int misalignments[] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 17, 87 };
+
+	struct struct_test tests[] = {
+		/* Normal struct. */
+		{ .name = "normal struct",
+		  .arg.inner.flags = O_RDONLY,
+		  .size = sizeof(struct open_how) },
+		/* Bigger struct, with zeroed out end. */
+		{ .name = "bigger struct (zeroed out)",
+		  .arg.inner.flags = O_RDONLY,
+		  .size = sizeof(struct open_how_ext) },
+
+		/* Normal struct with broken padding. */
+		{ .name = "normal struct (non-zero padding[0])",
+		  .arg.inner.flags = O_RDONLY,
+		  .arg.inner.__padding = {0xa0, 0x00},
+		  .size = sizeof(struct open_how_ext), .err = -EINVAL },
+		{ .name = "normal struct (non-zero padding[1])",
+		  .arg.inner.flags = O_RDONLY,
+		  .arg.inner.__padding = {0x00, 0x1a},
+		  .size = sizeof(struct open_how_ext), .err = -EINVAL },
+
+		/* TODO: Once expanded, check zero-padding. */
+
+		/* Smaller than version-0 struct. */
+		{ .name = "zero-sized 'struct'",
+		  .arg.inner.flags = O_RDONLY, .size = 0, .err = -EINVAL },
+		{ .name = "smaller-than-v0 struct",
+		  .arg.inner.flags = O_RDONLY,
+		  .size = OPEN_HOW_SIZE_VER0 - 1, .err = -EINVAL },
+
+		/* Bigger struct, with non-zero trailing bytes. */
+		{ .name = "bigger struct (non-zero data in first 'future field')",
+		  .arg.inner.flags = O_RDONLY, .arg.extra1 = 0xdeadbeef,
+		  .size = sizeof(struct open_how_ext), .err = -E2BIG },
+		{ .name = "bigger struct (non-zero data in middle of 'future fields')",
+		  .arg.inner.flags = O_RDONLY, .arg.extra2 = 0xfeedcafe,
+		  .size = sizeof(struct open_how_ext), .err = -E2BIG },
+		{ .name = "bigger struct (non-zero data at end of 'future fields')",
+		  .arg.inner.flags = O_RDONLY, .arg.extra3 = 0xabad1dea,
+		  .size = sizeof(struct open_how_ext), .err = -E2BIG },
+	};
+
+	BUILD_BUG_ON(ARRAY_LEN(misalignments) != NUM_OPENAT2_STRUCT_VARIATIONS);
+	BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_STRUCT_TESTS);
+
+	for (int i = 0; i < ARRAY_LEN(tests); i++) {
+		struct struct_test *test = &tests[i];
+		struct open_how_ext how_ext = test->arg;
+
+		for (int j = 0; j < ARRAY_LEN(misalignments); j++) {
+			int fd, misalign = misalignments[j];
+			char *fdpath = NULL;
+			bool failed;
+			void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+
+			void *copy = NULL, *how_copy = &how_ext;
+
+			if (!openat2_supported) {
+				ksft_print_msg("openat2(2) unsupported\n");
+				resultfn = ksft_test_result_skip;
+				goto skip;
+			}
+
+			if (misalign) {
+				/*
+				 * Explicitly misalign the structure copying it with the given
+				 * (mis)alignment offset. The other data is set to be non-zero to
+				 * make sure that non-zero bytes outside the struct aren't checked
+				 *
+				 * This is effectively to check that is_zeroed_user() works.
+				 */
+				copy = malloc(misalign + sizeof(how_ext));
+				how_copy = copy + misalign;
+				memset(copy, 0xff, misalign);
+				memcpy(how_copy, &how_ext, sizeof(how_ext));
+			}
+
+			fd = raw_openat2(AT_FDCWD, ".", how_copy, test->size);
+			if (test->err >= 0)
+				failed = (fd < 0);
+			else
+				failed = (fd != test->err);
+			if (fd >= 0) {
+				fdpath = fdreadlink(fd);
+				close(fd);
+			}
+
+			if (failed) {
+				resultfn = ksft_test_result_fail;
+
+				ksft_print_msg("openat2 unexpectedly returned ");
+				if (fdpath)
+					ksft_print_msg("%d['%s']\n", fd, fdpath);
+				else
+					ksft_print_msg("%d (%s)\n", fd, strerror(-fd));
+			}
+
+skip:
+			if (test->err >= 0)
+				resultfn("openat2 with %s argument [misalign=%d] succeeds\n",
+					 test->name, misalign);
+			else
+				resultfn("openat2 with %s argument [misalign=%d] fails with %d (%s)\n",
+					 test->name, misalign, test->err,
+					 strerror(-test->err));
+
+			free(copy);
+			free(fdpath);
+			fflush(stdout);
+		}
+	}
+}
+
+struct flag_test {
+	const char *name;
+	struct open_how how;
+	int err;
+};
+
+#define NUM_OPENAT2_FLAG_TESTS 21
+
+void test_openat2_flags(void)
+{
+	struct flag_test tests[] = {
+		/* O_TMPFILE is incompatible with O_PATH and O_CREAT. */
+		{ .name = "incompatible flags (O_TMPFILE | O_PATH)",
+		  .how.flags = O_TMPFILE | O_PATH | O_RDWR, .err = -EINVAL },
+		{ .name = "incompatible flags (O_TMPFILE | O_CREAT)",
+		  .how.flags = O_TMPFILE | O_CREAT | O_RDWR, .err = -EINVAL },
+
+		/* O_PATH only permits certain other flags to be set ... */
+		{ .name = "compatible flags (O_PATH | O_CLOEXEC)",
+		  .how.flags = O_PATH | O_CLOEXEC },
+		{ .name = "compatible flags (O_PATH | O_DIRECTORY)",
+		  .how.flags = O_PATH | O_DIRECTORY },
+		{ .name = "compatible flags (O_PATH | O_NOFOLLOW)",
+		  .how.flags = O_PATH | O_NOFOLLOW },
+		/* ... and others are absolutely not permitted. */
+		{ .name = "incompatible flags (O_PATH | O_RDWR)",
+		  .how.flags = O_PATH | O_RDWR, .err = -EINVAL },
+		{ .name = "incompatible flags (O_PATH | O_CREAT)",
+		  .how.flags = O_PATH | O_CREAT, .err = -EINVAL },
+		{ .name = "incompatible flags (O_PATH | O_EXCL)",
+		  .how.flags = O_PATH | O_EXCL, .err = -EINVAL },
+		{ .name = "incompatible flags (O_PATH | O_NOCTTY)",
+		  .how.flags = O_PATH | O_NOCTTY, .err = -EINVAL },
+		{ .name = "incompatible flags (O_PATH | O_DIRECT)",
+		  .how.flags = O_PATH | O_DIRECT, .err = -EINVAL },
+		{ .name = "incompatible flags (O_PATH | O_LARGEFILE)",
+		  .how.flags = O_PATH | O_LARGEFILE, .err = -EINVAL },
+
+		/* ->mode must only be set with O_{CREAT,TMPFILE}. */
+		{ .name = "non-zero how.mode and O_RDONLY",
+		  .how.flags = O_RDONLY, .how.mode = 0600, .err = -EINVAL },
+		{ .name = "non-zero how.mode and O_PATH",
+		  .how.flags = O_PATH,   .how.mode = 0600, .err = -EINVAL },
+		{ .name = "valid how.mode and O_CREAT",
+		  .how.flags = O_CREAT,  .how.mode = 0600 },
+		{ .name = "valid how.mode and O_TMPFILE",
+		  .how.flags = O_TMPFILE | O_RDWR, .how.mode = 0600 },
+		/* ->mode must only contain 0777 bits. */
+		{ .name = "invalid how.mode and O_CREAT",
+		  .how.flags = O_CREAT,
+		  .how.mode = 0xFFFF, .err = -EINVAL },
+		{ .name = "invalid how.mode and O_TMPFILE",
+		  .how.flags = O_TMPFILE | O_RDWR,
+		  .how.mode = 0x1337, .err = -EINVAL },
+
+		/* ->resolve must only contain RESOLVE_* flags. */
+		{ .name = "invalid how.resolve and O_RDONLY",
+		  .how.flags = O_RDONLY,
+		  .how.resolve = 0x1337, .err = -EINVAL },
+		{ .name = "invalid how.resolve and O_CREAT",
+		  .how.flags = O_CREAT,
+		  .how.resolve = 0x1337, .err = -EINVAL },
+		{ .name = "invalid how.resolve and O_TMPFILE",
+		  .how.flags = O_TMPFILE | O_RDWR,
+		  .how.resolve = 0x1337, .err = -EINVAL },
+		{ .name = "invalid how.resolve and O_PATH",
+		  .how.flags = O_PATH,
+		  .how.resolve = 0x1337, .err = -EINVAL },
+	};
+
+	BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_FLAG_TESTS);
+
+	for (int i = 0; i < ARRAY_LEN(tests); i++) {
+		int fd;
+		char *path, *fdpath = NULL;
+		bool failed = false;
+		struct flag_test *test = &tests[i];
+		void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+
+		if (!openat2_supported) {
+			ksft_print_msg("openat2(2) unsupported\n");
+			resultfn = ksft_test_result_skip;
+			goto skip;
+		}
+
+		path = (test->how.flags & O_CREAT) ? "/tmp/ksft.openat2_tmpfile" : ".";
+		unlink(path);
+
+		fd = sys_openat2(AT_FDCWD, path, &test->how);
+		if (test->err >= 0)
+			failed = (fd < 0);
+		else
+			failed = (fd != test->err);
+		if (fd >= 0) {
+			fdpath = fdreadlink(fd);
+			close(fd);
+		}
+
+		if (failed) {
+			resultfn = ksft_test_result_fail;
+
+			ksft_print_msg("openat2 unexpectedly returned ");
+			if (fdpath)
+				ksft_print_msg("%d['%s']\n", fd, fdpath);
+			else
+				ksft_print_msg("%d (%s)\n", fd, strerror(-fd));
+		}
+
+skip:
+		if (test->err >= 0)
+			resultfn("openat2 with %s succeeds\n", test->name);
+		else
+			resultfn("openat2 with %s fails with %d (%s)\n",
+				 test->name, test->err, strerror(-test->err));
+
+		free(fdpath);
+		fflush(stdout);
+	}
+}
+
+#define NUM_TESTS (NUM_OPENAT2_STRUCT_VARIATIONS * NUM_OPENAT2_STRUCT_TESTS + \
+		   NUM_OPENAT2_FLAG_TESTS)
+
+int main(int argc, char **argv)
+{
+	ksft_print_header();
+	ksft_set_plan(NUM_TESTS);
+
+	test_openat2_struct();
+	test_openat2_flags();
+
+	if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
+		ksft_exit_fail();
+	else
+		ksft_exit_pass();
+}
diff --git a/tools/testing/selftests/openat2/rename_attack_test.c b/tools/testing/selftests/openat2/rename_attack_test.c
new file mode 100644
index 000000000000..0a770728b436
--- /dev/null
+++ b/tools/testing/selftests/openat2/rename_attack_test.c
@@ -0,0 +1,160 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <cyphar@cyphar.com>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <sys/mman.h>
+#include <sys/prctl.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <syscall.h>
+#include <limits.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+#include "helpers.h"
+
+/* Construct a test directory with the following structure:
+ *
+ * root/
+ * |-- a/
+ * |   `-- c/
+ * `-- b/
+ */
+int setup_testdir(void)
+{
+	int dfd;
+	char dirname[] = "/tmp/ksft-openat2-rename-attack.XXXXXX";
+
+	/* Make the top-level directory. */
+	if (!mkdtemp(dirname))
+		ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n");
+	dfd = open(dirname, O_PATH | O_DIRECTORY);
+	if (dfd < 0)
+		ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n");
+
+	E_mkdirat(dfd, "a", 0755);
+	E_mkdirat(dfd, "b", 0755);
+	E_mkdirat(dfd, "a/c", 0755);
+
+	return dfd;
+}
+
+/* Swap @dirfd/@a and @dirfd/@b constantly. Parent must kill this process. */
+pid_t spawn_attack(int dirfd, char *a, char *b)
+{
+	pid_t child = fork();
+	if (child != 0)
+		return child;
+
+	/* If the parent (the test process) dies, kill ourselves too. */
+	E_prctl(PR_SET_PDEATHSIG, SIGKILL);
+
+	/* Swap @a and @b. */
+	for (;;)
+		renameat2(dirfd, a, dirfd, b, RENAME_EXCHANGE);
+	exit(1);
+}
+
+#define NUM_RENAME_TESTS 2
+#define ROUNDS 400000
+
+const char *flagname(int resolve)
+{
+	switch (resolve) {
+	case RESOLVE_IN_ROOT:
+		return "RESOLVE_IN_ROOT";
+	case RESOLVE_BENEATH:
+		return "RESOLVE_BENEATH";
+	}
+	return "(unknown)";
+}
+
+void test_rename_attack(int resolve)
+{
+	int dfd, afd;
+	pid_t child;
+	void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+	int escapes = 0, other_errs = 0, exdevs = 0, eagains = 0, successes = 0;
+
+	struct open_how how = {
+		.flags = O_PATH,
+		.resolve = resolve,
+	};
+
+	if (!openat2_supported) {
+		how.resolve = 0;
+		ksft_print_msg("openat2(2) unsupported -- using openat(2) instead\n");
+	}
+
+	dfd = setup_testdir();
+	afd = openat(dfd, "a", O_PATH);
+	if (afd < 0)
+		ksft_exit_fail_msg("test_rename_attack: failed to open 'a'\n");
+
+	child = spawn_attack(dfd, "a/c", "b");
+
+	for (int i = 0; i < ROUNDS; i++) {
+		int fd;
+		char *victim_path = "c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../..";
+
+		if (openat2_supported)
+			fd = sys_openat2(afd, victim_path, &how);
+		else
+			fd = sys_openat(afd, victim_path, &how);
+
+		if (fd < 0) {
+			if (fd == -EAGAIN)
+				eagains++;
+			else if (fd == -EXDEV)
+				exdevs++;
+			else if (fd == -ENOENT)
+				escapes++; /* escaped outside and got ENOENT... */
+			else
+				other_errs++; /* unexpected error */
+		} else {
+			if (fdequal(fd, afd, NULL))
+				successes++;
+			else
+				escapes++; /* we got an unexpected fd */
+		}
+		close(fd);
+	}
+
+	if (escapes > 0)
+		resultfn = ksft_test_result_fail;
+	ksft_print_msg("non-escapes: EAGAIN=%d EXDEV=%d E<other>=%d success=%d\n",
+		       eagains, exdevs, other_errs, successes);
+	resultfn("rename attack with %s (%d runs, got %d escapes)\n",
+		 flagname(resolve), ROUNDS, escapes);
+
+	/* Should be killed anyway, but might as well make sure. */
+	E_kill(child, SIGKILL);
+}
+
+#define NUM_TESTS NUM_RENAME_TESTS
+
+int main(int argc, char **argv)
+{
+	ksft_print_header();
+	ksft_set_plan(NUM_TESTS);
+
+	test_rename_attack(RESOLVE_BENEATH);
+	test_rename_attack(RESOLVE_IN_ROOT);
+
+	if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
+		ksft_exit_fail();
+	else
+		ksft_exit_pass();
+}
diff --git a/tools/testing/selftests/openat2/resolve_test.c b/tools/testing/selftests/openat2/resolve_test.c
new file mode 100644
index 000000000000..7a94b1da8e7b
--- /dev/null
+++ b/tools/testing/selftests/openat2/resolve_test.c
@@ -0,0 +1,523 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <cyphar@cyphar.com>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+
+#include "../kselftest.h"
+#include "helpers.h"
+
+/*
+ * Construct a test directory with the following structure:
+ *
+ * root/
+ * |-- procexe -> /proc/self/exe
+ * |-- procroot -> /proc/self/root
+ * |-- root/
+ * |-- mnt/ [mountpoint]
+ * |   |-- self -> ../mnt/
+ * |   `-- absself -> /mnt/
+ * |-- etc/
+ * |   `-- passwd
+ * |-- creatlink -> /newfile3
+ * |-- reletc -> etc/
+ * |-- relsym -> etc/passwd
+ * |-- absetc -> /etc/
+ * |-- abssym -> /etc/passwd
+ * |-- abscheeky -> /cheeky
+ * `-- cheeky/
+ *     |-- absself -> /
+ *     |-- self -> ../../root/
+ *     |-- garbageself -> /../../root/
+ *     |-- passwd -> ../cheeky/../cheeky/../etc/../etc/passwd
+ *     |-- abspasswd -> /../cheeky/../cheeky/../etc/../etc/passwd
+ *     |-- dotdotlink -> ../../../../../../../../../../../../../../etc/passwd
+ *     `-- garbagelink -> /../../../../../../../../../../../../../../etc/passwd
+ */
+int setup_testdir(void)
+{
+	int dfd, tmpfd;
+	char dirname[] = "/tmp/ksft-openat2-testdir.XXXXXX";
+
+	/* Unshare and make /tmp a new directory. */
+	E_unshare(CLONE_NEWNS);
+	E_mount("", "/tmp", "", MS_PRIVATE, "");
+
+	/* Make the top-level directory. */
+	if (!mkdtemp(dirname))
+		ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n");
+	dfd = open(dirname, O_PATH | O_DIRECTORY);
+	if (dfd < 0)
+		ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n");
+
+	/* A sub-directory which is actually used for tests. */
+	E_mkdirat(dfd, "root", 0755);
+	tmpfd = openat(dfd, "root", O_PATH | O_DIRECTORY);
+	if (tmpfd < 0)
+		ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n");
+	close(dfd);
+	dfd = tmpfd;
+
+	E_symlinkat("/proc/self/exe", dfd, "procexe");
+	E_symlinkat("/proc/self/root", dfd, "procroot");
+	E_mkdirat(dfd, "root", 0755);
+
+	/* There is no mountat(2), so use chdir. */
+	E_mkdirat(dfd, "mnt", 0755);
+	E_fchdir(dfd);
+	E_mount("tmpfs", "./mnt", "tmpfs", MS_NOSUID | MS_NODEV, "");
+	E_symlinkat("../mnt/", dfd, "mnt/self");
+	E_symlinkat("/mnt/", dfd, "mnt/absself");
+
+	E_mkdirat(dfd, "etc", 0755);
+	E_touchat(dfd, "etc/passwd");
+
+	E_symlinkat("/newfile3", dfd, "creatlink");
+	E_symlinkat("etc/", dfd, "reletc");
+	E_symlinkat("etc/passwd", dfd, "relsym");
+	E_symlinkat("/etc/", dfd, "absetc");
+	E_symlinkat("/etc/passwd", dfd, "abssym");
+	E_symlinkat("/cheeky", dfd, "abscheeky");
+
+	E_mkdirat(dfd, "cheeky", 0755);
+
+	E_symlinkat("/", dfd, "cheeky/absself");
+	E_symlinkat("../../root/", dfd, "cheeky/self");
+	E_symlinkat("/../../root/", dfd, "cheeky/garbageself");
+
+	E_symlinkat("../cheeky/../etc/../etc/passwd", dfd, "cheeky/passwd");
+	E_symlinkat("/../cheeky/../etc/../etc/passwd", dfd, "cheeky/abspasswd");
+
+	E_symlinkat("../../../../../../../../../../../../../../etc/passwd",
+		    dfd, "cheeky/dotdotlink");
+	E_symlinkat("/../../../../../../../../../../../../../../etc/passwd",
+		    dfd, "cheeky/garbagelink");
+
+	return dfd;
+}
+
+struct basic_test {
+	const char *name;
+	const char *dir;
+	const char *path;
+	struct open_how how;
+	bool pass;
+	union {
+		int err;
+		const char *path;
+	} out;
+};
+
+#define NUM_OPENAT2_OPATH_TESTS 88
+
+void test_openat2_opath_tests(void)
+{
+	int rootfd, hardcoded_fd;
+	char *procselfexe, *hardcoded_fdpath;
+
+	E_asprintf(&procselfexe, "/proc/%d/exe", getpid());
+	rootfd = setup_testdir();
+
+	hardcoded_fd = open("/dev/null", O_RDONLY);
+	E_assert(hardcoded_fd >= 0, "open fd to hardcode");
+	E_asprintf(&hardcoded_fdpath, "self/fd/%d", hardcoded_fd);
+
+	struct basic_test tests[] = {
+		/** RESOLVE_BENEATH **/
+		/* Attempts to cross dirfd should be blocked. */
+		{ .name = "[beneath] jump to /",
+		  .path = "/",			.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] absolute link to $root",
+		  .path = "cheeky/absself",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] chained absolute links to $root",
+		  .path = "abscheeky/absself",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] jump outside $root",
+		  .path = "..",			.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] temporary jump outside $root",
+		  .path = "../root/",		.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] symlink temporary jump outside $root",
+		  .path = "cheeky/self",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] chained symlink temporary jump outside $root",
+		  .path = "abscheeky/self",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] garbage links to $root",
+		  .path = "cheeky/garbageself",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] chained garbage links to $root",
+		  .path = "abscheeky/garbageself", .how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		/* Only relative paths that stay inside dirfd should work. */
+		{ .name = "[beneath] ordinary path to 'root'",
+		  .path = "root",		.how.resolve = RESOLVE_BENEATH,
+		  .out.path = "root",		.pass = true },
+		{ .name = "[beneath] ordinary path to 'etc'",
+		  .path = "etc",		.how.resolve = RESOLVE_BENEATH,
+		  .out.path = "etc",		.pass = true },
+		{ .name = "[beneath] ordinary path to 'etc/passwd'",
+		  .path = "etc/passwd",		.how.resolve = RESOLVE_BENEATH,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[beneath] relative symlink inside $root",
+		  .path = "relsym",		.how.resolve = RESOLVE_BENEATH,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[beneath] chained-'..' relative symlink inside $root",
+		  .path = "cheeky/passwd",	.how.resolve = RESOLVE_BENEATH,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[beneath] absolute symlink component outside $root",
+		  .path = "abscheeky/passwd",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] absolute symlink target outside $root",
+		  .path = "abssym",		.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] absolute path outside $root",
+		  .path = "/etc/passwd",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] cheeky absolute path outside $root",
+		  .path = "cheeky/abspasswd",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] chained cheeky absolute path outside $root",
+		  .path = "abscheeky/abspasswd", .how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		/* Tricky paths should fail. */
+		{ .name = "[beneath] tricky '..'-chained symlink outside $root",
+		  .path = "cheeky/dotdotlink",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] tricky absolute + '..'-chained symlink outside $root",
+		  .path = "abscheeky/dotdotlink", .how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] tricky garbage link outside $root",
+		  .path = "cheeky/garbagelink",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] tricky absolute + garbage link outside $root",
+		  .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+
+		/** RESOLVE_IN_ROOT **/
+		/* All attempts to cross the dirfd will be scoped-to-root. */
+		{ .name = "[in_root] jump to /",
+		  .path = "/",			.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = NULL,		.pass = true },
+		{ .name = "[in_root] absolute symlink to /root",
+		  .path = "cheeky/absself",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = NULL,		.pass = true },
+		{ .name = "[in_root] chained absolute symlinks to /root",
+		  .path = "abscheeky/absself",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = NULL,		.pass = true },
+		{ .name = "[in_root] '..' at root",
+		  .path = "..",			.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = NULL,		.pass = true },
+		{ .name = "[in_root] '../root' at root",
+		  .path = "../root/",		.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "root",		.pass = true },
+		{ .name = "[in_root] relative symlink containing '..' above root",
+		  .path = "cheeky/self",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "root",		.pass = true },
+		{ .name = "[in_root] garbage link to /root",
+		  .path = "cheeky/garbageself",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "root",		.pass = true },
+		{ .name = "[in_root] chainged garbage links to /root",
+		  .path = "abscheeky/garbageself", .how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "root",		.pass = true },
+		{ .name = "[in_root] relative path to 'root'",
+		  .path = "root",		.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "root",		.pass = true },
+		{ .name = "[in_root] relative path to 'etc'",
+		  .path = "etc",		.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc",		.pass = true },
+		{ .name = "[in_root] relative path to 'etc/passwd'",
+		  .path = "etc/passwd",		.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] relative symlink to 'etc/passwd'",
+		  .path = "relsym",		.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] chained-'..' relative symlink to 'etc/passwd'",
+		  .path = "cheeky/passwd",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] chained-'..' absolute + relative symlink to 'etc/passwd'",
+		  .path = "abscheeky/passwd",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] absolute symlink to 'etc/passwd'",
+		  .path = "abssym",		.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] absolute path 'etc/passwd'",
+		  .path = "/etc/passwd",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] cheeky absolute path 'etc/passwd'",
+		  .path = "cheeky/abspasswd",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] chained cheeky absolute path 'etc/passwd'",
+		  .path = "abscheeky/abspasswd", .how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] tricky '..'-chained symlink outside $root",
+		  .path = "cheeky/dotdotlink",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] tricky absolute + '..'-chained symlink outside $root",
+		  .path = "abscheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] tricky absolute path + absolute + '..'-chained symlink outside $root",
+		  .path = "/../../../../abscheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] tricky garbage link outside $root",
+		  .path = "cheeky/garbagelink",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] tricky absolute + garbage link outside $root",
+		  .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] tricky absolute path + absolute + garbage link outside $root",
+		  .path = "/../../../../abscheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		/* O_CREAT should handle trailing symlinks correctly. */
+		{ .name = "[in_root] O_CREAT of relative path inside $root",
+		  .path = "newfile1",		.how.flags = O_CREAT,
+						.how.mode = 0700,
+						.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "newfile1",	.pass = true },
+		{ .name = "[in_root] O_CREAT of absolute path",
+		  .path = "/newfile2",		.how.flags = O_CREAT,
+						.how.mode = 0700,
+						.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "newfile2",	.pass = true },
+		{ .name = "[in_root] O_CREAT of tricky symlink outside root",
+		  .path = "/creatlink",		.how.flags = O_CREAT,
+						.how.mode = 0700,
+						.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "newfile3",	.pass = true },
+
+		/** RESOLVE_NO_XDEV **/
+		/* Crossing *down* into a mountpoint is disallowed. */
+		{ .name = "[no_xdev] cross into $mnt",
+		  .path = "mnt",		.how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[no_xdev] cross into $mnt/",
+		  .path = "mnt/",		.how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[no_xdev] cross into $mnt/.",
+		  .path = "mnt/.",		.how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,		.pass = false },
+		/* Crossing *up* out of a mountpoint is disallowed. */
+		{ .name = "[no_xdev] goto mountpoint root",
+		  .dir = "mnt", .path = ".",	.how.resolve = RESOLVE_NO_XDEV,
+		  .out.path = "mnt",		.pass = true },
+		{ .name = "[no_xdev] cross up through '..'",
+		  .dir = "mnt", .path = "..",	.how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[no_xdev] temporary cross up through '..'",
+		  .dir = "mnt", .path = "../mnt", .how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[no_xdev] temporary relative symlink cross up",
+		  .dir = "mnt", .path = "self",	.how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[no_xdev] temporary absolute symlink cross up",
+		  .dir = "mnt", .path = "absself", .how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,		.pass = false },
+		/* Jumping to "/" is ok, but later components cannot cross. */
+		{ .name = "[no_xdev] jump to / directly",
+		  .dir = "mnt", .path = "/",	.how.resolve = RESOLVE_NO_XDEV,
+		  .out.path = "/",		.pass = true },
+		{ .name = "[no_xdev] jump to / (from /) directly",
+		  .dir = "/", .path = "/",	.how.resolve = RESOLVE_NO_XDEV,
+		  .out.path = "/",		.pass = true },
+		{ .name = "[no_xdev] jump to / then proc",
+		  .path = "/proc/1",		.how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[no_xdev] jump to / then tmp",
+		  .path = "/tmp",		.how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,		.pass = false },
+		/* Magic-links are blocked since they can switch vfsmounts. */
+		{ .name = "[no_xdev] cross through magic-link to self/root",
+		  .dir = "/proc", .path = "self/root", 	.how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,			.pass = false },
+		{ .name = "[no_xdev] cross through magic-link to self/cwd",
+		  .dir = "/proc", .path = "self/cwd",	.how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,			.pass = false },
+		/* Except magic-link jumps inside the same vfsmount. */
+		{ .name = "[no_xdev] jump through magic-link to same procfs",
+		  .dir = "/proc", .path = hardcoded_fdpath, .how.resolve = RESOLVE_NO_XDEV,
+		  .out.path = "/proc",			    .pass = true, },
+
+		/** RESOLVE_NO_MAGICLINKS **/
+		/* Regular symlinks should work. */
+		{ .name = "[no_magiclinks] ordinary relative symlink",
+		  .path = "relsym",		.how.resolve = RESOLVE_NO_MAGICLINKS,
+		  .out.path = "etc/passwd",	.pass = true },
+		/* Magic-links should not work. */
+		{ .name = "[no_magiclinks] symlink to magic-link",
+		  .path = "procexe",		.how.resolve = RESOLVE_NO_MAGICLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_magiclinks] normal path to magic-link",
+		  .path = "/proc/self/exe",	.how.resolve = RESOLVE_NO_MAGICLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_magiclinks] normal path to magic-link with O_NOFOLLOW",
+		  .path = "/proc/self/exe",	.how.flags = O_NOFOLLOW,
+						.how.resolve = RESOLVE_NO_MAGICLINKS,
+		  .out.path = procselfexe,	.pass = true },
+		{ .name = "[no_magiclinks] symlink to magic-link path component",
+		  .path = "procroot/etc",	.how.resolve = RESOLVE_NO_MAGICLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_magiclinks] magic-link path component",
+		  .path = "/proc/self/root/etc", .how.resolve = RESOLVE_NO_MAGICLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_magiclinks] magic-link path component with O_NOFOLLOW",
+		  .path = "/proc/self/root/etc", .how.flags = O_NOFOLLOW,
+						 .how.resolve = RESOLVE_NO_MAGICLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+
+		/** RESOLVE_NO_SYMLINKS **/
+		/* Normal paths should work. */
+		{ .name = "[no_symlinks] ordinary path to '.'",
+		  .path = ".",			.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.path = NULL,		.pass = true },
+		{ .name = "[no_symlinks] ordinary path to 'root'",
+		  .path = "root",		.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.path = "root",		.pass = true },
+		{ .name = "[no_symlinks] ordinary path to 'etc'",
+		  .path = "etc",		.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.path = "etc",		.pass = true },
+		{ .name = "[no_symlinks] ordinary path to 'etc/passwd'",
+		  .path = "etc/passwd",		.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.path = "etc/passwd",	.pass = true },
+		/* Regular symlinks are blocked. */
+		{ .name = "[no_symlinks] relative symlink target",
+		  .path = "relsym",		.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_symlinks] relative symlink component",
+		  .path = "reletc/passwd",	.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_symlinks] absolute symlink target",
+		  .path = "abssym",		.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_symlinks] absolute symlink component",
+		  .path = "absetc/passwd",	.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_symlinks] cheeky garbage link",
+		  .path = "cheeky/garbagelink",	.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_symlinks] cheeky absolute + garbage link",
+		  .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_symlinks] cheeky absolute + absolute symlink",
+		  .path = "abscheeky/absself",	.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		/* Trailing symlinks with NO_FOLLOW. */
+		{ .name = "[no_symlinks] relative symlink with O_NOFOLLOW",
+		  .path = "relsym",		.how.flags = O_NOFOLLOW,
+						.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.path = "relsym",		.pass = true },
+		{ .name = "[no_symlinks] absolute symlink with O_NOFOLLOW",
+		  .path = "abssym",		.how.flags = O_NOFOLLOW,
+						.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.path = "abssym",		.pass = true },
+		{ .name = "[no_symlinks] trailing symlink with O_NOFOLLOW",
+		  .path = "cheeky/garbagelink",	.how.flags = O_NOFOLLOW,
+						.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.path = "cheeky/garbagelink", .pass = true },
+		{ .name = "[no_symlinks] multiple symlink components with O_NOFOLLOW",
+		  .path = "abscheeky/absself",	.how.flags = O_NOFOLLOW,
+						.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_symlinks] multiple symlink (and garbage link) components with O_NOFOLLOW",
+		  .path = "abscheeky/garbagelink", .how.flags = O_NOFOLLOW,
+						   .how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+	};
+
+	BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_OPATH_TESTS);
+
+	for (int i = 0; i < ARRAY_LEN(tests); i++) {
+		int dfd, fd;
+		char *fdpath = NULL;
+		bool failed;
+		void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+		struct basic_test *test = &tests[i];
+
+		if (!openat2_supported) {
+			ksft_print_msg("openat2(2) unsupported\n");
+			resultfn = ksft_test_result_skip;
+			goto skip;
+		}
+
+		/* Auto-set O_PATH. */
+		if (!(test->how.flags & O_CREAT))
+			test->how.flags |= O_PATH;
+
+		if (test->dir)
+			dfd = openat(rootfd, test->dir, O_PATH | O_DIRECTORY);
+		else
+			dfd = dup(rootfd);
+		E_assert(dfd, "failed to openat root '%s': %m", test->dir);
+
+		E_dup2(dfd, hardcoded_fd);
+
+		fd = sys_openat2(dfd, test->path, &test->how);
+		if (test->pass)
+			failed = (fd < 0 || !fdequal(fd, rootfd, test->out.path));
+		else
+			failed = (fd != test->out.err);
+		if (fd >= 0) {
+			fdpath = fdreadlink(fd);
+			close(fd);
+		}
+		close(dfd);
+
+		if (failed) {
+			resultfn = ksft_test_result_fail;
+
+			ksft_print_msg("openat2 unexpectedly returned ");
+			if (fdpath)
+				ksft_print_msg("%d['%s']\n", fd, fdpath);
+			else
+				ksft_print_msg("%d (%s)\n", fd, strerror(-fd));
+		}
+
+skip:
+		if (test->pass)
+			resultfn("%s gives path '%s'\n", test->name,
+				 test->out.path ?: ".");
+		else
+			resultfn("%s fails with %d (%s)\n", test->name,
+				 test->out.err, strerror(-test->out.err));
+
+		fflush(stdout);
+		free(fdpath);
+	}
+
+	free(procselfexe);
+	close(rootfd);
+
+	free(hardcoded_fdpath);
+	close(hardcoded_fd);
+}
+
+#define NUM_TESTS NUM_OPENAT2_OPATH_TESTS
+
+int main(int argc, char **argv)
+{
+	ksft_print_header();
+	ksft_set_plan(NUM_TESTS);
+
+	/* NOTE: We should be checking for CAP_SYS_ADMIN here... */
+	if (geteuid() != 0)
+		ksft_exit_skip("all tests require euid == 0\n");
+
+	test_openat2_opath_tests();
+
+	if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
+		ksft_exit_fail();
+	else
+		ksft_exit_pass();
+}
-- 
2.23.0

^ permalink raw reply related

* [PATCH RESEND v14 4/6] open: introduce openat2(2) syscall
From: Aleksa Sarai @ 2019-10-26 18:56 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra
  Cc: Aleksa Sarai, Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Rasmus Villemoes, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Aleksa Sarai, Linus Torvalds, containers, linux-alpha
In-Reply-To: <20191026185700.10708-1-cyphar@cyphar.com>

/* Background. */
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].

This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).

In addition, the newly-added path resolution restriction LOOKUP flags
(which we would like to expose to user-space) don't feel related to the
pre-existing O_* flag set -- they affect all components of path lookup.
Thus it's necessary to (at the very least) add an additional flag.

Adding a new syscall allows us to finally fix the flag-ignoring problem,
and we can make it extensible enough so that we will hopefully never
need an openat3(2).

/* Syscall Prototype. */
  /*
   * open_how is an extensible structure (similar in interface to
   * clone3(2) or sched_setattr(2)). The size parameter must be set to
   * sizeof(struct open_how), to allow for future extensions. All future
   * extensions will be appended to open_how, with their zero value
   * acting as a no-op default.
   */
  struct open_how { /* ... */ };

  int openat2(int dfd, const char *pathname,
              struct open_how *how, size_t size);

/* Description. */
The initial version of 'struct open_how' contains the following fields:

  flags
    Used to specify openat(2)-style flags. However, any unknown flag
    bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
    will result in -EINVAL. In addition, this field is 64-bits wide to
    allow for more O_ flags than currently permitted with openat(2).

  mode
    The file mode for O_CREAT or O_TMPFILE.

    Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.

  __padding
    Must be set to all zeroes.

  resolve
    Restrict path resolution (in contrast to O_* flags they affect all
    path components). The current set of flags are as follows (at the
    moment, all of the RESOLVE_ flags are implemented as just passing
    the corresponding LOOKUP_ flag).

    RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
    RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
    RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
    RESOLVE_BENEATH       => LOOKUP_BENEATH
    RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT

open_how does not contain an embedded size field, because it is of
little benefit (userspace can figure out the kernel open_how size at
runtime fairly easily without it).

Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
is no longer permitted for openat(2). As far as I can tell, this has
always been a bug and appears to not be used by userspace (and I've not
seen any problems on my machines by disallowing it). If it turns out
this breaks something, we can special-case it and only permit it for
openat(2) but not openat2(2).

/* Testing. */
In a follow-up patch there are over 200 selftests which ensure that this
syscall has the correct semantics and will correctly handle several
attack scenarios.

In addition, I've written a userspace library[4] which provides
convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
syscalls). During the development of this patch, I've run numerous
verification tests using libpathrs (showing that the API is reasonably
usable by userspace).

/* Future Work. */
Additional RESOLVE_ flags have been suggested during the review period.
These can be easily implemented separately (such as blocking automount
during resolution).

Furthermore, there are some other proposed changes to the openat(2)
interface (the most obvious example is magic-link hardening[4]) which
would be a good opportunity to add a way for userspace to restrict how
O_PATH file descriptors can be re-opened.

[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
[3]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
[4]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/

Suggested-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 CREDITS                                     |   4 +-
 arch/alpha/kernel/syscalls/syscall.tbl      |   1 +
 arch/arm/tools/syscall.tbl                  |   1 +
 arch/arm64/include/asm/unistd.h             |   2 +-
 arch/arm64/include/asm/unistd32.h           |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |   1 +
 arch/s390/kernel/syscalls/syscall.tbl       |   1 +
 arch/sh/kernel/syscalls/syscall.tbl         |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |   1 +
 fs/open.c                                   | 154 +++++++++++++++-----
 include/linux/fcntl.h                       |  12 +-
 include/linux/syscalls.h                    |   3 +
 include/uapi/asm-generic/unistd.h           |   5 +-
 include/uapi/linux/fcntl.h                  |  41 ++++++
 24 files changed, 201 insertions(+), 38 deletions(-)

diff --git a/CREDITS b/CREDITS
index 031605d46b4d..a048e001d726 100644
--- a/CREDITS
+++ b/CREDITS
@@ -3301,7 +3301,9 @@ S: France
 N: Aleksa Sarai
 E: cyphar@cyphar.com
 W: https://www.cyphar.com/
-D: `pids` cgroup subsystem
+D: /sys/fs/cgroup/pids
+D: openat2(2)
+S: Sydney, Australia
 
 N: Dipankar Sarma
 E: dipankar@in.ibm.com
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 728fe028c02c..9f374f7d9514 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
 543	common	fspick				sys_fspick
 544	common	pidfd_open			sys_pidfd_open
 # 545 reserved for clone3
+547	common	openat2				sys_openat2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 6da7dc4d79cc..4ba54bc7e19a 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -449,3 +449,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 435	common	clone3				sys_clone3
+437	common	openat2				sys_openat2
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 2629a68b8724..8aa00ccb0b96 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		436
+#define __NR_compat_syscalls		438
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 94ab29cf4f00..57f6f592d460 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -879,6 +879,8 @@ __SYSCALL(__NR_fspick, sys_fspick)
 __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
 #define __NR_clone3 435
 __SYSCALL(__NR_clone3, sys_clone3)
+#define __NR_openat2 437
+__SYSCALL(__NR_openat2, sys_openat2)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 36d5faf4c86c..8d36f2e2dc89 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -356,3 +356,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+437	common	openat2				sys_openat2
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index a88a285a0e5f..2559925f1924 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -435,3 +435,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+437	common	openat2				sys_openat2
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 09b0cd7dab0a..c04385e60833 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 435	common	clone3				sys_clone3
+437	common	openat2				sys_openat2
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index e7c5ab38e403..68c9ec06851f 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -374,3 +374,4 @@
 433	n32	fspick				sys_fspick
 434	n32	pidfd_open			sys_pidfd_open
 435	n32	clone3				__sys_clone3
+437	n32	openat2				sys_openat2
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 13cd66581f3b..42a72d010050 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -350,3 +350,4 @@
 433	n64	fspick				sys_fspick
 434	n64	pidfd_open			sys_pidfd_open
 435	n64	clone3				__sys_clone3
+437	n64	openat2				sys_openat2
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 353539ea4140..f114c4aed0ed 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -423,3 +423,4 @@
 433	o32	fspick				sys_fspick
 434	o32	pidfd_open			sys_pidfd_open
 435	o32	clone3				__sys_clone3
+437	o32	openat2				sys_openat2
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 285ff516150c..b550ae9a7fea 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 435	common	clone3				sys_clone3_wrapper
+437	common	openat2				sys_openat2
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 43f736ed47f2..a8b5ecb5b602 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -517,3 +517,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 435	nospu	clone3				ppc_clone3
+437	common	openat2				sys_openat2
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 3054e9c035a3..16b571c06161 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
 433  common	fspick			sys_fspick			sys_fspick
 434  common	pidfd_open		sys_pidfd_open			sys_pidfd_open
 435  common	clone3			sys_clone3			sys_clone3
+437  common	openat2			sys_openat2			sys_openat2
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index b5ed26c4c005..a7185cc18626 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+437	common	openat2				sys_openat2
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 8c8cc7537fb2..b11c19552022 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -481,3 +481,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+437	common	openat2			sys_openat2
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 3fe02546aed3..e5c022e9a5c4 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -440,3 +440,4 @@
 433	i386	fspick			sys_fspick			__ia32_sys_fspick
 434	i386	pidfd_open		sys_pidfd_open			__ia32_sys_pidfd_open
 435	i386	clone3			sys_clone3			__ia32_sys_clone3
+437	i386	openat2			sys_openat2			__ia32_sys_openat2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c29976eca4a8..9035647ef236 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -357,6 +357,7 @@
 433	common	fspick			__x64_sys_fspick
 434	common	pidfd_open		__x64_sys_pidfd_open
 435	common	clone3			__x64_sys_clone3/ptregs
+437	common	openat2			__x64_sys_openat2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 25f4de729a6d..f0a68013c038 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -406,3 +406,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 435	common	clone3				sys_clone3
+437	common	openat2				sys_openat2
diff --git a/fs/open.c b/fs/open.c
index b62f5c0923a8..7cc6570201ca 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -955,25 +955,82 @@ struct file *open_with_fake_path(const struct path *path, int flags,
 }
 EXPORT_SYMBOL(open_with_fake_path);
 
-static inline int build_open_flags(int flags, umode_t mode, struct open_flags *op)
+#define WILL_CREATE(flags)	(flags & (O_CREAT | __O_TMPFILE))
+#define O_PATH_FLAGS		(O_DIRECTORY | O_NOFOLLOW | O_PATH | O_CLOEXEC)
+
+static inline struct open_how build_open_how(int flags, umode_t mode)
+{
+	struct open_how how = {
+		.flags = flags & VALID_OPEN_FLAGS,
+		.mode = mode & S_IALLUGO,
+	};
+
+	/* O_TMPFILE beats O_PATH. */
+	if (how.flags & __O_TMPFILE)
+		how.flags &= ~O_PATH;
+	/* O_PATH beats everything else. */
+	if (how.flags & O_PATH)
+		how.flags &= O_PATH_FLAGS;
+	/* Modes should only be set for create-like flags. */
+	if (!WILL_CREATE(how.flags))
+		how.mode = 0;
+	return how;
+}
+
+static inline int build_open_flags(const struct open_how *how,
+				   struct open_flags *op)
 {
+	int flags = how->flags;
 	int lookup_flags = 0;
 	int acc_mode = ACC_MODE(flags);
 
+	/* Must never be set by userspace */
+	flags &= ~(FMODE_NONOTIFY | O_CLOEXEC);
+
 	/*
-	 * Clear out all open flags we don't know about so that we don't report
-	 * them in fcntl(F_GETFD) or similar interfaces.
+	 * Older syscalls implicitly clear all of the invalid flags or argument
+	 * values before calling build_open_flags(), but openat2(2) checks all
+	 * of its arguments.
 	 */
-	flags &= VALID_OPEN_FLAGS;
+	if (flags & ~VALID_OPEN_FLAGS)
+		return -EINVAL;
+	if (how->resolve & ~VALID_RESOLVE_FLAGS)
+		return -EINVAL;
+	if (memchr_inv(how->__padding, 0, sizeof(how->__padding)))
+		return -EINVAL;
+
+	/* Deal with the mode. */
+	if (WILL_CREATE(how->flags)) {
+		if (how->mode & ~S_IALLUGO)
+			return -EINVAL;
+	} else if (how->mode != 0) {
+		return -EINVAL;
+	}
 
-	if (flags & (O_CREAT | __O_TMPFILE))
-		op->mode = (mode & S_IALLUGO) | S_IFREG;
+	/*
+	 * In order to ensure programs get explicit errors when trying to use
+	 * O_TMPFILE on old kernels, O_TMPFILE is implemented such that it
+	 * looks like (O_DIRECTORY|O_RDWR & ~O_CREAT) to old kernels. But we
+	 * have to require userspace to explicitly set it.
+	 */
+	if (flags & __O_TMPFILE) {
+		if ((flags & O_TMPFILE_MASK) != O_TMPFILE)
+			return -EINVAL;
+		if (!(acc_mode & MAY_WRITE))
+			return -EINVAL;
+	}
+	if (flags & O_PATH) {
+		/* O_PATH only permits certain other flags to be set. */
+		if (flags & ~O_PATH_FLAGS)
+			return -EINVAL;
+		acc_mode = 0;
+	}
+
+	if (WILL_CREATE(flags))
+		op->mode = how->mode | S_IFREG;
 	else
 		op->mode = 0;
 
-	/* Must never be set by userspace */
-	flags &= ~FMODE_NONOTIFY & ~O_CLOEXEC;
-
 	/*
 	 * O_SYNC is implemented as __O_SYNC|O_DSYNC.  As many places only
 	 * check for O_DSYNC if the need any syncing at all we enforce it's
@@ -983,20 +1040,6 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 	if (flags & __O_SYNC)
 		flags |= O_DSYNC;
 
-	if (flags & __O_TMPFILE) {
-		if ((flags & O_TMPFILE_MASK) != O_TMPFILE)
-			return -EINVAL;
-		if (!(acc_mode & MAY_WRITE))
-			return -EINVAL;
-	} else if (flags & O_PATH) {
-		/*
-		 * If we have O_PATH in the open flag. Then we
-		 * cannot have anything other than the below set of flags
-		 */
-		flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH;
-		acc_mode = 0;
-	}
-
 	op->open_flag = flags;
 
 	/* O_TRUNC implies we need access checks for write permissions */
@@ -1022,6 +1065,18 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 		lookup_flags |= LOOKUP_DIRECTORY;
 	if (!(flags & O_NOFOLLOW))
 		lookup_flags |= LOOKUP_FOLLOW;
+
+	if (how->resolve & RESOLVE_NO_XDEV)
+		lookup_flags |= LOOKUP_NO_XDEV;
+	if (how->resolve & RESOLVE_NO_MAGICLINKS)
+		lookup_flags |= LOOKUP_NO_MAGICLINKS;
+	if (how->resolve & RESOLVE_NO_SYMLINKS)
+		lookup_flags |= LOOKUP_NO_SYMLINKS;
+	if (how->resolve & RESOLVE_BENEATH)
+		lookup_flags |= LOOKUP_BENEATH;
+	if (how->resolve & RESOLVE_IN_ROOT)
+		lookup_flags |= LOOKUP_IN_ROOT;
+
 	op->lookup_flags = lookup_flags;
 	return 0;
 }
@@ -1040,8 +1095,11 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 struct file *file_open_name(struct filename *name, int flags, umode_t mode)
 {
 	struct open_flags op;
-	int err = build_open_flags(flags, mode, &op);
-	return err ? ERR_PTR(err) : do_filp_open(AT_FDCWD, name, &op);
+	struct open_how how = build_open_how(flags, mode);
+	int err = build_open_flags(&how, &op);
+	if (err)
+		return ERR_PTR(err);
+	return do_filp_open(AT_FDCWD, name, &op);
 }
 
 /**
@@ -1072,17 +1130,19 @@ struct file *file_open_root(struct dentry *dentry, struct vfsmount *mnt,
 			    const char *filename, int flags, umode_t mode)
 {
 	struct open_flags op;
-	int err = build_open_flags(flags, mode, &op);
+	struct open_how how = build_open_how(flags, mode);
+	int err = build_open_flags(&how, &op);
 	if (err)
 		return ERR_PTR(err);
 	return do_file_open_root(dentry, mnt, filename, &op);
 }
 EXPORT_SYMBOL(file_open_root);
 
-long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
+static long do_sys_openat2(int dfd, const char __user *filename,
+			   struct open_how *how)
 {
 	struct open_flags op;
-	int fd = build_open_flags(flags, mode, &op);
+	int fd = build_open_flags(how, &op);
 	struct filename *tmp;
 
 	if (fd)
@@ -1092,7 +1152,7 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
 	if (IS_ERR(tmp))
 		return PTR_ERR(tmp);
 
-	fd = get_unused_fd_flags(flags);
+	fd = get_unused_fd_flags(how->flags);
 	if (fd >= 0) {
 		struct file *f = do_filp_open(dfd, tmp, &op);
 		if (IS_ERR(f)) {
@@ -1107,12 +1167,16 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
 	return fd;
 }
 
-SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
+long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
 {
-	if (force_o_largefile())
-		flags |= O_LARGEFILE;
+	struct open_how how = build_open_how(flags, mode);
+	return do_sys_openat2(dfd, filename, &how);
+}
 
-	return do_sys_open(AT_FDCWD, filename, flags, mode);
+
+SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
+{
+	return ksys_open(filename, flags, mode);
 }
 
 SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
@@ -1120,10 +1184,32 @@ SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
 {
 	if (force_o_largefile())
 		flags |= O_LARGEFILE;
-
 	return do_sys_open(dfd, filename, flags, mode);
 }
 
+SYSCALL_DEFINE4(openat2, int, dfd, const char __user *, filename,
+		struct open_how __user *, how, size_t, usize)
+{
+	int err;
+	struct open_how tmp;
+
+	BUILD_BUG_ON(sizeof(struct open_how) < OPEN_HOW_SIZE_VER0);
+	BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_LATEST);
+
+	if (unlikely(usize < OPEN_HOW_SIZE_VER0))
+		return -EINVAL;
+
+	err = copy_struct_from_user(&tmp, sizeof(tmp), how, usize);
+	if (err)
+		return err;
+
+	/* O_LARGEFILE is only allowed for non-O_PATH. */
+	if (!(tmp.flags & O_PATH) && force_o_largefile())
+		tmp.flags |= O_LARGEFILE;
+
+	return do_sys_openat2(dfd, filename, &tmp);
+}
+
 #ifdef CONFIG_COMPAT
 /*
  * Exactly like sys_open(), except that it doesn't set the
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index d019df946cb2..f2eb05bd3af3 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -2,15 +2,25 @@
 #ifndef _LINUX_FCNTL_H
 #define _LINUX_FCNTL_H
 
+#include <linux/stat.h>
 #include <uapi/linux/fcntl.h>
 
-/* list of all valid flags for the open/openat flags argument: */
+/* List of all valid flags for the open/openat flags argument: */
 #define VALID_OPEN_FLAGS \
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
 	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
 
+/* List of all valid flags for the how->upgrade_mask argument: */
+#define VALID_UPGRADE_FLAGS \
+	(UPGRADE_NOWRITE | UPGRADE_NOREAD)
+
+/* List of all valid flags for the how->resolve argument: */
+#define VALID_RESOLVE_FLAGS \
+	(RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \
+	 RESOLVE_BENEATH | RESOLVE_IN_ROOT)
+
 #ifndef force_o_largefile
 #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
 #endif
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f7c561c4dcdd..808f103b7a62 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -69,6 +69,7 @@ struct rseq;
 union bpf_attr;
 struct io_uring_params;
 struct clone_args;
+struct open_how;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -439,6 +440,8 @@ asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user,
 asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
 asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
 			   umode_t mode);
+asmlinkage long sys_openat2(int dfd, const char __user *filename,
+			    struct open_how *how, size_t size);
 asmlinkage long sys_close(unsigned int fd);
 asmlinkage long sys_vhangup(void);
 
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1fc8faa6e973..d4122c091472 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -851,8 +851,11 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
 __SYSCALL(__NR_clone3, sys_clone3)
 #endif
 
+#define __NR_openat2 437
+__SYSCALL(__NR_openat2, sys_openat2)
+
 #undef __NR_syscalls
-#define __NR_syscalls 436
+#define __NR_syscalls 438
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 1d338357df8a..5de8b0006a95 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -93,5 +93,46 @@
 
 #define AT_RECURSIVE		0x8000	/* Apply to the entire subtree */
 
+/*
+ * Arguments for how openat2(2) should open the target path. If @resolve is
+ * zero, then openat2(2) operates very similarly to openat(2).
+ *
+ * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
+ * than being silently ignored. @mode must be zero unless one of {O_CREAT,
+ * O_TMPFILE} are set, and @upgrade_mask must be zero unless O_PATH is set.
+ *
+ * @flags: O_* flags.
+ * @mode: O_CREAT/O_TMPFILE file mode.
+ * @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening).
+ * @resolve: RESOLVE_* flags.
+ */
+struct open_how {
+	__aligned_u64 flags;
+	__u16 mode;
+	__u16 __padding[3]; /* must be zeroed */
+	__aligned_u64 resolve;
+};
+
+#define OPEN_HOW_SIZE_VER0	24 /* sizeof first published struct */
+#define OPEN_HOW_SIZE_LATEST	OPEN_HOW_SIZE_VER0
+
+/* how->resolve flags for openat2(2). */
+#define RESOLVE_NO_XDEV		0x01 /* Block mount-point crossings
+					(includes bind-mounts). */
+#define RESOLVE_NO_MAGICLINKS	0x02 /* Block traversal through procfs-style
+					"magic-links". */
+#define RESOLVE_NO_SYMLINKS	0x04 /* Block traversal through all symlinks
+					(implies OEXT_NO_MAGICLINKS) */
+#define RESOLVE_BENEATH		0x08 /* Block "lexical" trickery like
+					"..", symlinks, and absolute
+					paths which escape the dirfd. */
+#define RESOLVE_IN_ROOT		0x10 /* Make all jumps to "/" and ".."
+					be scoped inside the dirfd
+					(similar to chroot(2)). */
+
+/* how->upgrade flags for openat2(2). */
+/* First bit is reserved for a future UPGRADE_NOEXEC flag. */
+#define UPGRADE_NOREAD		0x02 /* Block re-opening with MAY_READ. */
+#define UPGRADE_NOWRITE		0x04 /* Block re-opening with MAY_WRITE. */
 
 #endif /* _UAPI_LINUX_FCNTL_H */
-- 
2.23.0

^ permalink raw reply related

* [PATCH RESEND v14 3/6] namei: permit ".." resolution with LOOKUP_{IN_ROOT,BENEATH}
From: Aleksa Sarai @ 2019-10-26 18:56 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra
  Cc: Aleksa Sarai, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	David Drysdale, Chanho Min, Oleg Nesterov, Rasmus Villemoes,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Christian Brauner,
	Aleksa Sarai, Linus Torvalds, containers, linux-alpha
In-Reply-To: <20191026185700.10708-1-cyphar@cyphar.com>

This patch allows for LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit
".." resolution (in the case of LOOKUP_BENEATH the resolution will still
fail if ".." resolution would resolve a path outside of the root --
while LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps
are still disallowed entirely[*].

The need for this patch (and the original no-".." restriction) is
explained by observing there is a fairly easy-to-exploit race condition
with chroot(2) (and thus by extension LOOKUP_IN_ROOT and LOOKUP_BENEATH
if ".." is allowed) where a rename(2) of a path can be used to "skip
over" nd->root and thus escape to the filesystem above nd->root.

  thread1 [attacker]:
    for (;;)
      renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
  thread2 [victim]:
    for (;;)
      openat2(dirb, "b/c/../../etc/shadow",
              { .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } );

With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.

With this patch, such cases will be detected *during* ".." resolution
and will return -EAGAIN for userspace to decide to either retry or abort
the lookup. It should be noted that ".." is the weak point of chroot(2)
-- walking *into* a subdirectory tautologically cannot result in you
walking *outside* nd->root (except through a bind-mount or magic-link).
There is also no other way for a directory's parent to change (which is
the primary worry with ".." resolution here) other than a rename or
MS_MOVE.

This is a first-pass implementation, where -EAGAIN will be returned if
any rename or mount occurs anywhere on the host (in any namespace). This
will result in spurious errors, but there isn't a satisfactory
alternative (other than denying ".." altogether).

One other possible alternative (which previous versions of this patch
used) would be to check with path_is_under() if there was a racing
rename or mount (after re-taking the relevant seqlocks). While this does
work, it results in possible O(n*m) behaviour if there are many renames
or mounts occuring *anywhere on the system*.

A variant of the above attack is included in the selftests for
openat2(2) later in this patch series. I've run this test on several
machines for several days and no instances of a breakout were detected.
While this is not concrete proof that this is safe, when combined with
the above argument it should lend some trustworthiness to this
construction.

[*] It may be acceptable in the future to do a path_is_under() check (as
    with the alternative solution for "..") for magic-links after they
    are resolved. However this seems unlikely to be a feature that
    people *really* need -- it can be added later if it turns out a lot
    of people want it.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c | 43 +++++++++++++++++++++++++++++--------------
 1 file changed, 29 insertions(+), 14 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 9d00b138f54c..0d6857ac4e5b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -491,7 +491,7 @@ struct nameidata {
 	struct path	root;
 	struct inode	*inode; /* path.dentry.d_inode */
 	unsigned int	flags;
-	unsigned	seq, m_seq;
+	unsigned	seq, m_seq, r_seq;
 	int		last_type;
 	unsigned	depth;
 	int		total_link_count;
@@ -1769,22 +1769,35 @@ static inline int handle_dots(struct nameidata *nd, int type)
 	if (type == LAST_DOTDOT) {
 		int error = 0;
 
-		/*
-		 * Scoped-lookup flags resolving ".." is not currently safe --
-		 * races can cause our parent to have moved outside of the root
-		 * and us to skip over it.
-		 */
-		if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
-			return -EXDEV;
 		if (!nd->root.mnt) {
 			error = set_root(nd);
 			if (error)
 				return error;
 		}
-		if (nd->flags & LOOKUP_RCU) {
-			return follow_dotdot_rcu(nd);
-		} else
-			return follow_dotdot(nd);
+		if (nd->flags & LOOKUP_RCU)
+			error = follow_dotdot_rcu(nd);
+		else
+			error = follow_dotdot(nd);
+		if (error)
+			return error;
+
+		if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) {
+			bool m_retry = read_seqretry(&mount_lock, nd->m_seq);
+			bool r_retry = read_seqretry(&rename_lock, nd->r_seq);
+
+			/*
+			 * If there was a racing rename or mount along our
+			 * path, then we can't be sure that ".." hasn't jumped
+			 * above nd->root (and so userspace should retry or use
+			 * some fallback).
+			 *
+			 * In future we could do a path_is_under() check here
+			 * instead, but there are O(n*m) performance
+			 * considerations with such a setup.
+			 */
+			if (unlikely(m_retry || r_retry))
+				return -EAGAIN;
+		}
 	}
 	return 0;
 }
@@ -2254,6 +2267,10 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->last_type = LAST_ROOT; /* if there are only slashes... */
 	nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
 	nd->depth = 0;
+
+	nd->m_seq = read_seqbegin(&mount_lock);
+	nd->r_seq = read_seqbegin(&rename_lock);
+
 	if (flags & LOOKUP_ROOT) {
 		struct dentry *root = nd->root.dentry;
 		struct inode *inode = root->d_inode;
@@ -2275,8 +2292,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->path.mnt = NULL;
 	nd->path.dentry = NULL;
 
-	nd->m_seq = read_seqbegin(&mount_lock);
-
 	/* LOOKUP_IN_ROOT treats absolute paths as being relative-to-dirfd. */
 	if (flags & LOOKUP_IN_ROOT)
 		while (*s == '/')
-- 
2.23.0

^ permalink raw reply related

* [PATCH RESEND v14 2/6] namei: LOOKUP_IN_ROOT: chroot-like path resolution
From: Aleksa Sarai @ 2019-10-26 18:56 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra
  Cc: Aleksa Sarai, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	David Drysdale, Chanho Min, Oleg Nesterov, Rasmus Villemoes,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Christian Brauner,
	Aleksa Sarai, Linus Torvalds, containers, linux-alpha
In-Reply-To: <20191026185700.10708-1-cyphar@cyphar.com>

/* Background. */
Container runtimes or other administrative management processes will
often interact with root filesystems while in the host mount namespace,
because the cost of doing a chroot(2) on every operation is too
prohibitive (especially in Go, which cannot safely use vfork). However,
a malicious program can trick the management process into doing
operations on files outside of the root filesystem through careful
crafting of symlinks.

Most programs that need this feature have attempted to make this process
safe, by doing all of the path resolution in userspace (with symlinks
being scoped to the root of the malicious root filesystem).
Unfortunately, this method is prone to foot-guns and usually such
implementations have subtle security bugs.

Thus, what userspace needs is a way to resolve a path as though it were
in a chroot(2) -- with all absolute symlinks being resolved relative to
the dirfd root (and ".." components being stuck under the dirfd root[1])
It is much simpler and more straight-forward to provide this
functionality in-kernel (because it can be done far more cheaply and
correctly).

More classical applications that also have this problem (which have
their own potentially buggy userspace path sanitisation code) include
web servers, archive extraction tools, network file servers, and so on.

[1]: At the moment, ".." and magic-link jumping are disallowed for the
     same reason it is disabled for LOOKUP_BENEATH -- currently it is
     not safe to allow it. Future patches may enable it unconditionally
     once we have resolved the possible races (for "..") and semantics
     (for magic-link jumping).

/* Userspace API. */
LOOKUP_IN_ROOT will be exposed to userspace through openat2(2).

There is a slight change in behaviour regarding pathnames -- if the
pathname is absolute then the dirfd is still used as the root of
resolution of LOOKUP_IN_ROOT is specified (this is to avoid obvious
foot-guns, at the cost of a minor API inconsistency).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c            | 5 +++++
 include/linux/namei.h | 3 ++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 54fdbdfbeb94..9d00b138f54c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2277,6 +2277,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)

 	nd->m_seq = read_seqbegin(&mount_lock);

+	/* LOOKUP_IN_ROOT treats absolute paths as being relative-to-dirfd. */
+	if (flags & LOOKUP_IN_ROOT)
+		while (*s == '/')
+			s++;
+
 	/* Figure out the starting path and root (if needed). */
 	if (*s == '/') {
 		error = nd_jump_root(nd);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 35a1bf074ff1..c7a010570d05 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -47,8 +47,9 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_NO_MAGICLINKS	0x080000 /* No /proc/$pid/fd/ "symlink" crossing. */
 #define LOOKUP_NO_SYMLINKS	0x100000 /* No symlink crossing *at all*.
 					    Implies LOOKUP_NO_MAGICLINKS. */
+#define LOOKUP_IN_ROOT		0x200000 /* Treat dirfd as %current->fs->root. */
 /* LOOKUP_* flags which do scope-related checks based on the dirfd. */
-#define LOOKUP_IS_SCOPED LOOKUP_BENEATH
+#define LOOKUP_IS_SCOPED (LOOKUP_BENEATH | LOOKUP_IN_ROOT)

 extern int path_pts(struct path *path);

-- 
2.23.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox