Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: Detecting the availability of VSYSCALL
From: Florian Weimer @ 2019-06-26 15:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Thomas Gleixner, Linux API, Kernel Hardening,
	linux-x86_64, linux-arch, Kees Cook, Carlos O'Donell, X86 ML
In-Reply-To: <CA96B819-30A9-43D3-9FE3-2D551D35369E@amacapital.net>

* Andy Lutomirski:

> I didn’t add a flag because the vsyscall page was thoroughly obsolete
> when all this happened, and I wanted to encourage all new code to just
> parse the vDSO instead of piling on the hacks.

It turned out that the thorny cases just switched to system calls
instead.  I think we finally completed the transition in glibc upstream
in 2018 (for x86).

> Anyway, you may be the right person to ask: is there some credible way
> that the kernel could detect new binaries that don’t need vsyscalls?
> Maybe a new ELF note on a static binary or on the ELF interpreter? We
> can dynamically switch it in principle.

For this kind of change, markup similar to PT_GNU_STACK would have been
appropriate, I think: Old kernels and loaders would have ignored the
program header and loaded the program anyway, but the vsyscall page
still existed, so that would have been fine. The kernel would have
needed to check the program interpreter or the main executable (without
a program interpreter, i.e., the statically linked case).  Due the way
the vsyscalls are concentrated in glibc, a dynamically linked executable
would not have needed checking (or re-linking).  I don't think we would
have implemented the full late enablement after dlopen we did for
executable stacks.  In theory, any code could have jumped to the
vsyscall area, but in practice, it's just dynamically linked glibc and
static binaries.

But nowadays, unmarked glibcs which do not depend on vsyscall vastly
outnumber unmarked glibcs which requrie it.  Therefore, markup of
binaries does not seem to be reasonable to day.  I could imagine a
personality flag you can set (if yoy have CAP_SYS_ADMIN) that re-enables
vsyscall support for new subprocesses.  And a container runtime would do
this based on metadata found in the image.  This way, the container host
itself could be protected, and you could still run legacy images which
require vsyscall.

For the non-container case, if you know that you'll run legacy
workloads, you'd still have the boot parameter.  But I think it could
default to vsyscall=none in many more cases.

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH 00/25] VFS: Introduce filesystem information query syscall [ver #14]
From: Christian Brauner @ 2019-06-26 14:50 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, mszeredi, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <9360.1561559497@warthog.procyon.org.uk>

On Wed, Jun 26, 2019 at 03:31:37PM +0100, David Howells wrote:
> Christian Brauner <christian@brauner.io> wrote:
> 
> > And I also very much recommend to remove any potential cross-dependency
> > between the fsinfo() and the notification patchset.
> 
> The problem with that is that to make the notification patchset useful for
> mount notifications, you need some information that you would obtain through
> fsinfo().

But would it really be that bad if you'd just land fsinfo() and then
focus on the notification stuff. This very much rather looks like a
timing issue than a conceptual one, i.e. you could very much just push
fsinfo() and leave the notification stuff alone until that is done.

Once fsinfo() has landed you can then go on to put additional bits you
need from or for fsinfo() for the notification patchset in there. Seems
you have at least sketched both APIs sufficiently that you know what you
need to look out for to not cause any regressions later on when you need
to expand them.

Christian

^ permalink raw reply

* Re: [PATCH 00/25] VFS: Introduce filesystem information query syscall [ver #14]
From: David Howells @ 2019-06-26 14:31 UTC (permalink / raw)
  To: Christian Brauner
  Cc: dhowells, viro, raven, mszeredi, linux-api, linux-fsdevel,
	linux-kernel
In-Reply-To: <20190626131902.6xat2ab65arc62td@brauner.io>

Christian Brauner <christian@brauner.io> wrote:

> And I also very much recommend to remove any potential cross-dependency
> between the fsinfo() and the notification patchset.

The problem with that is that to make the notification patchset useful for
mount notifications, you need some information that you would obtain through
fsinfo().

David

^ permalink raw reply

* Re: Detecting the availability of VSYSCALL
From: Andy Lutomirski @ 2019-06-26 14:15 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andy Lutomirski, Thomas Gleixner, Linux API, Kernel Hardening,
	linux-x86_64, linux-arch, Kees Cook, Carlos O'Donell, X86 ML
In-Reply-To: <87o92kmtp5.fsf@oldenburg2.str.redhat.com>



> On Jun 26, 2019, at 5:12 AM, Florian Weimer <fweimer@redhat.com> wrote:
> 
> * Andy Lutomirski:
> 
>>> On Tue, Jun 25, 2019 at 1:47 PM Florian Weimer <fweimer@redhat.com> wrote:
>>> 
>>> * Andy Lutomirski:
>>> 
>>>>> We want binaries that run fast on VSYSCALL kernels, but can fall back to
>>>>> full system calls on kernels that do not have them (instead of
>>>>> crashing).
>>>> 
>>>> Define "VSYSCALL kernels."  On any remotely recent kernel (*all* new
>>>> kernels and all kernels for the last several years that haven't
>>>> specifically requested vsyscall=native), using vsyscalls is much, much
>>>> slower than just doing syscalls.  I know a way you can tell whether
>>>> vsyscalls are fast, but it's unreliable, and I'm disinclined to
>>>> suggest it.  There are also at least two pending patch series that
>>>> will interfere.
>>> 
>>> The fast path is for the benefit of the 2.6.32-based kernel in Red Hat
>>> Enterprise Linux 6.  It doesn't have the vsyscall emulation code yet, I
>>> think.
>>> 
>>> My hope is to produce (statically linked) binaries that run as fast on
>>> that kernel as they run today, but can gracefully fall back to something
>>> else on kernels without vsyscall support.
>>> 
>>>>> We could parse the vDSO and prefer the functions found there, but this
>>>>> is for the statically linked case.  We currently do not have a (minimal)
>>>>> dynamic loader there in that version of the code base, so that doesn't
>>>>> really work for us.
>>>> 
>>>> Is anything preventing you from adding a vDSO parser?  I wrote one
>>>> just for this type of use:
>>>> 
>>>> $ wc -l tools/testing/selftests/vDSO/parse_vdso.c
>>>> 269 tools/testing/selftests/vDSO/parse_vdso.c
>>>> 
>>>> (289 lines includes quite a bit of comment.)
>>> 
>>> I'm worried that if I use a custom parser and the binaries start
>>> crashing again because something changed in the kernel (within the scope
>>> permitted by the ELF specification), the kernel won't be fixed.
>>> 
>>> That is, we'd be in exactly the same situation as today.
>> 
>> With my maintainer hat on, the kernel won't do that.  Obviously a
>> review of my parser would be appreciated, but I consider it to be
>> fully supported, just like glibc and musl's parsers are fully
>> supported.  Sadly, I *also* consider the version Go forked for a while
>> (now fixed) to be supported.  Sigh.
> 
> We've been burnt once, otherwise we wouldn't be having this
> conversation.  It's not just what the kernel does by default; if it's
> configurable, it will be disabled by some, and if it's label as
> “security hardening”, the userspace ABI promise is suddenly forgotten
> and it's all userspace's fault for not supporting the new way.
> 
> It looks like parsing the vDSO is the only way forward, and we have to
> move in that direction if we move at all.
> 
> It's tempting to read the machine code on the vsyscall page and analyze
> that, but vsyscall=none behavior changed at one point, and you no longer
> any mapping there at all.  So that doesn't work, either.

It’s worse than that. I have patches to make the vsyscall be execute-only. And the slowly forthcoming CET patches will change the machine code.

> 
> I do hope the next userspace ABI break will have an option to undo it on
> a per-container basis.  Or at least a flag to detect it.
> 

I didn’t add a flag because the vsyscall page was thoroughly obsolete when all this happened, and I wanted to encourage all new code to just parse the vDSO instead of piling on the hacks.

Anyway, you may be the right person to ask: is there some credible way that the kernel could detect new binaries that don’t need vsyscalls?  Maybe a new ELF note on a static binary or on the ELF interpreter? We can dynamically switch it in principle.

^ permalink raw reply

* Re: [PATCH 00/25] VFS: Introduce filesystem information query syscall [ver #14]
From: Christian Brauner @ 2019-06-26 13:19 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, mszeredi, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <20190626100525.irdehd24jowz5f75@brauner.io>

On Wed, Jun 26, 2019 at 12:05:25PM +0200, Christian Brauner wrote:
> On Mon, Jun 24, 2019 at 03:08:45PM +0100, David Howells wrote:
> > 
> > Hi Al,
> > 
> > Here are a set of patches that adds a syscall, fsinfo(), that allows
> > attributes of a filesystem/superblock to be queried.  Attribute values are
> > of four basic types:
> > 
> >  (1) Version dependent-length structure (size defined by type).
> > 
> >  (2) Variable-length string (up to PAGE_SIZE).
> > 
> >  (3) Array of fixed-length structures (up to INT_MAX size).
> > 
> >  (4) Opaque blob (up to INT_MAX size).
> > 
> > Attributes can have multiple values in up to two dimensions and all the
> > values of a particular attribute must have the same type.
> > 
> > Note that the attribute values *are* allowed to vary between dentries
> > within a single superblock, depending on the specific dentry that you're
> > looking at.
> > 
> > I've tried to make the interface as light as possible, so integer/enum
> > attribute selector rather than string and the core does all the allocation
> > and extensibility support work rather than leaving that to the filesystems.
> > That means that for the first two attribute types, sb->s_op->fsinfo() may
> > assume that the provided buffer is always present and always big enough.
> > 
> > Further, this removes the possibility of the filesystem gaining access to the
> > userspace buffer.
> > 
> > 
> > fsinfo() allows a variety of information to be retrieved about a filesystem
> > and the mount topology:
> > 
> >  (1) General superblock attributes:
> > 
> >       - The amount of space/free space in a filesystem (as statfs()).
> >       - Filesystem identifiers (UUID, volume label, device numbers, ...)
> >       - The limits on a filesystem's capabilities
> >       - Information on supported statx fields and attributes and IOC flags.
> >       - A variety single-bit flags indicating supported capabilities.
> >       - Timestamp resolution and range.
> >       - Sources (as per mount(2), but fsconfig() allows multiple sources).
> >       - In-filesystem filename format information.
> >       - Filesystem parameters ("mount -o xxx"-type things).
> >       - LSM parameters (again "mount -o xxx"-type things).
> > 
> >  (2) Filesystem-specific superblock attributes:
> > 
> >       - Server names and addresses.
> >       - Cell name.
> > 
> >  (3) Filesystem configuration metadata attributes:
> > 
> >       - Filesystem parameter type descriptions.
> >       - Name -> parameter mappings.
> >       - Simple enumeration name -> value mappings.
> > 
> >  (4) Mount topology:
> > 
> >       - General information about a mount object.
> >       - Mount device name(s).
> >       - Children of a mount object and their relative paths.
> > 
> >  (5) Information about what the fsinfo() syscall itself supports, including
> >      the number of attibutes supported and the number of capability bits
> >      supported.
> 
> Phew, this patchset is a lot. It's good of course but can we please cut
> some of the more advanced features such as querying by mount id,
> submounts etc. pp. for now?
> I feel this would help with review and since your interface is
> extensible it's really not a big deal if we defer fancy features to
> later cycles after people had more time to review and the interface has
> seen some exposure.
> 
> The mount api changes over the last months have honestly been so huge
> that any chance to make the changes smaller and easier to digest we
> should take. (I'm really not complaining. Good that the work is done and
> it's entirely ok that it's a lot of code.)
> 
> It would also be great if after you have dropped some stuff from this
> patchset and gotten an Ack we could stuff it into linux-next for some
> time because it hasn't been so far...

And I also very much recommend to remove any potential cross-dependency
between the fsinfo() and the notification patchset.
Ideally, I'd like to see fsinfo() to be completely independent to not
block it on something way more controversial.
Furthermore, I can't possibly keep the context of another huge patchset
not yet merged in the back of my mind while reviewing this patchset. :)

Christian

^ permalink raw reply

* Re: [PATCH V33 29/30] tracefs: Restrict tracefs when the kernel is locked down
From: Steven Rostedt @ 2019-06-26 13:07 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: jmorris, linux-security, linux-kernel, linux-api, Matthew Garrett
In-Reply-To: <20190621011941.186255-30-matthewgarrett@google.com>

On Thu, 20 Jun 2019 18:19:40 -0700
Matthew Garrett <matthewgarrett@google.com> wrote:

> Tracefs may release more information about the kernel than desirable, so
> restrict it when the kernel is locked down in confidentiality mode by
> preventing open().
> 
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> ---
>  fs/tracefs/inode.c           | 41 +++++++++++++++++++++++++++++++++++-
>  include/linux/security.h     |  1 +
>  security/lockdown/lockdown.c |  1 +
>  3 files changed, 42 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
> index 7098c49f3693..f6c04fa8e415 100644
> --- a/fs/tracefs/inode.c
> +++ b/fs/tracefs/inode.c
> @@ -24,6 +24,7 @@
>  #include <linux/parser.h>
>  #include <linux/magic.h>
>  #include <linux/slab.h>
> +#include <linux/security.h>
>  
>  #define TRACEFS_DEFAULT_MODE	0700
>  
> @@ -31,6 +32,21 @@ static struct vfsmount *tracefs_mount;
>  static int tracefs_mount_count;
>  static bool tracefs_registered;
>  
> +static int default_open_file(struct inode *inode, struct file *filp)
> +{
> +	struct dentry *dentry = filp->f_path.dentry;
> +	struct file_operations *real_fops;
> +
> +	if (!dentry)
> +		return -EINVAL;
> +
> +	if (security_is_locked_down(LOCKDOWN_TRACEFS))
> +		return -EPERM;
> +
> +	real_fops = dentry->d_fsdata;
> +	return real_fops->open(inode, filp);
> +}
> +
>  static ssize_t default_read_file(struct file *file, char __user *buf,
>  				 size_t count, loff_t *ppos)
>  {
> @@ -50,6 +66,13 @@ static const struct file_operations tracefs_file_operations = {
>  	.llseek =	noop_llseek,
>  };
>  
> +static const struct file_operations tracefs_proxy_file_operations = {
> +	.read =		default_read_file,
> +	.write =	default_write_file,
> +	.open =		default_open_file,
> +	.llseek =	noop_llseek,
> +};

This appears to be unused.

> +
>  static struct tracefs_dir_ops {
>  	int (*mkdir)(const char *name);
>  	int (*rmdir)(const char *name);
> @@ -225,6 +248,12 @@ static int tracefs_apply_options(struct super_block *sb)
>  	return 0;
>  }
>  
> +static void tracefs_destroy_inode(struct inode *inode)
> +{
> +	if (S_ISREG(inode->i_mode))
> +		kfree(inode->i_fop);
> +}
> +
>  static int tracefs_remount(struct super_block *sb, int *flags, char *data)
>  {
>  	int err;
> @@ -260,6 +289,7 @@ static int tracefs_show_options(struct seq_file *m, struct dentry *root)
>  
>  static const struct super_operations tracefs_super_operations = {
>  	.statfs		= simple_statfs,
> +	.destroy_inode  = tracefs_destroy_inode,
>  	.remount_fs	= tracefs_remount,
>  	.show_options	= tracefs_show_options,
>  };
> @@ -393,6 +423,7 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
>  {
>  	struct dentry *dentry;
>  	struct inode *inode;
> +	struct file_operations *proxy_fops;
>  
>  	if (!(mode & S_IFMT))
>  		mode |= S_IFREG;
> @@ -406,8 +437,16 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
>  	if (unlikely(!inode))
>  		return failed_creating(dentry);
>  
> +	proxy_fops = kzalloc(sizeof(struct file_operations), GFP_KERNEL);
> +	if (!proxy_fops)
> +		return failed_creating(dentry);
> +
> +	dentry->d_fsdata = fops ? (void *)fops :
> +		(void *)&tracefs_file_operations;
> +	memcpy(proxy_fops, dentry->d_fsdata, sizeof(struct file_operations));
> +	proxy_fops->open = default_open_file;
>  	inode->i_mode = mode;
> -	inode->i_fop = fops ? fops : &tracefs_file_operations;
> +	inode->i_fop = proxy_fops;


I think the above would look cleaner as:


	if (!fops)
		fops = &tracefs_file_operations;

	dentry->d_fsdata = (void *)fops;
	memcpy(proxy_fops, fops, sizeof(*proxy_fops);
	proxy_fops->open = default_open_file;

-- Steve

>  	inode->i_private = data;
>  	d_instantiate(dentry, inode);
>  	fsnotify_create(dentry->d_parent->d_inode, dentry);
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 2563a9e3b415..040e7fc33397 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -100,6 +100,7 @@ enum lockdown_reason {
>  	LOCKDOWN_KPROBES,
>  	LOCKDOWN_BPF,
>  	LOCKDOWN_PERF,
> +	LOCKDOWN_TRACEFS,
>  	LOCKDOWN_CONFIDENTIALITY_MAX,
>  };
>  
> diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c
> index a6f7b0770e78..7dc601f06cd3 100644
> --- a/security/lockdown/lockdown.c
> +++ b/security/lockdown/lockdown.c
> @@ -36,6 +36,7 @@ static char *lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
>  	[LOCKDOWN_KPROBES] = "use of kprobes",
>  	[LOCKDOWN_BPF] = "use of bpf",
>  	[LOCKDOWN_PERF] = "unsafe use of perf",
> +	[LOCKDOWN_TRACEFS] = "use of tracefs",
>  	[LOCKDOWN_CONFIDENTIALITY_MAX] = "confidentiality",
>  };
>  

^ permalink raw reply

* Re: [PATCH V33 21/30] x86/mmiotrace: Lock down the testmmiotrace module
From: Steven Rostedt @ 2019-06-26 12:46 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: jmorris, linux-security, linux-kernel, linux-api, David Howells,
	Thomas Gleixner, Matthew Garrett, Ingo Molnar, H. Peter Anvin,
	x86
In-Reply-To: <20190621011941.186255-22-matthewgarrett@google.com>

On Thu, 20 Jun 2019 18:19:32 -0700
Matthew Garrett <matthewgarrett@google.com> wrote:

> From: David Howells <dhowells@redhat.com>
> 
> The testmmiotrace module shouldn't be permitted when the kernel is locked
> down as it can be used to arbitrarily read and write MMIO space. This is
> a runtime check rather than buildtime in order to allow configurations
> where the same kernel may be run in both locked down or permissive modes
> depending on local policy.
> 
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: David Howells <dhowells@redhat.com
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> cc: Thomas Gleixner <tglx@linutronix.de>

Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

-- Steve

> cc: Ingo Molnar <mingo@kernel.org>
> cc: "H. Peter Anvin" <hpa@zytor.com>
> cc: x86@kernel.org
> ---

^ permalink raw reply

* Re: Detecting the availability of VSYSCALL
From: Florian Weimer @ 2019-06-26 12:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Linux API, Kernel Hardening, linux-x86_64,
	linux-arch, Kees Cook, Carlos O'Donell, X86 ML
In-Reply-To: <CALCETrUDt4v3=FqD+vseGTKTuG=qY+1LwRPrOrU8C7vCVbo=uA@mail.gmail.com>

* Andy Lutomirski:

> On Tue, Jun 25, 2019 at 1:47 PM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * Andy Lutomirski:
>>
>> >> We want binaries that run fast on VSYSCALL kernels, but can fall back to
>> >> full system calls on kernels that do not have them (instead of
>> >> crashing).
>> >
>> > Define "VSYSCALL kernels."  On any remotely recent kernel (*all* new
>> > kernels and all kernels for the last several years that haven't
>> > specifically requested vsyscall=native), using vsyscalls is much, much
>> > slower than just doing syscalls.  I know a way you can tell whether
>> > vsyscalls are fast, but it's unreliable, and I'm disinclined to
>> > suggest it.  There are also at least two pending patch series that
>> > will interfere.
>>
>> The fast path is for the benefit of the 2.6.32-based kernel in Red Hat
>> Enterprise Linux 6.  It doesn't have the vsyscall emulation code yet, I
>> think.
>>
>> My hope is to produce (statically linked) binaries that run as fast on
>> that kernel as they run today, but can gracefully fall back to something
>> else on kernels without vsyscall support.
>>
>> >> We could parse the vDSO and prefer the functions found there, but this
>> >> is for the statically linked case.  We currently do not have a (minimal)
>> >> dynamic loader there in that version of the code base, so that doesn't
>> >> really work for us.
>> >
>> > Is anything preventing you from adding a vDSO parser?  I wrote one
>> > just for this type of use:
>> >
>> > $ wc -l tools/testing/selftests/vDSO/parse_vdso.c
>> > 269 tools/testing/selftests/vDSO/parse_vdso.c
>> >
>> > (289 lines includes quite a bit of comment.)
>>
>> I'm worried that if I use a custom parser and the binaries start
>> crashing again because something changed in the kernel (within the scope
>> permitted by the ELF specification), the kernel won't be fixed.
>>
>> That is, we'd be in exactly the same situation as today.
>
> With my maintainer hat on, the kernel won't do that.  Obviously a
> review of my parser would be appreciated, but I consider it to be
> fully supported, just like glibc and musl's parsers are fully
> supported.  Sadly, I *also* consider the version Go forked for a while
> (now fixed) to be supported.  Sigh.

We've been burnt once, otherwise we wouldn't be having this
conversation.  It's not just what the kernel does by default; if it's
configurable, it will be disabled by some, and if it's label as
“security hardening”, the userspace ABI promise is suddenly forgotten
and it's all userspace's fault for not supporting the new way.

It looks like parsing the vDSO is the only way forward, and we have to
move in that direction if we move at all.

It's tempting to read the machine code on the vsyscall page and analyze
that, but vsyscall=none behavior changed at one point, and you no longer
any mapping there at all.  So that doesn't work, either.

I do hope the next userspace ABI break will have an option to undo it on
a per-container basis.  Or at least a flag to detect it.

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH 00/25] VFS: Introduce filesystem information query syscall [ver #14]
From: Christian Brauner @ 2019-06-26 10:47 UTC (permalink / raw)
  To: Ian Kent
  Cc: David Howells, viro, mszeredi, linux-api, linux-fsdevel,
	linux-kernel
In-Reply-To: <cf0361c2d1fc09ad0097f0da1e981b97ad39ab07.camel@themaw.net>

On Wed, Jun 26, 2019 at 06:42:51PM +0800, Ian Kent wrote:
> On Wed, 2019-06-26 at 12:05 +0200, Christian Brauner wrote:
> > On Mon, Jun 24, 2019 at 03:08:45PM +0100, David Howells wrote:
> > > Hi Al,
> > > 
> > > Here are a set of patches that adds a syscall, fsinfo(), that allows
> > > attributes of a filesystem/superblock to be queried.  Attribute values are
> > > of four basic types:
> > > 
> > >  (1) Version dependent-length structure (size defined by type).
> > > 
> > >  (2) Variable-length string (up to PAGE_SIZE).
> > > 
> > >  (3) Array of fixed-length structures (up to INT_MAX size).
> > > 
> > >  (4) Opaque blob (up to INT_MAX size).
> > > 
> > > Attributes can have multiple values in up to two dimensions and all the
> > > values of a particular attribute must have the same type.
> > > 
> > > Note that the attribute values *are* allowed to vary between dentries
> > > within a single superblock, depending on the specific dentry that you're
> > > looking at.
> > > 
> > > I've tried to make the interface as light as possible, so integer/enum
> > > attribute selector rather than string and the core does all the allocation
> > > and extensibility support work rather than leaving that to the filesystems.
> > > That means that for the first two attribute types, sb->s_op->fsinfo() may
> > > assume that the provided buffer is always present and always big enough.
> > > 
> > > Further, this removes the possibility of the filesystem gaining access to
> > > the
> > > userspace buffer.
> > > 
> > > 
> > > fsinfo() allows a variety of information to be retrieved about a filesystem
> > > and the mount topology:
> > > 
> > >  (1) General superblock attributes:
> > > 
> > >       - The amount of space/free space in a filesystem (as statfs()).
> > >       - Filesystem identifiers (UUID, volume label, device numbers, ...)
> > >       - The limits on a filesystem's capabilities
> > >       - Information on supported statx fields and attributes and IOC flags.
> > >       - A variety single-bit flags indicating supported capabilities.
> > >       - Timestamp resolution and range.
> > >       - Sources (as per mount(2), but fsconfig() allows multiple sources).
> > >       - In-filesystem filename format information.
> > >       - Filesystem parameters ("mount -o xxx"-type things).
> > >       - LSM parameters (again "mount -o xxx"-type things).
> > > 
> > >  (2) Filesystem-specific superblock attributes:
> > > 
> > >       - Server names and addresses.
> > >       - Cell name.
> > > 
> > >  (3) Filesystem configuration metadata attributes:
> > > 
> > >       - Filesystem parameter type descriptions.
> > >       - Name -> parameter mappings.
> > >       - Simple enumeration name -> value mappings.
> > > 
> > >  (4) Mount topology:
> > > 
> > >       - General information about a mount object.
> > >       - Mount device name(s).
> > >       - Children of a mount object and their relative paths.
> > > 
> > >  (5) Information about what the fsinfo() syscall itself supports, including
> > >      the number of attibutes supported and the number of capability bits
> > >      supported.
> > 
> > Phew, this patchset is a lot. It's good of course but can we please cut
> > some of the more advanced features such as querying by mount id,
> > submounts etc. pp. for now?
> 
> Did you mean the "vfs: Allow fsinfo() to look up a mount object by ID"
> patch?
> 
> We would need to be very careful what was dropped.

Not dropped as in never implement but rather defer it by one merge
window to give us a) more time to review and settle the interface while
b) not stalling the overall patch.

> 
> For example, I've found that the patch above is pretty much essential
> for fsinfo() to be useful from user space.

Yeah, but that interface is not clearly defined yet as can be seen from
the commit message and that's what's bothering me most.

^ permalink raw reply

* Re: [PATCH 00/25] VFS: Introduce filesystem information query syscall [ver #14]
From: Ian Kent @ 2019-06-26 10:42 UTC (permalink / raw)
  To: Christian Brauner, David Howells
  Cc: viro, mszeredi, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <20190626100525.irdehd24jowz5f75@brauner.io>

On Wed, 2019-06-26 at 12:05 +0200, Christian Brauner wrote:
> On Mon, Jun 24, 2019 at 03:08:45PM +0100, David Howells wrote:
> > Hi Al,
> > 
> > Here are a set of patches that adds a syscall, fsinfo(), that allows
> > attributes of a filesystem/superblock to be queried.  Attribute values are
> > of four basic types:
> > 
> >  (1) Version dependent-length structure (size defined by type).
> > 
> >  (2) Variable-length string (up to PAGE_SIZE).
> > 
> >  (3) Array of fixed-length structures (up to INT_MAX size).
> > 
> >  (4) Opaque blob (up to INT_MAX size).
> > 
> > Attributes can have multiple values in up to two dimensions and all the
> > values of a particular attribute must have the same type.
> > 
> > Note that the attribute values *are* allowed to vary between dentries
> > within a single superblock, depending on the specific dentry that you're
> > looking at.
> > 
> > I've tried to make the interface as light as possible, so integer/enum
> > attribute selector rather than string and the core does all the allocation
> > and extensibility support work rather than leaving that to the filesystems.
> > That means that for the first two attribute types, sb->s_op->fsinfo() may
> > assume that the provided buffer is always present and always big enough.
> > 
> > Further, this removes the possibility of the filesystem gaining access to
> > the
> > userspace buffer.
> > 
> > 
> > fsinfo() allows a variety of information to be retrieved about a filesystem
> > and the mount topology:
> > 
> >  (1) General superblock attributes:
> > 
> >       - The amount of space/free space in a filesystem (as statfs()).
> >       - Filesystem identifiers (UUID, volume label, device numbers, ...)
> >       - The limits on a filesystem's capabilities
> >       - Information on supported statx fields and attributes and IOC flags.
> >       - A variety single-bit flags indicating supported capabilities.
> >       - Timestamp resolution and range.
> >       - Sources (as per mount(2), but fsconfig() allows multiple sources).
> >       - In-filesystem filename format information.
> >       - Filesystem parameters ("mount -o xxx"-type things).
> >       - LSM parameters (again "mount -o xxx"-type things).
> > 
> >  (2) Filesystem-specific superblock attributes:
> > 
> >       - Server names and addresses.
> >       - Cell name.
> > 
> >  (3) Filesystem configuration metadata attributes:
> > 
> >       - Filesystem parameter type descriptions.
> >       - Name -> parameter mappings.
> >       - Simple enumeration name -> value mappings.
> > 
> >  (4) Mount topology:
> > 
> >       - General information about a mount object.
> >       - Mount device name(s).
> >       - Children of a mount object and their relative paths.
> > 
> >  (5) Information about what the fsinfo() syscall itself supports, including
> >      the number of attibutes supported and the number of capability bits
> >      supported.
> 
> Phew, this patchset is a lot. It's good of course but can we please cut
> some of the more advanced features such as querying by mount id,
> submounts etc. pp. for now?

Did you mean the "vfs: Allow fsinfo() to look up a mount object by ID"
patch?

We would need to be very careful what was dropped.

For example, I've found that the patch above is pretty much essential
for fsinfo() to be useful from user space.

> I feel this would help with review and since your interface is
> extensible it's really not a big deal if we defer fancy features to
> later cycles after people had more time to review and the interface has
> seen some exposure.
> 
> The mount api changes over the last months have honestly been so huge
> that any chance to make the changes smaller and easier to digest we
> should take. (I'm really not complaining. Good that the work is done and
> it's entirely ok that it's a lot of code.)
> 
> It would also be great if after you have dropped some stuff from this
> patchset and gotten an Ack we could stuff it into linux-next for some
> time because it hasn't been so far...
> 
> Christian

^ permalink raw reply

* Re: [PATCH 03/25] vfs: Allow fsinfo() to query what's in an fs_context [ver #14]
From: Christian Brauner @ 2019-06-26 10:06 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, mszeredi, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <6575.1561543379@warthog.procyon.org.uk>

On Wed, Jun 26, 2019 at 11:02:59AM +0100, David Howells wrote:
> Christian Brauner <christian@brauner.io> wrote:
> 
> > > +	ret = mutex_lock_interruptible(&fc->uapi_mutex);
> > > +	if (ret == 0) {
> > > +		ret = -EIO;
> > 
> > Why EIO when there's no root dentry?
> 
> Because I don't want to use ENODATA/EBADF and preferably not EINVAL and
> because the context you're accessing isn't in the correct state for you to be
> able to do that.  How about EBADFD ("File descriptor in bad state")?

Do we have that? If so that sounds good.

Christian

^ permalink raw reply

* Re: [PATCH 00/25] VFS: Introduce filesystem information query syscall [ver #14]
From: Christian Brauner @ 2019-06-26 10:05 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, mszeredi, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <156138532485.25627.7459410522109581052.stgit@warthog.procyon.org.uk>

On Mon, Jun 24, 2019 at 03:08:45PM +0100, David Howells wrote:
> 
> Hi Al,
> 
> Here are a set of patches that adds a syscall, fsinfo(), that allows
> attributes of a filesystem/superblock to be queried.  Attribute values are
> of four basic types:
> 
>  (1) Version dependent-length structure (size defined by type).
> 
>  (2) Variable-length string (up to PAGE_SIZE).
> 
>  (3) Array of fixed-length structures (up to INT_MAX size).
> 
>  (4) Opaque blob (up to INT_MAX size).
> 
> Attributes can have multiple values in up to two dimensions and all the
> values of a particular attribute must have the same type.
> 
> Note that the attribute values *are* allowed to vary between dentries
> within a single superblock, depending on the specific dentry that you're
> looking at.
> 
> I've tried to make the interface as light as possible, so integer/enum
> attribute selector rather than string and the core does all the allocation
> and extensibility support work rather than leaving that to the filesystems.
> That means that for the first two attribute types, sb->s_op->fsinfo() may
> assume that the provided buffer is always present and always big enough.
> 
> Further, this removes the possibility of the filesystem gaining access to the
> userspace buffer.
> 
> 
> fsinfo() allows a variety of information to be retrieved about a filesystem
> and the mount topology:
> 
>  (1) General superblock attributes:
> 
>       - The amount of space/free space in a filesystem (as statfs()).
>       - Filesystem identifiers (UUID, volume label, device numbers, ...)
>       - The limits on a filesystem's capabilities
>       - Information on supported statx fields and attributes and IOC flags.
>       - A variety single-bit flags indicating supported capabilities.
>       - Timestamp resolution and range.
>       - Sources (as per mount(2), but fsconfig() allows multiple sources).
>       - In-filesystem filename format information.
>       - Filesystem parameters ("mount -o xxx"-type things).
>       - LSM parameters (again "mount -o xxx"-type things).
> 
>  (2) Filesystem-specific superblock attributes:
> 
>       - Server names and addresses.
>       - Cell name.
> 
>  (3) Filesystem configuration metadata attributes:
> 
>       - Filesystem parameter type descriptions.
>       - Name -> parameter mappings.
>       - Simple enumeration name -> value mappings.
> 
>  (4) Mount topology:
> 
>       - General information about a mount object.
>       - Mount device name(s).
>       - Children of a mount object and their relative paths.
> 
>  (5) Information about what the fsinfo() syscall itself supports, including
>      the number of attibutes supported and the number of capability bits
>      supported.

Phew, this patchset is a lot. It's good of course but can we please cut
some of the more advanced features such as querying by mount id,
submounts etc. pp. for now?
I feel this would help with review and since your interface is
extensible it's really not a big deal if we defer fancy features to
later cycles after people had more time to review and the interface has
seen some exposure.

The mount api changes over the last months have honestly been so huge
that any chance to make the changes smaller and easier to digest we
should take. (I'm really not complaining. Good that the work is done and
it's entirely ok that it's a lot of code.)

It would also be great if after you have dropped some stuff from this
patchset and gotten an Ack we could stuff it into linux-next for some
time because it hasn't been so far...

Christian

^ permalink raw reply

* Re: [PATCH 03/25] vfs: Allow fsinfo() to query what's in an fs_context [ver #14]
From: David Howells @ 2019-06-26 10:02 UTC (permalink / raw)
  To: Christian Brauner
  Cc: dhowells, viro, raven, mszeredi, linux-api, linux-fsdevel,
	linux-kernel
In-Reply-To: <20190625092728.z3jn3gbyopzcg2it@brauner.io>

Christian Brauner <christian@brauner.io> wrote:

> > +	ret = mutex_lock_interruptible(&fc->uapi_mutex);
> > +	if (ret == 0) {
> > +		ret = -EIO;
> 
> Why EIO when there's no root dentry?

Because I don't want to use ENODATA/EBADF and preferably not EINVAL and
because the context you're accessing isn't in the correct state for you to be
able to do that.  How about EBADFD ("File descriptor in bad state")?

David

^ permalink raw reply

* Re: [PATCH 01/25] vfs: syscall: Add fsinfo() to query filesystem information [ver #14]
From: Christian Brauner @ 2019-06-26  9:58 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, mszeredi, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <7560.1561542552@warthog.procyon.org.uk>

On Wed, Jun 26, 2019 at 10:49:12AM +0100, David Howells wrote:
> Christian Brauner <christian@brauner.io> wrote:
> 
> > > +	return sizeof(*p);
> > 
> > Hm, the discrepancy between the function signature returning int and
> > the sizeof operator most likely being size_t is bothering me. It
> > probably doesn't matter but maybe we can avoid that.
> 
> If sizeof(*p) exceeds 4096, the buffer is going to have been overrun by this
> point anyway.

Ok.

> 
> The function can't return size_t, though it could return ssize_t.  I could
> switch it to return long or even store the result in fsinfo_kparams::usage and
> return 0.
> 
> > > +	strlcpy(p->f_fs_name, path->dentry->d_sb->s_type->name,
> > > +		sizeof(p->f_fs_name));
> > 
> > Truncation is acceptable or impossible I assume?
> 
> I'm hoping that file_system_type::name isn't going to exceed 15 chars plus
> NUL.  If it does, it will be truncated.  I don't really want to add an
> individual attribute just for the filesystem driver name.
> 
> > > +#define _gen(X, Y) FSINFO_ATTR_##X: return fsinfo_generic_##Y(path, params->buffer)
> > 
> > I'm really not sure that this helps readability in the switch below... :)
> > 
> > > +
> > > +	switch (params->request) {
> > > +	case _gen(STATFS,		statfs);
> > > +	case _gen(IDS,			ids);
> > > +	case _gen(LIMITS,		limits);
> > > +	case _gen(SUPPORTS,		supports);
> > > +	case _gen(CAPABILITIES,		capabilities);
> > > +	case _gen(TIMESTAMP_INFO,	timestamp_info);
> > > ...
> 
> I'm trying to avoid having to spend multiple lines per case and tabulation
> makes things easier to read.  So
> 
> 	case FSINFO_ATTR_SUPPORTS:		return fsinfo_generic_supports(path, params->buffer);
> 	case FSINFO_ATTR_CAPABILITIES:		return fsinfo_generic_capabilities(path, params->buffer);
> 	case FSINFO_ATTR_TIMESTAMP_INFO:	return fsinfo_generic_timestamp_info(path, params->buffer);
> 
> is a bit on the long side per line, whereas:
> 
> 	case FSINFO_ATTR_SUPPORTS:
> 		return fsinfo_generic_supports(path, params->buffer);
> 	case FSINFO_ATTR_CAPABILITIES:
> 		return fsinfo_generic_capabilities(path, params->buffer);
> 	case FSINFO_ATTR_TIMESTAMP_INFO:
> 		return fsinfo_generic_timestamp_info(path, params->buffer);
> 
> is less readable by interleaving two of the three columns.  (Note that _gen is
> a actually third column as I introduce alternatives later).
> 
> > > +		if (ret <= (int)params->buf_size)
> > 
> > He, and this is where the return value discrepancy hits again. Just
> > doesn't look nice tbh. :)
> 
> No.  That's dealing with signed/unsigned comparison.  It might be better if I
> change this to:
> 
> 		if (IS_ERR_VALUE(ret))
> 			return ret; /* Error */
> 		if ((unsigned int)ret <= params->buf_size)
> 			return ret; /* It fitted */
> 
> In any case, buf_size isn't permitted to be larger than INT_MAX due to a check
> later in the loop.
> 
> > > +		kvfree(params->buffer);
> > 
> > That means callers should always memset fsinfo_kparams or this is an
> > invalid free...
> 
> vfs_info() isn't a public function.  And, in any case, the caller *must*
> provide a buffer here.
> 
> > > + * Return buffer information by requestable attribute.
> > > + *
> > > + * STRUCT indicates a fixed-size structure with only one instance.
> > > ...
> > I honestly have a hard time following the documentation here
> 
> How about:
> 
>  * STRUCT	- a fixed-size structure with only one instance.
>  * STRUCT_N	- a sequence of STRUCTs, indexed by Nth
>  * STRUCT_NM	- a sequence of sequences of STRUCTs, indexed by Nth, Mth
>  * STRING	- a string with only one instance.
>  * STRING_N	- a sequence of STRING, indexed by Nth
>  * STRING_NM	- a sequence of sequences of STRING, indexed by Nth, Mth
>  * OPAQUE	- a blob that can be larger than 4K.
>  * STRUCT_ARRAY - an array of structs that can be larger than 4K
> 
> > and that monster table/macro thing below.  For example, STRUCT_NM
> > corresponds to __FSINFO_NM or what?
> 
> STRUCT_NM -> .type = __FSINFO_STRUCT, .flags = __FSINFO_NM, .size = ...
> 
> If you think this is bad, you should try looking at the device ID tables used
> by the drivers and the attribute tables;-)
> 
> I could spell out the flag and type in the macro defs (such as the body of
> FSINFO_STRING(X,Y) for instance).  It would make it harder to compare macros
> as it wouldn't then tabulate, though.
> 
> > And is this uapi as you're using this in your samples/test below?
> 
> Not exactly.  Each attribute is defined as being a certain type in the
> documentation in the UAPI header, but this is not coded there.  The assumption
> being that if you're using a particular attribute, you'll know what the type
> of the attribute is and you'll structure your code appropriately.
> 
> The reason the sample code has this replicated is that it doesn't really
> attempt to interpret the type per se.  It has a dumper for an individual
> attribute value, but the table tells it whether there should be one of those,
> N of those or N of M(0), M(1), M(2), ... of those so that it can report an
> error if it doesn't see what it expects.
> 
> I could even cheaply provide a meta attribute that dumps the contents of the
> table (just the type info, not the names).
> 
> > > ...
> > > +	FSINFO_STRING		(NAME_ENCODING,		-),
> > > +	FSINFO_STRING		(NAME_CODEPAGE,		-),
> > > +};
> > 
> > Can I complain again that this is really annoying to parse.
> 
> Apparently you can;-)  What would you prefer?  This:
> 
> static const struct fsinfo_attr_info fsinfo_buffer_info[FSINFO_ATTR__NR] = {
> 	[FSINFO_STATFS] = {
> 		.type	= __FSINFO_STRUCT,
> 		.size	= sizeof(struct fsinfo_statfs),
> 	},
> 	[FSINFO_SERVERS] = {
> 		.type	= __FSINFO_STRUCT,
> 		.flags	= __FSINFO_NM,
> 		.size	= sizeof(struct fsinfo_server),
> 	},
> 	...
> };	
> 
> That has 3-5 lines for each 1 in the current code and isn't a great deal more
> readable.

Really, I find this more readable because parsing structs and arrays of
structs is probably still even more common for C programmers then
deciphering nested macros. :) But I won't enforce my own pov. :)

> 
> > if (copy_to_user()) and if (clear_user()) and not if (clear_user() != 0)
> 
> Better "if (copy_to_user() != 0)" since it's not a boolean return value in
> either case.
> 
> > Nit: There's a bunch of name inconsistency for the arguments between the
> > stub and the definition:
> > 
> > SYSCALL_DEFINE5(fsinfo,
> > 		int, dfd, const char __user *, filename,
> > 		struct fsinfo_params __user *, _params,
> > 		void __user *, user_buffer, size_t, user_buf_size)
> 
> Yeah.  C just doesn't care.
> 
> I'll change filename to pathname throughout.  That's at least consistent with
> various glibc manpages for other vfs syscalls.
> 
> _params I can change to params and params as-was to kparams.
> 
> But user_buffer and user_buf_size, I'll keep as I've named them such to avoid
> confusion with kparams->buffer and kparams->scratch_buffer.  However, I
> wouldn't want to call them that in the UAPI.

Yep, it's really just that I prefer the naming to be consistent. :)

> 
> > Do we do SPDX that way? Or isn't this just supposed to be:
> > // <spdxy stuff>
> 
> Look in, say, include/uapi/linux/stat.h or .../fs.h.
> 
> > > +	FSINFO_ATTR__NR
> > 
> > Nit/Bikeshed: FSINFO_ATTR_MAX? Seems more intuitive.
> 
> No.  That would imply a limit that it will never exceed.
> 
> > > +struct fsinfo_u128 {
> > > +#if defined(__BYTE_ORDER) ? __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
> > > +	__u64	hi;
> > > +	__u64	lo;
> > > +#elif defined(__BYTE_ORDER) ? __BYTE_ORDER == __LITTLE_ENDIAN : defined(__LITTLE_ENDIAN)
> > > +	__u64	lo;
> > > +	__u64	hi;
> > > +#endif
> > > +};
> > 
> > Hm, I know why you do this custom fsinfo_u128 thingy but for userspace
> > that is going to be annoying to operate with, e.g. comparing the
> > size/space of two filesystems etc.
> 
> We don't have a __u128 in the UAPI, and I'm reluctant to use __uint128_t.
> 
> Do you have a better suggestion?
> 
> > > +struct fsinfo_ids {
> > > +	char	f_fs_name[15 + 1];	/* Filesystem name */
> > 
> > You should probably make this a macro so userspace can use it in fs-name
> > length checks too.
> 
> The name length, you mean?  Well, you can use sizeof...
> 
> > > +	FSINFO_CAP__NR
> > 
> > Hm, again, maybe better to use FSINFO_CAP_MAX?
> 
> It's not a limit.

Well, in both cases it's giving the limit of currently supported
attributes. Other places in the kernel do the same (netlink for
example). Anyway, it probably doesn't matter that much.

Christian

^ permalink raw reply

* Re: [PATCH 10/25] vfs: Allow mount information to be queried by fsinfo() [ver #14]
From: Christian Brauner @ 2019-06-26  9:53 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, mszeredi, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <156138542011.25627.17245518783942568567.stgit@warthog.procyon.org.uk>

On Mon, Jun 24, 2019 at 03:10:20PM +0100, David Howells wrote:
> Allow mount information, including information about the topology tree to
> be queried with the fsinfo() system call.  Usage of AT_FSINFO_MOUNTID_PATH
> allows overlapping mounts to be queried.

Again, I really think that this doesn't need to land at the same time as
the basic infrastructure for fsinfo()...

> 
> To this end, four fsinfo() attributes are provided:
> 
>  (1) FSINFO_ATTR_MOUNT_INFO.
> 
>      This is a structure providing information about a mount, including:
> 
> 	- Mounted superblock ID.
> 	- Mount ID (as AT_FSINFO_MOUNTID_PATH).
> 	- Parent mount ID.
> 	- Mount attributes (eg. R/O, NOEXEC).
> 	- Number of change notifications generated.
> 
>      Note that the parent mount ID is overridden to the ID of the queried
>      mount if the parent lies outside of the chroot or dfd tree.
> 
>  (2) FSINFO_ATTR_MOUNT_DEVNAME.
> 
>      This a string providing the device name associated with the mount.
> 
>      Note that the device name may be a path that lies outside of the root.
> 
>  (3) FSINFO_ATTR_MOUNT_CHILDREN.
> 
>      This produces an array of structures, one for each child and capped
>      with one for the argument mount (checked after listing all the
>      children).  Each element contains the mount ID and the notification
>      counter of the respective mount object.
> 
>  (4) FSINFO_ATTR_MOUNT_SUBMOUNT.
> 
>      This is a 1D array of strings, indexed with struct fsinfo_params::Nth.
>      Each string is the relative pathname of the corresponding child
>      returned by FSINFO_ATTR_MOUNT_CHILDREN.
> 
>      Note that paths in the mount at the base of the tree (whether that be
>      dfd or chroot) are relative to the base of the tree, not the root
>      directory of that mount.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  fs/d_path.c                 |    2 
>  fs/fsinfo.c                 |    8 ++
>  fs/internal.h               |    9 ++
>  fs/namespace.c              |  177 +++++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/fsinfo.h |   28 +++++++
>  samples/vfs/test-fsinfo.c   |   47 +++++++++++
>  6 files changed, 267 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/d_path.c b/fs/d_path.c
> index a7d0a96b35ce..89d77c264c5f 100644
> --- a/fs/d_path.c
> +++ b/fs/d_path.c
> @@ -227,7 +227,7 @@ static int prepend_unreachable(char **buffer, int *buflen)
>  	return prepend(buffer, buflen, "(unreachable)", 13);
>  }
>  
> -static void get_fs_root_rcu(struct fs_struct *fs, struct path *root)
> +void get_fs_root_rcu(struct fs_struct *fs, struct path *root)
>  {
>  	unsigned seq;
>  
> diff --git a/fs/fsinfo.c b/fs/fsinfo.c
> index 3b218f9fedb7..61f00f375fd2 100644
> --- a/fs/fsinfo.c
> +++ b/fs/fsinfo.c
> @@ -348,6 +348,10 @@ int generic_fsinfo(struct path *path, struct fsinfo_kparams *params)
>  	case _genf(PARAM_SPECIFICATION,	param_specification);
>  	case _genf(PARAM_ENUM,		param_enum);
>  	case _genp(PARAMETERS,		parameters);
> +	case _genp(MOUNT_INFO,		mount_info);
> +	case _genp(MOUNT_DEVNAME,	mount_devname);
> +	case _genp(MOUNT_CHILDREN,	mount_children);
> +	case _genp(MOUNT_SUBMOUNT,	mount_submount);
>  	default:
>  		return -EOPNOTSUPP;
>  	}
> @@ -627,6 +631,10 @@ static const struct fsinfo_attr_info fsinfo_buffer_info[FSINFO_ATTR__NR] = {
>  	FSINFO_STRUCT_N		(PARAM_ENUM,		param_enum),
>  	FSINFO_OPAQUE		(PARAMETERS,		-),
>  	FSINFO_OPAQUE		(LSM_PARAMETERS,	-),
> +	FSINFO_STRUCT		(MOUNT_INFO,		mount_info),
> +	FSINFO_STRING		(MOUNT_DEVNAME,		mount_devname),
> +	FSINFO_STRUCT_ARRAY	(MOUNT_CHILDREN,	mount_child),
> +	FSINFO_STRING_N		(MOUNT_SUBMOUNT,	mount_submount),
>  };
>  
>  /**
> diff --git a/fs/internal.h b/fs/internal.h
> index 074b1c65e3bd..bb3d8efa7f49 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -53,6 +53,11 @@ void __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
>   */
>  extern void __init chrdev_init(void);
>  
> +/*
> + * d_path.c
> + */
> +extern void get_fs_root_rcu(struct fs_struct *fs, struct path *root);
> +
>  /*
>   * fs_context.c
>   */
> @@ -98,6 +103,10 @@ extern void __mnt_drop_write_file(struct file *);
>  
>  extern void dissolve_on_fput(struct vfsmount *);
>  extern int lookup_mount_object(struct path *, int, struct path *);
> +extern int fsinfo_generic_mount_info(struct path *, struct fsinfo_kparams *);
> +extern int fsinfo_generic_mount_devname(struct path *, struct fsinfo_kparams *);
> +extern int fsinfo_generic_mount_children(struct path *, struct fsinfo_kparams *);
> +extern int fsinfo_generic_mount_submount(struct path *, struct fsinfo_kparams *);
>  
>  /*
>   * fs_struct.c
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 1450faab96b9..2ec7d9d1905a 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -29,6 +29,7 @@
>  #include <linux/sched/task.h>
>  #include <uapi/linux/mount.h>
>  #include <linux/fs_context.h>
> +#include <linux/fsinfo.h>
>  
>  #include "pnode.h"
>  #include "internal.h"
> @@ -4112,3 +4113,179 @@ int lookup_mount_object(struct path *root, int mnt_id, struct path *_mntpt)
>  	unlock_mount_hash();
>  	goto out_unlock;
>  }
> +
> +#ifdef CONFIG_FSINFO
> +int fsinfo_generic_mount_info(struct path *path, struct fsinfo_kparams *params)
> +{
> +	struct fsinfo_mount_info *p = params->buffer;
> +	struct super_block *sb;
> +	struct mount *m;
> +	struct path root;
> +	unsigned int flags;
> +
> +	if (!path->mnt)
> +		return -ENODATA;
> +
> +	m = real_mount(path->mnt);
> +	sb = m->mnt.mnt_sb;
> +
> +	p->f_sb_id		= sb->s_unique_id;
> +	p->mnt_id		= m->mnt_id;
> +	p->parent_id		= m->mnt_parent->mnt_id;
> +	p->notify_counter	= atomic_read(&m->mnt_notify_counter);
> +
> +	get_fs_root(current->fs, &root);
> +	if (path->mnt == root.mnt) {
> +		p->parent_id = p->mnt_id;
> +	} else {
> +		rcu_read_lock();
> +		if (!are_paths_connected(&root, path))
> +			p->parent_id = p->mnt_id;
> +		rcu_read_unlock();
> +	}
> +	if (IS_MNT_SHARED(m))
> +		p->group_id = m->mnt_group_id;
> +	if (IS_MNT_SLAVE(m)) {
> +		int master = m->mnt_master->mnt_group_id;
> +		int dom = get_dominating_id(m, &root);
> +		p->master_id = master;
> +		if (dom && dom != master)
> +			p->from_id = dom;
> +	}
> +	path_put(&root);
> +
> +	flags = READ_ONCE(m->mnt.mnt_flags);
> +	if (flags & MNT_READONLY)
> +		p->attr |= MOUNT_ATTR_RDONLY;
> +	if (flags & MNT_NOSUID)
> +		p->attr |= MOUNT_ATTR_NOSUID;
> +	if (flags & MNT_NODEV)
> +		p->attr |= MOUNT_ATTR_NODEV;
> +	if (flags & MNT_NOEXEC)
> +		p->attr |= MOUNT_ATTR_NOEXEC;
> +	if (flags & MNT_NODIRATIME)
> +		p->attr |= MOUNT_ATTR_NODIRATIME;
> +
> +	if (flags & MNT_NOATIME)
> +		p->attr |= MOUNT_ATTR_NOATIME;
> +	else if (flags & MNT_RELATIME)
> +		p->attr |= MOUNT_ATTR_RELATIME;
> +	else
> +		p->attr |= MOUNT_ATTR_STRICTATIME;
> +	return sizeof(*p);
> +}
> +
> +int fsinfo_generic_mount_devname(struct path *path, struct fsinfo_kparams *params)
> +{
> +	struct mount *m;
> +	size_t len;
> +
> +	if (!path->mnt)
> +		return -ENODATA;
> +
> +	m = real_mount(path->mnt);
> +	len = strlen(m->mnt_devname);
> +	memcpy(params->buffer, m->mnt_devname, len);
> +	return len;
> +}
> +
> +/*
> + * Store a mount record into the fsinfo buffer.
> + */
> +static void store_mount_fsinfo(struct fsinfo_kparams *params,
> +			       struct fsinfo_mount_child *child)
> +{
> +	unsigned int usage = params->usage;
> +	unsigned int total = sizeof(*child);
> +
> +	if (params->usage >= INT_MAX)
> +		return;
> +	params->usage = usage + total;
> +	if (params->buffer && params->usage <= params->buf_size)
> +		memcpy(params->buffer + usage, child, total);
> +}
> +
> +/*
> + * Return information about the submounts relative to path.
> + */
> +int fsinfo_generic_mount_children(struct path *path, struct fsinfo_kparams *params)
> +{
> +	struct fsinfo_mount_child record;
> +	struct mount *m, *child;
> +
> +	if (!path->mnt)
> +		return -ENODATA;
> +
> +	m = real_mount(path->mnt);
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(child, &m->mnt_mounts, mnt_child) {
> +		if (child->mnt_parent != m)
> +			continue;
> +		record.mnt_id = child->mnt_id;
> +		record.notify_counter = atomic_read(&child->mnt_notify_counter);
> +		store_mount_fsinfo(params, &record);
> +	}
> +	rcu_read_unlock();
> +
> +	/* End the list with a copy of the parameter mount's details so that
> +	 * userspace can quickly check for changes.
> +	 */
> +	record.mnt_id = m->mnt_id;
> +	record.notify_counter = atomic_read(&m->mnt_notify_counter);
> +	store_mount_fsinfo(params, &record);
> +	return params->usage;
> +}
> +
> +/*
> + * Return the path of the Nth submount relative to path.  This is derived from
> + * d_path(), but the root determination is more complicated.
> + */
> +int fsinfo_generic_mount_submount(struct path *path, struct fsinfo_kparams *params)
> +{
> +	struct mountpoint *mp;
> +	struct mount *m, *child;
> +	struct path mountpoint, root;
> +	unsigned int n = params->Nth;
> +	size_t len;
> +	void *p;
> +
> +	if (!path->mnt)
> +		return -ENODATA;
> +
> +	rcu_read_lock();
> +
> +	m = real_mount(path->mnt);
> +	list_for_each_entry_rcu(child, &m->mnt_mounts, mnt_child) {
> +		mp = READ_ONCE(child->mnt_mp);
> +		if (child->mnt_parent != m || !mp)
> +			continue;
> +		if (n-- == 0)
> +			goto found;
> +	}
> +	rcu_read_unlock();
> +	return -ENODATA;
> +
> +found:
> +	mountpoint.mnt = path->mnt;
> +	mountpoint.dentry = READ_ONCE(mp->m_dentry);
> +
> +	get_fs_root_rcu(current->fs, &root);
> +	if (root.mnt != path->mnt) {
> +		root.mnt = path->mnt;
> +		root.dentry = path->mnt->mnt_root;
> +	}
> +
> +	p = __d_path(&mountpoint, &root, params->buffer, params->buf_size);
> +	rcu_read_unlock();
> +
> +	if (IS_ERR(p))
> +		return PTR_ERR(p);
> +	if (!p)
> +		return -EPERM;
> +
> +	len = (params->buffer + params->buf_size) - p;
> +	memmove(params->buffer, p, len);
> +	return len;
> +}
> +#endif /* CONFIG_FSINFO */
> diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
> index bae0bdc9ace9..88e1d004ac6c 100644
> --- a/include/uapi/linux/fsinfo.h
> +++ b/include/uapi/linux/fsinfo.h
> @@ -32,6 +32,10 @@ enum fsinfo_attribute {
>  	FSINFO_ATTR_PARAM_ENUM		= 14,	/* Nth enum-to-val */
>  	FSINFO_ATTR_PARAMETERS		= 15,	/* Mount parameters (large string) */
>  	FSINFO_ATTR_LSM_PARAMETERS	= 16,	/* LSM Mount parameters (large string) */
> +	FSINFO_ATTR_MOUNT_INFO		= 17,	/* Mount object information */
> +	FSINFO_ATTR_MOUNT_DEVNAME	= 18,	/* Mount object device name (string) */
> +	FSINFO_ATTR_MOUNT_CHILDREN	= 19,	/* Submount list (array) */
> +	FSINFO_ATTR_MOUNT_SUBMOUNT	= 20,	/* Relative path of Nth submount (string) */
>  	FSINFO_ATTR__NR
>  };
>  
> @@ -276,4 +280,28 @@ struct fsinfo_param_enum {
>  	char		name[252];	/* Name of the enum value */
>  };
>  
> +/*
> + * Information struct for fsinfo(FSINFO_ATTR_MOUNT_INFO).
> + */
> +struct fsinfo_mount_info {
> +	__u64		f_sb_id;	/* Superblock ID */
> +	__u32		mnt_id;		/* Mount identifier (use with AT_FSINFO_MOUNTID_PATH) */
> +	__u32		parent_id;	/* Parent mount identifier */
> +	__u32		group_id;	/* Mount group ID */
> +	__u32		master_id;	/* Slave master group ID */
> +	__u32		from_id;	/* Slave propagated from ID */
> +	__u32		attr;		/* MOUNT_ATTR_* flags */
> +	__u32		notify_counter;	/* Number of notifications generated. */
> +	__u32		__reserved[1];
> +};
> +
> +/*
> + * Information struct element for fsinfo(FSINFO_ATTR_MOUNT_CHILDREN).
> + * - An extra element is placed on the end representing the parent mount.
> + */
> +struct fsinfo_mount_child {
> +	__u32		mnt_id;		/* Mount identifier (use with AT_FSINFO_MOUNTID_PATH) */
> +	__u32		notify_counter;	/* Number of notifications generated on mount. */
> +};
> +
>  #endif /* _UAPI_LINUX_FSINFO_H */
> diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
> index fadc5e1384fc..ab32a15d4c5b 100644
> --- a/samples/vfs/test-fsinfo.c
> +++ b/samples/vfs/test-fsinfo.c
> @@ -21,10 +21,10 @@
>  #include <errno.h>
>  #include <time.h>
>  #include <math.h>
> -#include <fcntl.h>
>  #include <sys/syscall.h>
>  #include <linux/fsinfo.h>
>  #include <linux/socket.h>
> +#include <linux/fcntl.h>
>  #include <sys/stat.h>
>  #include <arpa/inet.h>
>  
> @@ -83,6 +83,10 @@ static const struct fsinfo_attr_info fsinfo_buffer_info[FSINFO_ATTR__NR] = {
>  	FSINFO_STRUCT_N		(PARAM_ENUM,		param_enum),
>  	FSINFO_OVERLARGE	(PARAMETERS,		-),
>  	FSINFO_OVERLARGE	(LSM_PARAMETERS,	-),
> +	FSINFO_STRUCT		(MOUNT_INFO,		mount_info),
> +	FSINFO_STRING		(MOUNT_DEVNAME,		mount_devname),
> +	FSINFO_STRUCT_ARRAY	(MOUNT_CHILDREN,	mount_child),
> +	FSINFO_STRING_N		(MOUNT_SUBMOUNT,	mount_submount),
>  };
>  
>  #define FSINFO_NAME(X,Y) [FSINFO_ATTR_##X] = #Y
> @@ -104,6 +108,10 @@ static const char *fsinfo_attr_names[FSINFO_ATTR__NR] = {
>  	FSINFO_NAME		(PARAM_ENUM,		param_enum),
>  	FSINFO_NAME		(PARAMETERS,		parameters),
>  	FSINFO_NAME		(LSM_PARAMETERS,	lsm_parameters),
> +	FSINFO_NAME		(MOUNT_INFO,		mount_info),
> +	FSINFO_NAME		(MOUNT_DEVNAME,		mount_devname),
> +	FSINFO_NAME		(MOUNT_CHILDREN,	mount_children),
> +	FSINFO_NAME		(MOUNT_SUBMOUNT,	mount_submount),
>  };
>  
>  union reply {
> @@ -116,6 +124,8 @@ union reply {
>  	struct fsinfo_capabilities caps;
>  	struct fsinfo_timestamp_info timestamps;
>  	struct fsinfo_volume_uuid uuid;
> +	struct fsinfo_mount_info mount_info;
> +	struct fsinfo_mount_child mount_children[1];
>  };
>  
>  static void dump_hex(unsigned int *data, int from, int to)
> @@ -319,6 +329,29 @@ static void dump_attr_VOLUME_UUID(union reply *r, int size)
>  	       f->uuid[14], f->uuid[15]);
>  }
>  
> +static void dump_attr_MOUNT_INFO(union reply *r, int size)
> +{
> +	struct fsinfo_mount_info *f = &r->mount_info;
> +
> +	printf("\n");
> +	printf("\tsb_id   : %llx\n", (unsigned long long)f->f_sb_id);
> +	printf("\tmnt_id  : %x\n", f->mnt_id);
> +	printf("\tparent  : %x\n", f->parent_id);
> +	printf("\tgroup   : %x\n", f->group_id);
> +	printf("\tattr    : %x\n", f->attr);
> +	printf("\tnotifs  : %x\n", f->notify_counter);
> +}
> +
> +static void dump_attr_MOUNT_CHILDREN(union reply *r, int size)
> +{
> +	struct fsinfo_mount_child *f = r->mount_children;
> +	int i = 0;
> +
> +	printf("\n");
> +	for (; size >= sizeof(*f); size -= sizeof(*f), f++)
> +		printf("\t[%u] %8x %8x\n", i++, f->mnt_id, f->notify_counter);
> +}
> +
>  /*
>   *
>   */
> @@ -334,6 +367,8 @@ static const dumper_t fsinfo_attr_dumper[FSINFO_ATTR__NR] = {
>  	FSINFO_DUMPER(CAPABILITIES),
>  	FSINFO_DUMPER(TIMESTAMP_INFO),
>  	FSINFO_DUMPER(VOLUME_UUID),
> +	FSINFO_DUMPER(MOUNT_INFO),
> +	FSINFO_DUMPER(MOUNT_CHILDREN),
>  };
>  
>  static void dump_fsinfo(enum fsinfo_attribute attr,
> @@ -536,16 +571,21 @@ int main(int argc, char **argv)
>  	unsigned int attr;
>  	int raw = 0, opt, Nth, Mth;
>  
> -	while ((opt = getopt(argc, argv, "adlr"))) {
> +	while ((opt = getopt(argc, argv, "Madlr"))) {
>  		switch (opt) {
> +		case 'M':
> +			params.at_flags = AT_FSINFO_MOUNTID_PATH;
> +			continue;
>  		case 'a':
>  			params.at_flags |= AT_NO_AUTOMOUNT;
> +			params.at_flags &= ~AT_FSINFO_MOUNTID_PATH;
>  			continue;
>  		case 'd':
>  			debug = true;
>  			continue;
>  		case 'l':
>  			params.at_flags &= ~AT_SYMLINK_NOFOLLOW;
> +			params.at_flags &= ~AT_FSINFO_MOUNTID_PATH;
>  			continue;
>  		case 'r':
>  			raw = 1;
> @@ -558,7 +598,8 @@ int main(int argc, char **argv)
>  	argv += optind;
>  
>  	if (argc != 1) {
> -		printf("Format: test-fsinfo [-alr] <file>\n");
> +		printf("Format: test-fsinfo [-adlr] <file>\n");
> +		printf("Format: test-fsinfo [-dr] -M <mnt_id>\n");
>  		exit(2);
>  	}
>  
> 

^ permalink raw reply

* Re: [PATCH 09/25] vfs: Add mount notification count [ver #14]
From: Christian Brauner @ 2019-06-26  9:52 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, mszeredi, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <156138540611.25627.13279299200070977180.stgit@warthog.procyon.org.uk>

On Mon, Jun 24, 2019 at 03:10:06PM +0100, David Howells wrote:
> Add a notification count on mount objects so that the user can easily check
> to see if a mount has changed its attributes or its children.
> 
> Future patches will:
> 
>  (1) Provide this value through fsinfo() attributes.
> 
>  (2) Hook into the notify_mount() function to provide a notification
>      interface for userspace to monitor.

Sorry, I might have missed this. Is this based on your notification
patchset and thus presupposing that your notification patchset has been
merged?

(One further question below.)

> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  fs/mount.h     |   22 ++++++++++++++++++++++
>  fs/namespace.c |   13 +++++++++++++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/fs/mount.h b/fs/mount.h
> index 6250de544760..47795802f78e 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -70,6 +70,7 @@ struct mount {
>  	struct hlist_head mnt_pins;
>  	struct fs_pin mnt_umount;
>  	struct dentry *mnt_ex_mountpoint;
> +	atomic_t mnt_notify_counter;	/* Number of notifications generated */
>  } __randomize_layout;
>  
>  #define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */
> @@ -151,3 +152,24 @@ static inline bool is_anon_ns(struct mnt_namespace *ns)
>  {
>  	return ns->seq == 0;
>  }
> +
> +/*
> + * Type of mount topology change notification.
> + */
> +enum mount_notification_subtype {
> +	NOTIFY_MOUNT_NEW_MOUNT	= 0, /* New mount added */
> +	NOTIFY_MOUNT_UNMOUNT	= 1, /* Mount removed manually */
> +	NOTIFY_MOUNT_EXPIRY	= 2, /* Automount expired */
> +	NOTIFY_MOUNT_READONLY	= 3, /* Mount R/O state changed */
> +	NOTIFY_MOUNT_SETATTR	= 4, /* Mount attributes changed */
> +	NOTIFY_MOUNT_MOVE_FROM	= 5, /* Mount moved from here */
> +	NOTIFY_MOUNT_MOVE_TO	= 6, /* Mount moved to here (compare op_id) */
> +};
> +
> +static inline void notify_mount(struct mount *changed,
> +				struct mount *aux,
> +				enum mount_notification_subtype subtype,
> +				u32 info_flags)
> +{
> +	atomic_inc(&changed->mnt_notify_counter);
> +}
> diff --git a/fs/namespace.c b/fs/namespace.c
> index a49a7d9ed482..1450faab96b9 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -513,6 +513,8 @@ static int mnt_make_readonly(struct mount *mnt)
>  	smp_wmb();
>  	mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
>  	unlock_mount_hash();
> +	if (ret == 0)
> +		notify_mount(mnt, NULL, NOTIFY_MOUNT_READONLY, 0x10000);
>  	return ret;
>  }
>  
> @@ -521,6 +523,7 @@ static int __mnt_unmake_readonly(struct mount *mnt)
>  	lock_mount_hash();
>  	mnt->mnt.mnt_flags &= ~MNT_READONLY;
>  	unlock_mount_hash();
> +	notify_mount(mnt, NULL, NOTIFY_MOUNT_READONLY, 0);
>  	return 0;
>  }
>  
> @@ -833,6 +836,7 @@ static void umount_mnt(struct mount *mnt)
>  {
>  	/* old mountpoint will be dropped when we can do that */
>  	mnt->mnt_ex_mountpoint = mnt->mnt_mountpoint;
> +	notify_mount(mnt->mnt_parent, mnt, NOTIFY_MOUNT_UNMOUNT, 0);
>  	unhash_mnt(mnt);
>  }
>  
> @@ -1472,6 +1476,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
>  		p = list_first_entry(&tmp_list, struct mount, mnt_list);
>  		list_del_init(&p->mnt_expire);
>  		list_del_init(&p->mnt_list);
> +
>  		ns = p->mnt_ns;
>  		if (ns) {
>  			ns->mounts--;
> @@ -2095,7 +2100,10 @@ static int attach_recursive_mnt(struct mount *source_mnt,
>  		lock_mount_hash();
>  	}
>  	if (parent_path) {
> +		notify_mount(source_mnt->mnt_parent, source_mnt,
> +			     NOTIFY_MOUNT_MOVE_FROM, 0);
>  		detach_mnt(source_mnt, parent_path);
> +		notify_mount(dest_mnt, source_mnt, NOTIFY_MOUNT_MOVE_TO, 0);
>  		attach_mnt(source_mnt, dest_mnt, dest_mp);
>  		touch_mnt_namespace(source_mnt->mnt_ns);
>  	} else {
> @@ -2104,6 +2112,9 @@ static int attach_recursive_mnt(struct mount *source_mnt,
>  			list_del_init(&source_mnt->mnt_ns->list);
>  		}
>  		mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
> +		notify_mount(dest_mnt, source_mnt, NOTIFY_MOUNT_NEW_MOUNT,
> +			     source_mnt->mnt.mnt_sb->s_flags & SB_SUBMOUNT ?
> +			     0x10000 : 0);

What's that magic 0x10000 mean?

>  		commit_tree(source_mnt);
>  	}
>  
> @@ -2480,6 +2491,7 @@ static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags)
>  	mnt->mnt.mnt_flags = mnt_flags;
>  	touch_mnt_namespace(mnt->mnt_ns);
>  	unlock_mount_hash();
> +	notify_mount(mnt, NULL, NOTIFY_MOUNT_SETATTR, 0);
>  }
>  
>  /*
> @@ -2880,6 +2892,7 @@ void mark_mounts_for_expiry(struct list_head *mounts)
>  		if (!xchg(&mnt->mnt_expiry_mark, 1) ||
>  			propagate_mount_busy(mnt, 1))
>  			continue;
> +		notify_mount(mnt, NULL, NOTIFY_MOUNT_EXPIRY, 0);
>  		list_move(&mnt->mnt_expire, &graveyard);
>  	}
>  	while (!list_empty(&graveyard)) {
> 

^ permalink raw reply

* Re: [PATCH 08/25] vfs: Allow fsinfo() to look up a mount object by ID [ver #14]
From: Christian Brauner @ 2019-06-26  9:49 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, mszeredi, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <156138539803.25627.4855967830123016750.stgit@warthog.procyon.org.uk>

On Mon, Jun 24, 2019 at 03:09:58PM +0100, David Howells wrote:
> Allow the fsinfo() syscall to look up a mount object by ID rather than by
> pathname.  This is necessary as there can be multiple mounts stacked up at
> the same pathname and there's no way to look through them otherwise.
> 
> This is done by passing AT_FSINFO_MOUNTID_PATH to fsinfo() in the
> parameters and then passing the mount ID as a string to fsinfo() in place
> of the filename:
> 
> 	struct fsinfo_params params = {
> 		.at_flags = AT_FSINFO_MOUNTID_PATH,
> 		.request = FSINFO_ATTR_IDS,
> 	};
> 
> 	ret = fsinfo(AT_FDCWD, "21", &params, buffer, sizeof(buffer));
> 
> The caller is only permitted to query a mount object if the root directory
> of that mount connects directly to the current chroot if dfd == AT_FDCWD[*]
> or the directory specified by dfd otherwise.  Note that this is not
> available to the pathwalk of any other syscall.
> 
> [*] This needs to be something other than AT_FDCWD, perhaps AT_FDROOT.
> 
> [!] This probably needs an LSM hook.
> 
> [!] This might want to check the permissions on all the intervening dirs -
>     but it would have to do that under RCU conditions.
> 
> [!] This might want to check a CAP_* flag.

I would recommend you drop that patch for now.

First, because that patchset itself is already so huge that it's hard to
get enough reviewers to look at it in detail.
Second, from your "[!]" comments above it's obvious that the details of
this feature are not clear enough and need more discussion.
Third, (relatest to the first point) this patchset could use a little
shrinking. :)

> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  fs/fsinfo.c                |   56 +++++++++++++++++++++
>  fs/internal.h              |    2 +
>  fs/namespace.c             |  117 +++++++++++++++++++++++++++++++++++++++++++-
>  include/uapi/linux/fcntl.h |    1 
>  4 files changed, 173 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fsinfo.c b/fs/fsinfo.c
> index 127538193b77..3b218f9fedb7 100644
> --- a/fs/fsinfo.c
> +++ b/fs/fsinfo.c
> @@ -516,6 +516,57 @@ static int vfs_fsinfo_fscontext(int fd, struct fsinfo_kparams *params)
>  	return ret;
>  }
>  
> +/*
> + * Look up the root of a mount object.  This allows access to mount objects
> + * (and their attached superblocks) that can't be retrieved by path because
> + * they're entirely covered.
> + *
> + * We only permit access to a mount that has a direct path between either the
> + * dentry pointed to by dfd or to our chroot (if dfd is AT_FDCWD).
> + */
> +static int vfs_fsinfo_mount(int dfd, const char __user *filename,
> +			    struct fsinfo_kparams *params)
> +{
> +	struct path path;
> +	struct fd f = {};
> +	char *name;
> +	long mnt_id;
> +	int ret;
> +
> +	if ((params->at_flags & ~AT_FSINFO_MOUNTID_PATH) ||
> +	    !filename)
> +		return -EINVAL;
> +
> +	name = strndup_user(filename, 32);
> +	if (IS_ERR(name))
> +		return PTR_ERR(name);
> +	ret = kstrtoul(name, 0, &mnt_id);
> +	if (ret < 0)
> +		goto out_name;
> +	if (mnt_id > INT_MAX)
> +		goto out_name;
> +
> +	if (dfd != AT_FDCWD) {
> +		ret = -EBADF;
> +		f = fdget_raw(dfd);
> +		if (!f.file)
> +			goto out_name;
> +	}
> +
> +	ret = lookup_mount_object(f.file ? &f.file->f_path : NULL,
> +				  mnt_id, &path);
> +	if (ret < 0)
> +		goto out_fd;
> +
> +	ret = vfs_fsinfo(&path, params);
> +	path_put(&path);
> +out_fd:
> +	fdput(f);
> +out_name:
> +	kfree(name);
> +	return ret;
> +}
> +
>  /*
>   * Return buffer information by requestable attribute.
>   *
> @@ -629,6 +680,9 @@ SYSCALL_DEFINE5(fsinfo,
>  
>  		if ((params.at_flags & AT_FSINFO_FROM_FSOPEN) && filename)
>  			return -EINVAL;
> +		if ((params.at_flags & (AT_FSINFO_FROM_FSOPEN | AT_FSINFO_MOUNTID_PATH)) ==
> +		    (AT_FSINFO_FROM_FSOPEN | AT_FSINFO_MOUNTID_PATH))
> +			return -EINVAL;
>  	} else {
>  		params.request = FSINFO_ATTR_STATFS;
>  	}
> @@ -688,6 +742,8 @@ SYSCALL_DEFINE5(fsinfo,
>  
>  	if (params.at_flags & AT_FSINFO_FROM_FSOPEN)
>  		ret = vfs_fsinfo_fscontext(dfd, &params);
> +	else if (params.at_flags & AT_FSINFO_MOUNTID_PATH)
> +		ret = vfs_fsinfo_mount(dfd, filename, &params);
>  	else if (filename)
>  		ret = vfs_fsinfo_path(dfd, filename, &params);
>  	else
> diff --git a/fs/internal.h b/fs/internal.h
> index b089a489da1f..074b1c65e3bd 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -97,6 +97,8 @@ extern int __mnt_want_write_file(struct file *);
>  extern void __mnt_drop_write_file(struct file *);
>  
>  extern void dissolve_on_fput(struct vfsmount *);
> +extern int lookup_mount_object(struct path *, int, struct path *);
> +
>  /*
>   * fs_struct.c
>   */
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 1141641dff96..a49a7d9ed482 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -62,7 +62,7 @@ static int __init set_mphash_entries(char *str)
>  __setup("mphash_entries=", set_mphash_entries);
>  
>  static u64 event;
> -static DEFINE_IDA(mnt_id_ida);
> +static DEFINE_IDR(mnt_id_ida);
>  static DEFINE_IDA(mnt_group_ida);
>  
>  static struct hlist_head *mount_hashtable __read_mostly;
> @@ -101,17 +101,27 @@ static inline struct hlist_head *mp_hash(struct dentry *dentry)
>  
>  static int mnt_alloc_id(struct mount *mnt)
>  {
> -	int res = ida_alloc(&mnt_id_ida, GFP_KERNEL);
> +	int res;
>  
> +	/* Allocate an ID, but don't set the pointer back to the mount until
> +	 * later, as once we do that, we have to follow RCU protocols to get
> +	 * rid of the mount struct.
> +	 */
> +	res = idr_alloc(&mnt_id_ida, NULL, 0, INT_MAX, GFP_KERNEL);
>  	if (res < 0)
>  		return res;
>  	mnt->mnt_id = res;
>  	return 0;
>  }
>  
> +static void mnt_publish_id(struct mount *mnt)
> +{
> +	idr_replace(&mnt_id_ida, mnt, mnt->mnt_id);
> +}
> +
>  static void mnt_free_id(struct mount *mnt)
>  {
> -	ida_free(&mnt_id_ida, mnt->mnt_id);
> +	idr_remove(&mnt_id_ida, mnt->mnt_id);
>  }
>  
>  /*
> @@ -974,6 +984,7 @@ struct vfsmount *vfs_create_mount(struct fs_context *fc)
>  	lock_mount_hash();
>  	list_add_tail(&mnt->mnt_instance, &mnt->mnt.mnt_sb->s_mounts);
>  	unlock_mount_hash();
> +	mnt_publish_id(mnt);
>  	return &mnt->mnt;
>  }
>  EXPORT_SYMBOL(vfs_create_mount);
> @@ -1067,6 +1078,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
>  	lock_mount_hash();
>  	list_add_tail(&mnt->mnt_instance, &sb->s_mounts);
>  	unlock_mount_hash();
> +	mnt_publish_id(mnt);
>  
>  	if ((flag & CL_SLAVE) ||
>  	    ((flag & CL_SHARED_TO_SLAVE) && IS_MNT_SHARED(old))) {
> @@ -3988,3 +4000,102 @@ const struct proc_ns_operations mntns_operations = {
>  	.install	= mntns_install,
>  	.owner		= mntns_owner,
>  };
> +
> +/*
> + * See if one path point connects directly to another by ancestral relationship
> + * across mountpoints.  Must call with the RCU read lock held.
> + */
> +static bool are_paths_connected(struct path *ancestor, struct path *to_check)
> +{
> +	struct mount *mnt, *parent;
> +	struct path cursor;
> +	unsigned seq;
> +	bool connected;
> +
> +	seq = 0;
> +restart:
> +	cursor = *to_check;
> +
> +	read_seqbegin_or_lock(&rename_lock, &seq);
> +	while (cursor.mnt != ancestor->mnt) {
> +		mnt = real_mount(cursor.mnt);
> +		parent = READ_ONCE(mnt->mnt_parent);
> +		if (mnt == parent)
> +			goto failed;
> +		cursor.dentry = READ_ONCE(mnt->mnt_mountpoint);
> +		cursor.mnt = &parent->mnt;
> +	}
> +
> +	while (cursor.dentry != ancestor->dentry) {
> +		if (cursor.dentry == cursor.mnt->mnt_root ||
> +		    IS_ROOT(cursor.dentry))
> +			goto failed;
> +		cursor.dentry = READ_ONCE(cursor.dentry->d_parent);
> +	}
> +
> +	connected = true;
> +out:
> +	done_seqretry(&rename_lock, seq);
> +	return connected;
> +
> +failed:
> +	if (need_seqretry(&rename_lock, seq)) {
> +		seq = 1;
> +		goto restart;
> +	}
> +	connected = false;
> +	goto out;
> +}
> +
> +/**
> + * lookup_mount_object - Look up a vfsmount object by ID
> + * @root: The mount root must connect backwards to this point (or chroot if NULL).
> + * @id: The ID of the mountpoint.
> + * @_mntpt: Where to return the resulting mountpoint path.
> + *
> + * Look up the root of the mount with the corresponding ID.  This is only
> + * permitted if that mount connects directly to the specified root/chroot.
> + */
> +int lookup_mount_object(struct path *root, int mnt_id, struct path *_mntpt)
> +{
> +	struct mount *mnt;
> +	struct path stop, mntpt = {};
> +	int ret = -EPERM;
> +
> +	if (!root)
> +		get_fs_root(current->fs, &stop);
> +	else
> +		stop = *root;
> +
> +	rcu_read_lock();
> +	lock_mount_hash();
> +	mnt = idr_find(&mnt_id_ida, mnt_id);
> +	if (!mnt)
> +		goto out_unlock_mh;
> +	if (mnt->mnt.mnt_flags & (MNT_SYNC_UMOUNT | MNT_UMOUNT | MNT_DOOMED))
> +		goto out_unlock_mh;
> +	if (mnt_get_count(mnt) == 0)
> +		goto out_unlock_mh;
> +	mnt_add_count(mnt, 1);
> +	mntpt.mnt = &mnt->mnt;
> +	mntpt.dentry = dget(mnt->mnt.mnt_root);
> +	unlock_mount_hash();
> +
> +	if (are_paths_connected(&stop, &mntpt)) {
> +		*_mntpt = mntpt;
> +		mntpt.mnt = NULL;
> +		mntpt.dentry = NULL;
> +		ret = 0;
> +	}
> +
> +out_unlock:
> +	rcu_read_unlock();
> +	if (!root)
> +		path_put(&stop);
> +	path_put(&mntpt);
> +	return ret;
> +
> +out_unlock_mh:
> +	unlock_mount_hash();
> +	goto out_unlock;
> +}
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 6a2402a8fa30..5fda91cfca8a 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -92,6 +92,7 @@
>  #define AT_STATX_DONT_SYNC	0x4000	/* - Don't sync attributes with the server */
>  
>  #define AT_FSINFO_FROM_FSOPEN	0x2000	/* Examine the fs_context attached to dfd by fsopen() */
> +#define AT_FSINFO_MOUNTID_PATH	0x4000	/* The path is a mount object ID, not an actual path */
>  
>  #define AT_RECURSIVE		0x8000	/* Apply to the entire subtree */
>  
> 

^ permalink raw reply

* Re: [PATCH 01/25] vfs: syscall: Add fsinfo() to query filesystem information [ver #14]
From: David Howells @ 2019-06-26  9:49 UTC (permalink / raw)
  To: Christian Brauner
  Cc: dhowells, viro, raven, mszeredi, linux-api, linux-fsdevel,
	linux-kernel
In-Reply-To: <20190625082822.l4pz33dwzvotboe4@brauner.io>

Christian Brauner <christian@brauner.io> wrote:

> > +	return sizeof(*p);
> 
> Hm, the discrepancy between the function signature returning int and
> the sizeof operator most likely being size_t is bothering me. It
> probably doesn't matter but maybe we can avoid that.

If sizeof(*p) exceeds 4096, the buffer is going to have been overrun by this
point anyway.

The function can't return size_t, though it could return ssize_t.  I could
switch it to return long or even store the result in fsinfo_kparams::usage and
return 0.

> > +	strlcpy(p->f_fs_name, path->dentry->d_sb->s_type->name,
> > +		sizeof(p->f_fs_name));
> 
> Truncation is acceptable or impossible I assume?

I'm hoping that file_system_type::name isn't going to exceed 15 chars plus
NUL.  If it does, it will be truncated.  I don't really want to add an
individual attribute just for the filesystem driver name.

> > +#define _gen(X, Y) FSINFO_ATTR_##X: return fsinfo_generic_##Y(path, params->buffer)
> 
> I'm really not sure that this helps readability in the switch below... :)
> 
> > +
> > +	switch (params->request) {
> > +	case _gen(STATFS,		statfs);
> > +	case _gen(IDS,			ids);
> > +	case _gen(LIMITS,		limits);
> > +	case _gen(SUPPORTS,		supports);
> > +	case _gen(CAPABILITIES,		capabilities);
> > +	case _gen(TIMESTAMP_INFO,	timestamp_info);
> > ...

I'm trying to avoid having to spend multiple lines per case and tabulation
makes things easier to read.  So

	case FSINFO_ATTR_SUPPORTS:		return fsinfo_generic_supports(path, params->buffer);
	case FSINFO_ATTR_CAPABILITIES:		return fsinfo_generic_capabilities(path, params->buffer);
	case FSINFO_ATTR_TIMESTAMP_INFO:	return fsinfo_generic_timestamp_info(path, params->buffer);

is a bit on the long side per line, whereas:

	case FSINFO_ATTR_SUPPORTS:
		return fsinfo_generic_supports(path, params->buffer);
	case FSINFO_ATTR_CAPABILITIES:
		return fsinfo_generic_capabilities(path, params->buffer);
	case FSINFO_ATTR_TIMESTAMP_INFO:
		return fsinfo_generic_timestamp_info(path, params->buffer);

is less readable by interleaving two of the three columns.  (Note that _gen is
a actually third column as I introduce alternatives later).

> > +		if (ret <= (int)params->buf_size)
> 
> He, and this is where the return value discrepancy hits again. Just
> doesn't look nice tbh. :)

No.  That's dealing with signed/unsigned comparison.  It might be better if I
change this to:

		if (IS_ERR_VALUE(ret))
			return ret; /* Error */
		if ((unsigned int)ret <= params->buf_size)
			return ret; /* It fitted */

In any case, buf_size isn't permitted to be larger than INT_MAX due to a check
later in the loop.

> > +		kvfree(params->buffer);
> 
> That means callers should always memset fsinfo_kparams or this is an
> invalid free...

vfs_info() isn't a public function.  And, in any case, the caller *must*
provide a buffer here.

> > + * Return buffer information by requestable attribute.
> > + *
> > + * STRUCT indicates a fixed-size structure with only one instance.
> > ...
> I honestly have a hard time following the documentation here

How about:

 * STRUCT	- a fixed-size structure with only one instance.
 * STRUCT_N	- a sequence of STRUCTs, indexed by Nth
 * STRUCT_NM	- a sequence of sequences of STRUCTs, indexed by Nth, Mth
 * STRING	- a string with only one instance.
 * STRING_N	- a sequence of STRING, indexed by Nth
 * STRING_NM	- a sequence of sequences of STRING, indexed by Nth, Mth
 * OPAQUE	- a blob that can be larger than 4K.
 * STRUCT_ARRAY - an array of structs that can be larger than 4K

> and that monster table/macro thing below.  For example, STRUCT_NM
> corresponds to __FSINFO_NM or what?

STRUCT_NM -> .type = __FSINFO_STRUCT, .flags = __FSINFO_NM, .size = ...

If you think this is bad, you should try looking at the device ID tables used
by the drivers and the attribute tables;-)

I could spell out the flag and type in the macro defs (such as the body of
FSINFO_STRING(X,Y) for instance).  It would make it harder to compare macros
as it wouldn't then tabulate, though.

> And is this uapi as you're using this in your samples/test below?

Not exactly.  Each attribute is defined as being a certain type in the
documentation in the UAPI header, but this is not coded there.  The assumption
being that if you're using a particular attribute, you'll know what the type
of the attribute is and you'll structure your code appropriately.

The reason the sample code has this replicated is that it doesn't really
attempt to interpret the type per se.  It has a dumper for an individual
attribute value, but the table tells it whether there should be one of those,
N of those or N of M(0), M(1), M(2), ... of those so that it can report an
error if it doesn't see what it expects.

I could even cheaply provide a meta attribute that dumps the contents of the
table (just the type info, not the names).

> > ...
> > +	FSINFO_STRING		(NAME_ENCODING,		-),
> > +	FSINFO_STRING		(NAME_CODEPAGE,		-),
> > +};
> 
> Can I complain again that this is really annoying to parse.

Apparently you can;-)  What would you prefer?  This:

static const struct fsinfo_attr_info fsinfo_buffer_info[FSINFO_ATTR__NR] = {
	[FSINFO_STATFS] = {
		.type	= __FSINFO_STRUCT,
		.size	= sizeof(struct fsinfo_statfs),
	},
	[FSINFO_SERVERS] = {
		.type	= __FSINFO_STRUCT,
		.flags	= __FSINFO_NM,
		.size	= sizeof(struct fsinfo_server),
	},
	...
};	

That has 3-5 lines for each 1 in the current code and isn't a great deal more
readable.

> if (copy_to_user()) and if (clear_user()) and not if (clear_user() != 0)

Better "if (copy_to_user() != 0)" since it's not a boolean return value in
either case.

> Nit: There's a bunch of name inconsistency for the arguments between the
> stub and the definition:
> 
> SYSCALL_DEFINE5(fsinfo,
> 		int, dfd, const char __user *, filename,
> 		struct fsinfo_params __user *, _params,
> 		void __user *, user_buffer, size_t, user_buf_size)

Yeah.  C just doesn't care.

I'll change filename to pathname throughout.  That's at least consistent with
various glibc manpages for other vfs syscalls.

_params I can change to params and params as-was to kparams.

But user_buffer and user_buf_size, I'll keep as I've named them such to avoid
confusion with kparams->buffer and kparams->scratch_buffer.  However, I
wouldn't want to call them that in the UAPI.

> Do we do SPDX that way? Or isn't this just supposed to be:
> // <spdxy stuff>

Look in, say, include/uapi/linux/stat.h or .../fs.h.

> > +	FSINFO_ATTR__NR
> 
> Nit/Bikeshed: FSINFO_ATTR_MAX? Seems more intuitive.

No.  That would imply a limit that it will never exceed.

> > +struct fsinfo_u128 {
> > +#if defined(__BYTE_ORDER) ? __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
> > +	__u64	hi;
> > +	__u64	lo;
> > +#elif defined(__BYTE_ORDER) ? __BYTE_ORDER == __LITTLE_ENDIAN : defined(__LITTLE_ENDIAN)
> > +	__u64	lo;
> > +	__u64	hi;
> > +#endif
> > +};
> 
> Hm, I know why you do this custom fsinfo_u128 thingy but for userspace
> that is going to be annoying to operate with, e.g. comparing the
> size/space of two filesystems etc.

We don't have a __u128 in the UAPI, and I'm reluctant to use __uint128_t.

Do you have a better suggestion?

> > +struct fsinfo_ids {
> > +	char	f_fs_name[15 + 1];	/* Filesystem name */
> 
> You should probably make this a macro so userspace can use it in fs-name
> length checks too.

The name length, you mean?  Well, you can use sizeof...

> > +	FSINFO_CAP__NR
> 
> Hm, again, maybe better to use FSINFO_CAP_MAX?

It's not a limit.

David

^ permalink raw reply

* Re: [PATCH v4 0/3] initramfs: add support for xattrs in the initial ram disk
From: Roberto Sassu @ 2019-06-26  8:15 UTC (permalink / raw)
  To: Rob Landley, viro
  Cc: linux-security-module, linux-integrity, initramfs, linux-api,
	linux-fsdevel, linux-kernel, bug-cpio, zohar, silviu.vlasceanu,
	dmitry.kasatkin, takondra, kamensky, hpa, arnd, james.w.mcmechan,
	niveditas98
In-Reply-To: <541e9ea1-024f-5c22-0b58-f8692e6c1eb1@landley.net>

On 6/3/2019 8:32 PM, Rob Landley wrote:
> On 6/3/19 4:31 AM, Roberto Sassu wrote:
>>> This patch set aims at solving the following use case: appraise files from
>>> the initial ram disk. To do that, IMA checks the signature/hash from the
>>> security.ima xattr. Unfortunately, this use case cannot be implemented
>>> currently, as the CPIO format does not support xattrs.
>>>
>>> This proposal consists in including file metadata as additional files named
>>> METADATA!!!, for each file added to the ram disk. The CPIO parser in the
>>> kernel recognizes these special files from the file name, and calls the
>>> appropriate parser to add metadata to the previously extracted file. It has
>>> been proposed to use bit 17:16 of the file mode as a way to recognize files
>>> with metadata, but both the kernel and the cpio tool declare the file mode
>>> as unsigned short.
>>
>> Any opinion on this patch set?
>>
>> Thanks
>>
>> Roberto
> 
> Sorry, I've had the window open since you posted it but haven't gotten around to
> it. I'll try to build it later today.
> 
> It does look interesting, and I have no objections to the basic approach. I
> should be able to add support to toybox cpio over a weekend once I've got the
> kernel doing it to test against.

Ok.

Let me give some instructions so that people can test this patch set.

To add xattrs to the ram disk embedded in the kernel it is sufficient
to set CONFIG_INITRAMFS_FILE_METADATA="xattr" and
CONFIG_INITRAMFS_SOURCE="<file with xattr>" in the kernel configuration.

To add xattrs to the external ram disk, it is necessary to patch cpio:

https://github.com/euleros/cpio/commit/531cabc88e9ecdc3231fad6e4856869baa9a91ef 
(xattr-v1 branch)

and dracut:

https://github.com/euleros/dracut/commit/a2dee56ea80495c2c1871bc73186f7b00dc8bf3b 
(digest-lists branch)

The same modification can be done for mkinitramfs (add '-e xattr' to the
cpio command line).

To simplify the test, it would be sufficient to replace only the cpio
binary and the dracut script with the modified versions. For dracut, the
patch should be applied to the local dracut (after it has been renamed
to dracut.sh).

Then, run:

dracut -e xattr -I <file with xattr> (add -f to overwrite the ram disk)

Xattrs can be seen by stopping the boot process for example by adding
rd.break to the kernel command line.

Roberto

-- 
HUAWEI TECHNOLOGIES Duesseldorf GmbH, HRB 56063
Managing Director: Bo PENG, Jian LI, Yanli SHI

^ permalink raw reply

* Re: [PATCH v5 16/16] f2fs: add fs-verity support
From: Chao Yu @ 2019-06-26  7:34 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Theodore Y . Ts'o, Darrick J . Wong, linux-api, Dave Chinner,
	linux-f2fs-devel, linux-fscrypt, linux-fsdevel, Jaegeuk Kim,
	linux-integrity, linux-ext4, Linus Torvalds, Christoph Hellwig,
	Victor Hsieh
In-Reply-To: <20190625175225.GC81914@gmail.com>

Hi Eric,

On 2019/6/26 1:52, Eric Biggers wrote:
> Hi Chao, thanks for the review.
> 
> On Tue, Jun 25, 2019 at 03:55:57PM +0800, Chao Yu wrote:
>> Hi Eric,
>>
>> On 2019/6/21 4:50, Eric Biggers wrote:
>>> +static int f2fs_begin_enable_verity(struct file *filp)
>>> +{
>>> +	struct inode *inode = file_inode(filp);
>>> +	int err;
>>> +
>>
>> I think we'd better add condition here (under inode lock) to disallow enabling
>> verity on atomic/volatile inode, as we may fail to write merkle tree data due to
>> atomic/volatile inode's special writeback method.
>>
> 
> Yes, I'll add the following:
> 
> 	if (f2fs_is_atomic_file(inode) || f2fs_is_volatile_file(inode))
> 		return -EOPNOTSUPP;
> 
>>> +	err = f2fs_convert_inline_inode(inode);
>>> +	if (err)
>>> +		return err;
>>> +
>>> +	err = dquot_initialize(inode);
>>> +	if (err)
>>> +		return err;
>>
>> We can get rid of dquot_initialize() here, since f2fs_file_open() ->
>> dquot_file_open() should has initialized quota entry previously, right?
> 
> We still need it because dquot_file_open() only calls dquot_initialize() if the
> file is being opened for writing.  But here the file descriptor is readonly.
> I'll add a comment explaining this here and in the ext4 equivalent.

Ah, you're right.

f2fs_convert_inline_inode() may grab one more block during conversion, so we
need to call dquot_initialize() before inline conversion?

Thanks,

> 
> - Eric
> .
> 

^ permalink raw reply

* Re: [PATCH bpf-next v9 02/10] bpf: Add eBPF program subtype and is_valid_subtype() verifier
From: Mickaël Salaün @ 2019-06-26  7:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: LKML, Aleksa Sarai, Alexander Viro, Alexei Starovoitov,
	Andrew Morton, Andy Lutomirski, Arnaldo Carvalho de Melo,
	Casey Schaufler, Daniel Borkmann, David Drysdale,
	David S . Miller, Eric W . Biederman, James Morris, Jann Horn,
	John Johansen, Jonathan Corbet, Kees Cook, Michael Kerrisk,
	Mickaël Salaün
In-Reply-To: <CAADnVQ+Twio22VSi21RR5TY1Zm-1xRTGmREcXLSs5Jv-KWGTiw@mail.gmail.com>



On 26/06/2019 01:02, Alexei Starovoitov wrote:
> On Tue, Jun 25, 2019 at 3:04 PM Mickaël Salaün <mic@digikod.net> wrote:
>>
>> The goal of the program subtype is to be able to have different static
>> fine-grained verifications for a unique program type.
>>
>> The struct bpf_verifier_ops gets a new optional function:
>> is_valid_subtype(). This new verifier is called at the beginning of the
>> eBPF program verification to check if the (optional) program subtype is
>> valid.
>>
>> The new helper bpf_load_program_xattr() enables to verify a program with
>> subtypes.
>>
>> For now, only Landlock eBPF programs are using a program subtype (see
>> next commits) but this could be used by other program types in the
>> future.
>>
>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Cc: Daniel Borkmann <daniel@iogearbox.net>
>> Cc: David S. Miller <davem@davemloft.net>
>> Link: https://lkml.kernel.org/r/20160827205559.GA43880@ast-mbp.thefacebook.com
>> ---
>>
>> Changes since v8:
>> * use bpf_load_program_xattr() instead of bpf_load_program() and add
>>   bpf_verify_program_xattr() to deal with subtypes
>> * remove put_extra() since there is no more "previous" field (for now)
>>
>> Changes since v7:
>> * rename LANDLOCK_SUBTYPE_* to LANDLOCK_*
>> * move subtype in bpf_prog_aux and use only one bit for has_subtype
>>   (suggested by Alexei Starovoitov)
> 
> sorry to say, but I don't think the landlock will ever land,
> since posting huge patches once a year is missing a lot of development
> that is happening during that time.

You're right that it's been a while since the last patch set, but the
main reasons behind this was a lack of feedback (probably because of the
size of the patch set, which is now reduce to a consistent minimum), the
rework needed to address everyone's concern (Landlock modify kernel
components from different maintainers), and above all, the LSM stacking
infrastructure which was quite beefy and then took some time to land:
https://lore.kernel.org/lkml/50db058a-7dde-441b-a7f9-f6837fe8b69f@schaufler-ca.com/
This stacking infrastructure was required to have a useful version of
Landlock (which is used as a use case example), and it was released with
Linux v5.1 (last month). Now, I think everything is finally ready to
move forward.

> This 2/10 patch is an example.
> subtype concept was useful 2 years ago when v6 was posted.
> Since then bpf developers faced very similar problem in other parts
> and it was solved with 'expected_attach_type' field.
> See commit 5e43f899b03a ("bpf: Check attach type at prog load time")
> dated March 2018.

I saw this nice feature but I wasn't sure if it was the right field to
use. Indeed, I need more than a "type", but also some values (triggers)
as shown by this patch. What do you suggest?

^ permalink raw reply

* Re: [PATCH bpf-next v9 02/10] bpf: Add eBPF program subtype and is_valid_subtype() verifier
From: Alexei Starovoitov @ 2019-06-25 23:02 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: LKML, Aleksa Sarai, Alexander Viro, Alexei Starovoitov,
	Andrew Morton, Andy Lutomirski, Arnaldo Carvalho de Melo,
	Casey Schaufler, Daniel Borkmann, David Drysdale,
	David S . Miller, Eric W . Biederman, James Morris, Jann Horn,
	John Johansen, Jonathan Corbet, Kees Cook, Michael Kerrisk,
	Mickaël Salaün
In-Reply-To: <20190625215239.11136-3-mic@digikod.net>

On Tue, Jun 25, 2019 at 3:04 PM Mickaël Salaün <mic@digikod.net> wrote:
>
> The goal of the program subtype is to be able to have different static
> fine-grained verifications for a unique program type.
>
> The struct bpf_verifier_ops gets a new optional function:
> is_valid_subtype(). This new verifier is called at the beginning of the
> eBPF program verification to check if the (optional) program subtype is
> valid.
>
> The new helper bpf_load_program_xattr() enables to verify a program with
> subtypes.
>
> For now, only Landlock eBPF programs are using a program subtype (see
> next commits) but this could be used by other program types in the
> future.
>
> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: David S. Miller <davem@davemloft.net>
> Link: https://lkml.kernel.org/r/20160827205559.GA43880@ast-mbp.thefacebook.com
> ---
>
> Changes since v8:
> * use bpf_load_program_xattr() instead of bpf_load_program() and add
>   bpf_verify_program_xattr() to deal with subtypes
> * remove put_extra() since there is no more "previous" field (for now)
>
> Changes since v7:
> * rename LANDLOCK_SUBTYPE_* to LANDLOCK_*
> * move subtype in bpf_prog_aux and use only one bit for has_subtype
>   (suggested by Alexei Starovoitov)

sorry to say, but I don't think the landlock will ever land,
since posting huge patches once a year is missing a lot of development
that is happening during that time.
This 2/10 patch is an example.
subtype concept was useful 2 years ago when v6 was posted.
Since then bpf developers faced very similar problem in other parts
and it was solved with 'expected_attach_type' field.
See commit 5e43f899b03a ("bpf: Check attach type at prog load time")
dated March 2018.

^ permalink raw reply

* Re: [PATCH bpf-next v9 05/10] bpf,landlock: Add a new map type: inode
From: Al Viro @ 2019-06-25 22:52 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: linux-kernel, Aleksa Sarai, Alexei Starovoitov, Andrew Morton,
	Andy Lutomirski, Arnaldo Carvalho de Melo, Casey Schaufler,
	Daniel Borkmann, David Drysdale, David S . Miller,
	Eric W . Biederman, James Morris, Jann Horn, John Johansen,
	Jonathan Corbet, Kees Cook, Michael Kerrisk,
	Mickaël Salaün, Paul Moore
In-Reply-To: <20190625215239.11136-6-mic@digikod.net>

On Tue, Jun 25, 2019 at 11:52:34PM +0200, Mickaël Salaün wrote:
> +/* must call iput(inode) after this call */
> +static struct inode *inode_from_fd(int ufd, bool check_access)
> +{
> +	struct inode *ret;
> +	struct fd f;
> +	int deny;
> +
> +	f = fdget(ufd);
> +	if (unlikely(!f.file || !file_inode(f.file))) {
> +		ret = ERR_PTR(-EBADF);
> +		goto put_fd;
> +	}

Just when does one get a NULL file_inode()?  The reason I'm asking is
that arseloads of code would break if one managed to create such
a beast...

Incidentally, that should be return ERR_PTR(-EBADF); fdput() is wrong there.

> +	}
> +	/* check if the FD is tied to a mount point */
> +	/* TODO: add this check when called from an eBPF program too */
> +	if (unlikely(!f.file->f_path.mnt

Again, the same question - when the hell can that happen?  If you are
sitting on an exploitable roothole, do share it...

 || f.file->f_path.mnt->mnt_flags &
> +				MNT_INTERNAL)) {
> +		ret = ERR_PTR(-EINVAL);
> +		goto put_fd;

What does it have to do with mountpoints, anyway?

> +/* called from syscall */
> +static int sys_inode_map_delete_elem(struct bpf_map *map, struct inode *key)
> +{
> +	struct inode_array *array = container_of(map, struct inode_array, map);
> +	struct inode *inode;
> +	int i;
> +
> +	WARN_ON_ONCE(!rcu_read_lock_held());
> +	for (i = 0; i < array->map.max_entries; i++) {
> +		if (array->elems[i].inode == key) {
> +			inode = xchg(&array->elems[i].inode, NULL);
> +			array->nb_entries--;

Umm...  Is that intended to be atomic in any sense?

> +			iput(inode);
> +			return 0;
> +		}
> +	}
> +	return -ENOENT;
> +}
> +
> +/* called from syscall */
> +int bpf_inode_map_delete_elem(struct bpf_map *map, int *key)
> +{
> +	struct inode *inode;
> +	int err;
> +
> +	inode = inode_from_fd(*key, false);
> +	if (IS_ERR(inode))
> +		return PTR_ERR(inode);
> +	err = sys_inode_map_delete_elem(map, inode);
> +	iput(inode);
> +	return err;
> +}

Wait a sec...  So we have those beasties that can have long-term
references to arbitrary inodes stuck in them?  What will happen
if you get umount(2) called while such a thing exists?

^ permalink raw reply

* [PATCH bpf-next v9 10/10] landlock: Add user and kernel documentation for Landlock
From: Mickaël Salaün @ 2019-06-25 21:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Aleksa Sarai, Alexander Viro,
	Alexei Starovoitov, Andrew Morton, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, John Johansen, Jonathan Corbet,
	Kees Cook, Michael Kerrisk
In-Reply-To: <20190625215239.11136-1-mic@digikod.net>

This documentation can be built with the Sphinx framework.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: James Morris <jmorris@namei.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge E. Hallyn <serge@hallyn.com>
---

Changes since v8:
* remove documentation related to chaining and tagging according to this
  patch series

Changes since v7:
* update documentation according to the Landlock revamp

Changes since v6:
* add a check for ctx->event
* rename BPF_PROG_TYPE_LANDLOCK to BPF_PROG_TYPE_LANDLOCK_RULE
* rename Landlock version to ABI to better reflect its purpose and add a
  dedicated changelog section
* update tables
* relax no_new_privs recommendations
* remove ABILITY_WRITE related functions
* reword rule "appending" to "prepending" and explain it
* cosmetic fixes

Changes since v5:
* update the rule hierarchy inheritance explanation
* briefly explain ctx->arg2
* add ptrace restrictions
* explain EPERM
* update example (subtype)
* use ":manpage:"
---
 Documentation/security/index.rst           |   1 +
 Documentation/security/landlock/index.rst  |  20 +++
 Documentation/security/landlock/kernel.rst |  99 ++++++++++++++
 Documentation/security/landlock/user.rst   | 148 +++++++++++++++++++++
 4 files changed, 268 insertions(+)
 create mode 100644 Documentation/security/landlock/index.rst
 create mode 100644 Documentation/security/landlock/kernel.rst
 create mode 100644 Documentation/security/landlock/user.rst

diff --git a/Documentation/security/index.rst b/Documentation/security/index.rst
index aad6d92ffe31..32b4c1db2325 100644
--- a/Documentation/security/index.rst
+++ b/Documentation/security/index.rst
@@ -12,3 +12,4 @@ Security Documentation
    SCTP
    self-protection
    tpm/index
+   landlock/index
diff --git a/Documentation/security/landlock/index.rst b/Documentation/security/landlock/index.rst
new file mode 100644
index 000000000000..d0af868d1582
--- /dev/null
+++ b/Documentation/security/landlock/index.rst
@@ -0,0 +1,20 @@
+=========================================
+Landlock LSM: programmatic access control
+=========================================
+
+Landlock is a stackable Linux Security Module (LSM) that makes it possible to
+create security sandboxes, programmable access-controls or safe endpoint
+security agents.  This kind of sandbox is expected to help mitigate the
+security impact of bugs or unexpected/malicious behaviors in user-space
+applications.  The current version allows only a process with the global
+CAP_SYS_ADMIN capability to create such sandboxes but the ultimate goal of
+Landlock is to empower any process, including unprivileged ones, to securely
+restrict themselves.  Landlock is inspired by seccomp-bpf but instead of
+filtering syscalls and their raw arguments, a Landlock rule can inspect the use
+of kernel objects like files and hence make a decision according to the kernel
+semantic.
+
+.. toctree::
+
+    user
+    kernel
diff --git a/Documentation/security/landlock/kernel.rst b/Documentation/security/landlock/kernel.rst
new file mode 100644
index 000000000000..7d1e06d544bf
--- /dev/null
+++ b/Documentation/security/landlock/kernel.rst
@@ -0,0 +1,99 @@
+==============================
+Landlock: kernel documentation
+==============================
+
+eBPF properties
+===============
+
+To get an expressive language while still being safe and small, Landlock is
+based on eBPF. Landlock should be usable by untrusted processes and must
+therefore expose a minimal attack surface. The eBPF bytecode is minimal,
+powerful, widely used and designed to be used by untrusted applications. Thus,
+reusing the eBPF support in the kernel enables a generic approach while
+minimizing new code.
+
+An eBPF program has access to an eBPF context containing some fields used to
+inspect the current object. These arguments can be used directly (e.g. cookie)
+or passed to helper functions according to their types (e.g. inode pointer). It
+is then possible to do complex access checks without race conditions or
+inconsistent evaluation (i.e.  `incorrect mirroring of the OS code and state
+<https://www.ndss-symposium.org/ndss2003/traps-and-pitfalls-practical-problems-system-call-interposition-based-security-tools/>`_).
+
+A Landlock hook describes a particular access type.  For now, there is two
+hooks dedicated to filesystem related operations: LANDLOCK_HOOK_FS_PICK and
+LANDLOCK_HOOK_FS_WALK.  A Landlock program is tied to one hook.  This makes it
+possible to statically check context accesses, potentially performed by such
+program, and hence prevents kernel address leaks and ensure the right use of
+hook arguments with eBPF functions.  Any user can add multiple Landlock
+programs per Landlock hook.  They are stacked and evaluated one after the
+other, starting from the most recent program, as seccomp-bpf does with its
+filters.  Underneath, a hook is an abstraction over a set of LSM hooks.
+
+
+Guiding principles
+==================
+
+Unprivileged use
+----------------
+
+* Landlock helpers and context should be usable by any unprivileged and
+  untrusted program while following the system security policy enforced by
+  other access control mechanisms (e.g. DAC, LSM).
+
+
+Landlock hook and context
+-------------------------
+
+* A Landlock hook shall be focused on access control on kernel objects instead
+  of syscall filtering (i.e. syscall arguments), which is the purpose of
+  seccomp-bpf.
+* A Landlock context provided by a hook shall express the minimal and more
+  generic interface to control an access for a kernel object.
+* A hook shall guaranty that all the BPF function calls from a program are
+  safe.  Thus, the related Landlock context arguments shall always be of the
+  same type for a particular hook.  For example, a network hook could share
+  helpers with a file hook because of UNIX socket.  However, the same helpers
+  may not be compatible for a file system handle and a net handle.
+* Multiple hooks may use the same context interface.
+
+
+Landlock helpers
+----------------
+
+* Landlock helpers shall be as generic as possible while at the same time being
+  as simple as possible and following the syscall creation principles (cf.
+  *Documentation/adding-syscalls.txt*).
+* The only behavior change allowed on a helper is to fix a (logical) bug to
+  match the initial semantic.
+* Helpers shall be reentrant, i.e. only take inputs from arguments (e.g. from
+  the BPF context), to enable a hook to use a cache.  Future program options
+  might change this cache behavior.
+* It is quite easy to add new helpers to extend Landlock.  The main concern
+  should be about the possibility to leak information from the kernel that may
+  not be accessible otherwise (i.e. side-channel attack).
+
+
+Questions and answers
+=====================
+
+Why not create a custom hook for each kind of action?
+-----------------------------------------------------
+
+Landlock programs can handle these checks.  Adding more exceptions to the
+kernel code would lead to more code complexity.  A decision to ignore a kind of
+action can and should be done at the beginning of a Landlock program.
+
+
+Why a program does not return an errno or a kill code?
+------------------------------------------------------
+
+seccomp filters can return multiple kind of code, including an errno value or a
+kill signal, which may be convenient for access control.  Those return codes
+are hardwired in the userland ABI.  Instead, Landlock's approach is to return a
+boolean to allow or deny an action, which is much simpler and more generic.
+Moreover, we do not really have a choice because, unlike to seccomp, Landlock
+programs are not enforced at the syscall entry point but may be executed at any
+point in the kernel (through LSM hooks) where an errno return code may not make
+sense.  However, with this simple ABI and with the ability to call helpers,
+Landlock may gain features similar to seccomp-bpf in the future while being
+compatible with previous programs.
diff --git a/Documentation/security/landlock/user.rst b/Documentation/security/landlock/user.rst
new file mode 100644
index 000000000000..e4bb7dba4aa7
--- /dev/null
+++ b/Documentation/security/landlock/user.rst
@@ -0,0 +1,148 @@
+================================
+Landlock: userland documentation
+================================
+
+Landlock programs
+=================
+
+eBPF programs are used to create security programs.  They are contained and can
+call only a whitelist of dedicated functions. Moreover, they can only loop
+under strict conditions, which protects from denial of service.  More
+information on BPF can be found in *Documentation/networking/filter.txt*.
+
+
+Writing a program
+-----------------
+
+To enforce a security policy, a thread first needs to create a Landlock program.
+The easiest way to write an eBPF program depicting a security program is to write
+it in the C language.  As described in *samples/bpf/README.rst*, LLVM can
+compile such programs.  Files *samples/bpf/landlock1_kern.c* and those in
+*tools/testing/selftests/landlock/* can be used as examples.
+
+Once the eBPF program is created, the next step is to create the metadata
+describing the Landlock program.  This metadata includes a subtype which
+contains the hook type to which the program is tied and some options.
+
+.. code-block:: c
+
+    union bpf_prog_subtype subtype = {
+        .landlock_hook = {
+            .type = LANDLOCK_HOOK_FS_PICK,
+            .triggers = LANDLOCK_TRIGGER_FS_PICK_OPEN,
+        }
+    };
+
+A Landlock hook describes the kind of kernel object for which a program will be
+triggered to allow or deny an action.  For example, the hook
+LANDLOCK_HOOK_FS_PICK can be triggered every time a landlocked thread performs
+a set of action related to the filesystem (e.g. open, read, write, mount...).
+This actions are identified by the `triggers` bitfield.
+
+The next step is to fill a :c:type:`union bpf_attr <bpf_attr>` with
+BPF_PROG_TYPE_LANDLOCK_HOOK, the previously created subtype and other BPF
+program metadata.  This bpf_attr must then be passed to the :manpage:`bpf(2)`
+syscall alongside the BPF_PROG_LOAD command.  If everything is deemed correct
+by the kernel, the thread gets a file descriptor referring to this program.
+
+In the following code, the *insn* variable is an array of BPF instructions
+which can be extracted from an ELF file as is done in bpf_load_file() from
+*samples/bpf/bpf_load.c*.
+
+.. code-block:: c
+
+    union bpf_attr attr = {
+        .prog_type = BPF_PROG_TYPE_LANDLOCK_HOOK,
+        .insn_cnt = sizeof(insn) / sizeof(struct bpf_insn),
+        .insns = (__u64) (unsigned long) insn,
+        .license = (__u64) (unsigned long) "GPL",
+        .prog_subtype = &subtype,
+        .prog_subtype_size = sizeof(subtype),
+    };
+    int fd = bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
+    if (fd == -1)
+        exit(1);
+
+
+Enforcing a program
+-------------------
+
+Once the Landlock program has been created or received (e.g. through a UNIX
+socket), the thread willing to sandbox itself (and its future children) should
+perform the following two steps.
+
+The thread should first request to never be allowed to get new privileges with a
+call to :manpage:`prctl(2)` and the PR_SET_NO_NEW_PRIVS option.  More
+information can be found in *Documentation/prctl/no_new_privs.txt*.
+
+.. code-block:: c
+
+    if (prctl(PR_SET_NO_NEW_PRIVS, 1, NULL, 0, 0))
+        exit(1);
+
+A thread can apply a program to itself by using the :manpage:`seccomp(2)` syscall.
+The operation is SECCOMP_PREPEND_LANDLOCK_PROG, the flags must be empty and the
+*args* argument must point to a valid Landlock program file descriptor.
+
+.. code-block:: c
+
+    if (seccomp(SECCOMP_PREPEND_LANDLOCK_PROG, 0, &fd))
+        exit(1);
+
+If the syscall succeeds, the program is now enforced on the calling thread and
+will be enforced on all its subsequently created children of the thread as
+well.  Once a thread is landlocked, there is no way to remove this security
+policy, only stacking more restrictions is allowed.  The program evaluation is
+performed from the newest to the oldest.
+
+When a syscall ask for an action on a kernel object, if this action is denied,
+then an EACCES errno code is returned through the syscall.
+
+
+.. _inherited_programs:
+
+Inherited programs
+------------------
+
+Every new thread resulting from a :manpage:`clone(2)` inherits Landlock program
+restrictions from its parent.  This is similar to the seccomp inheritance as
+described in *Documentation/prctl/seccomp_filter.txt*.
+
+
+Ptrace restrictions
+-------------------
+
+A landlocked process has less privileges than a non-landlocked process and must
+then be subject to additional restrictions when manipulating another process.
+To be allowed to use :manpage:`ptrace(2)` and related syscalls on a target
+process, a landlocked process must have a subset of the target process programs.
+
+
+Landlock structures and constants
+=================================
+
+Hook types
+----------
+
+.. kernel-doc:: include/uapi/linux/landlock.h
+    :functions: landlock_hook_type
+
+
+Contexts
+--------
+
+.. kernel-doc:: include/uapi/linux/landlock.h
+    :functions: landlock_ctx_fs_pick landlock_ctx_fs_walk landlock_ctx_fs_get
+
+
+Triggers for fs_pick
+--------------------
+
+.. kernel-doc:: include/uapi/linux/landlock.h
+    :functions: landlock_triggers
+
+
+Additional documentation
+========================
+
+See https://landlock.io
-- 
2.20.1

^ permalink raw reply related

* [PATCH bpf-next v9 09/10] bpf,landlock: Add tests for Landlock
From: Mickaël Salaün @ 2019-06-25 21:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Aleksa Sarai, Alexander Viro,
	Alexei Starovoitov, Andrew Morton, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, John Johansen, Jonathan Corbet,
	Kees Cook, Michael Kerrisk
In-Reply-To: <20190625215239.11136-1-mic@digikod.net>

Test basic context access, ptrace protection and filesystem hooks.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: James Morris <jmorris@namei.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Will Drewry <wad@chromium.org>
---

Changes since v8:
* update eBPF include path for macros
* use TEST_GEN_PROGS and use the generic "clean" target
* add more verbose errors
* update the bpf/verifier files
* remove chain tests (from landlock and bpf/verifier)
* replace the whitelist tests with blacklist tests (because of stateless
  Landlock programs): remove "dotdot" tests and other depth tests
* sync the landlock Makefile with its bpf sibling directory and use
  bpf_load_program_xattr()

Changes since v7:
* update tests and add new ones for filesystem hierarchy and Landlock
  chains.

Changes since v6:
* use the new kselftest_harness.h
* use const variables
* replace ASSERT_STEP with ASSERT_*
* rename BPF_PROG_TYPE_LANDLOCK to BPF_PROG_TYPE_LANDLOCK_RULE
* force sample library rebuild
* fix install target

Changes since v5:
* add subtype test
* add ptrace tests
* split and rename files
* cleanup and rebase
---
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/bpf/bpf_helpers.h     |   2 +
 tools/testing/selftests/bpf/test_verifier.c   |   1 +
 .../testing/selftests/bpf/verifier/landlock.c |  35 +++
 .../testing/selftests/bpf/verifier/subtype.c  |  10 +
 tools/testing/selftests/landlock/.gitignore   |   4 +
 tools/testing/selftests/landlock/Makefile     |  39 +++
 tools/testing/selftests/landlock/test.h       |  48 ++++
 tools/testing/selftests/landlock/test_base.c  |  24 ++
 tools/testing/selftests/landlock/test_fs.c    | 257 ++++++++++++++++++
 .../testing/selftests/landlock/test_ptrace.c  | 154 +++++++++++
 11 files changed, 575 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/verifier/landlock.c
 create mode 100644 tools/testing/selftests/landlock/.gitignore
 create mode 100644 tools/testing/selftests/landlock/Makefile
 create mode 100644 tools/testing/selftests/landlock/test.h
 create mode 100644 tools/testing/selftests/landlock/test_base.c
 create mode 100644 tools/testing/selftests/landlock/test_fs.c
 create mode 100644 tools/testing/selftests/landlock/test_ptrace.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 9781ca79794a..342a7d714fb9 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -21,6 +21,7 @@ TARGETS += ir
 TARGETS += kcmp
 TARGETS += kexec
 TARGETS += kvm
+TARGETS += landlock
 TARGETS += lib
 TARGETS += livepatch
 TARGETS += membarrier
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 1a5b1accf091..0b15c49fac3f 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -225,6 +225,8 @@ static void *(*bpf_sk_storage_get)(void *map, struct bpf_sock *sk,
 static int (*bpf_sk_storage_delete)(void *map, struct bpf_sock *sk) =
 	(void *)BPF_FUNC_sk_storage_delete;
 static int (*bpf_send_signal)(unsigned sig) = (void *)BPF_FUNC_send_signal;
+static unsigned long long (*bpf_inode_map_lookup)(void *map, void *key) =
+	(void *) BPF_FUNC_inode_map_lookup;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 93faffd31fc3..c67218ffebf9 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -30,6 +30,7 @@
 #include <linux/bpf.h>
 #include <linux/if_ether.h>
 #include <linux/btf.h>
+#include <linux/landlock.h>
 
 #include <bpf/bpf.h>
 #include <bpf/libbpf.h>
diff --git a/tools/testing/selftests/bpf/verifier/landlock.c b/tools/testing/selftests/bpf/verifier/landlock.c
new file mode 100644
index 000000000000..7ed4e24c0a88
--- /dev/null
+++ b/tools/testing/selftests/bpf/verifier/landlock.c
@@ -0,0 +1,35 @@
+{
+	"landlock/fs_pick: always accept",
+	.insns = {
+		BPF_MOV32_IMM(BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_LANDLOCK_HOOK,
+	.has_prog_subtype = true,
+	.prog_subtype = {
+		.landlock_hook = {
+			.type = LANDLOCK_HOOK_FS_PICK,
+			.triggers = LANDLOCK_TRIGGER_FS_PICK_READ,
+		}
+	},
+},
+{
+	"landlock/fs_pick: read context",
+	.insns = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+		BPF_LDX_MEM(BPF_DW, BPF_REG_7, BPF_REG_6,
+			offsetof(struct landlock_ctx_fs_pick, inode)),
+		BPF_MOV32_IMM(BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_LANDLOCK_HOOK,
+	.has_prog_subtype = true,
+	.prog_subtype = {
+		.landlock_hook = {
+			.type = LANDLOCK_HOOK_FS_PICK,
+			.triggers = LANDLOCK_TRIGGER_FS_PICK_READ,
+		}
+	},
+},
diff --git a/tools/testing/selftests/bpf/verifier/subtype.c b/tools/testing/selftests/bpf/verifier/subtype.c
index cf614223d53f..6bb7ef4b39b5 100644
--- a/tools/testing/selftests/bpf/verifier/subtype.c
+++ b/tools/testing/selftests/bpf/verifier/subtype.c
@@ -8,3 +8,13 @@
 	.result = REJECT,
 	.has_prog_subtype = true,
 },
+{
+	"missing subtype",
+	.insns = {
+		BPF_MOV32_IMM(BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_LANDLOCK_HOOK,
+},
diff --git a/tools/testing/selftests/landlock/.gitignore b/tools/testing/selftests/landlock/.gitignore
new file mode 100644
index 000000000000..25b9cd834c3c
--- /dev/null
+++ b/tools/testing/selftests/landlock/.gitignore
@@ -0,0 +1,4 @@
+/test_base
+/test_fs
+/test_ptrace
+/tmp_*
diff --git a/tools/testing/selftests/landlock/Makefile b/tools/testing/selftests/landlock/Makefile
new file mode 100644
index 000000000000..7a253bf6d580
--- /dev/null
+++ b/tools/testing/selftests/landlock/Makefile
@@ -0,0 +1,39 @@
+LIBDIR := ../../../lib
+BPFDIR := $(LIBDIR)/bpf
+APIDIR := ../../../include/uapi
+GENDIR := ../../../../include/generated
+GENHDR := $(GENDIR)/autoconf.h
+
+ifneq ($(wildcard $(GENHDR)),)
+  GENFLAGS := -DHAVE_GENHDR
+endif
+
+BPFOBJS := $(BPFDIR)/bpf.o $(BPFDIR)/nlattr.o
+LOADOBJ := ../../../../samples/bpf/bpf_load.o
+
+CFLAGS += -Wl,-no-as-needed -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(BPFDIR) -I$(GENDIR) $(GENFLAGS) -I../../../include
+LDFLAGS += -lelf
+
+test_src = $(wildcard test_*.c)
+
+test_objs := $(test_src:.c=)
+
+TEST_GEN_PROGS := $(test_objs)
+
+.PHONY: all clean force
+
+all: $(test_objs)
+
+# force a rebuild of BPFOBJS when its dependencies are updated
+force:
+
+# rebuild bpf.o as a workaround for the samples/bpf bug
+$(BPFOBJS): $(LOADOBJ) force
+	$(MAKE) -C $(BPFDIR)
+
+$(LOADOBJ): force
+	$(MAKE) -C $(dir $(LOADOBJ))
+
+$(test_objs): $(BPFOBJS) $(LOADOBJ) ../kselftest_harness.h
+
+include ../lib.mk
diff --git a/tools/testing/selftests/landlock/test.h b/tools/testing/selftests/landlock/test.h
new file mode 100644
index 000000000000..7d412d94148c
--- /dev/null
+++ b/tools/testing/selftests/landlock/test.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Landlock helpers
+ *
+ * Copyright © 2017-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2019 ANSSI
+ */
+
+#include <bpf/bpf.h>
+#include <errno.h>
+#include <linux/filter.h>
+#include <linux/landlock.h>
+#include <linux/seccomp.h>
+#include <sys/prctl.h>
+#include <sys/syscall.h>
+
+#include "../kselftest_harness.h"
+#include "../../../../samples/bpf/bpf_load.h"
+
+#ifndef SECCOMP_PREPEND_LANDLOCK_PROG
+#define SECCOMP_PREPEND_LANDLOCK_PROG	4
+#endif
+
+#ifndef seccomp
+static int __attribute__((unused)) seccomp(unsigned int op, unsigned int flags,
+		void *args)
+{
+	errno = 0;
+	return syscall(__NR_seccomp, op, flags, args);
+}
+#endif
+
+/* bpf_load_program() with subtype */
+static int __attribute__((unused)) ll_bpf_load_program(
+		const struct bpf_insn *insns, size_t insns_cnt, char *log_buf,
+		size_t log_buf_sz, const union bpf_prog_subtype *subtype)
+{
+	struct bpf_load_program_attr load_attr;
+
+	memset(&load_attr, 0, sizeof(struct bpf_load_program_attr));
+	load_attr.prog_type = BPF_PROG_TYPE_LANDLOCK_HOOK;
+	load_attr.prog_subtype = subtype;
+	load_attr.insns = insns;
+	load_attr.insns_cnt = insns_cnt;
+	load_attr.license = "GPL";
+
+	return bpf_load_program_xattr(&load_attr, log_buf, log_buf_sz);
+}
diff --git a/tools/testing/selftests/landlock/test_base.c b/tools/testing/selftests/landlock/test_base.c
new file mode 100644
index 000000000000..db46f39048cb
--- /dev/null
+++ b/tools/testing/selftests/landlock/test_base.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Landlock tests - base
+ *
+ * Copyright © 2017-2019 Mickaël Salaün <mic@digikod.net>
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+
+#include "test.h"
+
+TEST(seccomp_landlock)
+{
+	int ret;
+
+	ret = seccomp(SECCOMP_PREPEND_LANDLOCK_PROG, 0, NULL);
+	EXPECT_EQ(-1, ret);
+	EXPECT_EQ(EFAULT, errno) {
+		TH_LOG("Kernel does not support CONFIG_SECURITY_LANDLOCK");
+	}
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/landlock/test_fs.c b/tools/testing/selftests/landlock/test_fs.c
new file mode 100644
index 000000000000..dba726ea4994
--- /dev/null
+++ b/tools/testing/selftests/landlock/test_fs.c
@@ -0,0 +1,257 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Landlock tests - file system
+ *
+ * Copyright © 2018-2019 Mickaël Salaün <mic@digikod.net>
+ */
+
+#include <fcntl.h> /* O_DIRECTORY */
+#include <sys/stat.h> /* statbuf */
+#include <unistd.h> /* faccessat() */
+
+#include "test.h"
+
+#define TEST_PATH_TRIGGERS ( \
+		LANDLOCK_TRIGGER_FS_PICK_OPEN | \
+		LANDLOCK_TRIGGER_FS_PICK_READDIR | \
+		LANDLOCK_TRIGGER_FS_PICK_EXECUTE | \
+		LANDLOCK_TRIGGER_FS_PICK_GETATTR)
+
+static void test_path_rel(struct __test_metadata *_metadata, int dirfd,
+		const char *path, int ret)
+{
+	int fd;
+	struct stat statbuf;
+
+	ASSERT_EQ(ret, faccessat(dirfd, path, R_OK | X_OK, 0));
+	ASSERT_EQ(ret, fstatat(dirfd, path, &statbuf, 0));
+	fd = openat(dirfd, path, O_DIRECTORY);
+	if (ret) {
+		ASSERT_EQ(-1, fd);
+	} else {
+		ASSERT_NE(-1, fd);
+		EXPECT_EQ(0, close(fd));
+	}
+}
+
+static void test_path(struct __test_metadata *_metadata, const char *path,
+		int ret)
+{
+	return test_path_rel(_metadata, AT_FDCWD, path, ret);
+}
+
+static const char d1[] = "/usr";
+static const char d2[] = "/usr/share";
+static const char d3[] = "/usr/share/doc";
+
+TEST(fs_base)
+{
+	test_path(_metadata, d1, 0);
+	test_path(_metadata, d2, 0);
+	test_path(_metadata, d3, 0);
+}
+
+#define MAP_VALUE_DENY 1
+
+static int create_denied_inode_map(struct __test_metadata *_metadata,
+		const char *const dirs[])
+{
+	int map, key, dirs_len, i;
+	__u64 value = MAP_VALUE_DENY;
+
+	ASSERT_NE(NULL, dirs) {
+		TH_LOG("No directory list\n");
+	}
+	ASSERT_NE(NULL, dirs[0]) {
+		TH_LOG("Empty directory list\n");
+	}
+
+	/* get the number of dir entries */
+	for (dirs_len = 0; dirs[dirs_len]; dirs_len++);
+	map = bpf_create_map(BPF_MAP_TYPE_INODE, sizeof(key), sizeof(value),
+			dirs_len, 0);
+	ASSERT_NE(-1, map) {
+		TH_LOG("Failed to create a map of %d elements: %s\n", dirs_len,
+				strerror(errno));
+	}
+
+	for (i = 0; dirs[i]; i++) {
+		key = open(dirs[i], O_RDONLY | O_CLOEXEC | O_DIRECTORY);
+		ASSERT_NE(-1, key) {
+			TH_LOG("Failed to open directory \"%s\": %s\n", dirs[i],
+					strerror(errno));
+		}
+		ASSERT_EQ(0, bpf_map_update_elem(map, &key, &value, BPF_ANY)) {
+			TH_LOG("Failed to update the map with \"%s\": %s\n",
+					dirs[i], strerror(errno));
+		}
+		close(key);
+	}
+	return map;
+}
+
+static void enforce_map(struct __test_metadata *_metadata, int map,
+		bool subpath)
+{
+	const struct bpf_insn prog_deny[] = {
+		BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_1),
+		/* look for the requested inode in the map */
+		BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_6,
+			offsetof(struct landlock_ctx_fs_walk, inode)),
+		BPF_LD_MAP_FD(BPF_REG_1, map), /* 2 instructions */
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+				BPF_FUNC_inode_map_lookup),
+		/* if it is there, then deny access to the inode, otherwise
+		 * allow it */
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, MAP_VALUE_DENY, 2),
+		BPF_MOV32_IMM(BPF_REG_0, LANDLOCK_RET_ALLOW),
+		BPF_EXIT_INSN(),
+		BPF_MOV32_IMM(BPF_REG_0, LANDLOCK_RET_DENY),
+		BPF_EXIT_INSN(),
+	};
+	union bpf_prog_subtype subtype = {};
+	int fd_walk = -1, fd_pick;
+	char log[1024] = "";
+
+	if (subpath) {
+		subtype.landlock_hook.type = LANDLOCK_HOOK_FS_WALK;
+		fd_walk = ll_bpf_load_program((const struct bpf_insn *)&prog_deny,
+				sizeof(prog_deny) / sizeof(struct bpf_insn),
+				log, sizeof(log), &subtype);
+		ASSERT_NE(-1, fd_walk) {
+			TH_LOG("Failed to load fs_walk program: %s\n%s",
+					strerror(errno), log);
+		}
+		ASSERT_EQ(0, seccomp(SECCOMP_PREPEND_LANDLOCK_PROG, 0, &fd_walk)) {
+			TH_LOG("Failed to apply Landlock program: %s", strerror(errno));
+		}
+		EXPECT_EQ(0, close(fd_walk));
+	}
+
+	subtype.landlock_hook.type = LANDLOCK_HOOK_FS_PICK;
+	subtype.landlock_hook.triggers = TEST_PATH_TRIGGERS;
+	fd_pick = ll_bpf_load_program((const struct bpf_insn *)&prog_deny,
+			sizeof(prog_deny) / sizeof(struct bpf_insn), log,
+			sizeof(log), &subtype);
+	ASSERT_NE(-1, fd_pick) {
+		TH_LOG("Failed to load fs_pick program: %s\n%s",
+				strerror(errno), log);
+	}
+	ASSERT_EQ(0, seccomp(SECCOMP_PREPEND_LANDLOCK_PROG, 0, &fd_pick)) {
+		TH_LOG("Failed to apply Landlock program: %s", strerror(errno));
+	}
+	EXPECT_EQ(0, close(fd_pick));
+}
+
+static void check_map_blacklist(struct __test_metadata *_metadata,
+		bool subpath)
+{
+	int map = create_denied_inode_map(_metadata, (const char *const [])
+			{ d2, NULL });
+	ASSERT_NE(-1, map);
+	enforce_map(_metadata, map, subpath);
+	test_path(_metadata, d1, 0);
+	test_path(_metadata, d2, -1);
+	test_path(_metadata, d3, subpath ? -1 : 0);
+	EXPECT_EQ(0, close(map));
+}
+
+TEST(fs_map_blacklist_literal)
+{
+	check_map_blacklist(_metadata, false);
+}
+
+TEST(fs_map_blacklist_subpath)
+{
+	check_map_blacklist(_metadata, true);
+}
+
+static const char r2[] = ".";
+static const char r3[] = "./doc";
+
+enum relative_access {
+	REL_OPEN,
+	REL_CHDIR,
+	REL_CHROOT,
+};
+
+static void check_access(struct __test_metadata *_metadata,
+		bool enforce, enum relative_access rel)
+{
+	int dirfd;
+	int map = -1;
+
+	if (rel == REL_CHROOT)
+		ASSERT_NE(-1, chdir(d2));
+	if (enforce) {
+		map = create_denied_inode_map(_metadata, (const char *const [])
+				{ d3, NULL });
+		ASSERT_NE(-1, map);
+		enforce_map(_metadata, map, true);
+	}
+	switch (rel) {
+	case REL_OPEN:
+		dirfd = open(d2, O_DIRECTORY);
+		ASSERT_NE(-1, dirfd);
+		break;
+	case REL_CHDIR:
+		ASSERT_NE(-1, chdir(d2));
+		dirfd = AT_FDCWD;
+		break;
+	case REL_CHROOT:
+		ASSERT_NE(-1, chroot(d2)) {
+			TH_LOG("Failed to chroot: %s\n", strerror(errno));
+		}
+		dirfd = AT_FDCWD;
+		break;
+	default:
+		ASSERT_TRUE(false);
+		return;
+	}
+
+	test_path_rel(_metadata, dirfd, r2, 0);
+	test_path_rel(_metadata, dirfd, r3, enforce ? -1 : 0);
+
+	if (rel == REL_OPEN)
+		EXPECT_EQ(0, close(dirfd));
+	if (enforce)
+		EXPECT_EQ(0, close(map));
+}
+
+TEST(fs_allow_open)
+{
+	/* no enforcement, via open */
+	check_access(_metadata, false, REL_OPEN);
+}
+
+TEST(fs_allow_chdir)
+{
+	/* no enforcement, via chdir */
+	check_access(_metadata, false, REL_CHDIR);
+}
+
+TEST(fs_allow_chroot)
+{
+	/* no enforcement, via chroot */
+	check_access(_metadata, false, REL_CHROOT);
+}
+
+TEST(fs_deny_open)
+{
+	/* enforcement without tag, via open */
+	check_access(_metadata, true, REL_OPEN);
+}
+
+TEST(fs_deny_chdir)
+{
+	/* enforcement without tag, via chdir */
+	check_access(_metadata, true, REL_CHDIR);
+}
+
+TEST(fs_deny_chroot)
+{
+	/* enforcement without tag, via chroot */
+	check_access(_metadata, true, REL_CHROOT);
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/landlock/test_ptrace.c b/tools/testing/selftests/landlock/test_ptrace.c
new file mode 100644
index 000000000000..2f3e346288bc
--- /dev/null
+++ b/tools/testing/selftests/landlock/test_ptrace.c
@@ -0,0 +1,154 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Landlock tests - ptrace
+ *
+ * Copyright © 2017-2019 Mickaël Salaün <mic@digikod.net>
+ */
+
+#define _GNU_SOURCE
+#include <signal.h> /* raise */
+#include <sys/ptrace.h>
+#include <sys/types.h> /* waitpid */
+#include <sys/wait.h> /* waitpid */
+#include <unistd.h> /* fork, pipe */
+
+#include "test.h"
+
+static void apply_null_sandbox(struct __test_metadata *_metadata)
+{
+	const struct bpf_insn prog_accept[] = {
+		BPF_MOV32_IMM(BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	};
+	const union bpf_prog_subtype subtype = {
+		.landlock_hook = {
+			.type = LANDLOCK_HOOK_FS_PICK,
+			.triggers = LANDLOCK_TRIGGER_FS_PICK_OPEN,
+		}
+	};
+	int prog;
+	char log[256] = "";
+
+	prog = ll_bpf_load_program((const struct bpf_insn *)&prog_accept,
+			sizeof(prog_accept) / sizeof(struct bpf_insn), log,
+			sizeof(log), &subtype);
+	ASSERT_NE(-1, prog) {
+		TH_LOG("Failed to load minimal rule: %s\n%s",
+				strerror(errno), log);
+	}
+	ASSERT_EQ(0, seccomp(SECCOMP_PREPEND_LANDLOCK_PROG, 0, &prog)) {
+		TH_LOG("Failed to apply minimal rule: %s", strerror(errno));
+	}
+	EXPECT_EQ(0, close(prog));
+}
+
+/* PTRACE_TRACEME and PTRACE_ATTACH without Landlock rules effect */
+static void check_ptrace(struct __test_metadata *_metadata,
+		int sandbox_both, int sandbox_parent, int sandbox_child,
+		int expect_ptrace)
+{
+	pid_t child;
+	int status;
+	int pipefd[2];
+
+	ASSERT_EQ(0, pipe(pipefd));
+	if (sandbox_both)
+		apply_null_sandbox(_metadata);
+
+	child = fork();
+	ASSERT_LE(0, child);
+	if (child == 0) {
+		char buf;
+
+		EXPECT_EQ(0, close(pipefd[1]));
+		if (sandbox_child)
+			apply_null_sandbox(_metadata);
+
+		/* test traceme */
+		ASSERT_EQ(expect_ptrace, ptrace(PTRACE_TRACEME));
+		if (expect_ptrace) {
+			ASSERT_EQ(EPERM, errno);
+		} else {
+			ASSERT_EQ(0, raise(SIGSTOP));
+		}
+
+		/* sync */
+		ASSERT_EQ(1, read(pipefd[0], &buf, 1)) {
+			TH_LOG("Failed to read() sync from parent");
+		}
+		ASSERT_EQ('.', buf);
+		_exit(_metadata->passed ? EXIT_SUCCESS : EXIT_FAILURE);
+	}
+
+	EXPECT_EQ(0, close(pipefd[0]));
+	if (sandbox_parent)
+		apply_null_sandbox(_metadata);
+
+	/* test traceme */
+	if (!expect_ptrace) {
+		ASSERT_EQ(child, waitpid(child, &status, 0));
+		ASSERT_EQ(1, WIFSTOPPED(status));
+		ASSERT_EQ(0, ptrace(PTRACE_DETACH, child, NULL, 0));
+	}
+	/* test attach */
+	ASSERT_EQ(expect_ptrace, ptrace(PTRACE_ATTACH, child, NULL, 0));
+	if (expect_ptrace) {
+		ASSERT_EQ(EPERM, errno);
+	} else {
+		ASSERT_EQ(child, waitpid(child, &status, 0));
+		ASSERT_EQ(1, WIFSTOPPED(status));
+		ASSERT_EQ(0, ptrace(PTRACE_CONT, child, NULL, 0));
+	}
+
+	/* sync */
+	ASSERT_EQ(1, write(pipefd[1], ".", 1)) {
+		TH_LOG("Failed to write() sync to child");
+	}
+	ASSERT_EQ(child, waitpid(child, &status, 0));
+	if (WIFSIGNALED(status) || WEXITSTATUS(status))
+		_metadata->passed = 0;
+}
+
+TEST(ptrace_allow_without_sandbox)
+{
+	/* no sandbox */
+	check_ptrace(_metadata, 0, 0, 0, 0);
+}
+
+TEST(ptrace_allow_with_one_sandbox)
+{
+	/* child sandbox */
+	check_ptrace(_metadata, 0, 0, 1, 0);
+}
+
+TEST(ptrace_allow_with_nested_sandbox)
+{
+	/* inherited and child sandbox */
+	check_ptrace(_metadata, 1, 0, 1, 0);
+}
+
+TEST(ptrace_deny_with_parent_sandbox)
+{
+	/* parent sandbox */
+	check_ptrace(_metadata, 0, 1, 0, -1);
+}
+
+TEST(ptrace_deny_with_nested_and_parent_sandbox)
+{
+	/* inherited and parent sandbox */
+	check_ptrace(_metadata, 1, 1, 0, -1);
+}
+
+TEST(ptrace_deny_with_forked_sandbox)
+{
+	/* inherited, parent and child sandbox */
+	check_ptrace(_metadata, 1, 1, 1, -1);
+}
+
+TEST(ptrace_deny_with_sibling_sandbox)
+{
+	/* parent and child sandbox */
+	check_ptrace(_metadata, 0, 1, 1, -1);
+}
+
+TEST_HARNESS_MAIN
-- 
2.20.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox