Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Lorenz Bauer @ 2019-07-23 10:45 UTC (permalink / raw)
  To: Song Liu
  Cc: Andy Lutomirski, Kees Cook, linux-security@vger.kernel.org,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Jann Horn, Greg KH, Linux API
In-Reply-To: <4A7A225A-6C23-4C0F-9A95-7C6C56B281ED@fb.com>

On Mon, 22 Jul 2019 at 21:54, Song Liu <songliubraving@fb.com> wrote:
>
> Hi Andy, Lorenz, and all,
>
> With 5.3-rc1 out, I am back on this. :)
>
> How about we modify the set as:
>   1. Introduce sys_bpf_with_cap() that takes fd of /dev/bpf.
>   2. Better handling of capable() calls through bpf code. I guess the
>      biggest problem here is is_priv in verifier.c:bpf_check().
>
> With this approach, we will be able to pass the fd around, so it should
> also solve problem for Go.

Thanks for picking this up again. I need to figure out what the API
for this would
look like on the Go side, but I think it's a nice solution!

Lorenz

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
From: Joel Fernandes @ 2019-07-23 14:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, vdavydov.dev, Brendan Gregg, kernel-team,
	Alexey Dobriyan, Al Viro, Andrew Morton, carmenjackson,
	Christian Hansen, Colin Ian King, dancol, David Howells, fmayer,
	joaodias, Jonathan Corbet, Kees Cook, Kirill Tkhai,
	Konstantin Khlebnikov, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, minchan, namhyung, sspatil@
In-Reply-To: <20190723060525.GA4552@dhcp22.suse.cz>

On Tue, Jul 23, 2019 at 08:05:25AM +0200, Michal Hocko wrote:
> [Cc linux-api - please always do CC this list when introducing a user
>  visible API]

Sorry, will do.

> On Mon 22-07-19 17:32:04, Joel Fernandes (Google) wrote:
> > The page_idle tracking feature currently requires looking up the pagemap
> > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > This is quite cumbersome and can be error-prone too. If between
> > accessing the per-PID pagemap and the global page_idle bitmap, if
> > something changes with the page then the information is not accurate.
> > More over looking up PFN from pagemap in Android devices is not
> > supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> > the PFN.
> > 
> > This patch adds support to directly interact with page_idle tracking at
> > the PID level by introducing a /proc/<pid>/page_idle file. This
> > eliminates the need for userspace to calculate the mapping of the page.
> > It follows the exact same semantics as the global
> > /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> > where looking up PFN is not needed and also does not require SYS_ADMIN.
> > It ended up simplifying userspace code, solving the security issue
> > mentioned and works quite well. SELinux does not need to be turned off
> > since no pagemap look up is needed.
> > 
> > In Android, we are using this for the heap profiler (heapprofd) which
> > profiles and pin points code paths which allocates and leaves memory
> > idle for long periods of time.
> > 
> > Documentation material:
> > The idle page tracking API for virtual address indexing using virtual page
> > frame numbers (VFN) is located at /proc/<pid>/page_idle. It is a bitmap
> > that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> > except that it uses virtual instead of physical frame numbers.
> > 
> > This idle page tracking API can be simpler to use than physical address
> > indexing, since the pagemap for a process does not need to be looked up
> > to mark or read a page's idle bit. It is also more accurate than
> > physical address indexing since in physical address indexing, address
> > space changes can occur between reading the pagemap and reading the
> > bitmap. In virtual address indexing, the process's mmap_sem is held for
> > the duration of the access.
> 
> I didn't get to read the actual code but the overall idea makes sense to
> me. I can see this being useful for userspace memory management (along
> with remote MADV_PAGEOUT, MADV_COLD).

Thanks.

> Normally I would object that a cumbersome nature of the existing
> interface can be hidden in a userspace but I do agree that rowhammer has
> made this one close to unusable for anything but a privileged process.

Agreed, this is one of the primary motivations for the patch as you said.

> I do not think you can make any argument about accuracy because
> the information will never be accurate. Sure the race window is smaller
> in principle but you can hardly say anything about how much or whether
> at all.

Sure, fair enough. That is why I wasn't beating the drum too much on the
accuracy point. However, this surprisingly does work quite well.

thanks,

 - Joel

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-07-23 15:11 UTC (permalink / raw)
  To: Song Liu
  Cc: Andy Lutomirski, Kees Cook, linux-security@vger.kernel.org,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API
In-Reply-To: <4A7A225A-6C23-4C0F-9A95-7C6C56B281ED@fb.com>

On Mon, Jul 22, 2019 at 1:54 PM Song Liu <songliubraving@fb.com> wrote:
>
> Hi Andy, Lorenz, and all,
>
> > On Jul 2, 2019, at 2:32 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Tue, Jul 2, 2019 at 2:04 PM Kees Cook <keescook@chromium.org> wrote:
> >>
> >> On Mon, Jul 01, 2019 at 06:59:13PM -0700, Andy Lutomirski wrote:
> >>> I think I'm understanding your motivation.  You're not trying to make
> >>> bpf() generically usable without privilege -- you're trying to create
> >>> a way to allow certain users to access dangerous bpf functionality
> >>> within some limits.
> >>>
> >>> That's a perfectly fine goal, but I think you're reinventing the
> >>> wheel, and the wheel you're reinventing is quite complicated and
> >>> already exists.  I think you should teach bpftool to be secure when
> >>> installed setuid root or with fscaps enabled and put your policy in
> >>> bpftool.  If you want to harden this a little bit, it would seem
> >>> entirely reasonable to add a new CAP_BPF_ADMIN and change some, but
> >>> not all, of the capable() checks to check CAP_BPF_ADMIN instead of the
> >>> capabilities that they currently check.
> >>
> >> If finer grained controls are wanted, it does seem like the /dev/bpf
> >> path makes the most sense. open, request abilities, use fd. The open can
> >> be mediated by DAC and LSM. The request can be mediated by LSM. This
> >> provides a way to add policy at the LSM level and at the tool level.
> >> (i.e. For tool-level controls: leave LSM wide open, make /dev/bpf owned
> >> by "bpfadmin" and bpftool becomes setuid "bpfadmin". For fine-grained
> >> controls, leave /dev/bpf wide open and add policy to SELinux, etc.)
> >>
> >> With only a new CAP, you don't get the fine-grained controls. (The
> >> "request abilities" part is the key there.)
> >
> > Sure you do: the effective set.  It has somewhat bizarre defaults, but
> > I don't think that's a real problem.  Also, this wouldn't be like
> > CAP_DAC_READ_SEARCH -- you can't accidentally use your BPF caps.
> >
> > I think that a /dev capability-like object isn't totally nuts, but I
> > think we should do it well, and this patch doesn't really achieve
> > that.  But I don't think bpf wants fine-grained controls like this at
> > all -- as I pointed upthread, a fine-grained solution really wants
> > different treatment for the different capable() checks, and a bunch of
> > them won't resemble capabilities or /dev/bpf at all.
>
> With 5.3-rc1 out, I am back on this. :)
>
> How about we modify the set as:
>   1. Introduce sys_bpf_with_cap() that takes fd of /dev/bpf.

I'm fine with this in principle, but:

>   2. Better handling of capable() calls through bpf code. I guess the
>      biggest problem here is is_priv in verifier.c:bpf_check().

I think it would be good to understand exactly what /dev/bpf will
enable one to do.  Without some care, it would just become the next
CAP_SYS_ADMIN: if you can open it, sure, you're not root, but you can
intercept network traffic, modify cgroup behavior, and do plenty of
other things, any of which can probably be used to completely take
over the system.

It would also be nice to understand why you can't do what you need to
do entirely in user code using setuid or fscaps.

Finally, at risk of rehashing some old arguments, I'll point out that
the bpf() syscall is an unusual design to begin with.  As an example,
consider bpf_prog_attach().  Outside of bpf(), if I want to change the
behavior of a cgroup, I would write to a file in
/sys/kernel/cgroup/unified/whatever/, and normal DAC and MAC rules
apply.  With bpf(), however, I just call bpf() to attach a program to
the cgroup.  bpf() says "oh, you are capable(CAP_NET_ADMIN) -- go for
it!".  Unless I missed something major, and I just re-read the code,
there is no check that the caller has write or LSM permission to
anything at all in cgroupfs, and the existing API would make it very
awkward to impose any kind of DAC rules here.

So I think it might actually be time to repay some techincal debt and
come up with a real fix.  As a less intrusive approach, you could see
about requiring ownership of the cgroup directory instead of
CAP_NET_ADMIN.  As a more intrusive but perhaps better approach, you
could invert the logic to to make it work like everything outside of
cgroup: add pseudo-files like bpf.inet_ingress to the cgroup
directories, and require a writable fd to *that* to a new improved
attach API.  If a user could do:

int fd = open("/sys/fs/cgroup/.../bpf.inet_attach", O_RDWR);  /* usual
DAC and MAC policy applies */
int bpf_fd = setup the bpf stuff;  /* no privilege required, unless
the program is huge or needs is_priv */
bpf(BPF_IMPROVED_ATTACH, target = fd, program = bpf_fd);

there would be no capabilities or global privilege at all required for
this.  It would just work with cgroup delegation, containers, etc.

I think you could even pull off this type of API change with only
libbpf changes.  In particular, there's this code:

int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type,
                    unsigned int flags)
{
        union bpf_attr attr;

        memset(&attr, 0, sizeof(attr));
        attr.target_fd     = target_fd;
        attr.attach_bpf_fd = prog_fd;
        attr.attach_type   = type;
        attr.attach_flags  = flags;

        return sys_bpf(BPF_PROG_ATTACH, &attr, sizeof(attr));
}

This would instead do something like:

int specific_target_fd = openat(target_fd, bpf_type_to_target[type], O_RDWR);
attr.target_fd = specific_target_fd;
...

return sys_bpf(BPF_PROG_IMPROVED_ATTACH, &attr, sizeof(attr));

Would this solve your problem without needing /dev/bpf at all?

--Andy

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Song Liu @ 2019-07-23 22:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, linux-security@vger.kernel.org, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API
In-Reply-To: <CALCETrX2bMnwC6_t4b_G-hzJSfMPrkK4YKs5ebcecv2LJ0rt3w@mail.gmail.com>



> On Jul 23, 2019, at 8:11 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Mon, Jul 22, 2019 at 1:54 PM Song Liu <songliubraving@fb.com> wrote:
>> 
>> Hi Andy, Lorenz, and all,
>> 
>>> On Jul 2, 2019, at 2:32 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>> 
>>> On Tue, Jul 2, 2019 at 2:04 PM Kees Cook <keescook@chromium.org> wrote:
>>>> 
>>>> On Mon, Jul 01, 2019 at 06:59:13PM -0700, Andy Lutomirski wrote:
>>>>> I think I'm understanding your motivation.  You're not trying to make
>>>>> bpf() generically usable without privilege -- you're trying to create
>>>>> a way to allow certain users to access dangerous bpf functionality
>>>>> within some limits.
>>>>> 
>>>>> That's a perfectly fine goal, but I think you're reinventing the
>>>>> wheel, and the wheel you're reinventing is quite complicated and
>>>>> already exists.  I think you should teach bpftool to be secure when
>>>>> installed setuid root or with fscaps enabled and put your policy in
>>>>> bpftool.  If you want to harden this a little bit, it would seem
>>>>> entirely reasonable to add a new CAP_BPF_ADMIN and change some, but
>>>>> not all, of the capable() checks to check CAP_BPF_ADMIN instead of the
>>>>> capabilities that they currently check.
>>>> 
>>>> If finer grained controls are wanted, it does seem like the /dev/bpf
>>>> path makes the most sense. open, request abilities, use fd. The open can
>>>> be mediated by DAC and LSM. The request can be mediated by LSM. This
>>>> provides a way to add policy at the LSM level and at the tool level.
>>>> (i.e. For tool-level controls: leave LSM wide open, make /dev/bpf owned
>>>> by "bpfadmin" and bpftool becomes setuid "bpfadmin". For fine-grained
>>>> controls, leave /dev/bpf wide open and add policy to SELinux, etc.)
>>>> 
>>>> With only a new CAP, you don't get the fine-grained controls. (The
>>>> "request abilities" part is the key there.)
>>> 
>>> Sure you do: the effective set.  It has somewhat bizarre defaults, but
>>> I don't think that's a real problem.  Also, this wouldn't be like
>>> CAP_DAC_READ_SEARCH -- you can't accidentally use your BPF caps.
>>> 
>>> I think that a /dev capability-like object isn't totally nuts, but I
>>> think we should do it well, and this patch doesn't really achieve
>>> that.  But I don't think bpf wants fine-grained controls like this at
>>> all -- as I pointed upthread, a fine-grained solution really wants
>>> different treatment for the different capable() checks, and a bunch of
>>> them won't resemble capabilities or /dev/bpf at all.
>> 
>> With 5.3-rc1 out, I am back on this. :)
>> 
>> How about we modify the set as:
>>  1. Introduce sys_bpf_with_cap() that takes fd of /dev/bpf.
> 
> I'm fine with this in principle, but:
> 
>>  2. Better handling of capable() calls through bpf code. I guess the
>>     biggest problem here is is_priv in verifier.c:bpf_check().
> 
> I think it would be good to understand exactly what /dev/bpf will
> enable one to do.  Without some care, it would just become the next
> CAP_SYS_ADMIN: if you can open it, sure, you're not root, but you can
> intercept network traffic, modify cgroup behavior, and do plenty of
> other things, any of which can probably be used to completely take
> over the system.

Well, yes. sys_bpf() is pretty powerful. 

The goal of /dev/bpf is to enable special users to call sys_bpf(). In 
the meanwhile, such users should not take down the whole system easily
by accident, e.g., with rm -rf /.

It is similar to CAP_BPF_ADMIN, without really adding the CAP_.  

I think adding new CAP_ requires much more effort. 

> 
> It would also be nice to understand why you can't do what you need to
> do entirely in user code using setuid or fscaps.

It is not very easy to achieve the same control: only certain users can
run certain tools (bpftool, etc.). 

The closest approach I can find is:
  1. use libcap (pam_cap) to give CAP_SETUID to certain users;
  2. add setuid(0) to bpftool.

The difference between this approach and /dev/bpf is that certain users
would be able to run other tools that call setuid(). Though I am not 
sure how many tools call setuid(), and how risky they are. 

> 
> Finally, at risk of rehashing some old arguments, I'll point out that
> the bpf() syscall is an unusual design to begin with.  As an example,
> consider bpf_prog_attach().  Outside of bpf(), if I want to change the
> behavior of a cgroup, I would write to a file in
> /sys/kernel/cgroup/unified/whatever/, and normal DAC and MAC rules
> apply.  With bpf(), however, I just call bpf() to attach a program to
> the cgroup.  bpf() says "oh, you are capable(CAP_NET_ADMIN) -- go for
> it!".  Unless I missed something major, and I just re-read the code,
> there is no check that the caller has write or LSM permission to
> anything at all in cgroupfs, and the existing API would make it very
> awkward to impose any kind of DAC rules here.
> 
> So I think it might actually be time to repay some techincal debt and
> come up with a real fix.  As a less intrusive approach, you could see
> about requiring ownership of the cgroup directory instead of
> CAP_NET_ADMIN.  As a more intrusive but perhaps better approach, you
> could invert the logic to to make it work like everything outside of
> cgroup: add pseudo-files like bpf.inet_ingress to the cgroup
> directories, and require a writable fd to *that* to a new improved
> attach API.  If a user could do:
> 
> int fd = open("/sys/fs/cgroup/.../bpf.inet_attach", O_RDWR);  /* usual
> DAC and MAC policy applies */
> int bpf_fd = setup the bpf stuff;  /* no privilege required, unless
> the program is huge or needs is_priv */
> bpf(BPF_IMPROVED_ATTACH, target = fd, program = bpf_fd);
> 
> there would be no capabilities or global privilege at all required for
> this.  It would just work with cgroup delegation, containers, etc.
> 
> I think you could even pull off this type of API change with only
> libbpf changes.  In particular, there's this code:
> 
> int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type,
>                    unsigned int flags)
> {
>        union bpf_attr attr;
> 
>        memset(&attr, 0, sizeof(attr));
>        attr.target_fd     = target_fd;
>        attr.attach_bpf_fd = prog_fd;
>        attr.attach_type   = type;
>        attr.attach_flags  = flags;
> 
>        return sys_bpf(BPF_PROG_ATTACH, &attr, sizeof(attr));
> }
> 
> This would instead do something like:
> 
> int specific_target_fd = openat(target_fd, bpf_type_to_target[type], O_RDWR);
> attr.target_fd = specific_target_fd;
> ...
> 
> return sys_bpf(BPF_PROG_IMPROVED_ATTACH, &attr, sizeof(attr));
> 
> Would this solve your problem without needing /dev/bpf at all?

This gives fine grain access control. I think it solves the problem. 
But it also requires a lot of rework to sys_bpf(). And it may also 
break backward/forward compatibility?

Personally, I think it is an overkill for the original motivation: 
call sys_bpf() with special user instead of root. 

Alexei, Daniel: what do you think about this? 

Thanks,
Song

^ permalink raw reply

* [PATCH v4 0/3] Casefolding in F2FS
From: Daniel Rosenberg @ 2019-07-23 23:05 UTC (permalink / raw)
  To: Jaegeuk Kim, Chao Yu, Jonathan Corbet, linux-f2fs-devel
  Cc: linux-kernel, linux-doc, linux-fsdevel, linux-api, kernel-team,
	Daniel Rosenberg

These patches are largely based on the casefolding patches for ext4

v4: Added FS_CASEFOLD_FL flag, added documentation that escaped the last
    format-patch, moved setting dentry ops to f2fs_setup_casefold
v3: Addressed feedback, apart from F2FS_CASEFOLD_FL/FS_CASEFOLD_FL
    Added sysfs file "encoding" to see the encoding set on a filesystem
v2: Rebased patches again master, changed f2fs_msg to f2fs_info/f2fs_err

Daniel Rosenberg (3):
  fs: Reserve flag for casefolding
  f2fs: include charset encoding information in the superblock
  f2fs: Support case-insensitive file name lookups

 Documentation/ABI/testing/sysfs-fs-f2fs |   7 ++
 Documentation/filesystems/f2fs.txt      |   3 +
 fs/f2fs/dir.c                           | 126 ++++++++++++++++++++++--
 fs/f2fs/f2fs.h                          |  21 +++-
 fs/f2fs/file.c                          |  16 ++-
 fs/f2fs/hash.c                          |  35 ++++++-
 fs/f2fs/inline.c                        |   4 +-
 fs/f2fs/inode.c                         |   4 +-
 fs/f2fs/namei.c                         |  21 ++++
 fs/f2fs/super.c                         |  96 ++++++++++++++++++
 fs/f2fs/sysfs.c                         |  23 +++++
 include/linux/f2fs_fs.h                 |   9 +-
 include/uapi/linux/fs.h                 |   1 +
 tools/include/uapi/linux/fs.h           |   1 +
 14 files changed, 347 insertions(+), 20 deletions(-)

-- 
2.22.0.657.g960e92d24f-goog

^ permalink raw reply

* [PATCH v4 1/3] fs: Reserve flag for casefolding
From: Daniel Rosenberg @ 2019-07-23 23:05 UTC (permalink / raw)
  To: Jaegeuk Kim, Chao Yu, Jonathan Corbet, linux-f2fs-devel
  Cc: linux-kernel, linux-doc, linux-fsdevel, linux-api, kernel-team,
	Daniel Rosenberg
In-Reply-To: <20190723230529.251659-1-drosen@google.com>

In preparation for including the casefold feature within f2fs, elevate
the EXT4_CASEFOLD_FL flag to FS_CASEFOLD_FL.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
---
 include/uapi/linux/fs.h       | 1 +
 tools/include/uapi/linux/fs.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 59c71fa8c553a..2a616aa3f6862 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -311,6 +311,7 @@ struct fscrypt_key {
 #define FS_NOCOW_FL			0x00800000 /* Do not cow file */
 #define FS_INLINE_DATA_FL		0x10000000 /* Reserved for ext4 */
 #define FS_PROJINHERIT_FL		0x20000000 /* Create with parents projid */
+#define FS_CASEFOLD_FL			0x40000000 /* Folder is case insensitive */
 #define FS_RESERVED_FL			0x80000000 /* reserved for ext2 lib */
 
 #define FS_FL_USER_VISIBLE		0x0003DFFF /* User visible flags */
diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h
index 59c71fa8c553a..2a616aa3f6862 100644
--- a/tools/include/uapi/linux/fs.h
+++ b/tools/include/uapi/linux/fs.h
@@ -311,6 +311,7 @@ struct fscrypt_key {
 #define FS_NOCOW_FL			0x00800000 /* Do not cow file */
 #define FS_INLINE_DATA_FL		0x10000000 /* Reserved for ext4 */
 #define FS_PROJINHERIT_FL		0x20000000 /* Create with parents projid */
+#define FS_CASEFOLD_FL			0x40000000 /* Folder is case insensitive */
 #define FS_RESERVED_FL			0x80000000 /* reserved for ext2 lib */
 
 #define FS_FL_USER_VISIBLE		0x0003DFFF /* User visible flags */
-- 
2.22.0.657.g960e92d24f-goog

^ permalink raw reply related

* [PATCH v4 2/3] f2fs: include charset encoding information in the superblock
From: Daniel Rosenberg @ 2019-07-23 23:05 UTC (permalink / raw)
  To: Jaegeuk Kim, Chao Yu, Jonathan Corbet, linux-f2fs-devel
  Cc: linux-kernel, linux-doc, linux-fsdevel, linux-api, kernel-team,
	Daniel Rosenberg
In-Reply-To: <20190723230529.251659-1-drosen@google.com>

Add charset encoding to f2fs to support casefolding. It is modeled after
the same feature introduced in commit c83ad55eaa91 ("ext4: include charset
encoding information in the superblock")

Currently this is not compatible with encryption, similar to the current
ext4 imlpementation. This will change in the future.

>From the ext4 patch:
"""
The s_encoding field stores a magic number indicating the encoding
format and version used globally by file and directory names in the
filesystem.  The s_encoding_flags defines policies for using the charset
encoding, like how to handle invalid sequences.  The magic number is
mapped to the exact charset table, but the mapping is specific to ext4.
Since we don't have any commitment to support old encodings, the only
encoding I am supporting right now is utf8-12.1.0.

The current implementation prevents the user from enabling encoding and
per-directory encryption on the same filesystem at the same time.  The
incompatibility between these features lies in how we do efficient
directory searches when we cannot be sure the encryption of the user
provided fname will match the actual hash stored in the disk without
decrypting every directory entry, because of normalization cases.  My
quickest solution is to simply block the concurrent use of these
features for now, and enable it later, once we have a better solution.
"""

Signed-off-by: Daniel Rosenberg <drosen@google.com>
---
 Documentation/ABI/testing/sysfs-fs-f2fs |  7 ++
 Documentation/filesystems/f2fs.txt      |  3 +
 fs/f2fs/f2fs.h                          |  6 ++
 fs/f2fs/super.c                         | 95 +++++++++++++++++++++++++
 fs/f2fs/sysfs.c                         | 23 ++++++
 include/linux/f2fs_fs.h                 |  9 ++-
 6 files changed, 142 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs b/Documentation/ABI/testing/sysfs-fs-f2fs
index dca326e0ee3e1..7ab2b1b5e255f 100644
--- a/Documentation/ABI/testing/sysfs-fs-f2fs
+++ b/Documentation/ABI/testing/sysfs-fs-f2fs
@@ -251,3 +251,10 @@ Description:
 		If checkpoint=disable, it displays the number of blocks that are unusable.
                 If checkpoint=enable it displays the enumber of blocks that would be unusable
                 if checkpoint=disable were to be set.
+
+What:		/sys/fs/f2fs/<disk>/encoding
+Date		July 2019
+Contact:	"Daniel Rosenberg" <drosen@google.com>
+Description:
+		Displays name and version of the encoding set for the filesystem.
+                If no encoding is set, displays (none)
diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt
index 496fa28b24925..5fa38ab373cab 100644
--- a/Documentation/filesystems/f2fs.txt
+++ b/Documentation/filesystems/f2fs.txt
@@ -413,6 +413,9 @@ Files in /sys/fs/f2fs/<devname>
                               that would be unusable if checkpoint=disable were
                               to be set.
 
+encoding 	              This shows the encoding used for casefolding.
+                              If casefolding is not enabled, returns (none)
+
 ================================================================================
 USAGE
 ================================================================================
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 17382da7f0bd9..c6c7904572d0d 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -153,6 +153,7 @@ struct f2fs_mount_info {
 #define F2FS_FEATURE_LOST_FOUND		0x0200
 #define F2FS_FEATURE_VERITY		0x0400	/* reserved */
 #define F2FS_FEATURE_SB_CHKSUM		0x0800
+#define F2FS_FEATURE_CASEFOLD		0x1000
 
 #define __F2FS_HAS_FEATURE(raw_super, mask)				\
 	((raw_super->feature & cpu_to_le32(mask)) != 0)
@@ -1169,6 +1170,10 @@ struct f2fs_sb_info {
 	int valid_super_block;			/* valid super block no */
 	unsigned long s_flag;				/* flags for sbi */
 	struct mutex writepages;		/* mutex for writepages() */
+#ifdef CONFIG_UNICODE
+	struct unicode_map *s_encoding;
+	__u16 s_encoding_flags;
+#endif
 
 #ifdef CONFIG_BLK_DEV_ZONED
 	unsigned int blocks_per_blkz;		/* F2FS blocks per zone */
@@ -3562,6 +3567,7 @@ F2FS_FEATURE_FUNCS(quota_ino, QUOTA_INO);
 F2FS_FEATURE_FUNCS(inode_crtime, INODE_CRTIME);
 F2FS_FEATURE_FUNCS(lost_found, LOST_FOUND);
 F2FS_FEATURE_FUNCS(sb_chksum, SB_CHKSUM);
+F2FS_FEATURE_FUNCS(casefold, CASEFOLD);
 
 #ifdef CONFIG_BLK_DEV_ZONED
 static inline bool f2fs_blkz_is_seq(struct f2fs_sb_info *sbi, int devi,
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 6de6cda440315..068db78a1e03e 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -23,6 +23,7 @@
 #include <linux/f2fs_fs.h>
 #include <linux/sysfs.h>
 #include <linux/quota.h>
+#include <linux/unicode.h>
 
 #include "f2fs.h"
 #include "node.h"
@@ -222,6 +223,36 @@ void f2fs_printk(struct f2fs_sb_info *sbi, const char *fmt, ...)
 	va_end(args);
 }
 
+#ifdef CONFIG_UNICODE
+static const struct f2fs_sb_encodings {
+	__u16 magic;
+	char *name;
+	char *version;
+} f2fs_sb_encoding_map[] = {
+	{F2FS_ENC_UTF8_12_1, "utf8", "12.1.0"},
+};
+
+static int f2fs_sb_read_encoding(const struct f2fs_super_block *sb,
+				 const struct f2fs_sb_encodings **encoding,
+				 __u16 *flags)
+{
+	__u16 magic = le16_to_cpu(sb->s_encoding);
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(f2fs_sb_encoding_map); i++)
+		if (magic == f2fs_sb_encoding_map[i].magic)
+			break;
+
+	if (i >= ARRAY_SIZE(f2fs_sb_encoding_map))
+		return -EINVAL;
+
+	*encoding = &f2fs_sb_encoding_map[i];
+	*flags = le16_to_cpu(sb->s_encoding_flags);
+
+	return 0;
+}
+#endif
+
 static inline void limit_reserve_root(struct f2fs_sb_info *sbi)
 {
 	block_t limit = min((sbi->user_block_count << 1) / 1000,
@@ -798,6 +829,13 @@ static int parse_options(struct super_block *sb, char *options)
 		return -EINVAL;
 	}
 #endif
+#ifndef CONFIG_UNICODE
+	if (f2fs_sb_has_casefold(sbi)) {
+		f2fs_err(sbi,
+			"Filesystem with casefold feature cannot be mounted without CONFIG_UNICODE");
+		return -EINVAL;
+	}
+#endif
 
 	if (F2FS_IO_SIZE_BITS(sbi) && !test_opt(sbi, LFS)) {
 		f2fs_err(sbi, "Should set mode=lfs with %uKB-sized IO",
@@ -1089,6 +1127,9 @@ static void f2fs_put_super(struct super_block *sb)
 	destroy_percpu_info(sbi);
 	for (i = 0; i < NR_PAGE_TYPE; i++)
 		kvfree(sbi->write_io[i]);
+#ifdef CONFIG_UNICODE
+	utf8_unload(sbi->s_encoding);
+#endif
 	kvfree(sbi);
 }
 
@@ -3031,6 +3072,52 @@ static int f2fs_scan_devices(struct f2fs_sb_info *sbi)
 	return 0;
 }
 
+static int f2fs_setup_casefold(struct f2fs_sb_info *sbi)
+{
+#ifdef CONFIG_UNICODE
+	if (f2fs_sb_has_casefold(sbi) && !sbi->s_encoding) {
+		const struct f2fs_sb_encodings *encoding_info;
+		struct unicode_map *encoding;
+		__u16 encoding_flags;
+
+		if (f2fs_sb_has_encrypt(sbi)) {
+			f2fs_err(sbi,
+				"Can't mount with encoding and encryption");
+			return -EINVAL;
+		}
+
+		if (f2fs_sb_read_encoding(sbi->raw_super, &encoding_info,
+					  &encoding_flags)) {
+			f2fs_err(sbi,
+				 "Encoding requested by superblock is unknown");
+			return -EINVAL;
+		}
+
+		encoding = utf8_load(encoding_info->version);
+		if (IS_ERR(encoding)) {
+			f2fs_err(sbi,
+				 "can't mount with superblock charset: %s-%s "
+				 "not supported by the kernel. flags: 0x%x.",
+				 encoding_info->name, encoding_info->version,
+				 encoding_flags);
+			return PTR_ERR(encoding);
+		}
+		f2fs_info(sbi, "Using encoding defined by superblock: "
+			 "%s-%s with flags 0x%hx", encoding_info->name,
+			 encoding_info->version?:"\b", encoding_flags);
+
+		sbi->s_encoding = encoding;
+		sbi->s_encoding_flags = encoding_flags;
+	}
+#else
+	if (f2fs_sb_has_casefold(sbi)) {
+		f2fs_err(sbi, "Filesystem with casefold feature cannot be mounted without CONFIG_UNICODE");
+		return -EINVAL;
+	}
+#endif
+	return 0;
+}
+
 static void f2fs_tuning_parameters(struct f2fs_sb_info *sbi)
 {
 	struct f2fs_sm_info *sm_i = SM_I(sbi);
@@ -3127,6 +3214,10 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent)
 				le32_to_cpu(raw_super->log_blocksize);
 	sb->s_max_links = F2FS_LINK_MAX;
 
+	err = f2fs_setup_casefold(sbi);
+	if (err)
+		goto free_options;
+
 #ifdef CONFIG_QUOTA
 	sb->dq_op = &f2fs_quota_operations;
 	sb->s_qcop = &f2fs_quotactl_ops;
@@ -3477,6 +3568,10 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent)
 free_bio_info:
 	for (i = 0; i < NR_PAGE_TYPE; i++)
 		kvfree(sbi->write_io[i]);
+
+#ifdef CONFIG_UNICODE
+	utf8_unload(sbi->s_encoding);
+#endif
 free_options:
 #ifdef CONFIG_QUOTA
 	for (i = 0; i < MAXQUOTAS; i++)
diff --git a/fs/f2fs/sysfs.c b/fs/f2fs/sysfs.c
index 3aeacd0aacfd2..f9fcca695db9f 100644
--- a/fs/f2fs/sysfs.c
+++ b/fs/f2fs/sysfs.c
@@ -10,6 +10,7 @@
 #include <linux/proc_fs.h>
 #include <linux/f2fs_fs.h>
 #include <linux/seq_file.h>
+#include <linux/unicode.h>
 
 #include "f2fs.h"
 #include "segment.h"
@@ -81,6 +82,19 @@ static ssize_t unusable_show(struct f2fs_attr *a,
 		(unsigned long long)unusable);
 }
 
+static ssize_t encoding_show(struct f2fs_attr *a,
+		struct f2fs_sb_info *sbi, char *buf)
+{
+#ifdef CONFIG_UNICODE
+	if (f2fs_sb_has_casefold(sbi))
+		return snprintf(buf, PAGE_SIZE, "%s (%d.%d.%d)\n",
+			sbi->s_encoding->charset,
+			(sbi->s_encoding->version >> 16) & 0xff,
+			(sbi->s_encoding->version >> 8) & 0xff,
+			sbi->s_encoding->version & 0xff);
+#endif
+	return snprintf(buf, PAGE_SIZE, "(none)");
+}
 
 static ssize_t lifetime_write_kbytes_show(struct f2fs_attr *a,
 		struct f2fs_sb_info *sbi, char *buf)
@@ -134,6 +148,9 @@ static ssize_t features_show(struct f2fs_attr *a,
 	if (f2fs_sb_has_sb_chksum(sbi))
 		len += snprintf(buf + len, PAGE_SIZE - len, "%s%s",
 				len ? ", " : "", "sb_checksum");
+	if (f2fs_sb_has_casefold(sbi))
+		len += snprintf(buf + len, PAGE_SIZE - len, "%s%s",
+				len ? ", " : "", "casefold");
 	len += snprintf(buf + len, PAGE_SIZE - len, "\n");
 	return len;
 }
@@ -365,6 +382,7 @@ enum feat_id {
 	FEAT_INODE_CRTIME,
 	FEAT_LOST_FOUND,
 	FEAT_SB_CHECKSUM,
+	FEAT_CASEFOLD,
 };
 
 static ssize_t f2fs_feature_show(struct f2fs_attr *a,
@@ -382,6 +400,7 @@ static ssize_t f2fs_feature_show(struct f2fs_attr *a,
 	case FEAT_INODE_CRTIME:
 	case FEAT_LOST_FOUND:
 	case FEAT_SB_CHECKSUM:
+	case FEAT_CASEFOLD:
 		return snprintf(buf, PAGE_SIZE, "supported\n");
 	}
 	return 0;
@@ -455,6 +474,7 @@ F2FS_GENERAL_RO_ATTR(lifetime_write_kbytes);
 F2FS_GENERAL_RO_ATTR(features);
 F2FS_GENERAL_RO_ATTR(current_reserved_blocks);
 F2FS_GENERAL_RO_ATTR(unusable);
+F2FS_GENERAL_RO_ATTR(encoding);
 
 #ifdef CONFIG_FS_ENCRYPTION
 F2FS_FEATURE_RO_ATTR(encryption, FEAT_CRYPTO);
@@ -471,6 +491,7 @@ F2FS_FEATURE_RO_ATTR(quota_ino, FEAT_QUOTA_INO);
 F2FS_FEATURE_RO_ATTR(inode_crtime, FEAT_INODE_CRTIME);
 F2FS_FEATURE_RO_ATTR(lost_found, FEAT_LOST_FOUND);
 F2FS_FEATURE_RO_ATTR(sb_checksum, FEAT_SB_CHECKSUM);
+F2FS_FEATURE_RO_ATTR(casefold, FEAT_CASEFOLD);
 
 #define ATTR_LIST(name) (&f2fs_attr_##name.attr)
 static struct attribute *f2fs_attrs[] = {
@@ -515,6 +536,7 @@ static struct attribute *f2fs_attrs[] = {
 	ATTR_LIST(features),
 	ATTR_LIST(reserved_blocks),
 	ATTR_LIST(current_reserved_blocks),
+	ATTR_LIST(encoding),
 	NULL,
 };
 ATTRIBUTE_GROUPS(f2fs);
@@ -535,6 +557,7 @@ static struct attribute *f2fs_feat_attrs[] = {
 	ATTR_LIST(inode_crtime),
 	ATTR_LIST(lost_found),
 	ATTR_LIST(sb_checksum),
+	ATTR_LIST(casefold),
 	NULL,
 };
 ATTRIBUTE_GROUPS(f2fs_feat);
diff --git a/include/linux/f2fs_fs.h b/include/linux/f2fs_fs.h
index 65559900d4d76..b7c9c7f721339 100644
--- a/include/linux/f2fs_fs.h
+++ b/include/linux/f2fs_fs.h
@@ -36,6 +36,11 @@
 
 #define F2FS_MAX_QUOTAS		3
 
+#define F2FS_ENC_UTF8_12_1	1
+#define F2FS_ENC_STRICT_MODE_FL	(1 << 0)
+#define f2fs_has_strict_mode(sbi) \
+	(sbi->s_encoding_flags & F2FS_ENC_STRICT_MODE_FL)
+
 #define F2FS_IO_SIZE(sbi)	(1 << F2FS_OPTION(sbi).write_io_size_bits) /* Blocks */
 #define F2FS_IO_SIZE_KB(sbi)	(1 << (F2FS_OPTION(sbi).write_io_size_bits + 2)) /* KB */
 #define F2FS_IO_SIZE_BYTES(sbi)	(1 << (F2FS_OPTION(sbi).write_io_size_bits + 12)) /* B */
@@ -109,7 +114,9 @@ struct f2fs_super_block {
 	struct f2fs_device devs[MAX_DEVICES];	/* device list */
 	__le32 qf_ino[F2FS_MAX_QUOTAS];	/* quota inode numbers */
 	__u8 hot_ext_count;		/* # of hot file extension */
-	__u8 reserved[310];		/* valid reserved region */
+	__le16  s_encoding;		/* Filename charset encoding */
+	__le16  s_encoding_flags;	/* Filename charset encoding flags */
+	__u8 reserved[306];		/* valid reserved region */
 	__le32 crc;			/* checksum of superblock */
 } __packed;
 
-- 
2.22.0.657.g960e92d24f-goog

^ permalink raw reply related

* [PATCH v4 3/3] f2fs: Support case-insensitive file name lookups
From: Daniel Rosenberg @ 2019-07-23 23:05 UTC (permalink / raw)
  To: Jaegeuk Kim, Chao Yu, Jonathan Corbet, linux-f2fs-devel
  Cc: linux-kernel, linux-doc, linux-fsdevel, linux-api, kernel-team,
	Daniel Rosenberg
In-Reply-To: <20190723230529.251659-1-drosen@google.com>

Modeled after commit b886ee3e778e ("ext4: Support case-insensitive file
name lookups")

"""
This patch implements the actual support for case-insensitive file name
lookups in f2fs, based on the feature bit and the encoding stored in the
superblock.

A filesystem that has the casefold feature set is able to configure
directories with the +F (F2FS_CASEFOLD_FL) attribute, enabling lookups
to succeed in that directory in a case-insensitive fashion, i.e: match
a directory entry even if the name used by userspace is not a byte per
byte match with the disk name, but is an equivalent case-insensitive
version of the Unicode string.  This operation is called a
case-insensitive file name lookup.

The feature is configured as an inode attribute applied to directories
and inherited by its children.  This attribute can only be enabled on
empty directories for filesystems that support the encoding feature,
thus preventing collision of file names that only differ by case.

* dcache handling:

For a +F directory, F2Fs only stores the first equivalent name dentry
used in the dcache. This is done to prevent unintentional duplication of
dentries in the dcache, while also allowing the VFS code to quickly find
the right entry in the cache despite which equivalent string was used in
a previous lookup, without having to resort to ->lookup().

d_hash() of casefolded directories is implemented as the hash of the
casefolded string, such that we always have a well-known bucket for all
the equivalencies of the same string. d_compare() uses the
utf8_strncasecmp() infrastructure, which handles the comparison of
equivalent, same case, names as well.

For now, negative lookups are not inserted in the dcache, since they
would need to be invalidated anyway, because we can't trust missing file
dentries.  This is bad for performance but requires some leveraging of
the vfs layer to fix.  We can live without that for now, and so does
everyone else.

* on-disk data:

Despite using a specific version of the name as the internal
representation within the dcache, the name stored and fetched from the
disk is a byte-per-byte match with what the user requested, making this
implementation 'name-preserving'. i.e. no actual information is lost
when writing to storage.

DX is supported by modifying the hashes used in +F directories to make
them case/encoding-aware.  The new disk hashes are calculated as the
hash of the full casefolded string, instead of the string directly.
This allows us to efficiently search for file names in the htree without
requiring the user to provide an exact name.

* Dealing with invalid sequences:

By default, when a invalid UTF-8 sequence is identified, ext4 will treat
it as an opaque byte sequence, ignoring the encoding and reverting to
the old behavior for that unique file.  This means that case-insensitive
file name lookup will not work only for that file.  An optional bit can
be set in the superblock telling the filesystem code and userspace tools
to enforce the encoding.  When that optional bit is set, any attempt to
create a file name using an invalid UTF-8 sequence will fail and return
an error to userspace.

* Normalization algorithm:

The UTF-8 algorithms used to compare strings in f2fs is implemented
in fs/unicode, and is based on a previous version developed by
SGI.  It implements the Canonical decomposition (NFD) algorithm
described by the Unicode specification 12.1, or higher, combined with
the elimination of ignorable code points (NFDi) and full
case-folding (CF) as documented in fs/unicode/utf8_norm.c.

NFD seems to be the best normalization method for F2FS because:

  - It has a lower cost than NFC/NFKC (which requires
    decomposing to NFD as an intermediary step)
  - It doesn't eliminate important semantic meaning like
    compatibility decompositions.

Although:

- This implementation is not completely linguistic accurate, because
different languages have conflicting rules, which would require the
specialization of the filesystem to a given locale, which brings all
sorts of problems for removable media and for users who use more than
one language.
"""

Signed-off-by: Daniel Rosenberg <drosen@google.com>
---
 fs/f2fs/dir.c    | 126 +++++++++++++++++++++++++++++++++++++++++++----
 fs/f2fs/f2fs.h   |  15 ++++--
 fs/f2fs/file.c   |  16 +++++-
 fs/f2fs/hash.c   |  35 ++++++++++++-
 fs/f2fs/inline.c |   4 +-
 fs/f2fs/inode.c  |   4 +-
 fs/f2fs/namei.c  |  21 ++++++++
 fs/f2fs/super.c  |   1 +
 8 files changed, 203 insertions(+), 19 deletions(-)

diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c
index 85a1528f319f2..2913483473f30 100644
--- a/fs/f2fs/dir.c
+++ b/fs/f2fs/dir.c
@@ -8,6 +8,7 @@
 #include <linux/fs.h>
 #include <linux/f2fs_fs.h>
 #include <linux/sched/signal.h>
+#include <linux/unicode.h>
 #include "f2fs.h"
 #include "node.h"
 #include "acl.h"
@@ -81,7 +82,8 @@ static unsigned long dir_block_index(unsigned int level,
 	return bidx;
 }
 
-static struct f2fs_dir_entry *find_in_block(struct page *dentry_page,
+static struct f2fs_dir_entry *find_in_block(struct inode *dir,
+				struct page *dentry_page,
 				struct fscrypt_name *fname,
 				f2fs_hash_t namehash,
 				int *max_slots,
@@ -93,7 +95,7 @@ static struct f2fs_dir_entry *find_in_block(struct page *dentry_page,
 
 	dentry_blk = (struct f2fs_dentry_block *)page_address(dentry_page);
 
-	make_dentry_ptr_block(NULL, &d, dentry_blk);
+	make_dentry_ptr_block(dir, &d, dentry_blk);
 	de = f2fs_find_target_dentry(fname, namehash, max_slots, &d);
 	if (de)
 		*res_page = dentry_page;
@@ -101,6 +103,39 @@ static struct f2fs_dir_entry *find_in_block(struct page *dentry_page,
 	return de;
 }
 
+#ifdef CONFIG_UNICODE
+/*
+ * Test whether a case-insensitive directory entry matches the filename
+ * being searched for.
+ *
+ * Returns: 0 if the directory entry matches, more than 0 if it
+ * doesn't match or less than zero on error.
+ */
+int f2fs_ci_compare(const struct inode *parent, const struct qstr *name,
+		    const struct qstr *entry)
+{
+	const struct f2fs_sb_info *sbi = F2FS_SB(parent->i_sb);
+	const struct unicode_map *um = sbi->s_encoding;
+	int ret;
+
+	ret = utf8_strncasecmp(um, name, entry);
+	if (ret < 0) {
+		/* Handle invalid character sequence as either an error
+		 * or as an opaque byte sequence.
+		 */
+		if (f2fs_has_strict_mode(sbi))
+			return -EINVAL;
+
+		if (name->len != entry->len)
+			return 1;
+
+		return !!memcmp(name->name, entry->name, name->len);
+	}
+
+	return ret;
+}
+#endif
+
 struct f2fs_dir_entry *f2fs_find_target_dentry(struct fscrypt_name *fname,
 			f2fs_hash_t namehash, int *max_slots,
 			struct f2fs_dentry_ptr *d)
@@ -108,6 +143,9 @@ struct f2fs_dir_entry *f2fs_find_target_dentry(struct fscrypt_name *fname,
 	struct f2fs_dir_entry *de;
 	unsigned long bit_pos = 0;
 	int max_len = 0;
+#ifdef CONFIG_UNICODE
+	struct qstr entry;
+#endif
 
 	if (max_slots)
 		*max_slots = 0;
@@ -119,16 +157,28 @@ struct f2fs_dir_entry *f2fs_find_target_dentry(struct fscrypt_name *fname,
 		}
 
 		de = &d->dentry[bit_pos];
+#ifdef CONFIG_UNICODE
+		entry.name = d->filename[bit_pos];
+		entry.len = de->name_len;
+#endif
 
 		if (unlikely(!de->name_len)) {
 			bit_pos++;
 			continue;
 		}
+		if (de->hash_code == namehash) {
+#ifdef CONFIG_UNICODE
+			if (F2FS_SB(d->inode->i_sb)->s_encoding &&
+					IS_CASEFOLDED(d->inode) &&
+					!f2fs_ci_compare(d->inode,
+						fname->usr_fname, &entry))
+				goto found;
 
-		if (de->hash_code == namehash &&
-		    fscrypt_match_name(fname, d->filename[bit_pos],
-				       le16_to_cpu(de->name_len)))
-			goto found;
+#endif
+			if (fscrypt_match_name(fname, d->filename[bit_pos],
+						le16_to_cpu(de->name_len)))
+				goto found;
+		}
 
 		if (max_slots && max_len > *max_slots)
 			*max_slots = max_len;
@@ -157,7 +207,7 @@ static struct f2fs_dir_entry *find_in_level(struct inode *dir,
 	struct f2fs_dir_entry *de = NULL;
 	bool room = false;
 	int max_slots;
-	f2fs_hash_t namehash = f2fs_dentry_hash(&name, fname);
+	f2fs_hash_t namehash = f2fs_dentry_hash(dir, &name, fname);
 
 	nbucket = dir_buckets(level, F2FS_I(dir)->i_dir_level);
 	nblock = bucket_blocks(level);
@@ -179,8 +229,8 @@ static struct f2fs_dir_entry *find_in_level(struct inode *dir,
 			}
 		}
 
-		de = find_in_block(dentry_page, fname, namehash, &max_slots,
-								res_page);
+		de = find_in_block(dir, dentry_page, fname, namehash,
+							&max_slots, res_page);
 		if (de)
 			break;
 
@@ -250,6 +300,14 @@ struct f2fs_dir_entry *f2fs_find_entry(struct inode *dir,
 	struct fscrypt_name fname;
 	int err;
 
+#ifdef CONFIG_UNICODE
+	if (f2fs_has_strict_mode(F2FS_I_SB(dir)) && IS_CASEFOLDED(dir) &&
+			utf8_validate(F2FS_I_SB(dir)->s_encoding, child)) {
+		*res_page = ERR_PTR(-EINVAL);
+		return NULL;
+	}
+#endif
+
 	err = fscrypt_setup_filename(dir, child, 1, &fname);
 	if (err) {
 		if (err == -ENOENT)
@@ -504,7 +562,7 @@ int f2fs_add_regular_entry(struct inode *dir, const struct qstr *new_name,
 
 	level = 0;
 	slots = GET_DENTRY_SLOTS(new_name->len);
-	dentry_hash = f2fs_dentry_hash(new_name, NULL);
+	dentry_hash = f2fs_dentry_hash(dir, new_name, NULL);
 
 	current_depth = F2FS_I(dir)->i_current_depth;
 	if (F2FS_I(dir)->chash == dentry_hash) {
@@ -943,3 +1001,51 @@ const struct file_operations f2fs_dir_operations = {
 	.compat_ioctl   = f2fs_compat_ioctl,
 #endif
 };
+
+#ifdef CONFIG_UNICODE
+static int f2fs_d_compare(const struct dentry *dentry, unsigned int len,
+			  const char *str, const struct qstr *name)
+{
+	struct qstr qstr = {.name = str, .len = len };
+
+	if (!IS_CASEFOLDED(dentry->d_parent->d_inode)) {
+		if (len != name->len)
+			return -1;
+		return memcmp(str, name, len);
+	}
+
+	return f2fs_ci_compare(dentry->d_parent->d_inode, name, &qstr);
+}
+
+static int f2fs_d_hash(const struct dentry *dentry, struct qstr *str)
+{
+	struct f2fs_sb_info *sbi = F2FS_SB(dentry->d_sb);
+	const struct unicode_map *um = sbi->s_encoding;
+	unsigned char *norm;
+	int len, ret = 0;
+
+	if (!IS_CASEFOLDED(dentry->d_inode))
+		return 0;
+
+	norm = f2fs_kmalloc(sbi, PATH_MAX, GFP_ATOMIC);
+	if (!norm)
+		return -ENOMEM;
+
+	len = utf8_casefold(um, str, norm, PATH_MAX);
+	if (len < 0) {
+		if (f2fs_has_strict_mode(sbi))
+			ret = -EINVAL;
+		goto out;
+	}
+	str->hash = full_name_hash(dentry, norm, len);
+out:
+	kvfree(norm);
+	return ret;
+}
+
+const struct dentry_operations f2fs_dentry_ops = {
+	.d_hash = f2fs_d_hash,
+	.d_compare = f2fs_d_compare,
+};
+#endif
+
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index c6c7904572d0d..31fd2a268ba14 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -2364,10 +2364,12 @@ static inline void f2fs_change_bit(unsigned int nr, char *addr)
 #define F2FS_INDEX_FL			0x00001000 /* hash-indexed directory */
 #define F2FS_DIRSYNC_FL			0x00010000 /* dirsync behaviour (directories only) */
 #define F2FS_PROJINHERIT_FL		0x20000000 /* Create with parents projid */
+#define F2FS_CASEFOLD_FL		0x40000000 /* Casefolded file */
 
 /* Flags that should be inherited by new inodes from their parent. */
 #define F2FS_FL_INHERITED (F2FS_SYNC_FL | F2FS_NODUMP_FL | F2FS_NOATIME_FL | \
-			   F2FS_DIRSYNC_FL | F2FS_PROJINHERIT_FL)
+			   F2FS_DIRSYNC_FL | F2FS_PROJINHERIT_FL | \
+			   F2FS_CASEFOLD_FL)
 
 /* Flags that are appropriate for regular files (all but dir-specific ones). */
 #define F2FS_REG_FLMASK		(~(F2FS_DIRSYNC_FL | F2FS_PROJINHERIT_FL))
@@ -2930,6 +2932,10 @@ int f2fs_update_extension_list(struct f2fs_sb_info *sbi, const char *name,
 							bool hot, bool set);
 struct dentry *f2fs_get_parent(struct dentry *child);
 
+extern int f2fs_ci_compare(const struct inode *parent,
+			   const struct qstr *name,
+			   const struct qstr *entry);
+
 /*
  * dir.c
  */
@@ -2993,8 +2999,8 @@ int f2fs_sanity_check_ckpt(struct f2fs_sb_info *sbi);
 /*
  * hash.c
  */
-f2fs_hash_t f2fs_dentry_hash(const struct qstr *name_info,
-				struct fscrypt_name *fname);
+f2fs_hash_t f2fs_dentry_hash(const struct inode *dir,
+		const struct qstr *name_info, struct fscrypt_name *fname);
 
 /*
  * node.c
@@ -3437,6 +3443,9 @@ static inline void f2fs_destroy_root_stats(void) { }
 #endif
 
 extern const struct file_operations f2fs_dir_operations;
+#ifdef CONFIG_UNICODE
+extern const struct dentry_operations f2fs_dentry_ops;
+#endif
 extern const struct file_operations f2fs_file_operations;
 extern const struct inode_operations f2fs_file_inode_operations;
 extern const struct address_space_operations f2fs_dblock_aops;
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index f8d46df8fa9ee..7ab2e6fea5b82 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -1660,7 +1660,16 @@ static int f2fs_setflags_common(struct inode *inode, u32 iflags, u32 mask)
 		return -EPERM;
 
 	oldflags = fi->i_flags;
+	if ((iflags ^ oldflags) & F2FS_CASEFOLD_FL) {
+		if (!f2fs_sb_has_casefold(F2FS_I_SB(inode)))
+			return -EOPNOTSUPP;
+
+		if (!S_ISDIR(inode->i_mode))
+			return -ENOTDIR;
 
+		if (!f2fs_empty_dir(inode))
+			return -ENOTEMPTY;
+	}
 	if ((iflags ^ oldflags) & (F2FS_APPEND_FL | F2FS_IMMUTABLE_FL))
 		if (!capable(CAP_LINUX_IMMUTABLE))
 			return -EPERM;
@@ -1699,6 +1708,7 @@ static const struct {
 	{ F2FS_INDEX_FL,	FS_INDEX_FL },
 	{ F2FS_DIRSYNC_FL,	FS_DIRSYNC_FL },
 	{ F2FS_PROJINHERIT_FL,	FS_PROJINHERIT_FL },
+	{ F2FS_CASEFOLD_FL,	FS_CASEFOLD_FL },
 };
 
 #define F2FS_GETTABLE_FS_FL (		\
@@ -1712,7 +1722,8 @@ static const struct {
 		FS_PROJINHERIT_FL |	\
 		FS_ENCRYPT_FL |		\
 		FS_INLINE_DATA_FL |	\
-		FS_NOCOW_FL)
+		FS_NOCOW_FL |		\
+		FS_CASEFOLD_FL)
 
 #define F2FS_SETTABLE_FS_FL (		\
 		FS_SYNC_FL |		\
@@ -1721,7 +1732,8 @@ static const struct {
 		FS_NODUMP_FL |		\
 		FS_NOATIME_FL |		\
 		FS_DIRSYNC_FL |		\
-		FS_PROJINHERIT_FL)
+		FS_PROJINHERIT_FL |	\
+		FS_CASEFOLD_FL)
 
 /* Convert f2fs on-disk i_flags to FS_IOC_{GET,SET}FLAGS flags */
 static inline u32 f2fs_iflags_to_fsflags(u32 iflags)
diff --git a/fs/f2fs/hash.c b/fs/f2fs/hash.c
index cc82f142f811f..99e79934f5088 100644
--- a/fs/f2fs/hash.c
+++ b/fs/f2fs/hash.c
@@ -14,6 +14,7 @@
 #include <linux/f2fs_fs.h>
 #include <linux/cryptohash.h>
 #include <linux/pagemap.h>
+#include <linux/unicode.h>
 
 #include "f2fs.h"
 
@@ -67,7 +68,7 @@ static void str2hashbuf(const unsigned char *msg, size_t len,
 		*buf++ = pad;
 }
 
-f2fs_hash_t f2fs_dentry_hash(const struct qstr *name_info,
+static f2fs_hash_t __f2fs_dentry_hash(const struct qstr *name_info,
 				struct fscrypt_name *fname)
 {
 	__u32 hash;
@@ -103,3 +104,35 @@ f2fs_hash_t f2fs_dentry_hash(const struct qstr *name_info,
 	f2fs_hash = cpu_to_le32(hash & ~F2FS_HASH_COL_BIT);
 	return f2fs_hash;
 }
+
+f2fs_hash_t f2fs_dentry_hash(const struct inode *dir,
+		const struct qstr *name_info, struct fscrypt_name *fname)
+{
+#ifdef CONFIG_UNICODE
+	struct f2fs_sb_info *sbi = F2FS_SB(dir->i_sb);
+	const struct unicode_map *um = sbi->s_encoding;
+	int r, dlen;
+	unsigned char *buff;
+	struct qstr *folded;
+
+	if (name_info->len && IS_CASEFOLDED(dir)) {
+		buff = f2fs_kzalloc(sbi, sizeof(char) * PATH_MAX, GFP_KERNEL);
+		if (!buff)
+			return -ENOMEM;
+
+		dlen = utf8_casefold(um, name_info, buff, PATH_MAX);
+		if (dlen < 0) {
+			kvfree(buff);
+			goto opaque_seq;
+		}
+		folded->name = buff;
+		folded->len = dlen;
+		r = __f2fs_dentry_hash(folded, fname);
+
+		kvfree(buff);
+		return r;
+	}
+opaque_seq:
+#endif
+	return __f2fs_dentry_hash(name_info, fname);
+}
diff --git a/fs/f2fs/inline.c b/fs/f2fs/inline.c
index 3613efca8c00c..354f71cf9e6ba 100644
--- a/fs/f2fs/inline.c
+++ b/fs/f2fs/inline.c
@@ -320,7 +320,7 @@ struct f2fs_dir_entry *f2fs_find_in_inline_dir(struct inode *dir,
 		return NULL;
 	}
 
-	namehash = f2fs_dentry_hash(&name, fname);
+	namehash = f2fs_dentry_hash(dir, &name, fname);
 
 	inline_dentry = inline_data_addr(dir, ipage);
 
@@ -580,7 +580,7 @@ int f2fs_add_inline_entry(struct inode *dir, const struct qstr *new_name,
 
 	f2fs_wait_on_page_writeback(ipage, NODE, true, true);
 
-	name_hash = f2fs_dentry_hash(new_name, NULL);
+	name_hash = f2fs_dentry_hash(dir, new_name, NULL);
 	f2fs_update_dentry(ino, mode, &d, new_name, name_hash, bit_pos);
 
 	set_page_dirty(ipage);
diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
index a33d7a849b2df..9a1f0d6616577 100644
--- a/fs/f2fs/inode.c
+++ b/fs/f2fs/inode.c
@@ -46,9 +46,11 @@ void f2fs_set_inode_flags(struct inode *inode)
 		new_fl |= S_DIRSYNC;
 	if (file_is_encrypt(inode))
 		new_fl |= S_ENCRYPTED;
+	if (flags & F2FS_CASEFOLD_FL)
+		new_fl |= S_CASEFOLD;
 	inode_set_flags(inode, new_fl,
 			S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|
-			S_ENCRYPTED);
+			S_ENCRYPTED|S_CASEFOLD);
 }
 
 static void __get_inode_rdev(struct inode *inode, struct f2fs_inode *ri)
diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
index c5b99042e6f2b..727de2f8620f2 100644
--- a/fs/f2fs/namei.c
+++ b/fs/f2fs/namei.c
@@ -489,6 +489,17 @@ static struct dentry *f2fs_lookup(struct inode *dir, struct dentry *dentry,
 		goto out_iput;
 	}
 out_splice:
+#ifdef CONFIG_UNICODE
+	if (!inode && IS_CASEFOLDED(dir)) {
+		/* Eventually we want to call d_add_ci(dentry, NULL)
+		 * for negative dentries in the encoding case as
+		 * well.  For now, prevent the negative dentry
+		 * from being cached.
+		 */
+		trace_f2fs_lookup_end(dir, dentry, ino, err);
+		return NULL;
+	}
+#endif
 	new = d_splice_alias(inode, dentry);
 	err = PTR_ERR_OR_ZERO(new);
 	trace_f2fs_lookup_end(dir, dentry, ino, err);
@@ -537,6 +548,16 @@ static int f2fs_unlink(struct inode *dir, struct dentry *dentry)
 		goto fail;
 	}
 	f2fs_delete_entry(de, page, dir, inode);
+#ifdef CONFIG_UNICODE
+	/* VFS negative dentries are incompatible with Encoding and
+	 * Case-insensitiveness. Eventually we'll want avoid
+	 * invalidating the dentries here, alongside with returning the
+	 * negative dentries at f2fs_lookup(), when it is  better
+	 * supported by the VFS for the CI case.
+	 */
+	if (IS_CASEFOLDED(dir))
+		d_invalidate(dentry);
+#endif
 	f2fs_unlock_op(sbi);
 
 	if (IS_DIRSYNC(dir))
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 068db78a1e03e..20bf5fc3c5333 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -3108,6 +3108,7 @@ static int f2fs_setup_casefold(struct f2fs_sb_info *sbi)
 
 		sbi->s_encoding = encoding;
 		sbi->s_encoding_flags = encoding_flags;
+		sbi->sb->s_d_op = &f2fs_dentry_ops;
 	}
 #else
 	if (f2fs_sb_has_casefold(sbi)) {
-- 
2.22.0.657.g960e92d24f-goog

^ permalink raw reply related

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-07-24  1:40 UTC (permalink / raw)
  To: Song Liu
  Cc: Andy Lutomirski, Kees Cook, linux-security@vger.kernel.org,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API
In-Reply-To: <514D5453-0AEE-420F-AEB6-3F4F58C62E7E@fb.com>



> On Jul 23, 2019, at 3:56 PM, Song Liu <songliubraving@fb.com> wrote:
> 
> 
> 
>> On Jul 23, 2019, at 8:11 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> 
>> On Mon, Jul 22, 2019 at 1:54 PM Song Liu <songliubraving@fb.com> wrote:
>>> 
>>> Hi Andy, Lorenz, and all,
>>> 
>>>> On Jul 2, 2019, at 2:32 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>>> 
>>>> On Tue, Jul 2, 2019 at 2:04 PM Kees Cook <keescook@chromium.org> wrote:
>>>>> 
>>>>>> On Mon, Jul 01, 2019 at 06:59:13PM -0700, Andy Lutomirski wrote:
>>>>>> I think I'm understanding your motivation.  You're not trying to make
>>>>>> bpf() generically usable without privilege -- you're trying to create
>>>>>> a way to allow certain users to access dangerous bpf functionality
>>>>>> within some limits.
>>>>>> 
>>>>>> That's a perfectly fine goal, but I think you're reinventing the
>>>>>> wheel, and the wheel you're reinventing is quite complicated and
>>>>>> already exists.  I think you should teach bpftool to be secure when
>>>>>> installed setuid root or with fscaps enabled and put your policy in
>>>>>> bpftool.  If you want to harden this a little bit, it would seem
>>>>>> entirely reasonable to add a new CAP_BPF_ADMIN and change some, but
>>>>>> not all, of the capable() checks to check CAP_BPF_ADMIN instead of the
>>>>>> capabilities that they currently check.
>>>>> 
>>>>> If finer grained controls are wanted, it does seem like the /dev/bpf
>>>>> path makes the most sense. open, request abilities, use fd. The open can
>>>>> be mediated by DAC and LSM. The request can be mediated by LSM. This
>>>>> provides a way to add policy at the LSM level and at the tool level.
>>>>> (i.e. For tool-level controls: leave LSM wide open, make /dev/bpf owned
>>>>> by "bpfadmin" and bpftool becomes setuid "bpfadmin". For fine-grained
>>>>> controls, leave /dev/bpf wide open and add policy to SELinux, etc.)
>>>>> 
>>>>> With only a new CAP, you don't get the fine-grained controls. (The
>>>>> "request abilities" part is the key there.)
>>>> 
>>>> Sure you do: the effective set.  It has somewhat bizarre defaults, but
>>>> I don't think that's a real problem.  Also, this wouldn't be like
>>>> CAP_DAC_READ_SEARCH -- you can't accidentally use your BPF caps.
>>>> 
>>>> I think that a /dev capability-like object isn't totally nuts, but I
>>>> think we should do it well, and this patch doesn't really achieve
>>>> that.  But I don't think bpf wants fine-grained controls like this at
>>>> all -- as I pointed upthread, a fine-grained solution really wants
>>>> different treatment for the different capable() checks, and a bunch of
>>>> them won't resemble capabilities or /dev/bpf at all.
>>> 
>>> With 5.3-rc1 out, I am back on this. :)
>>> 
>>> How about we modify the set as:
>>> 1. Introduce sys_bpf_with_cap() that takes fd of /dev/bpf.
>> 
>> I'm fine with this in principle, but:
>> 
>>> 2. Better handling of capable() calls through bpf code. I guess the
>>>    biggest problem here is is_priv in verifier.c:bpf_check().
>> 
>> I think it would be good to understand exactly what /dev/bpf will
>> enable one to do.  Without some care, it would just become the next
>> CAP_SYS_ADMIN: if you can open it, sure, you're not root, but you can
>> intercept network traffic, modify cgroup behavior, and do plenty of
>> other things, any of which can probably be used to completely take
>> over the system.
> 
> Well, yes. sys_bpf() is pretty powerful. 
> 
> The goal of /dev/bpf is to enable special users to call sys_bpf(). In 
> the meanwhile, such users should not take down the whole system easily
> by accident, e.g., with rm -rf /.

That’s easy, though — bpftool could learn to read /etc/bpfusers before allowing ruid != 0.

> 
> It is similar to CAP_BPF_ADMIN, without really adding the CAP_.  
> 
> I think adding new CAP_ requires much more effort. 
> 

A new CAP_ is straightforward — add the definition and change the max cap.

>> 
>> It would also be nice to understand why you can't do what you need to
>> do entirely in user code using setuid or fscaps.
> 
> It is not very easy to achieve the same control: only certain users can
> run certain tools (bpftool, etc.). 
> 
> The closest approach I can find is:
>  1. use libcap (pam_cap) to give CAP_SETUID to certain users;
>  2. add setuid(0) to bpftool.
> 
> The difference between this approach and /dev/bpf is that certain users
> would be able to run other tools that call setuid(). Though I am not 
> sure how many tools call setuid(), and how risky they are. 

I think you’re misunderstanding me. Install bpftool with either the setuid (S_ISUID) mode or with an appropriate fscap bit — see the setcap(8) manpage.

The downside of this approach is that it won’t work well in a container, and containers are cool these days :)

> 
>> 
>> Finally, at risk of rehashing some old arguments, I'll point out that
>> the bpf() syscall is an unusual design to begin with.  As an example,
>> consider bpf_prog_attach().  Outside of bpf(), if I want to change the
>> behavior of a cgroup, I would write to a file in
>> /sys/kernel/cgroup/unified/whatever/, and normal DAC and MAC rules
>> apply.  With bpf(), however, I just call bpf() to attach a program to
>> the cgroup.  bpf() says "oh, you are capable(CAP_NET_ADMIN) -- go for
>> it!".  Unless I missed something major, and I just re-read the code,
>> there is no check that the caller has write or LSM permission to
>> anything at all in cgroupfs, and the existing API would make it very
>> awkward to impose any kind of DAC rules here.
>> 
>> So I think it might actually be time to repay some techincal debt and
>> come up with a real fix.  As a less intrusive approach, you could see
>> about requiring ownership of the cgroup directory instead of
>> CAP_NET_ADMIN.  As a more intrusive but perhaps better approach, you
>> could invert the logic to to make it work like everything outside of
>> cgroup: add pseudo-files like bpf.inet_ingress to the cgroup
>> directories, and require a writable fd to *that* to a new improved
>> attach API.  If a user could do:
>> 
>> int fd = open("/sys/fs/cgroup/.../bpf.inet_attach", O_RDWR);  /* usual
>> DAC and MAC policy applies */
>> int bpf_fd = setup the bpf stuff;  /* no privilege required, unless
>> the program is huge or needs is_priv */
>> bpf(BPF_IMPROVED_ATTACH, target = fd, program = bpf_fd);
>> 
>> there would be no capabilities or global privilege at all required for
>> this.  It would just work with cgroup delegation, containers, etc.
>> 
>> I think you could even pull off this type of API change with only
>> libbpf changes.  In particular, there's this code:
>> 
>> int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type,
>>                   unsigned int flags)
>> {
>>       union bpf_attr attr;
>> 
>>       memset(&attr, 0, sizeof(attr));
>>       attr.target_fd     = target_fd;
>>       attr.attach_bpf_fd = prog_fd;
>>       attr.attach_type   = type;
>>       attr.attach_flags  = flags;
>> 
>>       return sys_bpf(BPF_PROG_ATTACH, &attr, sizeof(attr));
>> }
>> 
>> This would instead do something like:
>> 
>> int specific_target_fd = openat(target_fd, bpf_type_to_target[type], O_RDWR);
>> attr.target_fd = specific_target_fd;
>> ...
>> 
>> return sys_bpf(BPF_PROG_IMPROVED_ATTACH, &attr, sizeof(attr));
>> 
>> Would this solve your problem without needing /dev/bpf at all?
> 
> This gives fine grain access control. I think it solves the problem. 
> But it also requires a lot of rework to sys_bpf(). And it may also 
> break backward/forward compatibility?
> 

I think the compatibility issue is manageable. The current bpf() interface would be supported for at least several years, and libbpf could detect that the new interface isn’t supported and fall back the old interface

> Personally, I think it is an overkill for the original motivation: 
> call sys_bpf() with special user instead of root. 

It’s overkill for your specific use case, but I’m trying to encourage you to either solve your problem entirely in userspace or to solve a more general problem in the kernel :)

In furtherance of bpf’s goal of world domination, I think it would be great if it Just Worked in a container. My proposal does this.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Song Liu @ 2019-07-24  6:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Kees Cook, linux-security@vger.kernel.org,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API
In-Reply-To: <1DE886F3-3982-45DE-B545-67AD6A4871AB@amacapital.net>



> On Jul 23, 2019, at 6:40 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> 
> 
> 
>> On Jul 23, 2019, at 3:56 PM, Song Liu <songliubraving@fb.com> wrote:
>> 
>> 
>> 
>>> On Jul 23, 2019, at 8:11 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>> 
>>> On Mon, Jul 22, 2019 at 1:54 PM Song Liu <songliubraving@fb.com> wrote:
>>>> 
>>>> Hi Andy, Lorenz, and all,
>>>> 
>>>>> On Jul 2, 2019, at 2:32 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>>>> 
>>>>> On Tue, Jul 2, 2019 at 2:04 PM Kees Cook <keescook@chromium.org> wrote:
>>>>>> 
>>>>>>> On Mon, Jul 01, 2019 at 06:59:13PM -0700, Andy Lutomirski wrote:
>>>>>>> I think I'm understanding your motivation.  You're not trying to make
>>>>>>> bpf() generically usable without privilege -- you're trying to create
>>>>>>> a way to allow certain users to access dangerous bpf functionality
>>>>>>> within some limits.
>>>>>>> 
>>>>>>> That's a perfectly fine goal, but I think you're reinventing the
>>>>>>> wheel, and the wheel you're reinventing is quite complicated and
>>>>>>> already exists.  I think you should teach bpftool to be secure when
>>>>>>> installed setuid root or with fscaps enabled and put your policy in
>>>>>>> bpftool.  If you want to harden this a little bit, it would seem
>>>>>>> entirely reasonable to add a new CAP_BPF_ADMIN and change some, but
>>>>>>> not all, of the capable() checks to check CAP_BPF_ADMIN instead of the
>>>>>>> capabilities that they currently check.
>>>>>> 
>>>>>> If finer grained controls are wanted, it does seem like the /dev/bpf
>>>>>> path makes the most sense. open, request abilities, use fd. The open can
>>>>>> be mediated by DAC and LSM. The request can be mediated by LSM. This
>>>>>> provides a way to add policy at the LSM level and at the tool level.
>>>>>> (i.e. For tool-level controls: leave LSM wide open, make /dev/bpf owned
>>>>>> by "bpfadmin" and bpftool becomes setuid "bpfadmin". For fine-grained
>>>>>> controls, leave /dev/bpf wide open and add policy to SELinux, etc.)
>>>>>> 
>>>>>> With only a new CAP, you don't get the fine-grained controls. (The
>>>>>> "request abilities" part is the key there.)
>>>>> 
>>>>> Sure you do: the effective set.  It has somewhat bizarre defaults, but
>>>>> I don't think that's a real problem.  Also, this wouldn't be like
>>>>> CAP_DAC_READ_SEARCH -- you can't accidentally use your BPF caps.
>>>>> 
>>>>> I think that a /dev capability-like object isn't totally nuts, but I
>>>>> think we should do it well, and this patch doesn't really achieve
>>>>> that.  But I don't think bpf wants fine-grained controls like this at
>>>>> all -- as I pointed upthread, a fine-grained solution really wants
>>>>> different treatment for the different capable() checks, and a bunch of
>>>>> them won't resemble capabilities or /dev/bpf at all.
>>>> 
>>>> With 5.3-rc1 out, I am back on this. :)
>>>> 
>>>> How about we modify the set as:
>>>> 1. Introduce sys_bpf_with_cap() that takes fd of /dev/bpf.
>>> 
>>> I'm fine with this in principle, but:
>>> 
>>>> 2. Better handling of capable() calls through bpf code. I guess the
>>>>   biggest problem here is is_priv in verifier.c:bpf_check().
>>> 
>>> I think it would be good to understand exactly what /dev/bpf will
>>> enable one to do.  Without some care, it would just become the next
>>> CAP_SYS_ADMIN: if you can open it, sure, you're not root, but you can
>>> intercept network traffic, modify cgroup behavior, and do plenty of
>>> other things, any of which can probably be used to completely take
>>> over the system.
>> 
>> Well, yes. sys_bpf() is pretty powerful. 
>> 
>> The goal of /dev/bpf is to enable special users to call sys_bpf(). In 
>> the meanwhile, such users should not take down the whole system easily
>> by accident, e.g., with rm -rf /.
> 
> That’s easy, though — bpftool could learn to read /etc/bpfusers before allowing ruid != 0.

This is a great idea! fscaps + /etc/bpfusers should do the trick. 

> 
>> 
>> It is similar to CAP_BPF_ADMIN, without really adding the CAP_.  
>> 
>> I think adding new CAP_ requires much more effort. 
>> 
> 
> A new CAP_ is straightforward — add the definition and change the max cap.
> 
>>> 
>>> It would also be nice to understand why you can't do what you need to
>>> do entirely in user code using setuid or fscaps.
>> 
>> It is not very easy to achieve the same control: only certain users can
>> run certain tools (bpftool, etc.). 
>> 
>> The closest approach I can find is:
>> 1. use libcap (pam_cap) to give CAP_SETUID to certain users;
>> 2. add setuid(0) to bpftool.
>> 
>> The difference between this approach and /dev/bpf is that certain users
>> would be able to run other tools that call setuid(). Though I am not 
>> sure how many tools call setuid(), and how risky they are. 
> 
> I think you’re misunderstanding me. Install bpftool with either the setuid (S_ISUID) mode or with an appropriate fscap bit — see the setcap(8) manpage.
> 
> The downside of this approach is that it won’t work well in a container, and containers are cool these days :)
> 
>> 
>>> 
>>> Finally, at risk of rehashing some old arguments, I'll point out that
>>> the bpf() syscall is an unusual design to begin with.  As an example,
>>> consider bpf_prog_attach().  Outside of bpf(), if I want to change the
>>> behavior of a cgroup, I would write to a file in
>>> /sys/kernel/cgroup/unified/whatever/, and normal DAC and MAC rules
>>> apply.  With bpf(), however, I just call bpf() to attach a program to
>>> the cgroup.  bpf() says "oh, you are capable(CAP_NET_ADMIN) -- go for
>>> it!".  Unless I missed something major, and I just re-read the code,
>>> there is no check that the caller has write or LSM permission to
>>> anything at all in cgroupfs, and the existing API would make it very
>>> awkward to impose any kind of DAC rules here.
>>> 
>>> So I think it might actually be time to repay some techincal debt and
>>> come up with a real fix.  As a less intrusive approach, you could see
>>> about requiring ownership of the cgroup directory instead of
>>> CAP_NET_ADMIN.  As a more intrusive but perhaps better approach, you
>>> could invert the logic to to make it work like everything outside of
>>> cgroup: add pseudo-files like bpf.inet_ingress to the cgroup
>>> directories, and require a writable fd to *that* to a new improved
>>> attach API.  If a user could do:
>>> 
>>> int fd = open("/sys/fs/cgroup/.../bpf.inet_attach", O_RDWR);  /* usual
>>> DAC and MAC policy applies */
>>> int bpf_fd = setup the bpf stuff;  /* no privilege required, unless
>>> the program is huge or needs is_priv */
>>> bpf(BPF_IMPROVED_ATTACH, target = fd, program = bpf_fd);
>>> 
>>> there would be no capabilities or global privilege at all required for
>>> this.  It would just work with cgroup delegation, containers, etc.
>>> 
>>> I think you could even pull off this type of API change with only
>>> libbpf changes.  In particular, there's this code:
>>> 
>>> int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type,
>>>                  unsigned int flags)
>>> {
>>>      union bpf_attr attr;
>>> 
>>>      memset(&attr, 0, sizeof(attr));
>>>      attr.target_fd     = target_fd;
>>>      attr.attach_bpf_fd = prog_fd;
>>>      attr.attach_type   = type;
>>>      attr.attach_flags  = flags;
>>> 
>>>      return sys_bpf(BPF_PROG_ATTACH, &attr, sizeof(attr));
>>> }
>>> 
>>> This would instead do something like:
>>> 
>>> int specific_target_fd = openat(target_fd, bpf_type_to_target[type], O_RDWR);
>>> attr.target_fd = specific_target_fd;
>>> ...
>>> 
>>> return sys_bpf(BPF_PROG_IMPROVED_ATTACH, &attr, sizeof(attr));
>>> 
>>> Would this solve your problem without needing /dev/bpf at all?
>> 
>> This gives fine grain access control. I think it solves the problem. 
>> But it also requires a lot of rework to sys_bpf(). And it may also 
>> break backward/forward compatibility?
>> 
> 
> I think the compatibility issue is manageable. The current bpf() interface would be supported for at least several years, and libbpf could detect that the new interface isn’t supported and fall back the old interface

You are right. New BPF_PROG_IMPROVED_ATTACH helps compatibility. 
I missed that part. 

> 
>> Personally, I think it is an overkill for the original motivation: 
>> call sys_bpf() with special user instead of root. 
> 
> It’s overkill for your specific use case, but I’m trying to encourage you to either solve your problem entirely in userspace or to solve a more general problem in the kernel :)

I do like both proposals. Thanks for these invaluable suggestions. 

> 
> In furtherance of bpf’s goal of world domination, I think it would be great if it Just Worked in a container. My proposal does this.

Let me think more about this and discuss with the team. 

Thanks again, 
Song

^ permalink raw reply

* Re: [v4 PATCH 2/2] mm: mempolicy: handle vma with unmovable pages mapped correctly in mbind
From: Vlastimil Babka @ 2019-07-24  8:19 UTC (permalink / raw)
  To: Yang Shi, Andrew Morton
  Cc: mhocko, mgorman, linux-mm, linux-kernel, linux-api
In-Reply-To: <a9b8cae7-4bca-3c98-99f9-6b92de7e5909@linux.alibaba.com>

On 7/23/19 7:35 AM, Yang Shi wrote:
> 
> 
> On 7/22/19 6:02 PM, Andrew Morton wrote:
>> On Mon, 22 Jul 2019 09:25:09 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>>>> since there may be pages off LRU temporarily.  We should migrate other
>>>> pages if MPOL_MF_MOVE* is specified.  Set has_unmovable flag if some
>>>> paged could not be not moved, then return -EIO for mbind() eventually.
>>>>
>>>> With this change the above test would return -EIO as expected.
>>>>
>>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>>> Cc: Michal Hocko <mhocko@suse.com>
>>>> Cc: Mel Gorman <mgorman@techsingularity.net>
>>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>>> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>> Thanks.
>>
>> I'm a bit surprised that this doesn't have a cc:stable.  Did we
>> consider that?
> 
> The VM_BUG just happens on 4.9, and it is enabled only by CONFIG_VM. For 
> post-4.9 kernel, this fixes the semantics of mbind which should be not a 
> regression IMHO.

4.9 is a LTS kernel, so perhaps worth trying?

>>
>> Also, is this patch dependent upon "mm: mempolicy: make the behavior
>> consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified"?
>> Doesn't look that way..
> 
> No, it depends on patch #1.
> 
>>
>> Also, I have a note that you had concerns with "mm: mempolicy: make the
>> behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were
>> specified".  What is the status now?
> 
> Vlastimil had given his Reviewed-by.

Yes, the concerns were resolved.

^ permalink raw reply

* Re: [PATCH v3 02/12] fpga: dfl: fme: add DFL_FPGA_FME_PORT_RELEASE/ASSIGN ioctl support.
From: Greg KH @ 2019-07-24  9:33 UTC (permalink / raw)
  To: Wu Hao
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Zhang Yi Z, Xu Yilun
In-Reply-To: <1563857495-26692-3-git-send-email-hao.wu@intel.com>

On Tue, Jul 23, 2019 at 12:51:25PM +0800, Wu Hao wrote:
> +/**
> + * dfl_fpga_cdev_config_port - configure a port feature dev
> + * @cdev: parent container device.
> + * @port_id: id of the port feature device.
> + * @release: release port or assign port back.
> + *
> + * This function allows user to release port platform device or assign it back.
> + * e.g. to safely turn one port from PF into VF for PCI device SRIOV support,
> + * release port platform device is one necessary step.
> + */
> +int dfl_fpga_cdev_config_port(struct dfl_fpga_cdev *cdev, int port_id,
> +			      bool release)
> +{
> +	return release ? detach_port_dev(cdev, port_id) :
> +			 attach_port_dev(cdev, port_id);
> +}
> +EXPORT_SYMBOL_GPL(dfl_fpga_cdev_config_port);

That's a horrible api.  Every time you see this call in code, you have
to go and look up what "bool" means here.  There's no reason for it.

Just have 2 different functions, one that attaches a port, and one that
detaches it.  That way when you read the code that calls this function,
you know what it does instantly without having to go look up some api
function somewhere else.

Write code for people to read first.  And you are saving nothing here by
trying to do two different things in the same exact function.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v3 01/12] fpga: dfl: fme: support 512bit data width PR
From: Greg KH @ 2019-07-24  9:35 UTC (permalink / raw)
  To: Wu Hao
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Ananda Ravuri, Xu Yilun
In-Reply-To: <1563857495-26692-2-git-send-email-hao.wu@intel.com>

On Tue, Jul 23, 2019 at 12:51:24PM +0800, Wu Hao wrote:
> In early partial reconfiguration private feature, it only
> supports 32bit data width when writing data to hardware for
> PR. 512bit data width PR support is an important optimization
> for some specific solutions (e.g. XEON with FPGA integrated),
> it allows driver to use AVX512 instruction to improve the
> performance of partial reconfiguration. e.g. programming one
> 100MB bitstream image via this 512bit data width PR hardware
> only takes ~300ms, but 32bit revision requires ~3s per test
> result.
> 
> Please note now this optimization is only done on revision 2
> of this PR private feature which is only used in integrated
> solution that AVX512 is always supported. This revision 2
> hardware doesn't support 32bit PR.
> 
> Signed-off-by: Ananda Ravuri <ananda.ravuri@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>
> Acked-by: Alan Tull <atull@kernel.org>
> Signed-off-by: Moritz Fischer <mdf@kernel.org>
> ---
> v2: remove DRV/MODULE_VERSION modifications
> ---
>  drivers/fpga/dfl-fme-mgr.c | 110 ++++++++++++++++++++++++++++++++++++++-------
>  drivers/fpga/dfl-fme-pr.c  |  43 +++++++++++-------
>  drivers/fpga/dfl-fme.h     |   2 +
>  drivers/fpga/dfl.h         |   5 +++
>  4 files changed, 129 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/fpga/dfl-fme-mgr.c b/drivers/fpga/dfl-fme-mgr.c
> index b3f7eee..46e17f0 100644
> --- a/drivers/fpga/dfl-fme-mgr.c
> +++ b/drivers/fpga/dfl-fme-mgr.c
> @@ -22,6 +22,7 @@
>  #include <linux/io-64-nonatomic-lo-hi.h>
>  #include <linux/fpga/fpga-mgr.h>
>  
> +#include "dfl.h"
>  #include "dfl-fme-pr.h"
>  
>  /* FME Partial Reconfiguration Sub Feature Register Set */
> @@ -30,6 +31,7 @@
>  #define FME_PR_STS		0x10
>  #define FME_PR_DATA		0x18
>  #define FME_PR_ERR		0x20
> +#define FME_PR_512_DATA		0x40 /* Data Register for 512bit datawidth PR */
>  #define FME_PR_INTFC_ID_L	0xA8
>  #define FME_PR_INTFC_ID_H	0xB0
>  
> @@ -67,8 +69,43 @@
>  #define PR_WAIT_TIMEOUT   8000000
>  #define PR_HOST_STATUS_IDLE	0
>  
> +#if defined(CONFIG_X86) && defined(CONFIG_AS_AVX512)
> +
> +#include <linux/cpufeature.h>
> +#include <asm/fpu/api.h>
> +
> +static inline int is_cpu_avx512_enabled(void)
> +{
> +	return cpu_feature_enabled(X86_FEATURE_AVX512F);
> +}

That's a very arch specific function, why would a driver ever care about
this?

> +
> +static inline void copy512(const void *src, void __iomem *dst)
> +{
> +	kernel_fpu_begin();
> +
> +	asm volatile("vmovdqu64 (%0), %%zmm0;"
> +		     "vmovntdq %%zmm0, (%1);"
> +		     :
> +		     : "r"(src), "r"(dst)
> +		     : "memory");
> +
> +	kernel_fpu_end();
> +}

Shouldn't this be an arch-specific function somewhere?  Burying this in
a random driver is not ok.  Please make this generic for all systems.

> +#else
> +static inline int is_cpu_avx512_enabled(void)
> +{
> +	return 0;
> +}
> +
> +static inline void copy512(const void *src, void __iomem *dst)
> +{
> +	WARN_ON_ONCE(1);

Are you trying to get reports from syzbot?  :)

Please fix this all up.

greg k-h

^ permalink raw reply

* Re: [PATCH v3 03/12] fpga: dfl: pci: enable SRIOV support.
From: Greg KH @ 2019-07-24  9:37 UTC (permalink / raw)
  To: Wu Hao
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Zhang Yi Z, Xu Yilun
In-Reply-To: <1563857495-26692-4-git-send-email-hao.wu@intel.com>

On Tue, Jul 23, 2019 at 12:51:26PM +0800, Wu Hao wrote:
> This patch enables the standard sriov support. It allows user to
> enable SRIOV (and VFs), then user could pass through accelerators
> (VFs) into virtual machine or use VFs directly in host.
> 
> Signed-off-by: Zhang Yi Z <yi.z.zhang@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>
> Acked-by: Alan Tull <atull@kernel.org>
> Acked-by: Moritz Fischer <mdf@kernel.org>
> Signed-off-by: Moritz Fischer <mdf@kernel.org>
> ---
> v2: remove DRV/MODULE_VERSION modifications.
> ---
>  drivers/fpga/dfl-pci.c | 39 +++++++++++++++++++++++++++++++++++++++
>  drivers/fpga/dfl.c     | 41 +++++++++++++++++++++++++++++++++++++++++
>  drivers/fpga/dfl.h     |  1 +
>  3 files changed, 81 insertions(+)
> 
> diff --git a/drivers/fpga/dfl-pci.c b/drivers/fpga/dfl-pci.c
> index 66b5720..0e65d81 100644
> --- a/drivers/fpga/dfl-pci.c
> +++ b/drivers/fpga/dfl-pci.c
> @@ -223,8 +223,46 @@ int cci_pci_probe(struct pci_dev *pcidev, const struct pci_device_id *pcidevid)
>  	return ret;
>  }
>  
> +static int cci_pci_sriov_configure(struct pci_dev *pcidev, int num_vfs)
> +{
> +	struct cci_drvdata *drvdata = pci_get_drvdata(pcidev);
> +	struct dfl_fpga_cdev *cdev = drvdata->cdev;
> +	int ret = 0;
> +
> +	mutex_lock(&cdev->lock);
> +
> +	if (!num_vfs) {
> +		/*
> +		 * disable SRIOV and then put released ports back to default
> +		 * PF access mode.
> +		 */
> +		pci_disable_sriov(pcidev);
> +
> +		__dfl_fpga_cdev_config_port_vf(cdev, false);
> +
> +	} else if (cdev->released_port_num == num_vfs) {
> +		/*
> +		 * only enable SRIOV if cdev has matched released ports, put
> +		 * released ports into VF access mode firstly.
> +		 */
> +		__dfl_fpga_cdev_config_port_vf(cdev, true);
> +
> +		ret = pci_enable_sriov(pcidev, num_vfs);
> +		if (ret)
> +			__dfl_fpga_cdev_config_port_vf(cdev, false);
> +	} else {
> +		ret = -EINVAL;
> +	}
> +
> +	mutex_unlock(&cdev->lock);
> +	return ret;
> +}
> +
>  static void cci_pci_remove(struct pci_dev *pcidev)
>  {
> +	if (dev_is_pf(&pcidev->dev))
> +		cci_pci_sriov_configure(pcidev, 0);
> +
>  	cci_remove_feature_devs(pcidev);
>  	pci_disable_pcie_error_reporting(pcidev);
>  }
> @@ -234,6 +272,7 @@ static void cci_pci_remove(struct pci_dev *pcidev)
>  	.id_table = cci_pcie_id_tbl,
>  	.probe = cci_pci_probe,
>  	.remove = cci_pci_remove,
> +	.sriov_configure = cci_pci_sriov_configure,
>  };
>  
>  module_pci_driver(cci_pci_driver);
> diff --git a/drivers/fpga/dfl.c b/drivers/fpga/dfl.c
> index e04ed45..c3a8e1d 100644
> --- a/drivers/fpga/dfl.c
> +++ b/drivers/fpga/dfl.c
> @@ -1112,6 +1112,47 @@ int dfl_fpga_cdev_config_port(struct dfl_fpga_cdev *cdev, int port_id,
>  }
>  EXPORT_SYMBOL_GPL(dfl_fpga_cdev_config_port);
>  
> +static void config_port_vf(struct device *fme_dev, int port_id, bool is_vf)
> +{
> +	void __iomem *base;
> +	u64 v;
> +
> +	base = dfl_get_feature_ioaddr_by_id(fme_dev, FME_FEATURE_ID_HEADER);
> +
> +	v = readq(base + FME_HDR_PORT_OFST(port_id));
> +
> +	v &= ~FME_PORT_OFST_ACC_CTRL;
> +	v |= FIELD_PREP(FME_PORT_OFST_ACC_CTRL,
> +			is_vf ? FME_PORT_OFST_ACC_VF : FME_PORT_OFST_ACC_PF);
> +
> +	writeq(v, base + FME_HDR_PORT_OFST(port_id));
> +}
> +
> +/**
> + * __dfl_fpga_cdev_config_port_vf - configure port to VF access mode
> + *
> + * @cdev: parent container device.
> + * @if_vf: true for VF access mode, and false for PF access mode
> + *
> + * Return: 0 on success, negative error code otherwise.
> + *
> + * This function is needed in sriov configuration routine. It could be used to
> + * configures the released ports access mode to VF or PF.
> + * The caller needs to hold lock for protection.
> + */
> +void __dfl_fpga_cdev_config_port_vf(struct dfl_fpga_cdev *cdev, bool is_vf)
> +{
> +	struct dfl_feature_platform_data *pdata;
> +
> +	list_for_each_entry(pdata, &cdev->port_dev_list, node) {
> +		if (device_is_registered(&pdata->dev->dev))
> +			continue;
> +
> +		config_port_vf(cdev->fme_dev, pdata->id, is_vf);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(__dfl_fpga_cdev_config_port_vf);

Why are you exporting a function with a leading __?

You are expecting someone else, in who knows what code, to do locking
correctly?  If so, and the caller always has to have a local lock, then
it's not a big deal, just drop the '__', otherwise if you have to have a
specific lock for a specific device, then you have a really complex and
probably broken api here :(

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v3 04/12] fpga: dfl: afu: add AFU state related sysfs interfaces
From: Greg KH @ 2019-07-24  9:41 UTC (permalink / raw)
  To: Wu Hao
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Ananda Ravuri, Xu Yilun
In-Reply-To: <1563857495-26692-5-git-send-email-hao.wu@intel.com>

On Tue, Jul 23, 2019 at 12:51:27PM +0800, Wu Hao wrote:
> This patch introduces more sysfs interfaces for Accelerated
> Function Unit (AFU). These interfaces allow users to read
> current AFU Power State (APx), read / clear AFU Power (APx)
> events which are sticky to identify transient APx state,
> and manage AFU's LTR (latency tolerance reporting).
> 
> Signed-off-by: Ananda Ravuri <ananda.ravuri@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>
> Acked-by: Alan Tull <atull@kernel.org>
> Signed-off-by: Moritz Fischer <mdf@kernel.org>
> ---
> v2: rebased, and remove DRV/MODULE_VERSION modifications
> v3: update kernel version and date in sysfs doc
> ---
>  Documentation/ABI/testing/sysfs-platform-dfl-port |  30 +++++
>  drivers/fpga/dfl-afu-main.c                       | 137 ++++++++++++++++++++++
>  drivers/fpga/dfl.h                                |  11 ++
>  3 files changed, 178 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-port b/Documentation/ABI/testing/sysfs-platform-dfl-port
> index 6a92dda..5961fb2 100644
> --- a/Documentation/ABI/testing/sysfs-platform-dfl-port
> +++ b/Documentation/ABI/testing/sysfs-platform-dfl-port
> @@ -14,3 +14,33 @@ Description:	Read-only. User can program different PR bitstreams to FPGA
>  		Accelerator Function Unit (AFU) for different functions. It
>  		returns uuid which could be used to identify which PR bitstream
>  		is programmed in this AFU.
> +
> +What:		/sys/bus/platform/devices/dfl-port.0/power_state
> +Date:		July 2019
> +KernelVersion:	5.4
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-only. It reports the APx (AFU Power) state, different APx
> +		means different throttling level. When reading this file, it
> +		returns "0" - Normal / "1" - AP1 / "2" - AP2 / "6" - AP6.
> +
> +What:		/sys/bus/platform/devices/dfl-port.0/ap1_event
> +Date:		July 2019
> +KernelVersion:	5.4
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-write. Read or set 1 to clear AP1 (AFU Power State 1)
> +		event. It's used to indicate transient AP1 state.

So reading the value changes the state of the system?  That's almost
always never a good idea.

Force userspace to write the value to change something.  Otherwise all
libraries that use sysfs will be accidentally changing the state of your
system without you ever knowing it.



> +
> +What:		/sys/bus/platform/devices/dfl-port.0/ap2_event
> +Date:		July 2019
> +KernelVersion:	5.4
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-write. Read or set 1 to clear AP2 (AFU Power State 2)
> +		event. It's used to indicate transient AP2 state.
> +
> +What:		/sys/bus/platform/devices/dfl-port.0/ltr
> +Date:		July 2019
> +KernelVersion:	5.4
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-write. Read and set AFU latency tolerance reporting value.
> +		Set ltr to 1 if the AFU can tolerate latency >= 40us or set it
> +		to 0 if it is latency sensitive.
> diff --git a/drivers/fpga/dfl-afu-main.c b/drivers/fpga/dfl-afu-main.c
> index 68b4d08..cb3f73e 100644
> --- a/drivers/fpga/dfl-afu-main.c
> +++ b/drivers/fpga/dfl-afu-main.c
> @@ -141,8 +141,145 @@ static int port_get_id(struct platform_device *pdev)
>  }
>  static DEVICE_ATTR_RO(id);
>  
> +static ssize_t
> +ltr_show(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> +	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> +	void __iomem *base;
> +	u64 v;
> +
> +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
> +
> +	mutex_lock(&pdata->lock);
> +	v = readq(base + PORT_HDR_CTRL);
> +	mutex_unlock(&pdata->lock);

Why do you need a lock to call readq()?  What are you protecting here?


> +
> +	return sprintf(buf, "%x\n", (u8)FIELD_GET(PORT_CTRL_LATENCY, v));
> +}
> +
> +static ssize_t
> +ltr_store(struct device *dev, struct device_attribute *attr,
> +	  const char *buf, size_t count)
> +{
> +	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> +	void __iomem *base;
> +	u8 ltr;
> +	u64 v;
> +
> +	if (kstrtou8(buf, 0, &ltr) || ltr > 1)
> +		return -EINVAL;

Are you doing anything with this value?  If not, how about just using
the sysfs boolean read function and if it is 1, then do the clearing?

Same for all other show/store functions in here.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v3 09/12] fpga: dfl: afu: add STP (SignalTap) support
From: Greg KH @ 2019-07-24 10:11 UTC (permalink / raw)
  To: Wu Hao; +Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Xu Yilun
In-Reply-To: <1563857495-26692-10-git-send-email-hao.wu@intel.com>

On Tue, Jul 23, 2019 at 12:51:32PM +0800, Wu Hao wrote:
> STP (SignalTap) is one of the private features under the port for
> debugging. This patch adds private feature driver support for it
> to allow userspace applications to mmap related mmio region and
> provide STP service.
> 
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>
> Acked-by: Moritz Fischer <mdf@kernel.org>
> Acked-by: Alan Tull <atull@kernel.org>
> Signed-off-by: Moritz Fischer <mdf@kernel.org>
> ---
>  drivers/fpga/dfl-afu-main.c | 34 ++++++++++++++++++++++++++++++++++
>  1 file changed, 34 insertions(+)
> 
> diff --git a/drivers/fpga/dfl-afu-main.c b/drivers/fpga/dfl-afu-main.c
> index 15dd4cb..395f96e 100644
> --- a/drivers/fpga/dfl-afu-main.c
> +++ b/drivers/fpga/dfl-afu-main.c
> @@ -514,6 +514,36 @@ static void port_afu_uinit(struct platform_device *pdev,
>  	.uinit = port_afu_uinit,
>  };
>  
> +static int port_stp_init(struct platform_device *pdev,
> +			 struct dfl_feature *feature)
> +{
> +	struct resource *res = &pdev->resource[feature->resource_index];
> +
> +	dev_dbg(&pdev->dev, "PORT STP Init.\n");

ftrace is your friend, no need to do a lot of "look I am here!"
messages.

> +
> +	return afu_mmio_region_add(dev_get_platdata(&pdev->dev),
> +				   DFL_PORT_REGION_INDEX_STP,
> +				   resource_size(res), res->start,
> +				   DFL_PORT_REGION_MMAP | DFL_PORT_REGION_READ |
> +				   DFL_PORT_REGION_WRITE);
> +}
> +
> +static void port_stp_uinit(struct platform_device *pdev,
> +			   struct dfl_feature *feature)
> +{
> +	dev_dbg(&pdev->dev, "PORT STP UInit.\n");

Same here.

Why have this function at all if it does not do anything?


thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v3 09/12] fpga: dfl: afu: add STP (SignalTap) support
From: Wu Hao @ 2019-07-24 13:03 UTC (permalink / raw)
  To: Greg KH
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Xu Yilun
In-Reply-To: <20190724101109.GE29532@kroah.com>

On Wed, Jul 24, 2019 at 12:11:09PM +0200, Greg KH wrote:
> On Tue, Jul 23, 2019 at 12:51:32PM +0800, Wu Hao wrote:
> > STP (SignalTap) is one of the private features under the port for
> > debugging. This patch adds private feature driver support for it
> > to allow userspace applications to mmap related mmio region and
> > provide STP service.
> > 
> > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > Signed-off-by: Wu Hao <hao.wu@intel.com>
> > Acked-by: Moritz Fischer <mdf@kernel.org>
> > Acked-by: Alan Tull <atull@kernel.org>
> > Signed-off-by: Moritz Fischer <mdf@kernel.org>
> > ---
> >  drivers/fpga/dfl-afu-main.c | 34 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 34 insertions(+)
> > 
> > diff --git a/drivers/fpga/dfl-afu-main.c b/drivers/fpga/dfl-afu-main.c
> > index 15dd4cb..395f96e 100644
> > --- a/drivers/fpga/dfl-afu-main.c
> > +++ b/drivers/fpga/dfl-afu-main.c
> > @@ -514,6 +514,36 @@ static void port_afu_uinit(struct platform_device *pdev,
> >  	.uinit = port_afu_uinit,
> >  };
> >  
> > +static int port_stp_init(struct platform_device *pdev,
> > +			 struct dfl_feature *feature)
> > +{
> > +	struct resource *res = &pdev->resource[feature->resource_index];
> > +
> > +	dev_dbg(&pdev->dev, "PORT STP Init.\n");
> 
> ftrace is your friend, no need to do a lot of "look I am here!"
> messages.

Hi Greg,

Thanks for the code review!

Sure, let me remove them.

> 
> > +
> > +	return afu_mmio_region_add(dev_get_platdata(&pdev->dev),
> > +				   DFL_PORT_REGION_INDEX_STP,
> > +				   resource_size(res), res->start,
> > +				   DFL_PORT_REGION_MMAP | DFL_PORT_REGION_READ |
> > +				   DFL_PORT_REGION_WRITE);
> > +}
> > +
> > +static void port_stp_uinit(struct platform_device *pdev,
> > +			   struct dfl_feature *feature)
> > +{
> > +	dev_dbg(&pdev->dev, "PORT STP UInit.\n");
> 
> Same here.
> 
> Why have this function at all if it does not do anything?

Let me remove them in the next version. actually uinit callback is
always required in current code, i will add one more patch to change
it, and remove all uinit functions who do nothing, it does save code.

Thanks for the comments.
Hao

> 
> 
> thanks,
> 
> greg k-h

^ permalink raw reply

* Re: [PATCH v3 04/12] fpga: dfl: afu: add AFU state related sysfs interfaces
From: Wu Hao @ 2019-07-24 13:29 UTC (permalink / raw)
  To: Greg KH
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Ananda Ravuri, Xu Yilun
In-Reply-To: <20190724094110.GD29532@kroah.com>

On Wed, Jul 24, 2019 at 11:41:10AM +0200, Greg KH wrote:
> On Tue, Jul 23, 2019 at 12:51:27PM +0800, Wu Hao wrote:
> > This patch introduces more sysfs interfaces for Accelerated
> > Function Unit (AFU). These interfaces allow users to read
> > current AFU Power State (APx), read / clear AFU Power (APx)
> > events which are sticky to identify transient APx state,
> > and manage AFU's LTR (latency tolerance reporting).
> > 
> > Signed-off-by: Ananda Ravuri <ananda.ravuri@intel.com>
> > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > Signed-off-by: Wu Hao <hao.wu@intel.com>
> > Acked-by: Alan Tull <atull@kernel.org>
> > Signed-off-by: Moritz Fischer <mdf@kernel.org>
> > ---
> > v2: rebased, and remove DRV/MODULE_VERSION modifications
> > v3: update kernel version and date in sysfs doc
> > ---
> >  Documentation/ABI/testing/sysfs-platform-dfl-port |  30 +++++
> >  drivers/fpga/dfl-afu-main.c                       | 137 ++++++++++++++++++++++
> >  drivers/fpga/dfl.h                                |  11 ++
> >  3 files changed, 178 insertions(+)
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-port b/Documentation/ABI/testing/sysfs-platform-dfl-port
> > index 6a92dda..5961fb2 100644
> > --- a/Documentation/ABI/testing/sysfs-platform-dfl-port
> > +++ b/Documentation/ABI/testing/sysfs-platform-dfl-port
> > @@ -14,3 +14,33 @@ Description:	Read-only. User can program different PR bitstreams to FPGA
> >  		Accelerator Function Unit (AFU) for different functions. It
> >  		returns uuid which could be used to identify which PR bitstream
> >  		is programmed in this AFU.
> > +
> > +What:		/sys/bus/platform/devices/dfl-port.0/power_state
> > +Date:		July 2019
> > +KernelVersion:	5.4
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Read-only. It reports the APx (AFU Power) state, different APx
> > +		means different throttling level. When reading this file, it
> > +		returns "0" - Normal / "1" - AP1 / "2" - AP2 / "6" - AP6.
> > +
> > +What:		/sys/bus/platform/devices/dfl-port.0/ap1_event
> > +Date:		July 2019
> > +KernelVersion:	5.4
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Read-write. Read or set 1 to clear AP1 (AFU Power State 1)
> > +		event. It's used to indicate transient AP1 state.
> 
> So reading the value changes the state of the system?  That's almost
> always never a good idea.
> 
> Force userspace to write the value to change something.  Otherwise all
> libraries that use sysfs will be accidentally changing the state of your
> system without you ever knowing it.

Oh.. I think the description makes some misunderstanding here, will fix it
in the next version. This AP1/AP2 event will only be cleared by write 1 to
it, read will not change the state.

> 
> > +
> > +What:		/sys/bus/platform/devices/dfl-port.0/ap2_event
> > +Date:		July 2019
> > +KernelVersion:	5.4
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Read-write. Read or set 1 to clear AP2 (AFU Power State 2)
> > +		event. It's used to indicate transient AP2 state.
> > +
> > +What:		/sys/bus/platform/devices/dfl-port.0/ltr
> > +Date:		July 2019
> > +KernelVersion:	5.4
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Read-write. Read and set AFU latency tolerance reporting value.
> > +		Set ltr to 1 if the AFU can tolerate latency >= 40us or set it
> > +		to 0 if it is latency sensitive.
> > diff --git a/drivers/fpga/dfl-afu-main.c b/drivers/fpga/dfl-afu-main.c
> > index 68b4d08..cb3f73e 100644
> > --- a/drivers/fpga/dfl-afu-main.c
> > +++ b/drivers/fpga/dfl-afu-main.c
> > @@ -141,8 +141,145 @@ static int port_get_id(struct platform_device *pdev)
> >  }
> >  static DEVICE_ATTR_RO(id);
> >  
> > +static ssize_t
> > +ltr_show(struct device *dev, struct device_attribute *attr, char *buf)
> > +{
> > +	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> > +	void __iomem *base;
> > +	u64 v;
> > +
> > +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
> > +
> > +	mutex_lock(&pdata->lock);
> > +	v = readq(base + PORT_HDR_CTRL);
> > +	mutex_unlock(&pdata->lock);
> 
> Why do you need a lock to call readq()?  What are you protecting here?

If this code is running on 32bit machine, readq will be replaced with 2
readl operation. If that is the case, should we protect the code against
it?

> 
> 
> > +
> > +	return sprintf(buf, "%x\n", (u8)FIELD_GET(PORT_CTRL_LATENCY, v));
> > +}
> > +
> > +static ssize_t
> > +ltr_store(struct device *dev, struct device_attribute *attr,
> > +	  const char *buf, size_t count)
> > +{
> > +	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> > +	void __iomem *base;
> > +	u8 ltr;
> > +	u64 v;
> > +
> > +	if (kstrtou8(buf, 0, &ltr) || ltr > 1)
> > +		return -EINVAL;
> 
> Are you doing anything with this value?  If not, how about just using
> the sysfs boolean read function and if it is 1, then do the clearing?
> 
> Same for all other show/store functions in here.

Sure, will fix this in the next version.

Thanks a lot for the comments.

Hao

> 
> thanks,
> 
> greg k-h

^ permalink raw reply

* Re: [PATCH v3 03/12] fpga: dfl: pci: enable SRIOV support.
From: Wu Hao @ 2019-07-24 13:37 UTC (permalink / raw)
  To: Greg KH
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Zhang Yi Z, Xu Yilun
In-Reply-To: <20190724093744.GC29532@kroah.com>

On Wed, Jul 24, 2019 at 11:37:44AM +0200, Greg KH wrote:
> On Tue, Jul 23, 2019 at 12:51:26PM +0800, Wu Hao wrote:
> > This patch enables the standard sriov support. It allows user to
> > enable SRIOV (and VFs), then user could pass through accelerators
> > (VFs) into virtual machine or use VFs directly in host.
> > 
> > Signed-off-by: Zhang Yi Z <yi.z.zhang@intel.com>
> > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > Signed-off-by: Wu Hao <hao.wu@intel.com>
> > Acked-by: Alan Tull <atull@kernel.org>
> > Acked-by: Moritz Fischer <mdf@kernel.org>
> > Signed-off-by: Moritz Fischer <mdf@kernel.org>
> > ---
> > v2: remove DRV/MODULE_VERSION modifications.
> > ---
> >  drivers/fpga/dfl-pci.c | 39 +++++++++++++++++++++++++++++++++++++++
> >  drivers/fpga/dfl.c     | 41 +++++++++++++++++++++++++++++++++++++++++
> >  drivers/fpga/dfl.h     |  1 +
> >  3 files changed, 81 insertions(+)
> > 
> > diff --git a/drivers/fpga/dfl-pci.c b/drivers/fpga/dfl-pci.c
> > index 66b5720..0e65d81 100644
> > --- a/drivers/fpga/dfl-pci.c
> > +++ b/drivers/fpga/dfl-pci.c
> > @@ -223,8 +223,46 @@ int cci_pci_probe(struct pci_dev *pcidev, const struct pci_device_id *pcidevid)
> >  	return ret;
> >  }
> >  
> > +static int cci_pci_sriov_configure(struct pci_dev *pcidev, int num_vfs)
> > +{
> > +	struct cci_drvdata *drvdata = pci_get_drvdata(pcidev);
> > +	struct dfl_fpga_cdev *cdev = drvdata->cdev;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&cdev->lock);
> > +
> > +	if (!num_vfs) {
> > +		/*
> > +		 * disable SRIOV and then put released ports back to default
> > +		 * PF access mode.
> > +		 */
> > +		pci_disable_sriov(pcidev);
> > +
> > +		__dfl_fpga_cdev_config_port_vf(cdev, false);
> > +
> > +	} else if (cdev->released_port_num == num_vfs) {
> > +		/*
> > +		 * only enable SRIOV if cdev has matched released ports, put
> > +		 * released ports into VF access mode firstly.
> > +		 */
> > +		__dfl_fpga_cdev_config_port_vf(cdev, true);
> > +
> > +		ret = pci_enable_sriov(pcidev, num_vfs);
> > +		if (ret)
> > +			__dfl_fpga_cdev_config_port_vf(cdev, false);
> > +	} else {
> > +		ret = -EINVAL;
> > +	}
> > +
> > +	mutex_unlock(&cdev->lock);
> > +	return ret;
> > +}
> > +
> >  static void cci_pci_remove(struct pci_dev *pcidev)
> >  {
> > +	if (dev_is_pf(&pcidev->dev))
> > +		cci_pci_sriov_configure(pcidev, 0);
> > +
> >  	cci_remove_feature_devs(pcidev);
> >  	pci_disable_pcie_error_reporting(pcidev);
> >  }
> > @@ -234,6 +272,7 @@ static void cci_pci_remove(struct pci_dev *pcidev)
> >  	.id_table = cci_pcie_id_tbl,
> >  	.probe = cci_pci_probe,
> >  	.remove = cci_pci_remove,
> > +	.sriov_configure = cci_pci_sriov_configure,
> >  };
> >  
> >  module_pci_driver(cci_pci_driver);
> > diff --git a/drivers/fpga/dfl.c b/drivers/fpga/dfl.c
> > index e04ed45..c3a8e1d 100644
> > --- a/drivers/fpga/dfl.c
> > +++ b/drivers/fpga/dfl.c
> > @@ -1112,6 +1112,47 @@ int dfl_fpga_cdev_config_port(struct dfl_fpga_cdev *cdev, int port_id,
> >  }
> >  EXPORT_SYMBOL_GPL(dfl_fpga_cdev_config_port);
> >  
> > +static void config_port_vf(struct device *fme_dev, int port_id, bool is_vf)
> > +{
> > +	void __iomem *base;
> > +	u64 v;
> > +
> > +	base = dfl_get_feature_ioaddr_by_id(fme_dev, FME_FEATURE_ID_HEADER);
> > +
> > +	v = readq(base + FME_HDR_PORT_OFST(port_id));
> > +
> > +	v &= ~FME_PORT_OFST_ACC_CTRL;
> > +	v |= FIELD_PREP(FME_PORT_OFST_ACC_CTRL,
> > +			is_vf ? FME_PORT_OFST_ACC_VF : FME_PORT_OFST_ACC_PF);
> > +
> > +	writeq(v, base + FME_HDR_PORT_OFST(port_id));
> > +}
> > +
> > +/**
> > + * __dfl_fpga_cdev_config_port_vf - configure port to VF access mode
> > + *
> > + * @cdev: parent container device.
> > + * @if_vf: true for VF access mode, and false for PF access mode
> > + *
> > + * Return: 0 on success, negative error code otherwise.
> > + *
> > + * This function is needed in sriov configuration routine. It could be used to
> > + * configures the released ports access mode to VF or PF.
> > + * The caller needs to hold lock for protection.
> > + */
> > +void __dfl_fpga_cdev_config_port_vf(struct dfl_fpga_cdev *cdev, bool is_vf)
> > +{
> > +	struct dfl_feature_platform_data *pdata;
> > +
> > +	list_for_each_entry(pdata, &cdev->port_dev_list, node) {
> > +		if (device_is_registered(&pdata->dev->dev))
> > +			continue;
> > +
> > +		config_port_vf(cdev->fme_dev, pdata->id, is_vf);
> > +	}
> > +}
> > +EXPORT_SYMBOL_GPL(__dfl_fpga_cdev_config_port_vf);
> 
> Why are you exporting a function with a leading __?
> 
> You are expecting someone else, in who knows what code, to do locking
> correctly?  If so, and the caller always has to have a local lock, then
> it's not a big deal, just drop the '__', otherwise if you have to have a
> specific lock for a specific device, then you have a really complex and
> probably broken api here :(

Yes, I just want to remind the user of this API, caller needs to hold the
lock to protect the list. I fully agree, it does make sense to make the
APIs easy to use. I will try to improve this, maybe move the lock inside
this function, then API user doesn't need to know the details of locking.

Thanks a lot for the comments, it really helps.

Hao

> 
> thanks,
> 
> greg k-h

^ permalink raw reply

* Re: [PATCH v3 02/12] fpga: dfl: fme: add DFL_FPGA_FME_PORT_RELEASE/ASSIGN ioctl support.
From: Wu Hao @ 2019-07-24 13:47 UTC (permalink / raw)
  To: Greg KH
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Zhang Yi Z, Xu Yilun
In-Reply-To: <20190724093357.GA29532@kroah.com>

On Wed, Jul 24, 2019 at 11:33:57AM +0200, Greg KH wrote:
> On Tue, Jul 23, 2019 at 12:51:25PM +0800, Wu Hao wrote:
> > +/**
> > + * dfl_fpga_cdev_config_port - configure a port feature dev
> > + * @cdev: parent container device.
> > + * @port_id: id of the port feature device.
> > + * @release: release port or assign port back.
> > + *
> > + * This function allows user to release port platform device or assign it back.
> > + * e.g. to safely turn one port from PF into VF for PCI device SRIOV support,
> > + * release port platform device is one necessary step.
> > + */
> > +int dfl_fpga_cdev_config_port(struct dfl_fpga_cdev *cdev, int port_id,
> > +			      bool release)
> > +{
> > +	return release ? detach_port_dev(cdev, port_id) :
> > +			 attach_port_dev(cdev, port_id);
> > +}
> > +EXPORT_SYMBOL_GPL(dfl_fpga_cdev_config_port);
> 
> That's a horrible api.  Every time you see this call in code, you have
> to go and look up what "bool" means here.  There's no reason for it.
> 
> Just have 2 different functions, one that attaches a port, and one that
> detaches it.  That way when you read the code that calls this function,
> you know what it does instantly without having to go look up some api
> function somewhere else.
> 
> Write code for people to read first.  And you are saving nothing here by
> trying to do two different things in the same exact function.

I see, you're right, it saves everybody's time on reading, very important.
I will fix this and keep it in mind. Thank you.

Hao

> 
> thanks,
> 
> greg k-h

^ permalink raw reply

* Re: [PATCH v3 01/12] fpga: dfl: fme: support 512bit data width PR
From: Wu Hao @ 2019-07-24 14:22 UTC (permalink / raw)
  To: Greg KH
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Ananda Ravuri, Xu Yilun
In-Reply-To: <20190724093532.GB29532@kroah.com>

On Wed, Jul 24, 2019 at 11:35:32AM +0200, Greg KH wrote:
> On Tue, Jul 23, 2019 at 12:51:24PM +0800, Wu Hao wrote:
> > In early partial reconfiguration private feature, it only
> > supports 32bit data width when writing data to hardware for
> > PR. 512bit data width PR support is an important optimization
> > for some specific solutions (e.g. XEON with FPGA integrated),
> > it allows driver to use AVX512 instruction to improve the
> > performance of partial reconfiguration. e.g. programming one
> > 100MB bitstream image via this 512bit data width PR hardware
> > only takes ~300ms, but 32bit revision requires ~3s per test
> > result.
> > 
> > Please note now this optimization is only done on revision 2
> > of this PR private feature which is only used in integrated
> > solution that AVX512 is always supported. This revision 2
> > hardware doesn't support 32bit PR.
> > 
> > Signed-off-by: Ananda Ravuri <ananda.ravuri@intel.com>
> > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > Signed-off-by: Wu Hao <hao.wu@intel.com>
> > Acked-by: Alan Tull <atull@kernel.org>
> > Signed-off-by: Moritz Fischer <mdf@kernel.org>
> > ---
> > v2: remove DRV/MODULE_VERSION modifications
> > ---
> >  drivers/fpga/dfl-fme-mgr.c | 110 ++++++++++++++++++++++++++++++++++++++-------
> >  drivers/fpga/dfl-fme-pr.c  |  43 +++++++++++-------
> >  drivers/fpga/dfl-fme.h     |   2 +
> >  drivers/fpga/dfl.h         |   5 +++
> >  4 files changed, 129 insertions(+), 31 deletions(-)
> > 
> > diff --git a/drivers/fpga/dfl-fme-mgr.c b/drivers/fpga/dfl-fme-mgr.c
> > index b3f7eee..46e17f0 100644
> > --- a/drivers/fpga/dfl-fme-mgr.c
> > +++ b/drivers/fpga/dfl-fme-mgr.c
> > @@ -22,6 +22,7 @@
> >  #include <linux/io-64-nonatomic-lo-hi.h>
> >  #include <linux/fpga/fpga-mgr.h>
> >  
> > +#include "dfl.h"
> >  #include "dfl-fme-pr.h"
> >  
> >  /* FME Partial Reconfiguration Sub Feature Register Set */
> > @@ -30,6 +31,7 @@
> >  #define FME_PR_STS		0x10
> >  #define FME_PR_DATA		0x18
> >  #define FME_PR_ERR		0x20
> > +#define FME_PR_512_DATA		0x40 /* Data Register for 512bit datawidth PR */
> >  #define FME_PR_INTFC_ID_L	0xA8
> >  #define FME_PR_INTFC_ID_H	0xB0
> >  
> > @@ -67,8 +69,43 @@
> >  #define PR_WAIT_TIMEOUT   8000000
> >  #define PR_HOST_STATUS_IDLE	0
> >  
> > +#if defined(CONFIG_X86) && defined(CONFIG_AS_AVX512)
> > +
> > +#include <linux/cpufeature.h>
> > +#include <asm/fpu/api.h>
> > +
> > +static inline int is_cpu_avx512_enabled(void)
> > +{
> > +	return cpu_feature_enabled(X86_FEATURE_AVX512F);
> > +}
> 
> That's a very arch specific function, why would a driver ever care about
> this?

Yes, this is only applied to a specific FPGA solution, which FPGA
has been integrated with XEON. Hardware indicates this using register
to software. As it's cpu integrated solution, so CPU always has this
AVX512 capability. The only check we do, is make sure this is not
manually disabled by kernel.

With this hardware, software could use AVX512 to accelerate the FPGA
partial reconfiguration as mentioned in the patch commit message.
It brings performance benifits to people who uses it. This is only one
optimization (512 vs 32bit data write to hw) for a specific hardware.

For other discrete solutions, e.g. FPGA PCIe Card, this is not used
at all as driver does check hardware register to avoid any AVX512 code.

> 
> > +
> > +static inline void copy512(const void *src, void __iomem *dst)
> > +{
> > +	kernel_fpu_begin();
> > +
> > +	asm volatile("vmovdqu64 (%0), %%zmm0;"
> > +		     "vmovntdq %%zmm0, (%1);"
> > +		     :
> > +		     : "r"(src), "r"(dst)
> > +		     : "memory");
> > +
> > +	kernel_fpu_end();
> > +}
> 
> Shouldn't this be an arch-specific function somewhere?  Burying this in
> a random driver is not ok.  Please make this generic for all systems.

If more people need the same avx operation like this in kernel, then maybe
this can be moved to some arch-specific lib code somewhere as some common
functions to everybody, but i am not very sure if this is the case. Let me
think about this more.

> 
> > +#else
> > +static inline int is_cpu_avx512_enabled(void)
> > +{
> > +	return 0;
> > +}
> > +
> > +static inline void copy512(const void *src, void __iomem *dst)
> > +{
> > +	WARN_ON_ONCE(1);
> 
> Are you trying to get reports from syzbot?  :)

Oh.. no.. I will remove it. :)

Thank you very much!

Hao

> 
> Please fix this all up.
> 
> greg k-h

^ permalink raw reply

* [PATCH 2/5] pidfd: add pidfd_wait()
From: Christian Brauner @ 2019-07-24 14:46 UTC (permalink / raw)
  To: linux-kernel, oleg
  Cc: arnd, ebiederm, keescook, joel, tglx, tj, dhowells, jannh, luto,
	akpm, cyphar, torvalds, viro, kernel-team, Christian Brauner,
	linux-api
In-Reply-To: <20190724144651.28272-1-christian@brauner.io>

This adds the pidfd_wait() syscall.

One of the last remaining bits for the pidfd api is to make it possible
to wait on pidfds. With this syscall implemented parts of userspace that
want to use this api can finally switch to managing processes completely
through pidfds if they so desire (cf. [1]).

The pidfd_wait() syscall does not allow scoping of the process
identified by the pidfd, i.e. it explicitly does not try to mirror the
behavior of: wait4(-1), wait4(0), waitid(P_ALL), waitid(P_PGID) etc. It
only allows for semantics equivalent to wait4(pid), waitid(P_PID). Users
that need scoping should rely on pid-based wait*() syscalls for now.

pidfd_wait() allows to specify which changes to wait for. The states to
wait for can be or-ed and are specified in the states argument:
	WEXITED		Wait for children that have terminated.
	WSTOPPED	Wait for children that have been stopped by
			delivery of a signal.
	WCONTINUED	Wait for (previously stopped) children that have
			been resumed by delivery of SIGCONT.
	WUNTRACED	Return if a child has stopped.

The behavior of pidfd_wait() can be further modified by specifying the
following or-able options in the flags argument:
	__WCLONE	Only wait for a process that delivers no signal
			or a different signal than SIGCHLD to the parent
			on termination.
	__WALL		Wait for all children indepedent of whether or
			not they deliver no signal or another signal
			than SIGCHLD to the parent on termination.
			parent
	__WNOTHREAD	Do not wait for children of other threads in the
			same thread-group.
	WNOHANG		Return immediately if no child has exited.
	WNOWAIT		Leave the child in a waitable state.

pidfd_wait() takes an additional siginfo_t argument. If it is non-NULL,
pidfd_wait() will fill in si_pid, si_uid, si_signo, si_status, and
si_code. The si_code field will be set to one of CLD_EXITED, CLD_KILLED,
CLD_DUMPED, CLD_STOPPED, CLD_TRAPPED, or CLD_CONTINUED.
Information about resource usage of the process in question is returned
in the struct rusage argument of pidfd_wait().

On success, pidfd_wait() will return the pid of the process the pidfd
referred to. On failure, a negative error code will be returned.

/* Prior approach */
The first implementation was based on a flag WPIDFD which got added to
the wait*() system calls. However, that involved passing the pidfd
through the pid_t pid argument and do in-kernel type switching based on
the flag which feels like a really unclean solution and overall like a
mishmash of two apis. This is something we luckily have avoided so far
and I think we're better off in the long run if we keep it that way.

/* References */
[1]: https://github.com/systemd/systemd/issues/13101

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
---
 include/linux/pid.h |  5 +++
 kernel/exit.c       | 87 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c       |  8 +++++
 kernel/signal.c     |  7 ++--
 4 files changed, 105 insertions(+), 2 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 2a83e434db9d..443cd4108943 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -72,6 +72,11 @@ extern struct pid init_struct_pid;
 
 extern const struct file_operations pidfd_fops;
 
+struct file;
+
+extern struct pid *pidfd_pid(const struct file *file);
+
+
 static inline struct pid *get_pid(struct pid *pid)
 {
 	if (pid)
diff --git a/kernel/exit.c b/kernel/exit.c
index 73392a455b72..8086c76e1959 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1738,3 +1738,90 @@ __weak void abort(void)
 	panic("Oops failed to kill thread");
 }
 EXPORT_SYMBOL(abort);
+
+static int copy_rusage_to_user_any(struct rusage *kru, struct rusage __user *ru)
+{
+#ifdef CONFIG_COMPAT
+	if (in_compat_syscall())
+		return put_compat_rusage(kru, (struct compat_rusage __user *)ru);
+#endif
+	return copy_to_user(ru, kru, sizeof(*kru));
+}
+
+static int copy_siginfo_to_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
+{
+#ifdef CONFIG_COMPAT
+	if (in_compat_syscall())
+		return copy_siginfo_to_user32(
+			(struct compat_siginfo __user *)info, kinfo);
+#endif
+	return copy_siginfo_to_user(info, kinfo);
+}
+
+SYSCALL_DEFINE6(pidfd_wait, int, pidfd, int __user *, stat_addr,
+		siginfo_t __user *, info, struct rusage __user *, ru,
+		unsigned int, states, unsigned int, flags)
+{
+	long ret;
+	struct fd f;
+	struct pid *pid;
+	struct wait_opts wo;
+	struct rusage kru = {};
+	kernel_siginfo_t kinfo = {
+		.si_signo = 0,
+	};
+
+	if (pidfd < 0)
+		return -EINVAL;
+
+	if (states & ~(WEXITED | WSTOPPED | WCONTINUED | WUNTRACED))
+		return -EINVAL;
+
+	if (!(states & (WEXITED | WSTOPPED | WCONTINUED | WUNTRACED)))
+		return -EINVAL;
+
+	if (flags & ~(__WNOTHREAD | __WCLONE | __WALL | WNOWAIT | WNOHANG))
+		return -EINVAL;
+
+	f = fdget(pidfd);
+	if (!f.file)
+		return -EBADF;
+
+	pid = pidfd_pid(f.file);
+	if (IS_ERR(pid)) {
+		ret = PTR_ERR(pid);
+		goto out_fdput;
+	}
+
+	wo = (struct wait_opts){
+		.wo_type	= PIDTYPE_PID,
+		.wo_pid		= pid,
+		.wo_flags	= states | flags,
+		.wo_info	= info ? &kinfo : NULL,
+		.wo_rusage	= ru ? &kru : NULL,
+	};
+
+	ret = do_wait(&wo);
+	if (ret > 0) {
+		kinfo.si_signo = SIGCHLD;
+
+		if (stat_addr && put_user(wo.wo_stat, stat_addr)) {
+			ret = -EFAULT;
+			goto out_fdput;
+		}
+
+		if (ru && copy_rusage_to_user_any(&kru, ru)) {
+			ret = -EFAULT;
+			goto out_fdput;
+		}
+	} else {
+		kinfo.si_signo = 0;
+	}
+
+	if (info && copy_siginfo_to_user_any(&kinfo, info))
+		ret = -EFAULT;
+
+out_fdput:
+	fdput(f);
+	return ret;
+}
diff --git a/kernel/fork.c b/kernel/fork.c
index d8ae0f1b4148..baaff6570517 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1743,6 +1743,14 @@ const struct file_operations pidfd_fops = {
 #endif
 };
 
+struct pid *pidfd_pid(const struct file *file)
+{
+	if (file->f_op == &pidfd_fops)
+		return file->private_data;
+
+	return ERR_PTR(-EBADF);
+}
+
 static void __delayed_free_task(struct rcu_head *rhp)
 {
 	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
diff --git a/kernel/signal.c b/kernel/signal.c
index 91b789dd6e72..2e567f64812f 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3672,8 +3672,11 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
 
 static struct pid *pidfd_to_pid(const struct file *file)
 {
-	if (file->f_op == &pidfd_fops)
-		return file->private_data;
+	struct pid *pid;
+
+	pid = pidfd_pid(file);
+	if (!IS_ERR(pid))
+		return pid;
 
 	return tgid_pidfd_to_pid(file);
 }
-- 
2.22.0

^ permalink raw reply related

* [PATCH 3/5] arch: wire-up pidfd_wait()
From: Christian Brauner @ 2019-07-24 14:46 UTC (permalink / raw)
  To: linux-kernel, oleg
  Cc: arnd, ebiederm, keescook, joel, tglx, tj, dhowells, jannh, luto,
	akpm, cyphar, torvalds, viro, kernel-team, Christian Brauner,
	linux-api, linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-arch, x86
In-Reply-To: <20190724144651.28272-1-christian@brauner.io>

This wires up the pidfd_wait() syscall into all arches at once.

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: linux-alpha@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-m68k@lists.linux-m68k.org
Cc: linux-mips@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-xtensa@linux-xtensa.org
Cc: linux-arch@vger.kernel.org
Cc: x86@kernel.org
---
 arch/alpha/kernel/syscalls/syscall.tbl      | 1 +
 arch/arm/tools/syscall.tbl                  | 1 +
 arch/arm64/include/asm/unistd.h             | 2 +-
 arch/arm64/include/asm/unistd32.h           | 4 +++-
 arch/ia64/kernel/syscalls/syscall.tbl       | 1 +
 arch/m68k/kernel/syscalls/syscall.tbl       | 1 +
 arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   | 1 +
 arch/parisc/kernel/syscalls/syscall.tbl     | 1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    | 1 +
 arch/s390/kernel/syscalls/syscall.tbl       | 1 +
 arch/sh/kernel/syscalls/syscall.tbl         | 1 +
 arch/sparc/kernel/syscalls/syscall.tbl      | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl      | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl      | 1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     | 1 +
 include/linux/syscalls.h                    | 4 ++++
 include/uapi/asm-generic/unistd.h           | 4 +++-
 20 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 728fe028c02c..ca3e593f0c7a 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
 543	common	fspick				sys_fspick
 544	common	pidfd_open			sys_pidfd_open
 # 545 reserved for clone3
+548	common	pidfd_wait			sys_pidfd_wait
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 6da7dc4d79cc..5e448d915b2f 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -449,3 +449,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 435	common	clone3				sys_clone3
+438	common	pidfd_wait			sys_pidfd_wait
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 2629a68b8724..b722e47377a5 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		436
+#define __NR_compat_syscalls		439
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 94ab29cf4f00..ca77c9d4f7a1 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -877,7 +877,9 @@ __SYSCALL(__NR_fsmount, sys_fsmount)
 __SYSCALL(__NR_fspick, sys_fspick)
 #define __NR_pidfd_open 434
 __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
-#define __NR_clone3 435
+#define __NR_pidfd_wait 438
+__SYSCALL(__NR_pidfd_wait, sys_pidfd_wait)
+#define __NR_clone3 439
 __SYSCALL(__NR_clone3, sys_clone3)
 
 /*
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 36d5faf4c86c..f038afaced9b 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -356,3 +356,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+438	common	pidfd_wait			sys_pidfd_wait
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index a88a285a0e5f..51f86f7b4cec 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -435,3 +435,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+438	common	pidfd_wait			sys_pidfd_wait
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 09b0cd7dab0a..24f912ac5dfa 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 435	common	clone3				sys_clone3
+438	common	pidfd_wait			sys_pidfd_wait
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index c9c879ec9b6d..edc144c4040c 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -374,3 +374,4 @@
 433	n32	fspick				sys_fspick
 434	n32	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+438	n32	pidfd_wait			sys_pidfd_wait
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index bbce9159caa1..da4486ea0f4f 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -350,3 +350,4 @@
 433	n64	fspick				sys_fspick
 434	n64	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+438	n64	pidfd_wait			sys_pidfd_wait
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 9653591428ec..d738688e50d8 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -423,3 +423,4 @@
 433	o32	fspick				sys_fspick
 434	o32	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+438	o32	pidfd_wait			sys_pidfd_wait
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 670d1371aca1..d60f44d8145c 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -432,3 +432,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 435	common	clone3				sys_clone3_wrapper
+438	common	pidfd_wait			sys_pidfd_wait
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 3331749aab20..3309bf5f5370 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -517,3 +517,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+438	common	pidfd_wait			sys_pidfd_wait
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index a90d3e945445..ef8ba9a9c3bb 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
 433  common	fspick			sys_fspick			sys_fspick
 434  common	pidfd_open		sys_pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+438  common	pidfd_wait		sys_pidfd_wait			sys_pidfd_wait
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index b5ed26c4c005..9e786a198bfd 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+438	common	pidfd_wait			sys_pidfd_wait
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 8c8cc7537fb2..ef4f13907894 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -481,3 +481,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+438	common	pidfd_wait			sys_pidfd_wait
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index c00019abd076..76ec8c905745 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -440,3 +440,4 @@
 433	i386	fspick			sys_fspick			__ia32_sys_fspick
 434	i386	pidfd_open		sys_pidfd_open			__ia32_sys_pidfd_open
 435	i386	clone3			sys_clone3			__ia32_sys_clone3
+438	i386	pidfd_wait		sys_pidfd_wait			__ia32_sys_pidfd_wait
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c29976eca4a8..733c206130f8 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -357,6 +357,7 @@
 433	common	fspick			__x64_sys_fspick
 434	common	pidfd_open		__x64_sys_pidfd_open
 435	common	clone3			__x64_sys_clone3/ptregs
+438	common	pidfd_wait		__x64_sys_pidfd_wait
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 25f4de729a6d..417203971292 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -406,3 +406,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 435	common	clone3				sys_clone3
+438	common	pidfd_wait			sys_pidfd_wait
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 88145da7d140..760e8eacb93c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -932,6 +932,10 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
 asmlinkage long sys_syncfs(int fd);
 asmlinkage long sys_setns(int fd, int nstype);
 asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
+asmlinkage long sys_pidfd_wait(int pidfd, int __user *stat_addr,
+			       struct siginfo __user *info,
+			       struct rusage __user *ru, unsigned int states,
+			       unsigned int flags);
 asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
 			     unsigned int vlen, unsigned flags);
 asmlinkage long sys_process_vm_readv(pid_t pid,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1be0e798e362..0dd5b9d4dba0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -850,9 +850,11 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
 #define __NR_clone3 435
 __SYSCALL(__NR_clone3, sys_clone3)
 #endif
+#define __NR_pidfd_wait 438
+__SYSCALL(__NR_pidfd_wait, sys_pidfd_wait)
 
 #undef __NR_syscalls
-#define __NR_syscalls 436
+#define __NR_syscalls 439
 
 /*
  * 32 bit systems traditionally used different
-- 
2.22.0

^ permalink raw reply related

* [PATCH 4/5] pidfd: add CLONE_WAIT_PID
From: Christian Brauner @ 2019-07-24 14:46 UTC (permalink / raw)
  To: linux-kernel, oleg
  Cc: arnd, ebiederm, keescook, joel, tglx, tj, dhowells, jannh, luto,
	akpm, cyphar, torvalds, viro, kernel-team, Christian Brauner,
	Ingo Molnar, Peter Zijlstra, linux-api
In-Reply-To: <20190724144651.28272-1-christian@brauner.io>

If CLONE_WAIT_PID is set the newly created process will not be
considered by process wait requests that wait generically on children
such as:

	syscall(__NR_wait4, -1, wstatus, options, rusage)
	syscall(__NR_waitpid, -1, wstatus, options)
	syscall(__NR_waitid, P_ALL, -1, siginfo, options, rusage)
	syscall(__NR_waitid, P_PGID, -1, siginfo, options, rusage)
	syscall(__NR_waitpid, -pid, wstatus, options)
	syscall(__NR_wait4, -pid, wstatus, options, rusage)

A process created with CLONE_WAIT_PID can only be waited upon with a
focussed wait call. This ensures that processes can be reaped even if
all file descriptors referring to it are closed.

/* Usecases */
This feature has been requested in discussions when I presented this
work multiple times. Here are concrete use cases people have:
1. Process managers that would like to use pidfd for all process
   watching needs require this feature.
   A process manager (e.g. PID 1) that needs to reap all children
   assigned to it needs to invoke some form of waitall request as
   outlined above.  This has to be done since the process manager might
   not know about processes that got re-parented to it. Without
   CLONE_WAIT_PID the process manager will end up reaping processes it
   uses pidfds to watch for since they are crucial internal processes.
2. Various libraries want to be able to fork off helper processes
   internally that do not otherwise affect the program they are used in.
   This is currently not possible.
   However, if a process invokes a waitall request the internal
   helper process of the library might get reaped, confusing the library
   which expected it to reap it itself.
   Careful programs will thus generally avoid waitall requests which is
   inefficient.
3. A general class of programs are ones that use event loops (e.g. GLib,
   systemd, and LXC etc.). Such event loops currently call focused wait
   requests iteratively on all processes they are configured to watch to
   avoid waitall request pitfalls.
   This is ugly and inefficient since it cannot be used to watch large
   numbers of file descriptors without paying the O(n) cost on each
   event loop iteration.

/* Prior art */
FreeBSD has a similar concept (cf. [1], [2]). They are currently doing
it the other way around, i.e. by default all procdescs are not visible
in waitall requests. Howver, originally, they allowed procdescs to
appear in waitall and changed it later (cf. [1]).

Currently, CLONE_WAIT_PID can only be used in conjunction with
CLONE_PIDFD since the usecases above only make sense when used in
combination with both.

/* References */
[1]: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=201054
[2]: https://svnweb.freebsd.org/base/head/sys/kern/kern_exit.c

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-api@vger.kernel.org
---
 include/linux/sched.h      |  1 +
 include/uapi/linux/sched.h |  1 +
 kernel/exit.c              |  3 +++
 kernel/fork.c              | 11 ++++++++++-
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8dc1811487f5..f0166f630a1a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1468,6 +1468,7 @@ extern struct pid *cad_pid;
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_NOCMA	0x10000000	/* All allocation request will have _GFP_MOVABLE cleared */
+#define PF_WAIT_PID		0x20000000	/* This task will not appear in generic wait requests */
 #define PF_FREEZER_SKIP		0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
 
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b3105ac1381a..ffb1cac18e4e 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -32,6 +32,7 @@
 #define CLONE_NEWPID		0x20000000	/* New pid namespace */
 #define CLONE_NEWNET		0x40000000	/* New network namespace */
 #define CLONE_IO		0x80000000	/* Clone io context */
+#define CLONE_WAIT_PID		0x200000000ULL  /* set if process should not appear in generic wait requests */
 
 /*
  * Arguments for the clone3 syscall
diff --git a/kernel/exit.c b/kernel/exit.c
index 8086c76e1959..aa15de1108b2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1019,6 +1019,9 @@ eligible_child(struct wait_opts *wo, bool ptrace, struct task_struct *p)
 	if (!eligible_pid(wo, p))
 		return 0;
 
+	if ((p->flags & PF_WAIT_PID) && (wo->wo_type != PIDTYPE_PID))
+		return 0;
+
 	/*
 	 * Wait for all children (clone and not) if __WALL is set or
 	 * if it is traced by us.
diff --git a/kernel/fork.c b/kernel/fork.c
index baaff6570517..a067f3876e2e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1910,6 +1910,8 @@ static __latent_entropy struct task_struct *copy_process(
 	delayacct_tsk_init(p);	/* Must remain after dup_task_struct() */
 	p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE);
 	p->flags |= PF_FORKNOEXEC;
+	if (clone_flags & CLONE_WAIT_PID)
+		p->flags |= PF_WAIT_PID;
 	INIT_LIST_HEAD(&p->children);
 	INIT_LIST_HEAD(&p->sibling);
 	rcu_copy_process(p);
@@ -2590,7 +2592,7 @@ static bool clone3_args_valid(const struct kernel_clone_args *kargs)
 	 * All lower bits of the flag word are taken.
 	 * Verify that no other unknown flags are passed along.
 	 */
-	if (kargs->flags & ~CLONE_LEGACY_FLAGS)
+	if (kargs->flags & ~(CLONE_LEGACY_FLAGS | CLONE_WAIT_PID))
 		return false;
 
 	/*
@@ -2600,6 +2602,13 @@ static bool clone3_args_valid(const struct kernel_clone_args *kargs)
 	if (kargs->flags & (CLONE_DETACHED | CSIGNAL))
 		return false;
 
+	/*
+	 * Currently only allow CLONE_WAIT_PID for processes created as
+	 * pidfds until someone needs this feature for regular pids too.
+	 */
+	if ((kargs->flags & CLONE_WAIT_PID) && !(kargs->flags & CLONE_PIDFD))
+		return false;
+
 	if ((kargs->flags & (CLONE_THREAD | CLONE_PARENT)) &&
 	    kargs->exit_signal)
 		return false;
-- 
2.22.0

^ permalink raw reply related

* Re: [PATCH v4 0/3] initramfs: add support for xattrs in the initial ram disk
From: Roberto Sassu @ 2019-07-24 15:34 UTC (permalink / raw)
  To: Rob Landley, hpa, Arvind Sankar
  Cc: Mimi Zohar, viro, linux-security-module, linux-integrity,
	initramfs, linux-api, linux-fsdevel, linux-kernel, bug-cpio,
	zohar, silviu.vlasceanu, dmitry.kasatkin, takondra, kamensky,
	arnd, james.w.mcmechan
In-Reply-To: <f85ed711-f583-51cd-34e2-80018a592280@huawei.com>

Is there anything I didn't address in this patch set, that is delaying
the review? I would appreciate if you can give me a feedback, positive
or negative.

Thanks a lot!

Roberto


On 7/15/2019 6:54 PM, Roberto Sassu wrote:
> Rob, Peter, Arvind, did you have the chance to have a look at this
> version of the patch set?
> 
> Thanks
> 
> Roberto
> 
> 
> On 7/1/2019 4:31 PM, Mimi Zohar wrote:
>> On Mon, 2019-07-01 at 16:42 +0300, Roberto Sassu wrote:
>>> On 6/30/2019 6:39 PM, Mimi Zohar wrote:
>>>> On Wed, 2019-06-26 at 10:15 +0200, Roberto Sassu wrote:
>>>>> On 6/3/2019 8:32 PM, Rob Landley wrote:
>>>>>> On 6/3/19 4:31 AM, Roberto Sassu wrote:
>>>>>>>> This patch set aims at solving the following use case: appraise 
>>>>>>>> files from
>>>>>>>> the initial ram disk. To do that, IMA checks the signature/hash 
>>>>>>>> from the
>>>>>>>> security.ima xattr. Unfortunately, this use case cannot be 
>>>>>>>> implemented
>>>>>>>> currently, as the CPIO format does not support xattrs.
>>>>>>>>
>>>>>>>> This proposal consists in including file metadata as additional 
>>>>>>>> files named
>>>>>>>> METADATA!!!, for each file added to the ram disk. The CPIO 
>>>>>>>> parser in the
>>>>>>>> kernel recognizes these special files from the file name, and 
>>>>>>>> calls the
>>>>>>>> appropriate parser to add metadata to the previously extracted 
>>>>>>>> file. It has
>>>>>>>> been proposed to use bit 17:16 of the file mode as a way to 
>>>>>>>> recognize files
>>>>>>>> with metadata, but both the kernel and the cpio tool declare the 
>>>>>>>> file mode
>>>>>>>> as unsigned short.
>>>>>>>
>>>>>>> Any opinion on this patch set?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Roberto
>>>>>>
>>>>>> Sorry, I've had the window open since you posted it but haven't 
>>>>>> gotten around to
>>>>>> it. I'll try to build it later today.
>>>>>>
>>>>>> It does look interesting, and I have no objections to the basic 
>>>>>> approach. I
>>>>>> should be able to add support to toybox cpio over a weekend once 
>>>>>> I've got the
>>>>>> kernel doing it to test against.
>>>>>
>>>>> Ok.
>>>>>
>>>>> Let me give some instructions so that people can test this patch set.
>>>>>
>>>>> To add xattrs to the ram disk embedded in the kernel it is sufficient
>>>>> to set CONFIG_INITRAMFS_FILE_METADATA="xattr" and
>>>>> CONFIG_INITRAMFS_SOURCE="<file with xattr>" in the kernel 
>>>>> configuration.
>>>>>
>>>>> To add xattrs to the external ram disk, it is necessary to patch cpio:
>>>>>
>>>>> https://github.com/euleros/cpio/commit/531cabc88e9ecdc3231fad6e4856869baa9a91ef 
>>>>>
>>>>> (xattr-v1 branch)
>>>>>
>>>>> and dracut:
>>>>>
>>>>> https://github.com/euleros/dracut/commit/a2dee56ea80495c2c1871bc73186f7b00dc8bf3b 
>>>>>
>>>>> (digest-lists branch)
>>>>>
>>>>> The same modification can be done for mkinitramfs (add '-e xattr' 
>>>>> to the
>>>>> cpio command line).
>>>>>
>>>>> To simplify the test, it would be sufficient to replace only the cpio
>>>>> binary and the dracut script with the modified versions. For 
>>>>> dracut, the
>>>>> patch should be applied to the local dracut (after it has been renamed
>>>>> to dracut.sh).
>>>>>
>>>>> Then, run:
>>>>>
>>>>> dracut -e xattr -I <file with xattr> (add -f to overwrite the ram 
>>>>> disk)
>>>>>
>>>>> Xattrs can be seen by stopping the boot process for example by adding
>>>>> rd.break to the kernel command line.
>>>>
>>>> A simple way of testing, without needing any changes other than the
>>>> kernel patches, is to save the dracut temporary directory by supplying
>>>> "--keep" on the dracut command line, calling
>>>> usr/gen_initramfs_list.sh, followed by usr/gen_init_cpio with the "-e
>>>> xattr" option.
>>>
>>> Alternatively, follow the instructions to create the embedded ram disk
>>> with xattrs, and use the existing external ram disk created with dracut
>>> to check if xattrs are created.
>>
>> True, but this alternative is for those who normally use dracut to
>> create an initramfs, but don't want to update cpio or dracut.
>>
>> Mimi
>>
> 

-- 
HUAWEI TECHNOLOGIES Duesseldorf GmbH, HRB 56063
Managing Director: Li Peng, Li Jian, Shi Yanli

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox