Linux userland API discussions
 help / color / mirror / Atom feed
* [PATCH v5 9/9] Documentation: fpga: dfl: add descriptions for virtualization and new interfaces.
From: Wu Hao @ 2019-08-12  2:50 UTC (permalink / raw)
  To: gregkh, mdf, linux-fpga
  Cc: linux-kernel, linux-api, linux-doc, atull, Wu Hao, Xu Yilun
In-Reply-To: <1565578204-13969-1-git-send-email-hao.wu@intel.com>

This patch adds virtualization support description for DFL based
FPGA devices (based on PCIe SRIOV), and introductions to new
interfaces added by new dfl private feature drivers.

[mdf@kernel.org: Fixed up to make it work with new reStructuredText docs]
Signed-off-by: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
Acked-by: Alan Tull <atull@kernel.org>
Signed-off-by: Moritz Fischer <mdf@kernel.org>
---
 Documentation/fpga/dfl.rst | 105 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 105 insertions(+)

diff --git a/Documentation/fpga/dfl.rst b/Documentation/fpga/dfl.rst
index 2f125ab..6fa483f 100644
--- a/Documentation/fpga/dfl.rst
+++ b/Documentation/fpga/dfl.rst
@@ -87,6 +87,8 @@ The following functions are exposed through ioctls:
 - Get driver API version (DFL_FPGA_GET_API_VERSION)
 - Check for extensions (DFL_FPGA_CHECK_EXTENSION)
 - Program bitstream (DFL_FPGA_FME_PORT_PR)
+- Assign port to PF (DFL_FPGA_FME_PORT_ASSIGN)
+- Release port from PF (DFL_FPGA_FME_PORT_RELEASE)
 
 More functions are exposed through sysfs
 (/sys/class/fpga_region/regionX/dfl-fme.n/):
@@ -102,6 +104,10 @@ More functions are exposed through sysfs
      one FPGA device may have more than one port, this sysfs interface indicates
      how many ports the FPGA device has.
 
+ Global error reporting management (errors/)
+     error reporting sysfs interfaces allow user to read errors detected by the
+     hardware, and clear the logged errors.
+
 
 FIU - PORT
 ==========
@@ -143,6 +149,10 @@ More functions are exposed through sysfs:
  Read Accelerator GUID (afu_id)
      afu_id indicates which PR bitstream is programmed to this AFU.
 
+ Error reporting (errors/)
+     error reporting sysfs interfaces allow user to read port/afu errors
+     detected by the hardware, and clear the logged errors.
+
 
 DFL Framework Overview
 ======================
@@ -218,6 +228,101 @@ the compat_id exposed by the target FPGA region. This check is usually done by
 userspace before calling the reconfiguration IOCTL.
 
 
+FPGA virtualization - PCIe SRIOV
+================================
+This section describes the virtualization support on DFL based FPGA device to
+enable accessing an accelerator from applications running in a virtual machine
+(VM). This section only describes the PCIe based FPGA device with SRIOV support.
+
+Features supported by the particular FPGA device are exposed through Device
+Feature Lists, as illustrated below:
+
+::
+
+    +-------------------------------+  +-------------+
+    |              PF               |  |     VF      |
+    +-------------------------------+  +-------------+
+        ^            ^         ^              ^
+        |            |         |              |
+  +-----|------------|---------|--------------|-------+
+  |     |            |         |              |       |
+  |  +-----+     +-------+ +-------+      +-------+   |
+  |  | FME |     | Port0 | | Port1 |      | Port2 |   |
+  |  +-----+     +-------+ +-------+      +-------+   |
+  |                  ^         ^              ^       |
+  |                  |         |              |       |
+  |              +-------+ +------+       +-------+   |
+  |              |  AFU  | |  AFU |       |  AFU  |   |
+  |              +-------+ +------+       +-------+   |
+  |                                                   |
+  |            DFL based FPGA PCIe Device             |
+  +---------------------------------------------------+
+
+FME is always accessed through the physical function (PF).
+
+Ports (and related AFUs) are accessed via PF by default, but could be exposed
+through virtual function (VF) devices via PCIe SRIOV. Each VF only contains
+1 Port and 1 AFU for isolation. Users could assign individual VFs (accelerators)
+created via PCIe SRIOV interface, to virtual machines.
+
+The driver organization in virtualization case is illustrated below:
+::
+
+    +-------++------++------+             |
+    | FME   || FME  || FME  |             |
+    | FPGA  || FPGA || FPGA |             |
+    |Manager||Bridge||Region|             |
+    +-------++------++------+             |
+    +-----------------------+  +--------+ |             +--------+
+    |          FME          |  |  AFU   | |             |  AFU   |
+    |         Module        |  | Module | |             | Module |
+    +-----------------------+  +--------+ |             +--------+
+          +-----------------------+       |       +-----------------------+
+          | FPGA Container Device |       |       | FPGA Container Device |
+          |  (FPGA Base Region)   |       |       |  (FPGA Base Region)   |
+          +-----------------------+       |       +-----------------------+
+            +------------------+          |         +------------------+
+            | FPGA PCIE Module |          | Virtual | FPGA PCIE Module |
+            +------------------+   Host   | Machine +------------------+
+   -------------------------------------- | ------------------------------
+             +---------------+            |          +---------------+
+             | PCI PF Device |            |          | PCI VF Device |
+             +---------------+            |          +---------------+
+
+FPGA PCIe device driver is always loaded first once a FPGA PCIe PF or VF device
+is detected. It:
+
+* Finishes enumeration on both FPGA PCIe PF and VF device using common
+  interfaces from DFL framework.
+* Supports SRIOV.
+
+The FME device driver plays a management role in this driver architecture, it
+provides ioctls to release Port from PF and assign Port to PF. After release
+a port from PF, then it's safe to expose this port through a VF via PCIe SRIOV
+sysfs interface.
+
+To enable accessing an accelerator from applications running in a VM, the
+respective AFU's port needs to be assigned to a VF using the following steps:
+
+#. The PF owns all AFU ports by default. Any port that needs to be
+   reassigned to a VF must first be released through the
+   DFL_FPGA_FME_PORT_RELEASE ioctl on the FME device.
+
+#. Once N ports are released from PF, then user can use command below
+   to enable SRIOV and VFs. Each VF owns only one Port with AFU.
+
+   ::
+
+      echo N > $PCI_DEVICE_PATH/sriov_numvfs
+
+#. Pass through the VFs to VMs
+
+#. The AFU under VF is accessible from applications in VM (using the
+   same driver inside the VF).
+
+Note that an FME can't be assigned to a VF, thus PR and other management
+functions are only available via the PF.
+
 Device enumeration
 ==================
 This section introduces how applications enumerate the fpga device from
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH] syscalls: Update the syscall #defines to match uapi
From: Alistair Francis @ 2019-08-12  3:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alistair Francis, LKML, Linux API, Arnd Bergmann, Deepa Dinamani
In-Reply-To: <CALCETrVArr5TTbXayDZ9rz90iGoytGW2bnV5_ZOunhOsPR1u4g@mail.gmail.com>

On Sun, Aug 11, 2019 at 5:00 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Fri, Aug 9, 2019 at 6:11 PM Alistair Francis
> <alistair.francis@wdc.com> wrote:
> >
> > Update the #defines around sys_fstat64() and sys_fstatat64() to match
> > the #defines around the __NR3264_fstatat and __NR3264_fstat definitions
> > in include/uapi/asm-generic/unistd.h. This avoids compiler failures if
> > one is defined.
>
> What do you mean by "if one is defined"?

Yeah, I guess that isn't clear. What I mean is that if
__ARCH_WANT_NEW_STAT is defined but __ARCH_WANT_STAT64 isn't currently
it will fail to build.

>
> The patch seems like it's probably okay, but I can't understand the changelog.

I can send a v2 with an updated commit message.

Alistair

>
> --Andy

^ permalink raw reply

* Re: [PATCH] syscalls: Update the syscall #defines to match uapi
From: Arnd Bergmann @ 2019-08-12  9:48 UTC (permalink / raw)
  To: Alistair Francis
  Cc: Linux Kernel Mailing List, Linux API, Deepa Dinamani,
	Alistair Francis
In-Reply-To: <20190810010758.16407-1-alistair.francis@wdc.com>

On Sat, Aug 10, 2019 at 3:11 AM Alistair Francis
<alistair.francis@wdc.com> wrote:
>
> Update the #defines around sys_fstat64() and sys_fstatat64() to match
> the #defines around the __NR3264_fstatat and __NR3264_fstat definitions
> in include/uapi/asm-generic/unistd.h. This avoids compiler failures if
> one is defined.

What is the compiler failure you get?

> Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
> ---
>  include/linux/syscalls.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 2bcef4c70183..e4bf5e480d60 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -512,7 +512,7 @@ asmlinkage long sys_readlinkat(int dfd, const char __user *path, char __user *bu
>  asmlinkage long sys_newfstatat(int dfd, const char __user *filename,
>                                struct stat __user *statbuf, int flag);
>  asmlinkage long sys_newfstat(unsigned int fd, struct stat __user *statbuf);
> -#if defined(__ARCH_WANT_STAT64) || defined(__ARCH_WANT_COMPAT_STAT64)
> +#if defined(__ARCH_WANT_NEW_STAT) || defined(__ARCH_WANT_STAT64)
>  asmlinkage long sys_fstat64(unsigned long fd, struct stat64 __user *statbuf);
>  asmlinkage long sys_fstatat64(int dfd, const char __user *filename,
>                                struct stat64 __user *statbuf, int flag);

I think this is wrong: when __ARCH_WANT_NEW_STAT is set, we are
on a 64-bit architecture and only want the sys_newfstat{,at} system
calls, not sys_fstat{,at}64 that gets used on 32-bit machines.

The #if check in the syscalls.h file also matches the definition of
the function.

       Arnd

^ permalink raw reply

* Re: [PATCH v7 07/16] fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl
From: Theodore Y. Ts'o @ 2019-08-12 14:16 UTC (permalink / raw)
  To: linux-fscrypt, linux-fsdevel, linux-ext4, linux-f2fs-devel,
	linux-mtd, linux-api, linux-crypto, keyrings, Paul Crowley,
	Satya Tangirala
In-Reply-To: <20190802043827.GA19201@sol.localdomain>

On Thu, Aug 01, 2019 at 09:38:27PM -0700, Eric Biggers wrote:
> 
> Here's a slightly updated version (I missed removing some stale text):

Apologies for the delaying in getting back.  Thanks, this looks great.

	      	  	      	      	     - Ted

> 
> Removing keys
> -------------
> 
> Two ioctls are available for removing a key that was added by
> `FS_IOC_ADD_ENCRYPTION_KEY`_:
> 
> - `FS_IOC_REMOVE_ENCRYPTION_KEY`_
> - `FS_IOC_REMOVE_ENCRYPTION_KEY_ALL_USERS`_
> 
> These two ioctls differ only in cases where v2 policy keys are added
> or removed by non-root users.
> 
> These ioctls don't work on keys that were added via the legacy
> process-subscribed keyrings mechanism.
> 
> Before using these ioctls, read the `Kernel memory compromise`_
> section for a discussion of the security goals and limitations of
> these ioctls.
> 
> FS_IOC_REMOVE_ENCRYPTION_KEY
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> The FS_IOC_REMOVE_ENCRYPTION_KEY ioctl removes a claim to a master
> encryption key from the filesystem, and possibly removes the key
> itself.  It can be executed on any file or directory on the target
> filesystem, but using the filesystem's root directory is recommended.
> It takes in a pointer to a :c:type:`struct fscrypt_remove_key_arg`,
> defined as follows::
> 
>     struct fscrypt_remove_key_arg {
>             struct fscrypt_key_specifier key_spec;
>     #define FSCRYPT_KEY_REMOVAL_STATUS_FLAG_FILES_BUSY      0x00000001
>     #define FSCRYPT_KEY_REMOVAL_STATUS_FLAG_OTHER_USERS     0x00000002
>             __u32 removal_status_flags;     /* output */
>             __u32 __reserved[5];
>     };
> 
> This structure must be zeroed, then initialized as follows:
> 
> - The key to remove is specified by ``key_spec``:
> 
>     - To remove a key used by v1 encryption policies, set
>       ``key_spec.type`` to FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR and fill
>       in ``key_spec.u.descriptor``.  To remove this type of key, the
>       calling process must have the CAP_SYS_ADMIN capability in the
>       initial user namespace.
> 
>     - To remove a key used by v2 encryption policies, set
>       ``key_spec.type`` to FSCRYPT_KEY_SPEC_TYPE_IDENTIFIER and fill
>       in ``key_spec.u.identifier``.
> 
> For v2 policy keys, this ioctl is usable by non-root users.  However,
> to make this possible, it actually just removes the current user's
> claim to the key, undoing a single call to FS_IOC_ADD_ENCRYPTION_KEY.
> Only after all claims are removed is the key really removed.
> 
> For example, if FS_IOC_ADD_ENCRYPTION_KEY was called with uid 1000,
> then the key will be "claimed" by uid 1000, and
> FS_IOC_REMOVE_ENCRYPTION_KEY will only succeed as uid 1000.  Or, if
> both uids 1000 and 2000 added the key, then for each uid
> FS_IOC_REMOVE_ENCRYPTION_KEY will only remove their own claim.  Only
> once *both* are removed is the key really removed.  (Think of it like
> unlinking a file that may have hard links.)
> 
> If FS_IOC_REMOVE_ENCRYPTION_KEY really removes the key, it will also
> try to "lock" all files that had been unlocked with the key.  It won't
> lock files that are still in-use, so this ioctl is expected to be used
> in cooperation with userspace ensuring that none of the files are
> still open.  However, if necessary, the ioctl can be executed again
> later to retry locking any remaining files.
> 
> FS_IOC_REMOVE_ENCRYPTION_KEY returns 0 if either the key was removed
> (but may still have files remaining to be locked), the user's claim to
> the key was removed, or the key was already removed but had files
> remaining to be the locked so the ioctl retried locking them.  In any
> of these cases, ``removal_status_flags`` is filled in with the
> following informational status flags:
> 
> - ``FSCRYPT_KEY_REMOVAL_STATUS_FLAG_FILES_BUSY``: set if some file(s)
>   are still in-use.  Not guaranteed to be set in the case where only
>   the user's claim to the key was removed.
> - ``FSCRYPT_KEY_REMOVAL_STATUS_FLAG_OTHER_USERS``: set if only the
>   user's claim to the key was removed, not the key itself
> 
> FS_IOC_REMOVE_ENCRYPTION_KEY can fail with the following errors:
> 
> - ``EACCES``: The FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR key specifier type
>   was specified, but the caller does not have the CAP_SYS_ADMIN
>   capability in the initial user namespace
> - ``EINVAL``: invalid key specifier type, or reserved bits were set
> - ``ENOKEY``: the key object was not found at all, i.e. it was never
>   added in the first place or was already fully removed including all
>   files locked; or, the user does not have a claim to the key.
> - ``ENOTTY``: this type of filesystem does not implement encryption
> - ``EOPNOTSUPP``: the kernel was not configured with encryption
>   support for this filesystem, or the filesystem superblock has not
>   had encryption enabled on it
> 
> FS_IOC_REMOVE_ENCRYPTION_KEY_ALL_USERS
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> FS_IOC_REMOVE_ENCRYPTION_KEY_ALL_USERS is exactly the same as
> `FS_IOC_REMOVE_ENCRYPTION_KEY`_, except that for v2 policy keys, the
> ALL_USERS version of the ioctl will remove all users' claims to the
> key, not just the current user's.  I.e., the key itself will always be
> removed, no matter how many users have added it.  This difference is
> only meaningful if non-root users are adding and removing keys.
> 
> Because of this, FS_IOC_REMOVE_ENCRYPTION_KEY_ALL_USERS also requires
> "root", namely the CAP_SYS_ADMIN capability in the initial user
> namespace.  Otherwise it will fail with ``EACCES``.

^ permalink raw reply

* Re: [PATCH v5 1/6] mm/page_idle: Add per-pid idle page tracking using virtual index
From: Joel Fernandes @ 2019-08-12 14:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, Alexey Dobriyan, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen Rothwell
In-Reply-To: <20190808080044.GA18351@dhcp22.suse.cz>

On Thu, Aug 08, 2019 at 10:00:44AM +0200, Michal Hocko wrote:
> On Wed 07-08-19 17:31:05, Joel Fernandes wrote:
> > On Wed, Aug 07, 2019 at 01:58:40PM -0700, Andrew Morton wrote:
> > > On Wed, 7 Aug 2019 16:45:30 -0400 Joel Fernandes <joel@joelfernandes.org> wrote:
> > > 
> > > > On Wed, Aug 07, 2019 at 01:04:02PM -0700, Andrew Morton wrote:
> > > > > On Wed,  7 Aug 2019 13:15:54 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote:
> > > > > 
> > > > > > In Android, we are using this for the heap profiler (heapprofd) which
> > > > > > profiles and pin points code paths which allocates and leaves memory
> > > > > > idle for long periods of time. This method solves the security issue
> > > > > > with userspace learning the PFN, and while at it is also shown to yield
> > > > > > better results than the pagemap lookup, the theory being that the window
> > > > > > where the address space can change is reduced by eliminating the
> > > > > > intermediate pagemap look up stage. In virtual address indexing, the
> > > > > > process's mmap_sem is held for the duration of the access.
> > > > > 
> > > > > So is heapprofd a developer-only thing?  Is heapprofd included in
> > > > > end-user android loads?  If not then, again, wouldn't it be better to
> > > > > make the feature Kconfigurable so that Android developers can enable it
> > > > > during development then disable it for production kernels?
> > > > 
> > > > Almost all of this code is already configurable with
> > > > CONFIG_IDLE_PAGE_TRACKING. If you disable it, then all of this code gets
> > > > disabled.
> > > > 
> > > > Or are you referring to something else that needs to be made configurable?
> > > 
> > > Yes - the 300+ lines of code which this patchset adds!
> > > 
> > > The impacted people will be those who use the existing
> > > idle-page-tracking feature but who will not use the new feature.  I
> > > guess we can assume this set is small...
> > 
> > Yes, I think this set should be small. The code size increase of page_idle.o
> > is from ~1KB to ~2KB. Most of the extra space is consumed by
> > page_idle_proc_generic() function which this patch adds. I don't think adding
> > another CONFIG option to disable this while keeping existing
> > CONFIG_IDLE_PAGE_TRACKING enabled, is worthwhile but I am open to the
> > addition of such an option if anyone feels strongly about it. I believe that
> > once this patch is merged, most like this new interface being added is what
> > will be used more than the old interface (for some of the usecases) so it
> > makes sense to keep it alive with CONFIG_IDLE_PAGE_TRACKING.
> 
> I would tend to agree with Joel here. The functionality falls into an
> existing IDLE_PAGE_TRACKING config option quite nicely. If there really
> are users who want to save some space and this is standing in the way
> then they can easily add a new config option with some justification so
> the savings are clear. Without that an additional config simply adds to
> the already existing configurability complexity and balkanization.

Michal, Andrew, Minchan,

Would you have any other review comments on the v5 series? This is just a new
interface that does not disrupt existing users of the older page-idle
tracking, so as such it is a safe change (as in, doesn't change existing
functionality except for the draining bug fix).

thanks,

 - Joel

^ permalink raw reply

* Re: [PATCH V38 00/29] security: Add support for locking down the kernel
From: Matthew Garrett @ 2019-08-12 17:06 UTC (permalink / raw)
  To: James Morris; +Cc: LSM List, Linux Kernel Mailing List, Linux API
In-Reply-To: <alpine.LRH.2.21.1908101608260.25186@namei.org>

On Fri, Aug 9, 2019 at 11:08 PM James Morris <jmorris@namei.org> wrote:
> Please verify and test, as I had to make a few minor fixups for my v5.2
> base.

Thanks James - there's a few small fixups required, would you like
those as separate patches or should I just send you a fixed copy of
the original patchset?

^ permalink raw reply

* Re: [PATCH V38 00/29] security: Add support for locking down the kernel
From: James Morris @ 2019-08-12 17:39 UTC (permalink / raw)
  To: Matthew Garrett; +Cc: LSM List, Linux Kernel Mailing List, Linux API
In-Reply-To: <CACdnJusx3N_ZoH4=+tqt85K9J5wmUnC-+bTtG_5qSD_TYu74+A@mail.gmail.com>

On Mon, 12 Aug 2019, Matthew Garrett wrote:

> On Fri, Aug 9, 2019 at 11:08 PM James Morris <jmorris@namei.org> wrote:
> > Please verify and test, as I had to make a few minor fixups for my v5.2
> > base.
> 
> Thanks James - there's a few small fixups required, would you like
> those as separate patches or should I just send you a fixed copy of
> the original patchset?

Given there are a few, an updated copy of the patchset will be best.

-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply

* Re: [PATCH v5 1/6] mm/page_idle: Add per-pid idle page tracking using virtual index
From: Jann Horn @ 2019-08-12 18:14 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: kernel list, Alexey Dobriyan, Andrew Morton, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen,
	Daniel Colascione, fmayer, H. Peter Anvin, Ingo Molnar,
	Joel Fernandes, Jonathan Corbet, Kees Cook, kernel-team,
	Linux API, linux-doc, linux-fsdevel, Linux-MM, Michal Hocko,
	Mike Rapoport
In-Reply-To: <20190807171559.182301-1-joel@joelfernandes.org>

On Wed, Aug 7, 2019 at 7:16 PM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
> The page_idle tracking feature currently requires looking up the pagemap
> for a process followed by interacting with /sys/kernel/mm/page_idle.
> Looking up PFN from pagemap in Android devices is not supported by
> unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
>
> This patch adds support to directly interact with page_idle tracking at
> the PID level by introducing a /proc/<pid>/page_idle file.  It follows
> the exact same semantics as the global /sys/kernel/mm/page_idle, but now
> looking up PFN through pagemap is not needed since the interface uses
> virtual frame numbers, and at the same time also does not require
> SYS_ADMIN.
>
> In Android, we are using this for the heap profiler (heapprofd) which
> profiles and pin points code paths which allocates and leaves memory
> idle for long periods of time. This method solves the security issue
> with userspace learning the PFN, and while at it is also shown to yield
> better results than the pagemap lookup, the theory being that the window
> where the address space can change is reduced by eliminating the
> intermediate pagemap look up stage. In virtual address indexing, the
> process's mmap_sem is held for the duration of the access.

What happens when you use this interface on shared pages, like memory
inherited from the zygote, library file mappings and so on? If two
profilers ran concurrently for two different processes that both map
the same libraries, would they end up messing up each other's data?

Can this be used to observe which library pages other processes are
accessing, even if you don't have access to those processes, as long
as you can map the same libraries? I realize that there are already a
bunch of ways to do that with side channels and such; but if you're
adding an interface that allows this by design, it seems to me like
something that should be gated behind some sort of privilege check.

If the heap profiler is only interested in anonymous, process-private
memory, that might be an easy way out? Limit (unprivileged) use of
this interface to pages that aren't shared with any other processes?

> +/* Helper to get the start and end frame given a pos and count */
> +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> +                               unsigned long *start, unsigned long *end)
> +{
> +       unsigned long max_frame;
> +
> +       /* If an mm is not given, assume we want physical frames */
> +       max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> +
> +       if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> +               return -EINVAL;
> +
> +       *start = pos * BITS_PER_BYTE;
> +       if (*start >= max_frame)
> +               return -ENXIO;
> +
> +       *end = *start + count * BITS_PER_BYTE;
> +       if (*end > max_frame)
> +               *end = max_frame;
> +       return 0;
> +}

You could add some overflow checks for the multiplications. I haven't
seen any place where it actually matters, but it seems unclean; and in
particular, on a 32-bit architecture where the maximum user address is
very high (like with a 4G:4G split), it looks like this function might
theoretically return with `*start > *end`, which could be confusing to
callers.

[...]
>         for (; pfn < end_pfn; pfn++) {
>                 bit = pfn % BITMAP_CHUNK_BITS;
>                 if (!bit)
>                         *out = 0ULL;
> -               page = page_idle_get_page(pfn);
> -               if (page) {
> -                       if (page_is_idle(page)) {
> -                               /*
> -                                * The page might have been referenced via a
> -                                * pte, in which case it is not idle. Clear
> -                                * refs and recheck.
> -                                */
> -                               page_idle_clear_pte_refs(page);
> -                               if (page_is_idle(page))
> -                                       *out |= 1ULL << bit;
> -                       }
> +               page = page_idle_get_page_pfn(pfn);
> +               if (page && page_idle_pte_check(page)) {
> +                       *out |= 1ULL << bit;
>                         put_page(page);
>                 }

The `page && !page_idle_pte_check(page)` case looks like it's missing
a put_page(); you probably intended to write something like this?

    page = page_idle_get_page_pfn(pfn);
    if (page) {
        if (page_idle_pte_check(page))
            *out |= 1ULL << bit;
        put_page(page);
    }

[...]
> +/*  page_idle tracking for /proc/<pid>/page_idle */
> +
> +struct page_node {
> +       struct page *page;
> +       unsigned long addr;
> +       struct list_head list;
> +};
> +
> +struct page_idle_proc_priv {
> +       unsigned long start_addr;
> +       char *buffer;
> +       int write;
> +
> +       /* Pre-allocate and provide nodes to pte_page_idle_proc_add() */
> +       struct page_node *page_nodes;
> +       int cur_page_node;
> +       struct list_head *idle_page_list;
> +};

A linked list is a weird data structure to use if the list elements
are just consecutive array elements.

> +/*
> + * Add page to list to be set as idle later.
> + */
> +static void pte_page_idle_proc_add(struct page *page,
> +                              unsigned long addr, struct mm_walk *walk)
> +{
> +       struct page *page_get = NULL;
> +       struct page_node *pn;
> +       int bit;
> +       unsigned long frames;
> +       struct page_idle_proc_priv *priv = walk->private;
> +       u64 *chunk = (u64 *)priv->buffer;
> +
> +       if (priv->write) {
> +               VM_BUG_ON(!page);
> +
> +               /* Find whether this page was asked to be marked */
> +               frames = (addr - priv->start_addr) >> PAGE_SHIFT;
> +               bit = frames % BITMAP_CHUNK_BITS;
> +               chunk = &chunk[frames / BITMAP_CHUNK_BITS];
> +               if (((*chunk >> bit) & 1) == 0)
> +                       return;

This means that BITMAP_CHUNK_SIZE is UAPI on big-endian systems,
right? My opinion is that it would be slightly nicer to design the
UAPI such that incrementing virtual addresses are mapped to
incrementing offsets in the buffer (iow, either use bytewise access or
use little-endian), but I'm not going to ask you to redesign the UAPI
this late.

[...]
> +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
> +                              size_t count, loff_t *pos, int write)
> +{
[...]
> +       down_read(&mm->mmap_sem);
[...]
> +
> +       if (!write && !walk_error)
> +               ret = copy_to_user(ubuff, buffer, count);
> +
> +       up_read(&mm->mmap_sem);

I'd move the up_read() above the copy_to_user(); copy_to_user() can
block, and there's no reason to hold the mmap_sem across
copy_to_user().

Sorry about only chiming in at v5 with all this.

^ permalink raw reply

* Re: [PATCH v8 05/20] fscrypt: rename fscrypt_master_key to fscrypt_direct_key
From: Theodore Y. Ts'o @ 2019-08-12 22:20 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
	keyrings, linux-mtd, linux-crypto, linux-fsdevel, Jaegeuk Kim,
	linux-ext4, Paul Crowley
In-Reply-To: <20190805162521.90882-6-ebiggers@kernel.org>

On Mon, Aug 05, 2019 at 09:25:06AM -0700, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> In preparation for introducing a filesystem-level keyring which will
> contain fscrypt master keys, rename the existing 'struct
> fscrypt_master_key' to 'struct fscrypt_direct_key'.  This is the
> structure in the existing table of master keys that's maintained to
> deduplicate the crypto transforms for v1 DIRECT_KEY policies.
> 
> I've chosen to keep this table as-is rather than make it automagically
> add/remove the keys to/from the filesystem-level keyring, since that
> would add a lot of extra complexity to the filesystem-level keyring.
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Looks good.  You can add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>

						- Ted

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply

* Re: [PATCH V38 00/29] security: Add support for locking down the kernel
From: James Morris @ 2019-08-12 22:29 UTC (permalink / raw)
  To: Matthew Garrett; +Cc: LSM List, Linux Kernel Mailing List, Linux API
In-Reply-To: <alpine.LRH.2.21.1908130339130.14197@namei.org>

On Tue, 13 Aug 2019, James Morris wrote:

> On Mon, 12 Aug 2019, Matthew Garrett wrote:
> 
> > On Fri, Aug 9, 2019 at 11:08 PM James Morris <jmorris@namei.org> wrote:
> > > Please verify and test, as I had to make a few minor fixups for my v5.2
> > > base.
> > 
> > Thanks James - there's a few small fixups required, would you like
> > those as separate patches or should I just send you a fixed copy of
> > the original patchset?
> 
> Given there are a few, an updated copy of the patchset will be best.

Actually, that's a lot of patches to resend, just send updates against my 
next-lockdown branch.

-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply

* Re: [PATCH v8 06/20] fscrypt: refactor key setup code in preparation for v2 policies
From: Theodore Y. Ts'o @ 2019-08-12 22:38 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
	keyrings, linux-mtd, linux-crypto, linux-fsdevel, Jaegeuk Kim,
	linux-ext4, Paul Crowley
In-Reply-To: <20190805162521.90882-7-ebiggers@kernel.org>

On Mon, Aug 05, 2019 at 09:25:07AM -0700, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> Do some more refactoring of the key setup code, in preparation for
> introducing a filesystem-level keyring and v2 encryption policies:
> 
> - Now that ci_inode exists, don't pass around the inode unnecessarily.
> 
> - Define a function setup_file_encryption_key() which handles the crypto
>   key setup given an under-construction fscrypt_info.  Don't pass the
>   fscrypt_context, since everything is in the fscrypt_info.
>   [This will be extended for v2 policies and the fs-level keyring.]
> 
> - Define a function fscrypt_set_derived_key() which sets the per-file
>   key, without depending on anything specific to v1 policies.
>   [This will also be used for v2 policies.]
> 
> - Define a function fscrypt_setup_v1_file_key() which takes the raw
>   master key, thus separating finding the key from using it.
>   [This will also be used if the key is found in the fs-level keyring.]
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Looks good, you can add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>

						- Ted

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply

* Re: [PATCH v8 07/20] fscrypt: move v1 policy key setup to keysetup_v1.c
From: Theodore Y. Ts'o @ 2019-08-12 22:53 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
	keyrings, linux-mtd, linux-crypto, linux-fsdevel, Jaegeuk Kim,
	linux-ext4, Paul Crowley
In-Reply-To: <20190805162521.90882-8-ebiggers@kernel.org>

On Mon, Aug 05, 2019 at 09:25:08AM -0700, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> In preparation for introducing v2 encryption policies which will find
> and derive encryption keys differently from the current v1 encryption
> policies, move the v1 policy-specific key setup code from keyinfo.c into
> keysetup_v1.c.
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Looks good, you can add

Reviewed-by: Theodore Ts'o <tytso@mit.edu>


______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply

* Re: [PATCH v8 08/20] fscrypt: rename keyinfo.c to keysetup.c
From: Theodore Y. Ts'o @ 2019-08-12 22:53 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
	keyrings, linux-mtd, linux-crypto, linux-fsdevel, Jaegeuk Kim,
	linux-ext4, Paul Crowley
In-Reply-To: <20190805162521.90882-9-ebiggers@kernel.org>

On Mon, Aug 05, 2019 at 09:25:09AM -0700, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> Rename keyinfo.c to keysetup.c since this better describes what the file
> does (sets up the key), and it matches the new file keysetup_v1.c.
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Looks good, you can add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>


______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply

* Re: [PATCH v8 10/20] fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl
From: Theodore Y. Ts'o @ 2019-08-13  0:06 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
	keyrings, linux-mtd, linux-crypto, linux-fsdevel, Jaegeuk Kim,
	linux-ext4, Paul Crowley
In-Reply-To: <20190805162521.90882-11-ebiggers@kernel.org>

> +		/* Some inodes still reference this key; try to evict them. */
> +		if (try_to_lock_encrypted_files(sb, mk) != 0)
> +			status_flags |=
> +				FSCRYPT_KEY_REMOVAL_STATUS_FLAG_FILES_BUSY;
> +	}

try_to_lock_encrypted_files() can return other errors besides -EBUSY;
in particular sync_filesystem() can return other errors, such as -EIO
or -EFSCORUPTED.  In that case, I think we're better off returning the
relevant status code back to the user.  We will have already wiped the
master key, but this situation will only happen in exceptional
conditions (e.g., user has ejected the sdcard, etc.), so it's not
worth it to try to undo the master key wipe to try to restore things
to the pre-ioctl execution state.

So I think we should capture the return code from
try_to_lock_encrypted_files, and if it is EBUSY, we can set FILES_BUSY
flag and return success.  Otherwise, we should return the error.

If you agree, please fix that up and then feel free to add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>

						- Ted

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply

* Re: [PATCH v8 14/20] fscrypt: allow unprivileged users to add/remove keys for v2 policies
From: Theodore Y. Ts'o @ 2019-08-13  0:14 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
	keyrings, linux-mtd, linux-crypto, linux-fsdevel, Jaegeuk Kim,
	linux-ext4, Paul Crowley
In-Reply-To: <20190805162521.90882-15-ebiggers@kernel.org>

On Mon, Aug 05, 2019 at 09:25:15AM -0700, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> Allow the FS_IOC_ADD_ENCRYPTION_KEY and FS_IOC_REMOVE_ENCRYPTION_KEY
> ioctls to be used by non-root users to add and remove encryption keys
> from the filesystem-level crypto keyrings, subject to limitations.
> 
> Motivation: while privileged fscrypt key management is sufficient for
> some users (e.g. Android and Chromium OS, where a privileged process
> manages all keys), the old API by design also allows non-root users to
> set up and use encrypted directories, and we don't want to regress on
> that.  Especially, we don't want to force users to continue using the
> old API, running into the visibility mismatch between files and keyrings
> and being unable to "lock" encrypted directories.
> 
> Intuitively, the ioctls have to be privileged since they manipulate
> filesystem-level state.  However, it's actually safe to make them
> unprivileged if we very carefully enforce some specific limitations.
> 
> First, each key must be identified by a cryptographic hash so that a
> user can't add the wrong key for another user's files.  For v2
> encryption policies, we use the key_identifier for this.  v1 policies
> don't have this, so managing keys for them remains privileged.
> 
> Second, each key a user adds is charged to their quota for the keyrings
> service.  Thus, a user can't exhaust memory by adding a huge number of
> keys.  By default each non-root user is allowed up to 200 keys; this can
> be changed using the existing sysctl 'kernel.keys.maxkeys'.
> 
> Third, if multiple users add the same key, we keep track of those users
> of the key (of which there remains a single copy), and won't really
> remove the key, i.e. "lock" the encrypted files, until all those users
> have removed it.  This prevents denial of service attacks that would be
> possible under simpler schemes, such allowing the first user who added a
> key to remove it -- since that could be a malicious user who has
> compromised the key.  Of course, encryption keys should be kept secret,
> but the idea is that using encryption should never be *less* secure than
> not using encryption, even if your key was compromised.
> 
> We tolerate that a user will be unable to really remove a key, i.e.
> unable to "lock" their encrypted files, if another user has added the
> same key.  But in a sense, this is actually a good thing because it will
> avoid providing a false notion of security where a key appears to have
> been removed when actually it's still in memory, available to any
> attacker who compromises the operating system kernel.
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Looks good.  I'd probably would have used either "mk_secret_sem" or
"mk->mk_secret_sem" in the comments, instead of "->mk_securet_sem",
but that's just a personal style preference.  Since you consistently
used the latter, I assume that's a deliberate choice, which is fine.

Feel free to add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH v8 15/20] fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY_ALL_USERS ioctl
From: Theodore Y. Ts'o @ 2019-08-13  0:15 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
	keyrings, linux-mtd, linux-crypto, linux-fsdevel, Jaegeuk Kim,
	linux-ext4, Paul Crowley
In-Reply-To: <20190805162521.90882-16-ebiggers@kernel.org>

On Mon, Aug 05, 2019 at 09:25:16AM -0700, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> Add a root-only variant of the FS_IOC_REMOVE_ENCRYPTION_KEY ioctl which
> removes all users' claims of the key, not just the current user's claim.
> I.e., it always removes the key itself, no matter how many users have
> added it.
> 
> This is useful for forcing a directory to be locked, without having to
> figure out which user ID(s) the key was added under.  This is planned to
> be used by a command like 'sudo fscrypt lock DIR --all-users' in the
> fscrypt userspace tool (http://github.com/google/fscrypt).
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Looks good, thanks.   Feel free to add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply

* Re: [PATCH v8 13/20] fscrypt: v2 encryption policy support
From: Theodore Y. Ts'o @ 2019-08-13  0:39 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
	keyrings, linux-mtd, linux-crypto, linux-fsdevel, Jaegeuk Kim,
	linux-ext4, Paul Crowley
In-Reply-To: <20190805162521.90882-14-ebiggers@kernel.org>

On Mon, Aug 05, 2019 at 09:25:14AM -0700, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> Add a new fscrypt policy version, "v2".  It has the following changes
> from the original policy version, which we call "v1" (*):
> 
> - Master keys (the user-provided encryption keys) are only ever used as
>   input to HKDF-SHA512.  This is more flexible and less error-prone, and
>   it avoids the quirks and limitations of the AES-128-ECB based KDF.
>   Three classes of cryptographically isolated subkeys are defined:
> 
>     - Per-file keys, like used in v1 policies except for the new KDF.
> 
>     - Per-mode keys.  These implement the semantics of the DIRECT_KEY
>       flag, which for v1 policies made the master key be used directly.
>       These are also planned to be used for inline encryption when
>       support for it is added.
> 
>     - Key identifiers (see below).
> 
> - Each master key is identified by a 16-byte master_key_identifier,
>   which is derived from the key itself using HKDF-SHA512.  This prevents
>   users from associating the wrong key with an encrypted file or
>   directory.  This was easily possible with v1 policies, which
>   identified the key by an arbitrary 8-byte master_key_descriptor.
> 
> - The key must be provided in the filesystem-level keyring, not in a
>   process-subscribed keyring.
> 
> The following UAPI additions are made:
> 
> - The existing ioctl FS_IOC_SET_ENCRYPTION_POLICY can now be passed a
>   fscrypt_policy_v2 to set a v2 encryption policy.  It's disambiguated
>   from fscrypt_policy/fscrypt_policy_v1 by the version code prefix.
> 
> - A new ioctl FS_IOC_GET_ENCRYPTION_POLICY_EX is added.  It allows
>   getting the v1 or v2 encryption policy of an encrypted file or
>   directory.  The existing FS_IOC_GET_ENCRYPTION_POLICY ioctl could not
>   be used because it did not have a way for userspace to indicate which
>   policy structure is expected.  The new ioctl includes a size field, so
>   it is extensible to future fscrypt policy versions.
> 
> - The ioctls FS_IOC_ADD_ENCRYPTION_KEY, FS_IOC_REMOVE_ENCRYPTION_KEY,
>   and FS_IOC_GET_ENCRYPTION_KEY_STATUS now support managing keys for v2
>   encryption policies.  Such keys are kept logically separate from keys
>   for v1 encryption policies, and are identified by 'identifier' rather
>   than by 'descriptor'.  The 'identifier' need not be provided when
>   adding a key, since the kernel will calculate it anyway.
> 
> This patch temporarily keeps adding/removing v2 policy keys behind the
> same permission check done for adding/removing v1 policy keys:
> capable(CAP_SYS_ADMIN).  However, the next patch will carefully take
> advantage of the cryptographically secure master_key_identifier to allow
> non-root users to add/remove v2 policy keys, thus providing a full
> replacement for v1 policies.
> 
> (*) Actually, in the API fscrypt_policy::version is 0 while on-disk
>     fscrypt_context::format is 1.  But I believe it makes the most sense
>     to advance both to '2' to have them be in sync, and to consider the
>     numbering to start at 1 except for the API quirk.
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Looks good, feel free to add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH v8 20/20] fscrypt: document the new ioctls and policy version
From: Theodore Y. Ts'o @ 2019-08-13  0:49 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
	keyrings, linux-mtd, linux-crypto, linux-fsdevel, Jaegeuk Kim,
	linux-ext4, Paul Crowley
In-Reply-To: <20190805162521.90882-21-ebiggers@kernel.org>

On Mon, Aug 05, 2019 at 09:25:21AM -0700, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> Update the fscrypt documentation file to catch up to all the latest
> changes, including the new ioctls to manage master encryption keys in
> the filesystem-level keyring and the support for v2 encryption policies.
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Looks good, thanks!  Feel free to add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH V37 27/29] tracefs: Restrict tracefs when the kernel is locked down
From: Marek Szyprowski @ 2019-08-13  6:10 UTC (permalink / raw)
  To: Matthew Garrett, jmorris
  Cc: linux-security-module, linux-kernel, linux-api, Matthew Garrett,
	Steven Rostedt, Krzysztof Kozlowski
In-Reply-To: <20190731221617.234725-28-matthewgarrett@google.com>

Hi

On 2019-08-01 00:16, Matthew Garrett wrote:
> Tracefs may release more information about the kernel than desirable, so
> restrict it when the kernel is locked down in confidentiality mode by
> preventing open().
>
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

This patch causes the following regression on various Samsung Exynos SoC 
based boards (ARM 32bit):

[   15.364422] Unable to handle kernel NULL pointer dereference at 
virtual address 00000000
[   15.368775] pgd = a530ddbe
[   15.371447] [00000000] *pgd=bcd7c831
[   15.374993] Internal error: Oops: 80000007 [#1] PREEMPT SMP ARM
[   15.380890] Modules linked in:
[   15.383929] CPU: 0 PID: 1393 Comm: perf Not tainted 
5.2.0-00027-g757ff7244358-dirty #6459
[   15.392086] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[   15.398164] PC is at 0x0
[   15.400687] LR is at do_dentry_open+0x22c/0x3b0
[   15.405193] pc : [<00000000>]    lr : [<c02977c4>]    psr: 60000053
[   15.411442] sp : e7317dd8  ip : 00000000  fp : 00000000
[   15.416650] r10: c0187e6c  r9 : c041f8cc  r8 : e72123c8
[   15.421858] r7 : e7317ec0  r6 : e7d89630  r5 : 00000000  r4 : e72123c0
[   15.428368] r3 : 00000000  r2 : 5ba370f3  r1 : e72123c0  r0 : e7d89630
[   15.434880] Flags: nZCv  IRQs on  FIQs off  Mode SVC_32  ISA ARM  
Segment none
[   15.442083] Control: 10c5387d  Table: 6726404a  DAC: 00000051
[   15.447812] Process perf (pid: 1393, stack limit = 0x17621431)
[   15.453628] Stack: (0xe7317dd8 to 0xe7318000)
...
[   15.604842] [<c02977c4>] (do_dentry_open) from [<c02aafc8>] 
(path_openat+0x5a0/0x1004)
[   15.612735] [<c02aafc8>] (path_openat) from [<c02acce8>] 
(do_filp_open+0x6c/0xd8)
[   15.620200] [<c02acce8>] (do_filp_open) from [<c0298cc4>] 
(do_sys_open+0x130/0x1f4)
[   15.627839] [<c0298cc4>] (do_sys_open) from [<c0101000>] 
(ret_fast_syscall+0x0/0x28)
[   15.635560] Exception stack(0xe7317fa8 to 0xe7317ff0)
[   15.640596] 7fa0:                   0022dc0b 001deee0 ffffff9c 
beb6d764 00020000 00000000
[   15.648756] 7fc0: 0022dc0b 001deee0 0022dba8 00000142 001ba044 
00241d68 001a13d8 beb6e78c
[   15.656913] 7fe0: b6f7e000 beb6c6f8 9a27c600 b6f69504
[   15.661952] Code: bad PC value
[   15.665105] ---[ end trace 7e8b864582108f4a ]---

This is standard ARM 32bit kernel with 
arch/arm/configs/exynos_defconfig. It is enough to run "perf list" command.

> ---
>   fs/tracefs/inode.c           | 40 +++++++++++++++++++++++++++++++++++-
>   include/linux/security.h     |  1 +
>   security/lockdown/lockdown.c |  1 +
>   3 files changed, 41 insertions(+), 1 deletion(-)
>
> diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
> index 1387bcd96a79..12a325fb4cbd 100644
> --- a/fs/tracefs/inode.c
> +++ b/fs/tracefs/inode.c
> @@ -21,6 +21,7 @@
>   #include <linux/seq_file.h>
>   #include <linux/magic.h>
>   #include <linux/slab.h>
> +#include <linux/security.h>
>   
>   #define TRACEFS_DEFAULT_MODE	0700
>   
> @@ -28,6 +29,23 @@ static struct vfsmount *tracefs_mount;
>   static int tracefs_mount_count;
>   static bool tracefs_registered;
>   
> +static int default_open_file(struct inode *inode, struct file *filp)
> +{
> +	struct dentry *dentry = filp->f_path.dentry;
> +	struct file_operations *real_fops;
> +	int ret;
> +
> +	if (!dentry)
> +		return -EINVAL;
> +
> +	ret = security_locked_down(LOCKDOWN_TRACEFS);
> +	if (ret)
> +		return ret;
> +
> +	real_fops = dentry->d_fsdata;

real_fops are NULL in my test case.

> +	return real_fops->open(inode, filp);
> +}
> +
>   static ssize_t default_read_file(struct file *file, char __user *buf,
>   				 size_t count, loff_t *ppos)
>   {
> @@ -210,6 +228,12 @@ static int tracefs_apply_options(struct super_block *sb)
>   	return 0;
>   }
>   
> +static void tracefs_destroy_inode(struct inode *inode)
> +{
> +	if (S_ISREG(inode->i_mode))
> +		kfree(inode->i_fop);
> +}
> +
>   static int tracefs_reconfigure(struct fs_context *fc)
>   {
>   	struct super_block *sb = fc->root->d_sb;
> @@ -236,6 +260,7 @@ static int tracefs_show_options(struct seq_file *m, struct dentry *root)
>   
>   static const struct super_operations tracefs_super_operations = {
>   	.statfs		= simple_statfs,
> +	.destroy_inode  = tracefs_destroy_inode,
>   	.show_options	= tracefs_show_options,
>   };
>   
> @@ -372,6 +397,7 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
>   				   struct dentry *parent, void *data,
>   				   const struct file_operations *fops)
>   {
> +	struct file_operations *proxy_fops;
>   	struct dentry *dentry;
>   	struct inode *inode;
>   
> @@ -387,8 +413,20 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
>   	if (unlikely(!inode))
>   		return failed_creating(dentry);
>   
> +	proxy_fops = kzalloc(sizeof(struct file_operations), GFP_KERNEL);
> +	if (unlikely(!proxy_fops)) {
> +		iput(inode);
> +		return failed_creating(dentry);
> +	}
> +
> +	if (!fops)
> +		fops = &tracefs_file_operations;
> +
> +	dentry->d_fsdata = (void *)fops;
> +	memcpy(proxy_fops, fops, sizeof(*proxy_fops));
> +	proxy_fops->open = default_open_file;
>   	inode->i_mode = mode;
> -	inode->i_fop = fops ? fops : &tracefs_file_operations;
> +	inode->i_fop = proxy_fops;
>   	inode->i_private = data;
>   	d_instantiate(dentry, inode);
>   	fsnotify_create(dentry->d_parent->d_inode, dentry);
> diff --git a/include/linux/security.h b/include/linux/security.h
> index d92323b44a3f..807dc0d24982 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -121,6 +121,7 @@ enum lockdown_reason {
>   	LOCKDOWN_KPROBES,
>   	LOCKDOWN_BPF_READ,
>   	LOCKDOWN_PERF,
> +	LOCKDOWN_TRACEFS,
>   	LOCKDOWN_CONFIDENTIALITY_MAX,
>   };
>   
> diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c
> index 88064ce1c844..173191562047 100644
> --- a/security/lockdown/lockdown.c
> +++ b/security/lockdown/lockdown.c
> @@ -36,6 +36,7 @@ static char *lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
>   	[LOCKDOWN_KPROBES] = "use of kprobes",
>   	[LOCKDOWN_BPF_READ] = "use of bpf to read kernel RAM",
>   	[LOCKDOWN_PERF] = "unsafe use of perf",
> +	[LOCKDOWN_TRACEFS] = "use of tracefs",
>   	[LOCKDOWN_CONFIDENTIALITY_MAX] = "confidentiality",
>   };
>   

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

^ permalink raw reply

* Re: [PATCH V37 27/29] tracefs: Restrict tracefs when the kernel is locked down
From: Marek Szyprowski @ 2019-08-13  7:21 UTC (permalink / raw)
  To: Matthew Garrett, jmorris
  Cc: linux-security-module, linux-kernel, linux-api, Matthew Garrett,
	Steven Rostedt, Krzysztof Kozlowski
In-Reply-To: <a3caa6d3-e3f9-ae41-d87e-253d9dc53d81@samsung.com>

Hi again,

On 2019-08-13 08:10, Marek Szyprowski wrote:
> Hi
>
> On 2019-08-01 00:16, Matthew Garrett wrote:
>> Tracefs may release more information about the kernel than desirable, so
>> restrict it when the kernel is locked down in confidentiality mode by
>> preventing open().
>>
>> Signed-off-by: Matthew Garrett <mjg59@google.com>
>> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
>
> This patch causes the following regression on various Samsung Exynos 
> SoC based boards (ARM 32bit):
>
> [   15.364422] Unable to handle kernel NULL pointer dereference at 
> virtual address 00000000
> [   15.368775] pgd = a530ddbe
> [   15.371447] [00000000] *pgd=bcd7c831
> [   15.374993] Internal error: Oops: 80000007 [#1] PREEMPT SMP ARM
> [   15.380890] Modules linked in:
> [   15.383929] CPU: 0 PID: 1393 Comm: perf Not tainted 
> 5.2.0-00027-g757ff7244358-dirty #6459
> [   15.392086] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
> [   15.398164] PC is at 0x0
> [   15.400687] LR is at do_dentry_open+0x22c/0x3b0
> [   15.405193] pc : [<00000000>]    lr : [<c02977c4>]    psr: 60000053
> [   15.411442] sp : e7317dd8  ip : 00000000  fp : 00000000
> [   15.416650] r10: c0187e6c  r9 : c041f8cc  r8 : e72123c8
> [   15.421858] r7 : e7317ec0  r6 : e7d89630  r5 : 00000000  r4 : e72123c0
> [   15.428368] r3 : 00000000  r2 : 5ba370f3  r1 : e72123c0  r0 : e7d89630
> [   15.434880] Flags: nZCv  IRQs on  FIQs off  Mode SVC_32  ISA ARM  
> Segment none
> [   15.442083] Control: 10c5387d  Table: 6726404a  DAC: 00000051
> [   15.447812] Process perf (pid: 1393, stack limit = 0x17621431)
> [   15.453628] Stack: (0xe7317dd8 to 0xe7318000)
> ...
> [   15.604842] [<c02977c4>] (do_dentry_open) from [<c02aafc8>] 
> (path_openat+0x5a0/0x1004)
> [   15.612735] [<c02aafc8>] (path_openat) from [<c02acce8>] 
> (do_filp_open+0x6c/0xd8)
> [   15.620200] [<c02acce8>] (do_filp_open) from [<c0298cc4>] 
> (do_sys_open+0x130/0x1f4)
> [   15.627839] [<c0298cc4>] (do_sys_open) from [<c0101000>] 
> (ret_fast_syscall+0x0/0x28)
> [   15.635560] Exception stack(0xe7317fa8 to 0xe7317ff0)
> [   15.640596] 7fa0:                   0022dc0b 001deee0 ffffff9c 
> beb6d764 00020000 00000000
> [   15.648756] 7fc0: 0022dc0b 001deee0 0022dba8 00000142 001ba044 
> 00241d68 001a13d8 beb6e78c
> [   15.656913] 7fe0: b6f7e000 beb6c6f8 9a27c600 b6f69504
> [   15.661952] Code: bad PC value
> [   15.665105] ---[ end trace 7e8b864582108f4a ]---
>
> This is standard ARM 32bit kernel with 
> arch/arm/configs/exynos_defconfig. It is enough to run "perf list" 
> command.
>
>> ---
>>   fs/tracefs/inode.c           | 40 +++++++++++++++++++++++++++++++++++-
>>   include/linux/security.h     |  1 +
>>   security/lockdown/lockdown.c |  1 +
>>   3 files changed, 41 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
>> index 1387bcd96a79..12a325fb4cbd 100644
>> --- a/fs/tracefs/inode.c
>> +++ b/fs/tracefs/inode.c
>> @@ -21,6 +21,7 @@
>>   #include <linux/seq_file.h>
>>   #include <linux/magic.h>
>>   #include <linux/slab.h>
>> +#include <linux/security.h>
>>     #define TRACEFS_DEFAULT_MODE    0700
>>   @@ -28,6 +29,23 @@ static struct vfsmount *tracefs_mount;
>>   static int tracefs_mount_count;
>>   static bool tracefs_registered;
>>   +static int default_open_file(struct inode *inode, struct file *filp)
>> +{
>> +    struct dentry *dentry = filp->f_path.dentry;
>> +    struct file_operations *real_fops;
>> +    int ret;
>> +
>> +    if (!dentry)
>> +        return -EINVAL;
>> +
>> +    ret = security_locked_down(LOCKDOWN_TRACEFS);
>> +    if (ret)
>> +        return ret;
>> +
>> +    real_fops = dentry->d_fsdata;
>
> real_fops are NULL in my test case.

Too much of a hurry. real_fops are okay in that test case...

>
>> +    return real_fops->open(inode, filp);

... the issue is caused by NULL ->open() callback. Switching the above 
line to:

return real_fops->open ? real_fops->open(inode, filp) : 0;

fixes the issue.

>> +}
>> +
>>   static ssize_t default_read_file(struct file *file, char __user *buf,
>>                    size_t count, loff_t *ppos)
>>   {
>> @@ -210,6 +228,12 @@ static int tracefs_apply_options(struct 
>> super_block *sb)
>>       return 0;
>>   }
>>   +static void tracefs_destroy_inode(struct inode *inode)
>> +{
>> +    if (S_ISREG(inode->i_mode))
>> +        kfree(inode->i_fop);
>> +}
>> +
>>   static int tracefs_reconfigure(struct fs_context *fc)
>>   {
>>       struct super_block *sb = fc->root->d_sb;
>> @@ -236,6 +260,7 @@ static int tracefs_show_options(struct seq_file 
>> *m, struct dentry *root)
>>     static const struct super_operations tracefs_super_operations = {
>>       .statfs        = simple_statfs,
>> +    .destroy_inode  = tracefs_destroy_inode,
>>       .show_options    = tracefs_show_options,
>>   };
>>   @@ -372,6 +397,7 @@ struct dentry *tracefs_create_file(const char 
>> *name, umode_t mode,
>>                      struct dentry *parent, void *data,
>>                      const struct file_operations *fops)
>>   {
>> +    struct file_operations *proxy_fops;
>>       struct dentry *dentry;
>>       struct inode *inode;
>>   @@ -387,8 +413,20 @@ struct dentry *tracefs_create_file(const char 
>> *name, umode_t mode,
>>       if (unlikely(!inode))
>>           return failed_creating(dentry);
>>   +    proxy_fops = kzalloc(sizeof(struct file_operations), GFP_KERNEL);
>> +    if (unlikely(!proxy_fops)) {
>> +        iput(inode);
>> +        return failed_creating(dentry);
>> +    }
>> +
>> +    if (!fops)
>> +        fops = &tracefs_file_operations;
>> +
>> +    dentry->d_fsdata = (void *)fops;
>> +    memcpy(proxy_fops, fops, sizeof(*proxy_fops));
>> +    proxy_fops->open = default_open_file;
>>       inode->i_mode = mode;
>> -    inode->i_fop = fops ? fops : &tracefs_file_operations;
>> +    inode->i_fop = proxy_fops;
>>       inode->i_private = data;
>>       d_instantiate(dentry, inode);
>>       fsnotify_create(dentry->d_parent->d_inode, dentry);
>> diff --git a/include/linux/security.h b/include/linux/security.h
>> index d92323b44a3f..807dc0d24982 100644
>> --- a/include/linux/security.h
>> +++ b/include/linux/security.h
>> @@ -121,6 +121,7 @@ enum lockdown_reason {
>>       LOCKDOWN_KPROBES,
>>       LOCKDOWN_BPF_READ,
>>       LOCKDOWN_PERF,
>> +    LOCKDOWN_TRACEFS,
>>       LOCKDOWN_CONFIDENTIALITY_MAX,
>>   };
>>   diff --git a/security/lockdown/lockdown.c 
>> b/security/lockdown/lockdown.c
>> index 88064ce1c844..173191562047 100644
>> --- a/security/lockdown/lockdown.c
>> +++ b/security/lockdown/lockdown.c
>> @@ -36,6 +36,7 @@ static char 
>> *lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
>>       [LOCKDOWN_KPROBES] = "use of kprobes",
>>       [LOCKDOWN_BPF_READ] = "use of bpf to read kernel RAM",
>>       [LOCKDOWN_PERF] = "unsafe use of perf",
>> +    [LOCKDOWN_TRACEFS] = "use of tracefs",
>>       [LOCKDOWN_CONFIDENTIALITY_MAX] = "confidentiality",
>>   };
>
> Best regards

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

^ permalink raw reply

* Re: [PATCH v5 1/6] mm/page_idle: Add per-pid idle page tracking using virtual index
From: Michal Hocko @ 2019-08-13  9:14 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Andrew Morton, linux-kernel, Alexey Dobriyan, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen Rothwell
In-Reply-To: <20190812145620.GB224541@google.com>

On Mon 12-08-19 10:56:20, Joel Fernandes wrote:
> On Thu, Aug 08, 2019 at 10:00:44AM +0200, Michal Hocko wrote:
> > On Wed 07-08-19 17:31:05, Joel Fernandes wrote:
> > > On Wed, Aug 07, 2019 at 01:58:40PM -0700, Andrew Morton wrote:
> > > > On Wed, 7 Aug 2019 16:45:30 -0400 Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > 
> > > > > On Wed, Aug 07, 2019 at 01:04:02PM -0700, Andrew Morton wrote:
> > > > > > On Wed,  7 Aug 2019 13:15:54 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote:
> > > > > > 
> > > > > > > In Android, we are using this for the heap profiler (heapprofd) which
> > > > > > > profiles and pin points code paths which allocates and leaves memory
> > > > > > > idle for long periods of time. This method solves the security issue
> > > > > > > with userspace learning the PFN, and while at it is also shown to yield
> > > > > > > better results than the pagemap lookup, the theory being that the window
> > > > > > > where the address space can change is reduced by eliminating the
> > > > > > > intermediate pagemap look up stage. In virtual address indexing, the
> > > > > > > process's mmap_sem is held for the duration of the access.
> > > > > > 
> > > > > > So is heapprofd a developer-only thing?  Is heapprofd included in
> > > > > > end-user android loads?  If not then, again, wouldn't it be better to
> > > > > > make the feature Kconfigurable so that Android developers can enable it
> > > > > > during development then disable it for production kernels?
> > > > > 
> > > > > Almost all of this code is already configurable with
> > > > > CONFIG_IDLE_PAGE_TRACKING. If you disable it, then all of this code gets
> > > > > disabled.
> > > > > 
> > > > > Or are you referring to something else that needs to be made configurable?
> > > > 
> > > > Yes - the 300+ lines of code which this patchset adds!
> > > > 
> > > > The impacted people will be those who use the existing
> > > > idle-page-tracking feature but who will not use the new feature.  I
> > > > guess we can assume this set is small...
> > > 
> > > Yes, I think this set should be small. The code size increase of page_idle.o
> > > is from ~1KB to ~2KB. Most of the extra space is consumed by
> > > page_idle_proc_generic() function which this patch adds. I don't think adding
> > > another CONFIG option to disable this while keeping existing
> > > CONFIG_IDLE_PAGE_TRACKING enabled, is worthwhile but I am open to the
> > > addition of such an option if anyone feels strongly about it. I believe that
> > > once this patch is merged, most like this new interface being added is what
> > > will be used more than the old interface (for some of the usecases) so it
> > > makes sense to keep it alive with CONFIG_IDLE_PAGE_TRACKING.
> > 
> > I would tend to agree with Joel here. The functionality falls into an
> > existing IDLE_PAGE_TRACKING config option quite nicely. If there really
> > are users who want to save some space and this is standing in the way
> > then they can easily add a new config option with some justification so
> > the savings are clear. Without that an additional config simply adds to
> > the already existing configurability complexity and balkanization.
> 
> Michal, Andrew, Minchan,
> 
> Would you have any other review comments on the v5 series? This is just a new
> interface that does not disrupt existing users of the older page-idle
> tracking, so as such it is a safe change (as in, doesn't change existing
> functionality except for the draining bug fix).

I hope to find some more time to finish the review but let me point out
that "it's new it is regression safe" is not really a great argument for
a new user visible API. If the API is flawed then this is likely going
to kick us later and will be hard to fix. I am still not convinced about
the swap part of the thing TBH.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v5 1/6] mm/page_idle: Add per-pid idle page tracking using virtual index
From: Michal Hocko @ 2019-08-13 10:08 UTC (permalink / raw)
  To: Jann Horn
  Cc: Joel Fernandes (Google), kernel list, Alexey Dobriyan,
	Andrew Morton, Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, Daniel Colascione, fmayer, H. Peter Anvin,
	Ingo Molnar, Joel Fernandes, Jonathan Corbet, Kees Cook,
	kernel-team, Linux API, linux-doc, linux-fsdevel, Linux-MM, Mike
In-Reply-To: <CAG48ez0ysprvRiENhBkLeV9YPTN_MB18rbu2HDa2jsWo5FYR8g@mail.gmail.com>

On Mon 12-08-19 20:14:38, Jann Horn wrote:
> On Wed, Aug 7, 2019 at 7:16 PM Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
> > The page_idle tracking feature currently requires looking up the pagemap
> > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > Looking up PFN from pagemap in Android devices is not supported by
> > unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
> >
> > This patch adds support to directly interact with page_idle tracking at
> > the PID level by introducing a /proc/<pid>/page_idle file.  It follows
> > the exact same semantics as the global /sys/kernel/mm/page_idle, but now
> > looking up PFN through pagemap is not needed since the interface uses
> > virtual frame numbers, and at the same time also does not require
> > SYS_ADMIN.
> >
> > In Android, we are using this for the heap profiler (heapprofd) which
> > profiles and pin points code paths which allocates and leaves memory
> > idle for long periods of time. This method solves the security issue
> > with userspace learning the PFN, and while at it is also shown to yield
> > better results than the pagemap lookup, the theory being that the window
> > where the address space can change is reduced by eliminating the
> > intermediate pagemap look up stage. In virtual address indexing, the
> > process's mmap_sem is held for the duration of the access.
> 
> What happens when you use this interface on shared pages, like memory
> inherited from the zygote, library file mappings and so on? If two
> profilers ran concurrently for two different processes that both map
> the same libraries, would they end up messing up each other's data?

Yup PageIdle state is shared. That is the page_idle semantic even now
IIRC.

> Can this be used to observe which library pages other processes are
> accessing, even if you don't have access to those processes, as long
> as you can map the same libraries? I realize that there are already a
> bunch of ways to do that with side channels and such; but if you're
> adding an interface that allows this by design, it seems to me like
> something that should be gated behind some sort of privilege check.

Hmm, you need to be priviledged to get the pfn now and without that you
cannot get to any page so the new interface is weakening the rules.
Maybe we should limit setting the idle state to processes with the write
status. Or do you think that even observing idle status is useful for
practical side channel attacks? If yes, is that a problem of the
profiler which does potentially dangerous things?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v5 1/6] mm/page_idle: Add per-pid idle page tracking using virtual index
From: Joel Fernandes @ 2019-08-13 13:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, Alexey Dobriyan, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen Rothwell
In-Reply-To: <20190813091430.GE17933@dhcp22.suse.cz>

On Tue, Aug 13, 2019 at 11:14:30AM +0200, Michal Hocko wrote:
> On Mon 12-08-19 10:56:20, Joel Fernandes wrote:
> > On Thu, Aug 08, 2019 at 10:00:44AM +0200, Michal Hocko wrote:
> > > On Wed 07-08-19 17:31:05, Joel Fernandes wrote:
> > > > On Wed, Aug 07, 2019 at 01:58:40PM -0700, Andrew Morton wrote:
> > > > > On Wed, 7 Aug 2019 16:45:30 -0400 Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > 
> > > > > > On Wed, Aug 07, 2019 at 01:04:02PM -0700, Andrew Morton wrote:
> > > > > > > On Wed,  7 Aug 2019 13:15:54 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote:
> > > > > > > 
> > > > > > > > In Android, we are using this for the heap profiler (heapprofd) which
> > > > > > > > profiles and pin points code paths which allocates and leaves memory
> > > > > > > > idle for long periods of time. This method solves the security issue
> > > > > > > > with userspace learning the PFN, and while at it is also shown to yield
> > > > > > > > better results than the pagemap lookup, the theory being that the window
> > > > > > > > where the address space can change is reduced by eliminating the
> > > > > > > > intermediate pagemap look up stage. In virtual address indexing, the
> > > > > > > > process's mmap_sem is held for the duration of the access.
> > > > > > > 
> > > > > > > So is heapprofd a developer-only thing?  Is heapprofd included in
> > > > > > > end-user android loads?  If not then, again, wouldn't it be better to
> > > > > > > make the feature Kconfigurable so that Android developers can enable it
> > > > > > > during development then disable it for production kernels?
> > > > > > 
> > > > > > Almost all of this code is already configurable with
> > > > > > CONFIG_IDLE_PAGE_TRACKING. If you disable it, then all of this code gets
> > > > > > disabled.
> > > > > > 
> > > > > > Or are you referring to something else that needs to be made configurable?
> > > > > 
> > > > > Yes - the 300+ lines of code which this patchset adds!
> > > > > 
> > > > > The impacted people will be those who use the existing
> > > > > idle-page-tracking feature but who will not use the new feature.  I
> > > > > guess we can assume this set is small...
> > > > 
> > > > Yes, I think this set should be small. The code size increase of page_idle.o
> > > > is from ~1KB to ~2KB. Most of the extra space is consumed by
> > > > page_idle_proc_generic() function which this patch adds. I don't think adding
> > > > another CONFIG option to disable this while keeping existing
> > > > CONFIG_IDLE_PAGE_TRACKING enabled, is worthwhile but I am open to the
> > > > addition of such an option if anyone feels strongly about it. I believe that
> > > > once this patch is merged, most like this new interface being added is what
> > > > will be used more than the old interface (for some of the usecases) so it
> > > > makes sense to keep it alive with CONFIG_IDLE_PAGE_TRACKING.
> > > 
> > > I would tend to agree with Joel here. The functionality falls into an
> > > existing IDLE_PAGE_TRACKING config option quite nicely. If there really
> > > are users who want to save some space and this is standing in the way
> > > then they can easily add a new config option with some justification so
> > > the savings are clear. Without that an additional config simply adds to
> > > the already existing configurability complexity and balkanization.
> > 
> > Michal, Andrew, Minchan,
> > 
> > Would you have any other review comments on the v5 series? This is just a new
> > interface that does not disrupt existing users of the older page-idle
> > tracking, so as such it is a safe change (as in, doesn't change existing
> > functionality except for the draining bug fix).
> 
> I hope to find some more time to finish the review but let me point out
> that "it's new it is regression safe" is not really a great argument for
> a new user visible API.

Actually, I think you misunderstood me and took it out of context. I never
intended to say "it is regression safe". I meant to say it is "low risk", as
in that in all likelihood should not be hurting *existing users* of the *old
interface*. Also as you know, it has been tested.

> If the API is flawed then this is likely going
> to kick us later and will be hard to fix. I am still not convinced about
> the swap part of the thing TBH.

Ok, then let us discuss it. As I mentioned before, without this we lose the
access information due to MADVISE or swapping. Minchan and Konstantin both
suggested it that's why I also added it (other than me also realizing that it
is neeed). For x86, it uses existing bits in pte so it is not adding any more
bits. For arm64, it uses unused bits that the hardware cannot use. So I
don't see why this is an issue to you.

thanks,

 - Joel

^ permalink raw reply

* Re: [PATCH v5 1/6] mm/page_idle: Add per-pid idle page tracking using virtual index
From: Michal Hocko @ 2019-08-13 14:14 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Andrew Morton, linux-kernel, Alexey Dobriyan, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen Rothwell
In-Reply-To: <20190813135152.GC258732@google.com>

On Tue 13-08-19 09:51:52, Joel Fernandes wrote:
> On Tue, Aug 13, 2019 at 11:14:30AM +0200, Michal Hocko wrote:
> > On Mon 12-08-19 10:56:20, Joel Fernandes wrote:
> > > On Thu, Aug 08, 2019 at 10:00:44AM +0200, Michal Hocko wrote:
> > > > On Wed 07-08-19 17:31:05, Joel Fernandes wrote:
> > > > > On Wed, Aug 07, 2019 at 01:58:40PM -0700, Andrew Morton wrote:
> > > > > > On Wed, 7 Aug 2019 16:45:30 -0400 Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > > 
> > > > > > > On Wed, Aug 07, 2019 at 01:04:02PM -0700, Andrew Morton wrote:
> > > > > > > > On Wed,  7 Aug 2019 13:15:54 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote:
> > > > > > > > 
> > > > > > > > > In Android, we are using this for the heap profiler (heapprofd) which
> > > > > > > > > profiles and pin points code paths which allocates and leaves memory
> > > > > > > > > idle for long periods of time. This method solves the security issue
> > > > > > > > > with userspace learning the PFN, and while at it is also shown to yield
> > > > > > > > > better results than the pagemap lookup, the theory being that the window
> > > > > > > > > where the address space can change is reduced by eliminating the
> > > > > > > > > intermediate pagemap look up stage. In virtual address indexing, the
> > > > > > > > > process's mmap_sem is held for the duration of the access.
> > > > > > > > 
> > > > > > > > So is heapprofd a developer-only thing?  Is heapprofd included in
> > > > > > > > end-user android loads?  If not then, again, wouldn't it be better to
> > > > > > > > make the feature Kconfigurable so that Android developers can enable it
> > > > > > > > during development then disable it for production kernels?
> > > > > > > 
> > > > > > > Almost all of this code is already configurable with
> > > > > > > CONFIG_IDLE_PAGE_TRACKING. If you disable it, then all of this code gets
> > > > > > > disabled.
> > > > > > > 
> > > > > > > Or are you referring to something else that needs to be made configurable?
> > > > > > 
> > > > > > Yes - the 300+ lines of code which this patchset adds!
> > > > > > 
> > > > > > The impacted people will be those who use the existing
> > > > > > idle-page-tracking feature but who will not use the new feature.  I
> > > > > > guess we can assume this set is small...
> > > > > 
> > > > > Yes, I think this set should be small. The code size increase of page_idle.o
> > > > > is from ~1KB to ~2KB. Most of the extra space is consumed by
> > > > > page_idle_proc_generic() function which this patch adds. I don't think adding
> > > > > another CONFIG option to disable this while keeping existing
> > > > > CONFIG_IDLE_PAGE_TRACKING enabled, is worthwhile but I am open to the
> > > > > addition of such an option if anyone feels strongly about it. I believe that
> > > > > once this patch is merged, most like this new interface being added is what
> > > > > will be used more than the old interface (for some of the usecases) so it
> > > > > makes sense to keep it alive with CONFIG_IDLE_PAGE_TRACKING.
> > > > 
> > > > I would tend to agree with Joel here. The functionality falls into an
> > > > existing IDLE_PAGE_TRACKING config option quite nicely. If there really
> > > > are users who want to save some space and this is standing in the way
> > > > then they can easily add a new config option with some justification so
> > > > the savings are clear. Without that an additional config simply adds to
> > > > the already existing configurability complexity and balkanization.
> > > 
> > > Michal, Andrew, Minchan,
> > > 
> > > Would you have any other review comments on the v5 series? This is just a new
> > > interface that does not disrupt existing users of the older page-idle
> > > tracking, so as such it is a safe change (as in, doesn't change existing
> > > functionality except for the draining bug fix).
> > 
> > I hope to find some more time to finish the review but let me point out
> > that "it's new it is regression safe" is not really a great argument for
> > a new user visible API.
> 
> Actually, I think you misunderstood me and took it out of context. I never
> intended to say "it is regression safe". I meant to say it is "low risk", as
> in that in all likelihood should not be hurting *existing users* of the *old
> interface*. Also as you know, it has been tested.

Yeah, misreading on my end.

> > If the API is flawed then this is likely going
> > to kick us later and will be hard to fix. I am still not convinced about
> > the swap part of the thing TBH.
> 
> Ok, then let us discuss it. As I mentioned before, without this we lose the
> access information due to MADVISE or swapping. Minchan and Konstantin both
> suggested it that's why I also added it (other than me also realizing that it
> is neeed).

I have described my concerns about the general idle bit behavior after
unmapping pointing to discrepancy with !anon pages. And I believe those
haven't been addressed yet. Besides that I am still not seeing any
description of the usecase that would suffer from the lack of the
functionality in changelogs.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v5 1/6] mm/page_idle: Add per-pid idle page tracking using virtual index
From: Joel Fernandes @ 2019-08-13 14:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jann Horn, kernel list, Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas, Christian Hansen,
	Daniel Colascione, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, Linux API, linux-doc,
	linux-fsdevel, Linux-MM, Mike Rapoport, Minchan Kim
In-Reply-To: <20190813100856.GF17933@dhcp22.suse.cz>

On Tue, Aug 13, 2019 at 12:08:56PM +0200, Michal Hocko wrote:
> On Mon 12-08-19 20:14:38, Jann Horn wrote:
> > On Wed, Aug 7, 2019 at 7:16 PM Joel Fernandes (Google)
> > <joel@joelfernandes.org> wrote:
> > > The page_idle tracking feature currently requires looking up the pagemap
> > > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > > Looking up PFN from pagemap in Android devices is not supported by
> > > unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
> > >
> > > This patch adds support to directly interact with page_idle tracking at
> > > the PID level by introducing a /proc/<pid>/page_idle file.  It follows
> > > the exact same semantics as the global /sys/kernel/mm/page_idle, but now
> > > looking up PFN through pagemap is not needed since the interface uses
> > > virtual frame numbers, and at the same time also does not require
> > > SYS_ADMIN.
> > >
> > > In Android, we are using this for the heap profiler (heapprofd) which
> > > profiles and pin points code paths which allocates and leaves memory
> > > idle for long periods of time. This method solves the security issue
> > > with userspace learning the PFN, and while at it is also shown to yield
> > > better results than the pagemap lookup, the theory being that the window
> > > where the address space can change is reduced by eliminating the
> > > intermediate pagemap look up stage. In virtual address indexing, the
> > > process's mmap_sem is held for the duration of the access.
> > 
> > What happens when you use this interface on shared pages, like memory
> > inherited from the zygote, library file mappings and so on? If two
> > profilers ran concurrently for two different processes that both map
> > the same libraries, would they end up messing up each other's data?
> 
> Yup PageIdle state is shared. That is the page_idle semantic even now
> IIRC.

Yes, that's right. This patch doesn't change that semantic. Idle page
tracking at the core is a global procedure which is based on pages that can
be shared.

One of the usecases of the heap profiler is to enable profiling of pages that
are shared between zygote and any processes that are forked. In this case,
I am told by our team working on the heap profiler, that the monitoring of
shared pages will help.

> > Can this be used to observe which library pages other processes are
> > accessing, even if you don't have access to those processes, as long
> > as you can map the same libraries? I realize that there are already a
> > bunch of ways to do that with side channels and such; but if you're
> > adding an interface that allows this by design, it seems to me like
> > something that should be gated behind some sort of privilege check.
> 
> Hmm, you need to be priviledged to get the pfn now and without that you
> cannot get to any page so the new interface is weakening the rules.
> Maybe we should limit setting the idle state to processes with the write
> status. Or do you think that even observing idle status is useful for
> practical side channel attacks? If yes, is that a problem of the
> profiler which does potentially dangerous things?

The heap profiler is currently unprivileged. Would it help the concern Jann
raised, if the new interface was limited to only anonymous private/shared
pages and not to file pages? Or, is this even a real concern?

thanks,

 - Joel

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox