Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* [WIP 4/4] bpf: Allow creating all program types without privilege
From: Andy Lutomirski @ 2019-08-05 21:29 UTC (permalink / raw)
  To: LKML, Alexei Starovoitov
  Cc: Song Liu, Kees Cook, Networking, bpf, Daniel Borkmann,
	Alexei Starovoitov, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List, Andy Lutomirski
In-Reply-To: <cover.1565040372.git.luto@kernel.org>

This doesn't let you *run* the programs except in test mode, so it should
be safe.  Famous last words.

This assumes that the check-privilege-to-call-privileged-functions
patch actually catches all the cases and that there's nothing else
that should need privilege lurking in the type-specific verifiers.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/bpf/syscall.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 23f8f89d2a86..730afa2be786 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1649,8 +1649,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
 	    attr->insn_cnt > (capable(CAP_SYS_ADMIN) ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS))
 		return -E2BIG;
 	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
-	    type != BPF_PROG_TYPE_CGROUP_SKB &&
-	    !capable(CAP_SYS_ADMIN))
+	    type != BPF_PROG_TYPE_CGROUP_SKB)
 		return -EPERM;
 
 	bpf_prog_load_fixup_attach_type(attr);
-- 
2.21.0

^ permalink raw reply related

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-05 22:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <CALCETrVtPs8gY-H4gmzSqPboid3CB++n50SvYd6RU9YVde_-Ow@mail.gmail.com>



> On Aug 5, 2019, at 2:25 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Mon, Aug 5, 2019 at 12:21 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:


> 
>> What we need is to drop privileges sooner in daemons like systemd.
> 
> This is doable right now: systemd could fork off a subprocess and
> delegate its cgroup operations to it.  It would be maybe a couple
> hundred lines of code.  As an added benefit, that subprocess could
> verify that the bpf operations in question are reasonable.
> Alternatively, if there was a CAP_BPF_ADMIN, systemd could retain that
> capability and flip it on and off as needed.

I tried to look at the code and I couldn’t find it. Does systemd drop privileges at all?  Can you point me at the code you’re thinking of

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-06  1:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Song Liu, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List
In-Reply-To: <CALCETrVtPs8gY-H4gmzSqPboid3CB++n50SvYd6RU9YVde_-Ow@mail.gmail.com>

On Mon, Aug 05, 2019 at 02:25:35PM -0700, Andy Lutomirski wrote:
> It tries to make the kernel respect the access modes for fds.  Without
> this patch, there seem to be some holes: nothing looked at program fds
> and, unless I missed something, you could take a readonly fd for a
> program, pin the program, and reopen it RW.

I think it's by design. iirc Daniel had a use case for something like this.

The key to understand is that bpf is not about security. It's about safety.
All features are geared to safety.
The users are trusted most of the time.
The number of unprivileged bpf use cases is tiny compared to trusted.
Hence unprivileged bpf is actually something that can be deprecated.

> Other than the issue that this patch partially fixes, can you see any
> reason that loading a program should require privilege?  Obviously the
> verifier is weakened a bit when called by privileged users, but a lot
> of that is about excessive resource usage and various less-well-tested
> features.  It seems to me that most of the value of bpf() should be
> available to programs that should not need privilege to load.  Are
> there things I'm missing?

see below.

> LPM: I don't see why this requires privilege at all.  It indeed checks
> capable(CAP_SYS_ADMIN), but I don't see why.

see below.

> 
> >
> > Attaching to a cgroup already has file based permission checks.
> > The user needs to open cgroup directory to attach.
> > acls on cgroup dir can already be used to prevent attaching to
> > certain parts of cgroup hierarchy.
> 
> The current checks seem inadequate.
> 
> $ echo 'yay' </sys/fs/cgroup/systemd/system.slice/
> 
> The ability to obtain an fd to a cgroup does *not* imply any right to
> modify that cgroup.  The ability to write to a cgroup directory
> already means something else -- it's the ability to create cgroups
> under the group in question.  I'm suggesting that a new API be added
> that allows attaching a bpf program to a cgroup without capabilities
> and that instead requires write access to a new file in the cgroup
> directory.  (It could be a single file for all bpf types or one file
> per type.  I prefer the latter -- it gives the admin finer-grained
> control.)

This is something to discuss. I don't mind something like this,
but in general bpf is not for untrusted users.
Hence I don't want to overdesign.

> 
> > What we need is to drop privileges sooner in daemons like systemd.
> 
> This is doable right now: systemd could fork off a subprocess and
> delegate its cgroup operations to it.  It would be maybe a couple
> hundred lines of code.  As an added benefit, that subprocess could
> verify that the bpf operations in question are reasonable.
> Alternatively, if there was a CAP_BPF_ADMIN, systemd could retain that
> capability and flip it on and off as needed.

See https://github.com/systemd/systemd/blob/01234e1fe777602265ffd25ab6e73823c62b0efe/src/core/bpf-firewall.c#L671-L674
bpf based IP sandboxing doesn't work in 'systemd --user'.
That is just one of the problems that people complained about.

Note that systemd bpf usage is basic. There is ongoing work to
adopt libbpf in systemd, so more features and more use cases will open up.

Inside containers and inside nested containers we need to start processes
that will use bpf. All of the processes are trusted.
We need to drop root not to be secure, but to be safe.
Consider a bug in a code that accidently did sys_kill(-1).
Dropping root is a mitigation for bugs like this.

> 
> > Container management daemon runs in the nested containers.
> > These trusted daemons need to have access to full bpf, but they
> > don't want to be root all the time.
> > They cannot flip back and forth via seteuid to root every time they
> > need to do bpf.
> > Hence the idea is to have a file that this daemon can open,
> > then drop privileges and still keep doing bpf things because FD is held.
> > Outer container daemon can pass this /dev/bpf's FD to inner daemon, etc.
> > This /dev/bpf would be accessible to root only.
> > There is no desire to open it up to non-root.
> 
> This seems extremely dangerous right now.  A program that can bypass
> *all* of the capable() checks in bpf() can do a whole lot.  Among
> other things, it can read all of kernel memory. 

It's 'dangerous' only if you think about it from security point of view.
The tracing (and sometimes networking) bpf progs need to read all of kernel
memory without being root.
That is the whole point of the /dev/bpf.

> This seems to have most of the same problems.  My main point is that
> it conflates a whole lot of different permissions, and I really don't
> think it's that much work to mostly disentangle the permissions in
> question.  My little series (if completed) plus a patch to allow
> unprivileged cgroup attach operations if you have an FMODE_WRITE fd to
> an appropriate file should get most of the way there.

I think I understand your concern. One /dev/bpf magic to by-pass
all of capable() checks in bpf doesn't look nice indeed.

Now to answer your question about capable(sys_admin) in the verifier.

See this bugfix:
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=7c2e988f400e83501e0a3568250780609b7c8263

It's a bug in the JIT that was there pretty much forever,
but it got exposed due to two new bpf features.
Initially syzbot complained that bounded loops allow jmp to 1st insn.
syzbot reproducer contained 'never taken' branch, so buggy JIT
wouldn't have caused a kernel panic for this reproducer though
JITed x86 code is broken.
It took me few days to realize that with bpf2bpf calls _and_
bounded loops I can craft a test that will expose this JIT bug
and will crash the kernel.
So a combination of two recent verifier features exposed old JIT bug
that would have been a security issue if we didn't gate these verifier
features for root only.
Now it's simply a kernel bugfix.

Such subtle bugs happen all the time when new verifier features are
introduced. Hence we always start them with root only.
bpf2bpf calls, bounded loops, precision tracking, dead code elimination,
LPM maps, many programs type are root only.
We don't want cve-s to be filed for every bug like this.

Even if we start relaxing features (like dropping root from LPM map)
at any given time there will be a lot of useful verifier features,
maps and program types that are root only.
For root == for trusted users only.
Unfortunately this approach creates adoption problem.
The trusted users don't want to be root to use bpf.
Hence this /dev/bpf.

To solve your concern of bypassing all capable checks...
How about we do /dev/bpf/full_verifier first?
It will replace capable() checks in the verifier only.

How to delegate cgroup attach permission is a follow up discussion.
Could be special files in cgroup dir as you proposed or something else.
Let's table that and focus on the verifier first.

^ permalink raw reply

* Re: [PATCH] mm/mempolicy.c: Remove unnecessary nodemask check in kernel_migrate_pages()
From: Vlastimil Babka @ 2019-08-06  8:36 UTC (permalink / raw)
  To: Kefeng Wang, Andrew Morton, linux-kernel
  Cc: Andrea Arcangeli, Dan Williams, Michal Hocko, Oscar Salvador,
	linux-mm, Linux API, linux-man@vger.kernel.org
In-Reply-To: <20190806023634.55356-1-wangkefeng.wang@huawei.com>

On 8/6/19 4:36 AM, Kefeng Wang wrote:
> 1) task_nodes = cpuset_mems_allowed(current);
>    -> cpuset_mems_allowed() guaranteed to return some non-empty
>       subset of node_states[N_MEMORY].

Right, there's an explicit guarantee.

> 2) nodes_and(*new, *new, task_nodes);
>    -> after nodes_and(), the 'new' should be empty or appropriate
>       nodemask(online node and with memory).
> 
> After 1) and 2), we could remove unnecessary check whether the 'new'
> AND node_states[N_MEMORY] is empty.

Yeah looks like the check is there due to evolution of the code, where initially
it was added to prevent calling the syscall with bogus nodes, but now that's
achieved by cpuset_mems_allowed().

> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: linux-mm@kvack.org
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
> 
> [QUESTION]
> 
> SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
>                 const unsigned long __user *, old_nodes,
>                 const unsigned long __user *, new_nodes)
> {
>         return kernel_migrate_pages(pid, maxnode, old_nodes, new_nodes);
> }
> 
> The migrate_pages() takes pid argument, witch is the ID of the process
> whose pages are to be moved. should the cpuset_mems_allowed(current) be
> cpuset_mems_allowed(task)?

The check for cpuset_mems_allowed(task) is just above the code you change, so
the new nodes have to be subset of the target task's cpuset.
But they also have to be allowed by the calling task's cpuset. In manpage of
migrate_pages(2), this is hinted by the NOTES "Use get_mempolicy(2) with the
MPOL_F_MEMS_ALLOWED flag to obtain the set of nodes that are allowed by the
calling process's cpuset..."

But perhaps the manpage should be better clarified:

- the EINVAL case includes "Or, none of the node IDs specified by new_nodes are
on-line and allowed by the process's current cpuset context, or none of the
specified nodes contain memory." - this should probably say "calling process" to
disambiguate
- the EPERM case should mention that new_nodes have to be subset of the target
process' cpuset context. The caller should also have CAP_SYS_NICE and
ptrace_may_access()

>  mm/mempolicy.c | 4 ----
>  1 file changed, 4 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index f48693f75b37..fceb44066184 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1467,10 +1467,6 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
>  	if (nodes_empty(*new))
>  		goto out_put;
>  
> -	nodes_and(*new, *new, node_states[N_MEMORY]);
> -	if (nodes_empty(*new))
> -		goto out_put;
> -
>  	err = security_task_movememory(task);
>  	if (err)
>  		goto out_put;
> 

^ permalink raw reply

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
From: Michal Hocko @ 2019-08-06  8:42 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Robin Murphy, Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas, Christian Hansen,
	dancol, fmayer, H. Peter Anvin, Ingo Molnar, joelaf,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, minchan, namhyung,
	paulmck, Roman Gushchin, Stephen
In-Reply-To: <20190805170451.26009-3-joel@joelfernandes.org>

On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> This bit will be used by idle page tracking code to correctly identify
> if a page that was swapped out was idle before it got swapped out.
> Without this PTE bit, we lose information about if a page is idle or not
> since the page frame gets unmapped.

And why do we need that? Why cannot we simply assume all swapped out
pages to be idle? They were certainly idle enough to be reclaimed,
right? Or what does idle actualy mean here?

> In this patch we reuse PTE_DEVMAP bit since idle page tracking only
> works on user pages in the LRU. Device pages should not consitute those
> so it should be unused and safe to use.
> 
> Cc: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  arch/arm64/Kconfig                    |  1 +
>  arch/arm64/include/asm/pgtable-prot.h |  1 +
>  arch/arm64/include/asm/pgtable.h      | 15 +++++++++++++++
>  3 files changed, 17 insertions(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 3adcec05b1f6..9d1412c693d7 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -128,6 +128,7 @@ config ARM64
>  	select HAVE_ARCH_MMAP_RND_BITS
>  	select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
>  	select HAVE_ARCH_PREL32_RELOCATIONS
> +	select HAVE_ARCH_PTE_SWP_PGIDLE
>  	select HAVE_ARCH_SECCOMP_FILTER
>  	select HAVE_ARCH_STACKLEAK
>  	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
> diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
> index 92d2e9f28f28..917b15c5d63a 100644
> --- a/arch/arm64/include/asm/pgtable-prot.h
> +++ b/arch/arm64/include/asm/pgtable-prot.h
> @@ -18,6 +18,7 @@
>  #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
>  #define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
>  #define PTE_PROT_NONE		(_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
> +#define PTE_SWP_PGIDLE		PTE_DEVMAP		 /* for idle page tracking during swapout */
>  
>  #ifndef __ASSEMBLY__
>  
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 3f5461f7b560..558f5ebd81ba 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -212,6 +212,21 @@ static inline pte_t pte_mkdevmap(pte_t pte)
>  	return set_pte_bit(pte, __pgprot(PTE_DEVMAP));
>  }
>  
> +static inline int pte_swp_page_idle(pte_t pte)
> +{
> +	return 0;
> +}
> +
> +static inline pte_t pte_swp_mkpage_idle(pte_t pte)
> +{
> +	return set_pte_bit(pte, __pgprot(PTE_SWP_PGIDLE));
> +}
> +
> +static inline pte_t pte_swp_clear_page_idle(pte_t pte)
> +{
> +	return clear_pte_bit(pte, __pgprot(PTE_SWP_PGIDLE));
> +}
> +
>  static inline void set_pte(pte_t *ptep, pte_t pte)
>  {
>  	WRITE_ONCE(*ptep, pte);
> -- 
> 2.22.0.770.g0f2c4a37fd-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v4 4/5] page_idle: Drain all LRU pagevec before idle tracking
From: Michal Hocko @ 2019-08-06  8:43 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, joelaf, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen
In-Reply-To: <20190805170451.26009-4-joel@joelfernandes.org>

On Mon 05-08-19 13:04:50, Joel Fernandes (Google) wrote:
> During idle tracking, we see that sometimes faulted anon pages are in
> pagevec but are not drained to LRU. Idle tracking considers pages only
> on LRU. Drain all CPU's LRU before starting idle tracking.

Please expand on why does this matter enough to introduce a potentially
expensinve draining which has to schedule a work on each CPU and wait
for them to finish.

> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  mm/page_idle.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/page_idle.c b/mm/page_idle.c
> index a5b00d63216c..2972367a599f 100644
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -180,6 +180,8 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
>  	unsigned long pfn, end_pfn;
>  	int bit, ret;
>  
> +	lru_add_drain_all();
> +
>  	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
>  	if (ret == -ENXIO)
>  		return 0;  /* Reads beyond max_pfn do nothing */
> @@ -211,6 +213,8 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
>  	unsigned long pfn, end_pfn;
>  	int bit, ret;
>  
> +	lru_add_drain_all();
> +
>  	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
>  	if (ret)
>  		return ret;
> @@ -428,6 +432,8 @@ ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
>  	walk.private = &priv;
>  	walk.mm = mm;
>  
> +	lru_add_drain_all();
> +
>  	down_read(&mm->mmap_sem);
>  
>  	/*
> -- 
> 2.22.0.770.g0f2c4a37fd-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing
From: Michal Hocko @ 2019-08-06  8:56 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, joelaf, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen
In-Reply-To: <20190805170451.26009-1-joel@joelfernandes.org>

On Mon 05-08-19 13:04:47, Joel Fernandes (Google) wrote:
> The page_idle tracking feature currently requires looking up the pagemap
> for a process followed by interacting with /sys/kernel/mm/page_idle.
> Looking up PFN from pagemap in Android devices is not supported by
> unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
> 
> This patch adds support to directly interact with page_idle tracking at
> the PID level by introducing a /proc/<pid>/page_idle file.  It follows
> the exact same semantics as the global /sys/kernel/mm/page_idle, but now
> looking up PFN through pagemap is not needed since the interface uses
> virtual frame numbers, and at the same time also does not require
> SYS_ADMIN.
> 
> In Android, we are using this for the heap profiler (heapprofd) which
> profiles and pin points code paths which allocates and leaves memory
> idle for long periods of time. This method solves the security issue
> with userspace learning the PFN, and while at it is also shown to yield
> better results than the pagemap lookup, the theory being that the window
> where the address space can change is reduced by eliminating the
> intermediate pagemap look up stage. In virtual address indexing, the
> process's mmap_sem is held for the duration of the access.

As already mentioned in one of the previous versions. The interface
seems sane and the usecase as well. So I do not really have high level
objections.

>From a quick look at the patch I would just object to pulling swap idle
tracking into this patch because it makes the review harder and it is
essentially a dead code until a later patch. I am also not sure whether
that is really necessary and it really begs for an explicit
justification.

I will try to go through the patch more carefully later as time allows.

> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
From: Joel Fernandes @ 2019-08-06 10:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Robin Murphy, Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas, Christian Hansen,
	dancol, fmayer, H. Peter Anvin, Ingo Molnar, Jonathan Corbet,
	Kees Cook, kernel-team, linux-api, linux-doc, linux-fsdevel,
	linux-mm, Mike Rapoport, minchan, namhyung, paulmck,
	Roman Gushchin, Stephen Rothwell
In-Reply-To: <20190806084203.GJ11812@dhcp22.suse.cz>

On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > This bit will be used by idle page tracking code to correctly identify
> > if a page that was swapped out was idle before it got swapped out.
> > Without this PTE bit, we lose information about if a page is idle or not
> > since the page frame gets unmapped.
> 
> And why do we need that? Why cannot we simply assume all swapped out
> pages to be idle? They were certainly idle enough to be reclaimed,
> right? Or what does idle actualy mean here?

Yes, but other than swapping, in Android a page can be forced to be swapped
out as well using the new hints that Minchan is adding?

Also, even if they were idle enough to be swapped, there is a chance that they
were marked as idle and *accessed* before the swapping. Due to swapping, the
"page was accessed since we last marked it as idle" information is lost. I am
able to verify this.

Idle in this context means the same thing as in page idle tracking terms, the
page was not accessed by userspace since we last marked it as idle (using
/proc/<pid>/page_idle).

thanks,

 - Joel


> > In this patch we reuse PTE_DEVMAP bit since idle page tracking only
> > works on user pages in the LRU. Device pages should not consitute those
> > so it should be unused and safe to use.
> > 
> > Cc: Robin Murphy <robin.murphy@arm.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  arch/arm64/Kconfig                    |  1 +
> >  arch/arm64/include/asm/pgtable-prot.h |  1 +
> >  arch/arm64/include/asm/pgtable.h      | 15 +++++++++++++++
> >  3 files changed, 17 insertions(+)
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 3adcec05b1f6..9d1412c693d7 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -128,6 +128,7 @@ config ARM64
> >  	select HAVE_ARCH_MMAP_RND_BITS
> >  	select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
> >  	select HAVE_ARCH_PREL32_RELOCATIONS
> > +	select HAVE_ARCH_PTE_SWP_PGIDLE
> >  	select HAVE_ARCH_SECCOMP_FILTER
> >  	select HAVE_ARCH_STACKLEAK
> >  	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
> > diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
> > index 92d2e9f28f28..917b15c5d63a 100644
> > --- a/arch/arm64/include/asm/pgtable-prot.h
> > +++ b/arch/arm64/include/asm/pgtable-prot.h
> > @@ -18,6 +18,7 @@
> >  #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
> >  #define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
> >  #define PTE_PROT_NONE		(_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
> > +#define PTE_SWP_PGIDLE		PTE_DEVMAP		 /* for idle page tracking during swapout */
> >  
> >  #ifndef __ASSEMBLY__
> >  
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 3f5461f7b560..558f5ebd81ba 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -212,6 +212,21 @@ static inline pte_t pte_mkdevmap(pte_t pte)
> >  	return set_pte_bit(pte, __pgprot(PTE_DEVMAP));
> >  }
> >  
> > +static inline int pte_swp_page_idle(pte_t pte)
> > +{
> > +	return 0;
> > +}
> > +
> > +static inline pte_t pte_swp_mkpage_idle(pte_t pte)
> > +{
> > +	return set_pte_bit(pte, __pgprot(PTE_SWP_PGIDLE));
> > +}
> > +
> > +static inline pte_t pte_swp_clear_page_idle(pte_t pte)
> > +{
> > +	return clear_pte_bit(pte, __pgprot(PTE_SWP_PGIDLE));
> > +}
> > +
> >  static inline void set_pte(pte_t *ptep, pte_t pte)
> >  {
> >  	WRITE_ONCE(*ptep, pte);
> > -- 
> > 2.22.0.770.g0f2c4a37fd-goog
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply

* Re: [PATCH v4 4/5] page_idle: Drain all LRU pagevec before idle tracking
From: Joel Fernandes @ 2019-08-06 10:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen Rothwell
In-Reply-To: <20190806084357.GK11812@dhcp22.suse.cz>

On Tue, Aug 06, 2019 at 10:43:57AM +0200, Michal Hocko wrote:
> On Mon 05-08-19 13:04:50, Joel Fernandes (Google) wrote:
> > During idle tracking, we see that sometimes faulted anon pages are in
> > pagevec but are not drained to LRU. Idle tracking considers pages only
> > on LRU. Drain all CPU's LRU before starting idle tracking.
> 
> Please expand on why does this matter enough to introduce a potentially
> expensinve draining which has to schedule a work on each CPU and wait
> for them to finish.

Sure, I can expand. I am able to find multiple issues involving this. One
issue looks like idle tracking is completely broken. It shows up in my
testing as if a page that is marked as idle is always "accessed" -- because
it was never marked as idle (due to not draining of pagevec).

The other issue shows up as a failure in my "swap test", with the following
sequence:
1. Allocate some pages
2. Write to them
3. Mark them as idle                                    <--- fails
4. Introduce some memory pressure to induce swapping.
5. Check the swap bit I introduced in this series.      <--- fails to set idle
                                                             bit in swap PTE.

Draining the pagevec in advance fixes both of these issues.

This operation even if expensive is only done once during the access of the
page_idle file. Did you have a better fix in mind?

thanks,

 - Joel


> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  mm/page_idle.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/mm/page_idle.c b/mm/page_idle.c
> > index a5b00d63216c..2972367a599f 100644
> > --- a/mm/page_idle.c
> > +++ b/mm/page_idle.c
> > @@ -180,6 +180,8 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
> >  	unsigned long pfn, end_pfn;
> >  	int bit, ret;
> >  
> > +	lru_add_drain_all();
> > +
> >  	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> >  	if (ret == -ENXIO)
> >  		return 0;  /* Reads beyond max_pfn do nothing */
> > @@ -211,6 +213,8 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
> >  	unsigned long pfn, end_pfn;
> >  	int bit, ret;
> >  
> > +	lru_add_drain_all();
> > +
> >  	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> >  	if (ret)
> >  		return ret;
> > @@ -428,6 +432,8 @@ ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
> >  	walk.private = &priv;
> >  	walk.mm = mm;
> >  
> > +	lru_add_drain_all();
> > +
> >  	down_read(&mm->mmap_sem);
> >  
> >  	/*
> > -- 
> > 2.22.0.770.g0f2c4a37fd-goog
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply

* Re: [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing
From: Joel Fernandes @ 2019-08-06 10:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen Rothwell
In-Reply-To: <20190806085605.GL11812@dhcp22.suse.cz>

On Tue, Aug 06, 2019 at 10:56:05AM +0200, Michal Hocko wrote:
> On Mon 05-08-19 13:04:47, Joel Fernandes (Google) wrote:
> > The page_idle tracking feature currently requires looking up the pagemap
> > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > Looking up PFN from pagemap in Android devices is not supported by
> > unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
> > 
> > This patch adds support to directly interact with page_idle tracking at
> > the PID level by introducing a /proc/<pid>/page_idle file.  It follows
> > the exact same semantics as the global /sys/kernel/mm/page_idle, but now
> > looking up PFN through pagemap is not needed since the interface uses
> > virtual frame numbers, and at the same time also does not require
> > SYS_ADMIN.
> > 
> > In Android, we are using this for the heap profiler (heapprofd) which
> > profiles and pin points code paths which allocates and leaves memory
> > idle for long periods of time. This method solves the security issue
> > with userspace learning the PFN, and while at it is also shown to yield
> > better results than the pagemap lookup, the theory being that the window
> > where the address space can change is reduced by eliminating the
> > intermediate pagemap look up stage. In virtual address indexing, the
> > process's mmap_sem is held for the duration of the access.
> 
> As already mentioned in one of the previous versions. The interface
> seems sane and the usecase as well. So I do not really have high level
> objections.

That is great to know.

> From a quick look at the patch I would just object to pulling swap idle
> tracking into this patch because it makes the review harder and it is
> essentially a dead code until a later patch. I am also not sure whether
> that is really necessary and it really begs for an explicit
> justification.

Ok I will split it out, and also expand on the need for it a bit more.

> 
> I will try to go through the patch more carefully later as time allows.

Thanks a lot.

> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> -- 
> Michal Hocko
> SUSE Labs

 - Joel

^ permalink raw reply

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
From: Michal Hocko @ 2019-08-06 10:47 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Robin Murphy, Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas, Christian Hansen,
	dancol, fmayer, H. Peter Anvin, Ingo Molnar, Jonathan Corbet,
	Kees Cook, kernel-team, linux-api, linux-doc, linux-fsdevel,
	linux-mm, Mike Rapoport, minchan, namhyung, paulmck,
	Roman Gushchin, Stephen Rothwell
In-Reply-To: <20190806103627.GA218260@google.com>

On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > This bit will be used by idle page tracking code to correctly identify
> > > if a page that was swapped out was idle before it got swapped out.
> > > Without this PTE bit, we lose information about if a page is idle or not
> > > since the page frame gets unmapped.
> > 
> > And why do we need that? Why cannot we simply assume all swapped out
> > pages to be idle? They were certainly idle enough to be reclaimed,
> > right? Or what does idle actualy mean here?
> 
> Yes, but other than swapping, in Android a page can be forced to be swapped
> out as well using the new hints that Minchan is adding?

Yes and that is effectivelly making them idle, no?

> Also, even if they were idle enough to be swapped, there is a chance that they
> were marked as idle and *accessed* before the swapping. Due to swapping, the
> "page was accessed since we last marked it as idle" information is lost. I am
> able to verify this.
> 
> Idle in this context means the same thing as in page idle tracking terms, the
> page was not accessed by userspace since we last marked it as idle (using
> /proc/<pid>/page_idle).

Please describe a usecase and why that information might be useful.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v4 4/5] page_idle: Drain all LRU pagevec before idle tracking
From: Michal Hocko @ 2019-08-06 10:51 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen Rothwell
In-Reply-To: <20190806104554.GB218260@google.com>

On Tue 06-08-19 06:45:54, Joel Fernandes wrote:
> On Tue, Aug 06, 2019 at 10:43:57AM +0200, Michal Hocko wrote:
> > On Mon 05-08-19 13:04:50, Joel Fernandes (Google) wrote:
> > > During idle tracking, we see that sometimes faulted anon pages are in
> > > pagevec but are not drained to LRU. Idle tracking considers pages only
> > > on LRU. Drain all CPU's LRU before starting idle tracking.
> > 
> > Please expand on why does this matter enough to introduce a potentially
> > expensinve draining which has to schedule a work on each CPU and wait
> > for them to finish.
> 
> Sure, I can expand. I am able to find multiple issues involving this. One
> issue looks like idle tracking is completely broken. It shows up in my
> testing as if a page that is marked as idle is always "accessed" -- because
> it was never marked as idle (due to not draining of pagevec).
> 
> The other issue shows up as a failure in my "swap test", with the following
> sequence:
> 1. Allocate some pages
> 2. Write to them
> 3. Mark them as idle                                    <--- fails
> 4. Introduce some memory pressure to induce swapping.
> 5. Check the swap bit I introduced in this series.      <--- fails to set idle
>                                                              bit in swap PTE.
> 
> Draining the pagevec in advance fixes both of these issues.

This belongs to the changelog.

> This operation even if expensive is only done once during the access of the
> page_idle file. Did you have a better fix in mind?

Can we set the idle bit also for non-lru pages as long as they are
reachable via pte?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
From: Minchan Kim @ 2019-08-06 11:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joel Fernandes, linux-kernel, Robin Murphy, Alexey Dobriyan,
	Andrew Morton, Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, namhyung, paulmck,
	Roman Gushchin, Stephen
In-Reply-To: <20190806104755.GR11812@dhcp22.suse.cz>

On Tue, Aug 06, 2019 at 12:47:55PM +0200, Michal Hocko wrote:
> On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> > On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > > This bit will be used by idle page tracking code to correctly identify
> > > > if a page that was swapped out was idle before it got swapped out.
> > > > Without this PTE bit, we lose information about if a page is idle or not
> > > > since the page frame gets unmapped.
> > > 
> > > And why do we need that? Why cannot we simply assume all swapped out
> > > pages to be idle? They were certainly idle enough to be reclaimed,
> > > right? Or what does idle actualy mean here?
> > 
> > Yes, but other than swapping, in Android a page can be forced to be swapped
> > out as well using the new hints that Minchan is adding?
> 
> Yes and that is effectivelly making them idle, no?

1. mark page-A idle which was present at that time.
2. run workload
3. page-A is touched several times
4. *sudden* memory pressure happen so finally page A is finally swapped out
5. now see the page A idle - but it's incorrect.

^ permalink raw reply

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
From: Joel Fernandes @ 2019-08-06 11:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Robin Murphy, Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas, Christian Hansen,
	dancol, fmayer, H. Peter Anvin, Ingo Molnar, Jonathan Corbet,
	Kees Cook, kernel-team, linux-api, linux-doc, linux-fsdevel,
	linux-mm, Mike Rapoport, minchan, namhyung, paulmck,
	Roman Gushchin, Stephen Rothwell
In-Reply-To: <20190806104755.GR11812@dhcp22.suse.cz>

On Tue, Aug 06, 2019 at 12:47:55PM +0200, Michal Hocko wrote:
> On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> > On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > > This bit will be used by idle page tracking code to correctly identify
> > > > if a page that was swapped out was idle before it got swapped out.
> > > > Without this PTE bit, we lose information about if a page is idle or not
> > > > since the page frame gets unmapped.
> > > 
> > > And why do we need that? Why cannot we simply assume all swapped out
> > > pages to be idle? They were certainly idle enough to be reclaimed,
> > > right? Or what does idle actualy mean here?
> > 
> > Yes, but other than swapping, in Android a page can be forced to be swapped
> > out as well using the new hints that Minchan is adding?
> 
> Yes and that is effectivelly making them idle, no?

That depends on how you think of it. If you are thinking of a monitoring
process like a heap profiler, then from the heap profiler's (that only cares
about the process it is monitoring) perspective it will look extremely odd if
pages that are recently accessed by the process appear to be idle which would
falsely look like those processes are leaking memory. The reality being,
Android forced those pages into swap because of other reasons. I would like
for the swapping mechanism, whether forced swapping or memory reclaim, not to
interfere with the idle detection.

This is just an effort to make the idle tracking a little bit better. We
would like to not lose the 'accessed' information of the pages.

Initially, I had proposed what you are suggesting as well however the above
reasons made me to do it like this. Also Minchan and Konstantin suggested
this, so there are more people interested in the swap idle bit. Minchan, can
you provide more thoughts here? (He is on 2-week vacation from today so
hopefully replies before he vanishes ;-)).

Also assuming all swap pages as idle has other "semantic" issues. It is quite
odd if a swapped page is automatically marked as idle without userspace
telling it to. Consider the following set of events: 1. Userspace marks only
a certain memory region as idle. 2. Userspace reads back the bits
corresponding to a bigger region. Part of this bigger region is swapped.
Userspace expects all of the pages it did not mark, to have idle bit set to
'0' because it never marked them as idle. However if it is now surprised by
what it read back (not all '0' read back). Since a page is swapped, it will
be now marked "automatically" as idle as per your proposal, even if userspace
never marked it explicity before. This would be quite confusing/ambiguous.

I will include this and other information in future commit messages.

thanks,

 - Joel

^ permalink raw reply

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
From: Michal Hocko @ 2019-08-06 11:14 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Joel Fernandes, linux-kernel, Robin Murphy, Alexey Dobriyan,
	Andrew Morton, Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, namhyung, paulmck,
	Roman Gushchin, Stephen
In-Reply-To: <20190806110737.GB32615@google.com>

On Tue 06-08-19 20:07:37, Minchan Kim wrote:
> On Tue, Aug 06, 2019 at 12:47:55PM +0200, Michal Hocko wrote:
> > On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> > > On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > > > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > > > This bit will be used by idle page tracking code to correctly identify
> > > > > if a page that was swapped out was idle before it got swapped out.
> > > > > Without this PTE bit, we lose information about if a page is idle or not
> > > > > since the page frame gets unmapped.
> > > > 
> > > > And why do we need that? Why cannot we simply assume all swapped out
> > > > pages to be idle? They were certainly idle enough to be reclaimed,
> > > > right? Or what does idle actualy mean here?
> > > 
> > > Yes, but other than swapping, in Android a page can be forced to be swapped
> > > out as well using the new hints that Minchan is adding?
> > 
> > Yes and that is effectivelly making them idle, no?
> 
> 1. mark page-A idle which was present at that time.
> 2. run workload
> 3. page-A is touched several times
> 4. *sudden* memory pressure happen so finally page A is finally swapped out
> 5. now see the page A idle - but it's incorrect.

Could you expand on what you mean by idle exactly? Why pageout doesn't
really qualify as "mark-idle and reclaim"? Also could you describe a
usecase where the swapout distinction really matters and it would lead
to incorrect behavior?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v4 4/5] page_idle: Drain all LRU pagevec before idle tracking
From: Joel Fernandes @ 2019-08-06 11:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen Rothwell
In-Reply-To: <20190806105149.GT11812@dhcp22.suse.cz>

On Tue, Aug 06, 2019 at 12:51:49PM +0200, Michal Hocko wrote:
> On Tue 06-08-19 06:45:54, Joel Fernandes wrote:
> > On Tue, Aug 06, 2019 at 10:43:57AM +0200, Michal Hocko wrote:
> > > On Mon 05-08-19 13:04:50, Joel Fernandes (Google) wrote:
> > > > During idle tracking, we see that sometimes faulted anon pages are in
> > > > pagevec but are not drained to LRU. Idle tracking considers pages only
> > > > on LRU. Drain all CPU's LRU before starting idle tracking.
> > > 
> > > Please expand on why does this matter enough to introduce a potentially
> > > expensinve draining which has to schedule a work on each CPU and wait
> > > for them to finish.
> > 
> > Sure, I can expand. I am able to find multiple issues involving this. One
> > issue looks like idle tracking is completely broken. It shows up in my
> > testing as if a page that is marked as idle is always "accessed" -- because
> > it was never marked as idle (due to not draining of pagevec).
> > 
> > The other issue shows up as a failure in my "swap test", with the following
> > sequence:
> > 1. Allocate some pages
> > 2. Write to them
> > 3. Mark them as idle                                    <--- fails
> > 4. Introduce some memory pressure to induce swapping.
> > 5. Check the swap bit I introduced in this series.      <--- fails to set idle
> >                                                              bit in swap PTE.
> > 
> > Draining the pagevec in advance fixes both of these issues.
> 
> This belongs to the changelog.

Sure, will add.


> > This operation even if expensive is only done once during the access of the
> > page_idle file. Did you have a better fix in mind?
> 
> Can we set the idle bit also for non-lru pages as long as they are
> reachable via pte?

Not at the moment with the current page idle tracking code. PageLRU(page)
flag is checked in page_idle_get_page().

Even if we could set it for non-LRU, the idle bit (page flag) would not be
cleared if page is not on LRU because page-reclaim code (page_referenced() I
believe) would not clear it. This whole mechanism depends on page-reclaim. Or
did I miss your point?


thanks,

 - Joel

^ permalink raw reply

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
From: Joel Fernandes @ 2019-08-06 11:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, linux-kernel, Robin Murphy, Alexey Dobriyan,
	Andrew Morton, Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, namhyung, paulmck,
	Roman Gushchin, Stephen Rothwell <sfr@
In-Reply-To: <20190806111452.GW11812@dhcp22.suse.cz>

On Tue, Aug 06, 2019 at 01:14:52PM +0200, Michal Hocko wrote:
> On Tue 06-08-19 20:07:37, Minchan Kim wrote:
> > On Tue, Aug 06, 2019 at 12:47:55PM +0200, Michal Hocko wrote:
> > > On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> > > > On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > > > > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > > > > This bit will be used by idle page tracking code to correctly identify
> > > > > > if a page that was swapped out was idle before it got swapped out.
> > > > > > Without this PTE bit, we lose information about if a page is idle or not
> > > > > > since the page frame gets unmapped.
> > > > > 
> > > > > And why do we need that? Why cannot we simply assume all swapped out
> > > > > pages to be idle? They were certainly idle enough to be reclaimed,
> > > > > right? Or what does idle actualy mean here?
> > > > 
> > > > Yes, but other than swapping, in Android a page can be forced to be swapped
> > > > out as well using the new hints that Minchan is adding?
> > > 
> > > Yes and that is effectivelly making them idle, no?
> > 
> > 1. mark page-A idle which was present at that time.
> > 2. run workload
> > 3. page-A is touched several times
> > 4. *sudden* memory pressure happen so finally page A is finally swapped out
> > 5. now see the page A idle - but it's incorrect.
> 
> Could you expand on what you mean by idle exactly? Why pageout doesn't
> really qualify as "mark-idle and reclaim"? Also could you describe a
> usecase where the swapout distinction really matters and it would lead
> to incorrect behavior?

Michal,
Did you read this post ? :
https://lore.kernel.org/lkml/20190806104715.GC218260@google.com/T/#m4ece68ceaf6e54d4d29e974f5f4c1080e733f6c1

Just wanted to be sure you did not miss it.

thanks,

 - Joel

^ permalink raw reply

* Re: [PATCH v4 4/5] page_idle: Drain all LRU pagevec before idle tracking
From: Michal Hocko @ 2019-08-06 11:44 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen Rothwell
In-Reply-To: <20190806111921.GB117316@google.com>

On Tue 06-08-19 07:19:21, Joel Fernandes wrote:
> On Tue, Aug 06, 2019 at 12:51:49PM +0200, Michal Hocko wrote:
> > On Tue 06-08-19 06:45:54, Joel Fernandes wrote:
> > > On Tue, Aug 06, 2019 at 10:43:57AM +0200, Michal Hocko wrote:
> > > > On Mon 05-08-19 13:04:50, Joel Fernandes (Google) wrote:
> > > > > During idle tracking, we see that sometimes faulted anon pages are in
> > > > > pagevec but are not drained to LRU. Idle tracking considers pages only
> > > > > on LRU. Drain all CPU's LRU before starting idle tracking.
> > > > 
> > > > Please expand on why does this matter enough to introduce a potentially
> > > > expensinve draining which has to schedule a work on each CPU and wait
> > > > for them to finish.
> > > 
> > > Sure, I can expand. I am able to find multiple issues involving this. One
> > > issue looks like idle tracking is completely broken. It shows up in my
> > > testing as if a page that is marked as idle is always "accessed" -- because
> > > it was never marked as idle (due to not draining of pagevec).
> > > 
> > > The other issue shows up as a failure in my "swap test", with the following
> > > sequence:
> > > 1. Allocate some pages
> > > 2. Write to them
> > > 3. Mark them as idle                                    <--- fails
> > > 4. Introduce some memory pressure to induce swapping.
> > > 5. Check the swap bit I introduced in this series.      <--- fails to set idle
> > >                                                              bit in swap PTE.
> > > 
> > > Draining the pagevec in advance fixes both of these issues.
> > 
> > This belongs to the changelog.
> 
> Sure, will add.
> 
> 
> > > This operation even if expensive is only done once during the access of the
> > > page_idle file. Did you have a better fix in mind?
> > 
> > Can we set the idle bit also for non-lru pages as long as they are
> > reachable via pte?
> 
> Not at the moment with the current page idle tracking code. PageLRU(page)
> flag is checked in page_idle_get_page().

yes, I am aware of the current code. I strongly suspect that the PageLRU
check was there to not mark arbitrary page looked up by pfn with the
idle bit because that would be unexpected. But I might be easily wrong
here.

> Even if we could set it for non-LRU, the idle bit (page flag) would not be
> cleared if page is not on LRU because page-reclaim code (page_referenced() I
> believe) would not clear it.

Yes, it is either reclaim when checking references as you say but also
mark_page_accessed. I believe the later might still have the page on the
pcp LRU add cache. Maybe I am missing something something but it seems
that there is nothing fundamentally requiring the user mapped page to be
on the LRU list when seting the idle bit.

That being said, your big hammer approach will work more reliable but if
you do not feel like changing the underlying PageLRU assumption then
document that draining should be removed longterm.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
From: Michal Hocko @ 2019-08-06 11:57 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Robin Murphy, Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas, Christian Hansen,
	dancol, fmayer, H. Peter Anvin, Ingo Molnar, Jonathan Corbet,
	Kees Cook, kernel-team, linux-api, linux-doc, linux-fsdevel,
	linux-mm, Mike Rapoport, minchan, namhyung, paulmck,
	Roman Gushchin, Stephen Rothwell
In-Reply-To: <20190806111446.GA117316@google.com>

On Tue 06-08-19 07:14:46, Joel Fernandes wrote:
> On Tue, Aug 06, 2019 at 12:47:55PM +0200, Michal Hocko wrote:
> > On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> > > On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > > > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > > > This bit will be used by idle page tracking code to correctly identify
> > > > > if a page that was swapped out was idle before it got swapped out.
> > > > > Without this PTE bit, we lose information about if a page is idle or not
> > > > > since the page frame gets unmapped.
> > > > 
> > > > And why do we need that? Why cannot we simply assume all swapped out
> > > > pages to be idle? They were certainly idle enough to be reclaimed,
> > > > right? Or what does idle actualy mean here?
> > > 
> > > Yes, but other than swapping, in Android a page can be forced to be swapped
> > > out as well using the new hints that Minchan is adding?
> > 
> > Yes and that is effectivelly making them idle, no?
> 
> That depends on how you think of it.

I would much prefer to have it documented so that I do not have to guess ;)

> If you are thinking of a monitoring
> process like a heap profiler, then from the heap profiler's (that only cares
> about the process it is monitoring) perspective it will look extremely odd if
> pages that are recently accessed by the process appear to be idle which would
> falsely look like those processes are leaking memory. The reality being,
> Android forced those pages into swap because of other reasons. I would like
> for the swapping mechanism, whether forced swapping or memory reclaim, not to
> interfere with the idle detection.

Hmm, but how are you going to handle situation when the page is unmapped
and refaulted again (e.g. a normal reclaim of a pagecache)? You are
losing that information same was as in the swapout case, no? Or am I
missing something?

> This is just an effort to make the idle tracking a little bit better. We
> would like to not lose the 'accessed' information of the pages.
> 
> Initially, I had proposed what you are suggesting as well however the above
> reasons made me to do it like this. Also Minchan and Konstantin suggested
> this, so there are more people interested in the swap idle bit. Minchan, can
> you provide more thoughts here? (He is on 2-week vacation from today so
> hopefully replies before he vanishes ;-)).

We can move on with the rest of the series in the mean time but I would
like to see a proper justification for the swap entries and why they
should be handled special.

> Also assuming all swap pages as idle has other "semantic" issues. It is quite
> odd if a swapped page is automatically marked as idle without userspace
> telling it to. Consider the following set of events: 1. Userspace marks only
> a certain memory region as idle. 2. Userspace reads back the bits
> corresponding to a bigger region. Part of this bigger region is swapped.
> Userspace expects all of the pages it did not mark, to have idle bit set to
> '0' because it never marked them as idle. However if it is now surprised by
> what it read back (not all '0' read back). Since a page is swapped, it will
> be now marked "automatically" as idle as per your proposal, even if userspace
> never marked it explicity before. This would be quite confusing/ambiguous.

OK, I see. I guess the primary question I have is how do you distinguish
Idle page which got unmapped and faulted in again from swapped out page
and refaulted - including the time the pte is not present.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
From: Joel Fernandes @ 2019-08-06 13:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Robin Murphy, Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas, Christian Hansen,
	dancol, fmayer, H. Peter Anvin, Ingo Molnar, Jonathan Corbet,
	Kees Cook, kernel-team, linux-api, linux-doc, linux-fsdevel,
	linux-mm, Mike Rapoport, minchan, namhyung, paulmck,
	Roman Gushchin, Stephen Rothwell
In-Reply-To: <20190806115703.GY11812@dhcp22.suse.cz>

On Tue, Aug 06, 2019 at 01:57:03PM +0200, Michal Hocko wrote:
> On Tue 06-08-19 07:14:46, Joel Fernandes wrote:
> > On Tue, Aug 06, 2019 at 12:47:55PM +0200, Michal Hocko wrote:
> > > On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> > > > On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > > > > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > > > > This bit will be used by idle page tracking code to correctly identify
> > > > > > if a page that was swapped out was idle before it got swapped out.
> > > > > > Without this PTE bit, we lose information about if a page is idle or not
> > > > > > since the page frame gets unmapped.
> > > > > 
> > > > > And why do we need that? Why cannot we simply assume all swapped out
> > > > > pages to be idle? They were certainly idle enough to be reclaimed,
> > > > > right? Or what does idle actualy mean here?
> > > > 
> > > > Yes, but other than swapping, in Android a page can be forced to be swapped
> > > > out as well using the new hints that Minchan is adding?
> > > 
> > > Yes and that is effectivelly making them idle, no?
> > 
> > That depends on how you think of it.
> 
> I would much prefer to have it documented so that I do not have to guess ;)

Sure :)

> > If you are thinking of a monitoring
> > process like a heap profiler, then from the heap profiler's (that only cares
> > about the process it is monitoring) perspective it will look extremely odd if
> > pages that are recently accessed by the process appear to be idle which would
> > falsely look like those processes are leaking memory. The reality being,
> > Android forced those pages into swap because of other reasons. I would like
> > for the swapping mechanism, whether forced swapping or memory reclaim, not to
> > interfere with the idle detection.
> 
> Hmm, but how are you going to handle situation when the page is unmapped
> and refaulted again (e.g. a normal reclaim of a pagecache)? You are
> losing that information same was as in the swapout case, no? Or am I
> missing something?

Yes you are right, it would have the same issue, thanks for bringing it up.
Should we rename this bit to PTE_IDLE and do the same thing that we are doing
for swap?

i.e. if (page_idle(page)) and page is a file page, then we write state
into the PTE of the page. Later on refault, the PTE bit would automatically
get cleared (just like it does on swap-in). But before refault, the idle
tracking code sees the page as still marked idle. Do you see any issue with that?


> > This is just an effort to make the idle tracking a little bit better. We
> > would like to not lose the 'accessed' information of the pages.
> > 
> > Initially, I had proposed what you are suggesting as well however the above
> > reasons made me to do it like this. Also Minchan and Konstantin suggested
> > this, so there are more people interested in the swap idle bit. Minchan, can
> > you provide more thoughts here? (He is on 2-week vacation from today so
> > hopefully replies before he vanishes ;-)).
> 
> We can move on with the rest of the series in the mean time but I would
> like to see a proper justification for the swap entries and why they
> should be handled special.

Ok, I will improve the changelog.


> > Also assuming all swap pages as idle has other "semantic" issues. It is quite
> > odd if a swapped page is automatically marked as idle without userspace
> > telling it to. Consider the following set of events: 1. Userspace marks only
> > a certain memory region as idle. 2. Userspace reads back the bits
> > corresponding to a bigger region. Part of this bigger region is swapped.
> > Userspace expects all of the pages it did not mark, to have idle bit set to
> > '0' because it never marked them as idle. However if it is now surprised by
> > what it read back (not all '0' read back). Since a page is swapped, it will
> > be now marked "automatically" as idle as per your proposal, even if userspace
> > never marked it explicity before. This would be quite confusing/ambiguous.
> 
> OK, I see. I guess the primary question I have is how do you distinguish
> Idle page which got unmapped and faulted in again from swapped out page
> and refaulted - including the time the pte is not present.

Ok, lets discuss more.

thanks Michal!

 - Joel

^ permalink raw reply

* Re: [PATCH v4 4/5] page_idle: Drain all LRU pagevec before idle tracking
From: Joel Fernandes @ 2019-08-06 13:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen Rothwell
In-Reply-To: <20190806114402.GX11812@dhcp22.suse.cz>

On Tue, Aug 06, 2019 at 01:44:02PM +0200, Michal Hocko wrote:
[snip]
> > > > This operation even if expensive is only done once during the access of the
> > > > page_idle file. Did you have a better fix in mind?
> > > 
> > > Can we set the idle bit also for non-lru pages as long as they are
> > > reachable via pte?
> > 
> > Not at the moment with the current page idle tracking code. PageLRU(page)
> > flag is checked in page_idle_get_page().
> 
> yes, I am aware of the current code. I strongly suspect that the PageLRU
> check was there to not mark arbitrary page looked up by pfn with the
> idle bit because that would be unexpected. But I might be easily wrong
> here.

Yes, quite possible.

> > Even if we could set it for non-LRU, the idle bit (page flag) would not be
> > cleared if page is not on LRU because page-reclaim code (page_referenced() I
> > believe) would not clear it.
> 
> Yes, it is either reclaim when checking references as you say but also
> mark_page_accessed. I believe the later might still have the page on the
> pcp LRU add cache. Maybe I am missing something something but it seems
> that there is nothing fundamentally requiring the user mapped page to be
> on the LRU list when seting the idle bit.
> 
> That being said, your big hammer approach will work more reliable but if
> you do not feel like changing the underlying PageLRU assumption then
> document that draining should be removed longterm.

Yes, at the moment I am in preference of keeping the underlying assumption
same. I am Ok with adding of a comment on the drain call that it is to be
removed longterm.

thanks,

 - Joel

^ permalink raw reply

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
From: Michal Hocko @ 2019-08-06 14:09 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Robin Murphy, Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas, Christian Hansen,
	dancol, fmayer, H. Peter Anvin, Ingo Molnar, Jonathan Corbet,
	Kees Cook, kernel-team, linux-api, linux-doc, linux-fsdevel,
	linux-mm, Mike Rapoport, minchan, namhyung, paulmck,
	Roman Gushchin, Stephen Rothwell
In-Reply-To: <20190806134321.GA15167@google.com>

On Tue 06-08-19 09:43:21, Joel Fernandes wrote:
> On Tue, Aug 06, 2019 at 01:57:03PM +0200, Michal Hocko wrote:
> > On Tue 06-08-19 07:14:46, Joel Fernandes wrote:
> > > On Tue, Aug 06, 2019 at 12:47:55PM +0200, Michal Hocko wrote:
> > > > On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> > > > > On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > > > > > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > > > > > This bit will be used by idle page tracking code to correctly identify
> > > > > > > if a page that was swapped out was idle before it got swapped out.
> > > > > > > Without this PTE bit, we lose information about if a page is idle or not
> > > > > > > since the page frame gets unmapped.
> > > > > > 
> > > > > > And why do we need that? Why cannot we simply assume all swapped out
> > > > > > pages to be idle? They were certainly idle enough to be reclaimed,
> > > > > > right? Or what does idle actualy mean here?
> > > > > 
> > > > > Yes, but other than swapping, in Android a page can be forced to be swapped
> > > > > out as well using the new hints that Minchan is adding?
> > > > 
> > > > Yes and that is effectivelly making them idle, no?
> > > 
> > > That depends on how you think of it.
> > 
> > I would much prefer to have it documented so that I do not have to guess ;)
> 
> Sure :)
> 
> > > If you are thinking of a monitoring
> > > process like a heap profiler, then from the heap profiler's (that only cares
> > > about the process it is monitoring) perspective it will look extremely odd if
> > > pages that are recently accessed by the process appear to be idle which would
> > > falsely look like those processes are leaking memory. The reality being,
> > > Android forced those pages into swap because of other reasons. I would like
> > > for the swapping mechanism, whether forced swapping or memory reclaim, not to
> > > interfere with the idle detection.
> > 
> > Hmm, but how are you going to handle situation when the page is unmapped
> > and refaulted again (e.g. a normal reclaim of a pagecache)? You are
> > losing that information same was as in the swapout case, no? Or am I
> > missing something?
> 
> Yes you are right, it would have the same issue, thanks for bringing it up.
> Should we rename this bit to PTE_IDLE and do the same thing that we are doing
> for swap?

What if we decide to tear the page table down as well? E.g. because we
can reclaim file backed mappings and free some memory used for page
tables. We do not do that right now but I can see that really large
mappings might push us that direction. Sure this is mostly a theoretical
concern but I am wondering whether promissing to keep the idle bit over
unmapping is not too much.

I am not sure how to deal with this myself, TBH. In any case the current
semantic - via pfn - will lose the idle bit already so can we mimic it
as well? We only have 1 bit for each address which makes it challenging.
The easiest way would be to declare that the idle bit might disappear on
activating or reclaiming the page. How well that suits different
usecases is a different question. I would be interested in hearing from
other people about this of course.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
From: Minchan Kim @ 2019-08-06 14:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joel Fernandes, linux-kernel, Robin Murphy, Alexey Dobriyan,
	Andrew Morton, Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, namhyung, paulmck,
	Roman Gushchin, Stephen
In-Reply-To: <20190806115703.GY11812@dhcp22.suse.cz>

On Tue, Aug 06, 2019 at 01:57:03PM +0200, Michal Hocko wrote:
> On Tue 06-08-19 07:14:46, Joel Fernandes wrote:
> > On Tue, Aug 06, 2019 at 12:47:55PM +0200, Michal Hocko wrote:
> > > On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> > > > On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > > > > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > > > > This bit will be used by idle page tracking code to correctly identify
> > > > > > if a page that was swapped out was idle before it got swapped out.
> > > > > > Without this PTE bit, we lose information about if a page is idle or not
> > > > > > since the page frame gets unmapped.
> > > > > 
> > > > > And why do we need that? Why cannot we simply assume all swapped out
> > > > > pages to be idle? They were certainly idle enough to be reclaimed,
> > > > > right? Or what does idle actualy mean here?
> > > > 
> > > > Yes, but other than swapping, in Android a page can be forced to be swapped
> > > > out as well using the new hints that Minchan is adding?
> > > 
> > > Yes and that is effectivelly making them idle, no?
> > 
> > That depends on how you think of it.
> 
> I would much prefer to have it documented so that I do not have to guess ;)
> 
> > If you are thinking of a monitoring
> > process like a heap profiler, then from the heap profiler's (that only cares
> > about the process it is monitoring) perspective it will look extremely odd if
> > pages that are recently accessed by the process appear to be idle which would
> > falsely look like those processes are leaking memory. The reality being,
> > Android forced those pages into swap because of other reasons. I would like
> > for the swapping mechanism, whether forced swapping or memory reclaim, not to
> > interfere with the idle detection.
> 
> Hmm, but how are you going to handle situation when the page is unmapped
> and refaulted again (e.g. a normal reclaim of a pagecache)? You are
> losing that information same was as in the swapout case, no? Or am I
> missing something?

If page is unmapped, it's not a idle memory any longer because it's
free memory. We could detect the pte is not present.

If page is refaulted, it's not a idle memory any longer because it's
accessed again. We could detect it because the newly allocated page
doesn't have a PG_idle page flag.

Both case, idle page tracking couldn't report them as IDLE so it's okay.

^ permalink raw reply

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
From: Joel Fernandes @ 2019-08-06 15:20 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, linux-kernel, Robin Murphy, Alexey Dobriyan,
	Andrew Morton, Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, namhyung, paulmck,
	Roman Gushchin, Stephen Rothwell <sfr@
In-Reply-To: <20190806144747.GA72938@google.com>

On Tue, Aug 06, 2019 at 11:47:47PM +0900, Minchan Kim wrote:
> On Tue, Aug 06, 2019 at 01:57:03PM +0200, Michal Hocko wrote:
> > On Tue 06-08-19 07:14:46, Joel Fernandes wrote:
> > > On Tue, Aug 06, 2019 at 12:47:55PM +0200, Michal Hocko wrote:
> > > > On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> > > > > On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > > > > > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > > > > > This bit will be used by idle page tracking code to correctly identify
> > > > > > > if a page that was swapped out was idle before it got swapped out.
> > > > > > > Without this PTE bit, we lose information about if a page is idle or not
> > > > > > > since the page frame gets unmapped.
> > > > > > 
> > > > > > And why do we need that? Why cannot we simply assume all swapped out
> > > > > > pages to be idle? They were certainly idle enough to be reclaimed,
> > > > > > right? Or what does idle actualy mean here?
> > > > > 
> > > > > Yes, but other than swapping, in Android a page can be forced to be swapped
> > > > > out as well using the new hints that Minchan is adding?
> > > > 
> > > > Yes and that is effectivelly making them idle, no?
> > > 
> > > That depends on how you think of it.
> > 
> > I would much prefer to have it documented so that I do not have to guess ;)
> > 
> > > If you are thinking of a monitoring
> > > process like a heap profiler, then from the heap profiler's (that only cares
> > > about the process it is monitoring) perspective it will look extremely odd if
> > > pages that are recently accessed by the process appear to be idle which would
> > > falsely look like those processes are leaking memory. The reality being,
> > > Android forced those pages into swap because of other reasons. I would like
> > > for the swapping mechanism, whether forced swapping or memory reclaim, not to
> > > interfere with the idle detection.
> > 
> > Hmm, but how are you going to handle situation when the page is unmapped
> > and refaulted again (e.g. a normal reclaim of a pagecache)? You are
> > losing that information same was as in the swapout case, no? Or am I
> > missing something?
> 
> If page is unmapped, it's not a idle memory any longer because it's
> free memory. We could detect the pte is not present.

I think Michal is not talking of explictly being unmapped, but about the case
where a file-backed mapped page is unmapped due to memory pressure ? This is
similar to the swap situation.

Basically... file page is marked idle, then it is accessed by userspace. Then
memory pressure drops it off the page cache so the idle information is lost.
Next time we check the page_idle, we miss that it was accessed indeed.

It is not an issue for the heap profiler or anonymous memory per-se. But is
similar to the swap situation.

> If page is refaulted, it's not a idle memory any longer because it's
> accessed again. We could detect it because the newly allocated page
> doesn't have a PG_idle page flag.

In the refault case, yes it should not be a problem.

thanks,

 - Joel

^ permalink raw reply

* Re: [PATCH v13 1/6] sched/core: uclamp: Extend CPU's cgroup controller
From: Michal Koutný @ 2019-08-06 16:11 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, linux-api, cgroups, Ingo Molnar,
	Peter Zijlstra, Tejun Heo, Rafael J . Wysocki, Vincent Guittot,
	Viresh Kumar, Paul Turner, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes,
	Steve Muckle, Suren Baghdasaryan, Alessio Balsini
In-Reply-To: <20190802090853.4810-2-patrick.bellasi@arm.com>

On Fri, Aug 02, 2019 at 10:08:48AM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> +static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
> +				size_t nbytes, loff_t off,
> +				enum uclamp_id clamp_id)
> +{
> +	struct uclamp_request req;
> +	struct task_group *tg;
> +
> +	req = capacity_from_percent(buf);
> +	if (req.ret)
> +		return req.ret;
> +
> +	rcu_read_lock();
This should be the uclamp_mutex.

(The compound results of the series is correct as the lock is introduced
in "sched/core: uclamp: Propagate parent clamps".
This is just for the happiness of cherry-pickers/bisectors.)

> +static inline void cpu_uclamp_print(struct seq_file *sf,
> +				    enum uclamp_id clamp_id)
> +{
> [...]
> +	rcu_read_lock();
> +	tg = css_tg(seq_css(sf));
> +	util_clamp = tg->uclamp_req[clamp_id].value;
> +	rcu_read_unlock();
Why is the rcu_read_lock() needed here? (I'm considering the comment in
of_css() that should apply here (and it seems that similar uses in other
seq_file handlers also skip this).)

> @@ -7369,6 +7506,20 @@ static struct cftype cpu_legacy_files[] = {
> [...]
> +		.name = "uclamp.min",
> [...]
> +		.name = "uclamp.max",
I don't see technical reasons why uclamp couldn't work on legacy
hierarchy and Tejun acked the series, despite that I'll ask -- should
the new attributes be exposed in v1 controller hierarchy (taking into
account the legacy API is sort of frozen and potential maintenance needs
spanning both hierarchies)?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox