Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [RFCv2 6/6] mm: extend process_madvise syscall to support vector arrary
From: Yann Droneaud @ 2019-05-31 10:06 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton
In-Reply-To: <20190531064313.193437-7-minchan@kernel.org>

Hi,

Le vendredi 31 mai 2019 à 15:43 +0900, Minchan Kim a écrit :
> 
> diff --git a/include/uapi/asm-generic/mman-common.h
> b/include/uapi/asm-generic/mman-common.h
> index 92e347a89ddc..220c2b5eb961 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -75,4 +75,15 @@
>  #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
>  				 PKEY_DISABLE_WRITE)
>  
> +struct pr_madvise_param {
> +	int size;		/* the size of this structure */
> +	int cookie;		/* reserved to support atomicity */
> +	int nr_elem;		/* count of below arrary fields */

Those should be unsigned.

There's an implicit hole here on ABI with 64bits aligned pointers

> +	int __user *hints;	/* hints for each range */
> +	/* to store result of each operation */
> +	const struct iovec __user *results;
> +	/* input address ranges */
> +	const struct iovec __user *ranges;

Using pointer type in uAPI structure require a 'compat' version of the
syscall need to be provided.

If using iovec too.

> +};
> +
>  #endif /* __ASM_GENERIC_MMAN_COMMON_H */

Regards.

-- 
Yann Droneaud
OPTEYA

^ permalink raw reply

* Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
From: Renzo Davoli @ 2019-05-31 10:45 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Greg KH, Alexander Viro, Davide Libenzi, linux-fsdevel,
	linux-kernel, linux-api, linux-kernel-owner
In-Reply-To: <480f1bda66b67f740f5da89189bbfca3@suse.de>

HI Roman,

On Fri, May 31, 2019 at 11:34:08AM +0200, Roman Penyaev wrote:
> On 2019-05-27 15:36, Renzo Davoli wrote:
> > Unfortunately this approach cannot be applied to
> > poll/select/ppoll/pselect/epoll.
> 
> If you have to override other systemcalls, what is the problem to override
> poll family?  It will add, let's say, 50 extra code lines complexity to your
> userspace code.  All you need is to be woken up by *any* event and check
> one mask variable, in order to understand what you need to do: read or
> write,
> basically exactly what you do in your eventfd modification, but only in
> userspace.

This approach would not scale. If I want to use both a (user-space) network stack
and a (emulated) device (or more stacks and devices) which (overridden) poll would I use?

The poll of the first stack is not able to to deal with the third device.

> 
> 
> > > Why can it not be less than 64?
> > This is the imeplementation of 'write'. The 64 bits include the
> > 'command'
> > EFD_VPOLL_ADDEVENTS, EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS (in the
> > most
> > significant 32 bits) and the set of events (in the lowest 32 bits).
> 
> Do you really need add/del/mod semantics?  Userspace still has to keep mask
> somewhere, so you can have one simple command, which does:
>    ctx->count = events;
> in kernel, so no masks and this games with bits are needed.  That will
> simplify API.

It is true, at the price to have more complex code in user space.
Other system calls could have beeen implemented as "set the value", instead there are
ADD/DEL modification flags.
I mean for example sigprocmask (SIG_BLOCK, SIG_UNBLOCK, SIG_SETMASK), or even epoll_ctl.
While poll requires the program to keep the struct pollfd array stored somewhere,
epoll is more powerful and flexible as different file descriptors can be added
and deleted by different modules/components.

If I have two threads implementing the send and receive path of a socket in a user-space
network stack implementation the epoll pending bitmap is shared so I have to create
critical sections like the following one any time I need to set or reset a bit.
	pthread_mutex_lock(mylock)
	events |= EPOLLIN
	write(efd, &events, sizeof(events));
	pthread_mutex_unlock(mylock)
Using add/del semantics locking is not required as the send path thread deals with EPOLLOUT while
its siblings receive thread uses EPOLLIN or EPOLLPRI

I would prefer the add/del/mod semantics, but if this is generally perceived as a unnecessary 
complexity in the kernel code I can update my patch.  

Thank you Roman,

			renzo

^ permalink raw reply

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: Peter Zijlstra @ 2019-05-31 11:14 UTC (permalink / raw)
  To: David Howells
  Cc: Jann Horn, Greg KH, Al Viro, raven, linux-fsdevel, Linux API,
	linux-block, keyrings, linux-security-module, kernel list,
	Kees Cook, Kernel Hardening
In-Reply-To: <16193.1559163763@warthog.procyon.org.uk>

On Wed, May 29, 2019 at 10:02:43PM +0100, David Howells wrote:
> Jann Horn <jannh@google.com> wrote:
> 
> > Does this mean that refcount_read() isn't sufficient for what you want
> > to do with tracing (because for some reason you actually need to know
> > the values atomically at the time of increment/decrement)?
> 
> Correct.  There's a gap and if an interrupt or something occurs, it's
> sufficiently big for the refcount trace to go weird.
> 
> I've seen it in afs/rxrpc where the incoming network packets that are part of
> the rxrpc call flow disrupt the refcounts noted in trace lines.

Can you re-iterate the exact problem? I konw we talked about this in the
past, but I seem to have misplaced those memories :/

FWIW I agree that kref is useless fluff, but I've long ago given up on
that fight.

^ permalink raw reply

* Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
From: Roman Penyaev @ 2019-05-31 11:48 UTC (permalink / raw)
  To: Renzo Davoli
  Cc: Greg KH, Alexander Viro, Davide Libenzi, linux-fsdevel,
	linux-kernel, linux-api, linux-kernel-owner
In-Reply-To: <20190531104502.GE3661@cs.unibo.it>

On 2019-05-31 12:45, Renzo Davoli wrote:
> HI Roman,
> 
> On Fri, May 31, 2019 at 11:34:08AM +0200, Roman Penyaev wrote:
>> On 2019-05-27 15:36, Renzo Davoli wrote:
>> > Unfortunately this approach cannot be applied to
>> > poll/select/ppoll/pselect/epoll.
>> 
>> If you have to override other systemcalls, what is the problem to 
>> override
>> poll family?  It will add, let's say, 50 extra code lines complexity 
>> to your
>> userspace code.  All you need is to be woken up by *any* event and 
>> check
>> one mask variable, in order to understand what you need to do: read or
>> write,
>> basically exactly what you do in your eventfd modification, but only 
>> in
>> userspace.
> 
> This approach would not scale. If I want to use both a (user-space)
> network stack
> and a (emulated) device (or more stacks and devices) which
> (overridden) poll would I use?
> 
> The poll of the first stack is not able to to deal with the third 
> device.

Since each such a stack has a set of read/write/etc functions you always
can extend you stack with another call which returns you event mask,
specifying what exactly you have to do, e.g.:

     nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
     for (n = 0; n < nfds; ++n) {
          struct sock *sock;

          sock = events[n].data.ptr;
          events = sock->get_events(sock, &events[n]);

          if (events & EPOLLIN)
              sock->read(sock);
          if (events & EPOLLOUT)
              sock->write(sock);
     }


With such a virtual table you can mix all userspace stacks and even
with normal sockets, for which 'get_events' function can be declared as

static poll_t kernel_sock_get_events(struct sock *sock, struct 
epoll_event *ev)
{
     return ev->events;
}

Do I miss something?


>> > > Why can it not be less than 64?
>> > This is the imeplementation of 'write'. The 64 bits include the
>> > 'command'
>> > EFD_VPOLL_ADDEVENTS, EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS (in the
>> > most
>> > significant 32 bits) and the set of events (in the lowest 32 bits).
>> 
>> Do you really need add/del/mod semantics?  Userspace still has to keep 
>> mask
>> somewhere, so you can have one simple command, which does:
>>    ctx->count = events;
>> in kernel, so no masks and this games with bits are needed.  That will
>> simplify API.
> 
> It is true, at the price to have more complex code in user space.
> Other system calls could have beeen implemented as "set the value",
> instead there are
> ADD/DEL modification flags.
> I mean for example sigprocmask (SIG_BLOCK, SIG_UNBLOCK, SIG_SETMASK),
> or even epoll_ctl.
> While poll requires the program to keep the struct pollfd array stored
> somewhere,
> epoll is more powerful and flexible as different file descriptors can 
> be added
> and deleted by different modules/components.
> 
> If I have two threads implementing the send and receive path of a
> socket in a user-space

Eventually you come up with such a lock to protect your tcp or whatever
state machine.  Or you have a real example where read and write paths
can work completely independently?

--
Roman

^ permalink raw reply

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: David Howells @ 2019-05-31 12:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dhowells, Jann Horn, Greg KH, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list, Kees Cook, Kernel Hardening
In-Reply-To: <20190531111445.GO2677@hirez.programming.kicks-ass.net>

Peter Zijlstra <peterz@infradead.org> wrote:

> Can you re-iterate the exact problem? I konw we talked about this in the
> past, but I seem to have misplaced those memories :/

Take this for example:

	void afs_put_call(struct afs_call *call)
	{
		struct afs_net *net = call->net;
		int n = atomic_dec_return(&call->usage);
		int o = atomic_read(&net->nr_outstanding_calls);

		trace_afs_call(call, afs_call_trace_put, n + 1, o,
			       __builtin_return_address(0));

		ASSERTCMP(n, >=, 0);
		if (n == 0) {
			...
		}
	}

I am printing the usage count in the afs_call tracepoint so that I can use it
to debug refcount bugs.  If I do it like this:

	void afs_put_call(struct afs_call *call)
	{
		int n = refcount_read(&call->usage);
		int o = atomic_read(&net->nr_outstanding_calls);

		trace_afs_call(call, afs_call_trace_put, n, o,
			       __builtin_return_address(0));

		if (refcount_dec_and_test(&call->usage)) {
			...
		}
	}

then there's a temporal gap between the usage count being read and the actual
atomic decrement in which another CPU can alter the count.  This can be
exacerbated by an interrupt occurring, a softirq occurring or someone enabling
the tracepoint.

I can't do the tracepoint after the decrement if refcount_dec_and_test()
returns false unless I save all the values from the object that I might need
as the object could be destroyed any time from that point on.  In this
particular case, that's just call->debug_id, but it could be other things in
other cases.

Note that I also can't touch the afs_net object in that situation either, and
the outstanding calls count that I record will potentially be out of date -
but there's not a lot I can do about that.

David

^ permalink raw reply

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: David Howells @ 2019-05-31 12:42 UTC (permalink / raw)
  To: Greg KH
  Cc: dhowells, viro, raven, linux-fsdevel, linux-api, linux-block,
	keyrings, linux-security-module, linux-kernel
In-Reply-To: <20190529230954.GA3164@kroah.com>

Greg KH <gregkh@linuxfoundation.org> wrote:

> > kref_put() enforces a very specific destructor signature.  I know of places
> > where that doesn't work because the destructor takes more than one argument
> > (granted that this is not the case here).  So why does kref_put() exist at
> > all?  Why not kref_dec_and_test()?
>
> The destructor only takes one object pointer as you are finally freeing
> that object.  What more do you need/want to "know" at that point in
> time?

Imagine that I have an object that's on a list rooted in a namespace and that
I have a lot of these objects.  Imagine further that any time I want to put a
ref on one of these objects, it's in a context that has the namespace pinned.
I therefore don't need to store a pointer to the namespace in every object
because I can pass that in to the put function

Indeed, I can still access the namespace even after the decrement didn't
reduce the usage count to 0 - say for doing statistics.

> What would kref_dec_and_test() be needed for?

Why do you need kref_put() to take a destructor function pointer?  Why cannot
that be replaced with, say:

	static inline bool __kref_put(struct kref *k)
	{
		return refcount_dec_and_test(&k->refcount);
	}

and then one could do:

	void put_foo(struct foo_net *ns, struct foo *f)
	{
		if (__kref_put(&f->refcount)) {
			// destroy foo
		}
	}

that way the destruction code does not have to be offloaded into its own
function and you still have your pattern to look for.

For tracing purposes, I could live with something like:

	static inline
	bool __kref_put_return(struct kref *k, unsigned int *_usage)
	{
		return refcount_dec_and_test_return(&k->refcount, _usage);
	}

and then I could do:

	void put_foo(struct foo_net *ns, struct foo *f)
	{
		unsigned int u;
		bool is_zero = __kref_put_return(&f->refcount, &u);

		trace_foo_refcount(f, u);
		if (is_zero) {
			// destroy foo
		}
	}

then it could be made such that you can disable the ability of
refcount_dec_and_test_return() to pass back a useful refcount value if you
want a bit of extra speed.

Or even if refcount_dec_return() is guaranteed to return 0 if the count hits
the floor and non-zero otherwise and there's a config switch to impose a
stronger guarantee that it will return a value that's appropriately
transformed to look as if I was using atomic_dec_return().

Similarly for refcount_inc_return() - it could just return gibberish unless
the same config switch is enabled.

Question for AMD/Intel guys: I'm curious if LOCK DECL faster than LOCK XADD -1
on x86_64?

> > Why doesn't refcount_t get merged into kref, or vice versa?  Having both
> > would seem redundant.
>
> kref uses refcount_t and provides a different functionality on top of
> it.  Not all uses of a refcount in the kernel is for object lifecycle
> reference counting, as you know :)

I do?  I can't think of one offhand.  Not that I'm saying you're wrong on
that - there's an awful lot of kernel.

David

^ permalink raw reply

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-05-31 12:44 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Tycho Andersen, Serge E. Hallyn, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	ebiederm, nhorman
In-Reply-To: <20190531002058.tsddah4edcazkuzs@madcap2.tricolour.ca>

On Thu, May 30, 2019 at 8:21 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-05-30 19:26, Paul Moore wrote:
> > On Thu, May 30, 2019 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > On Thu, May 30, 2019 at 03:29:32PM -0400, Paul Moore wrote:
> > > >
> > > > [REMINDER: It is an "*audit* container ID" and not a general
> > > > "container ID" ;)  Smiley aside, I'm not kidding about that part.]
> > >
> > > This sort of seems like a distinction without a difference; presumably
> > > audit is going to want to differentiate between everything that people
> > > in userspace call a container. So you'll have to support all this
> > > insanity anyway, even if it's "not a container ID".
> >
> > That's not quite right.  Audit doesn't care about what a container is,
> > or is not, it also doesn't care if the "audit container ID" actually
> > matches the ID used by the container engine in userspace and I think
> > that is a very important line to draw.  Audit is simply given a value
> > which it calls the "audit container ID", it ensures that the value is
> > inherited appropriately (e.g. children inherit their parent's audit
> > container ID), and it uses the value in audit records to provide some
> > additional context for log analysis.  The distinction isn't limited to
> > the value itself, but also to how it is used; it is an "audit
> > container ID" and not a "container ID" because this value is
> > exclusively for use by the audit subsystem.  We are very intentionally
> > not adding a generic container ID to the kernel.  If the kernel does
> > ever grow a general purpose container ID we will be one of the first
> > ones in line to make use of it, but we are not going to be the ones to
> > generically add containers to the kernel.  Enough people already hate
> > audit ;)
> >
> > > > I'm not interested in supporting/merging something that isn't useful;
> > > > if this doesn't work for your use case then we need to figure out what
> > > > would work.  It sounds like nested containers are much more common in
> > > > the lxc world, can you elaborate a bit more on this?
> > > >
> > > > As far as the possible solutions you mention above, I'm not sure I
> > > > like the per-userns audit container IDs, I'd much rather just emit the
> > > > necessary tracking information via the audit record stream and let the
> > > > log analysis tools figure it out.  However, the bigger question is how
> > > > to limit (re)setting the audit container ID when you are in a non-init
> > > > userns.  For reasons already mentioned, using capable() is a non
> > > > starter for everything but the initial userns, and using ns_capable()
> > > > is equally poor as it essentially allows any userns the ability to
> > > > munge it's audit container ID (obviously not good).  It appears we
> > > > need a different method for controlling access to the audit container
> > > > ID.
> > >
> > > One option would be to make it a string, and have it be append only.
> > > That should be safe with no checks.
> > >
> > > I know there was a long thread about what type to make this thing. I
> > > think you could accomplish the append-only-ness with a u64 if you had
> > > some rule about only allowing setting lower order bits than those that
> > > are already set. With 4 bits for simplicity:
> > >
> > > 1100         # initial container id
> > > 1100 -> 1011 # not allowed
> > > 1100 -> 1101 # allowed, but now 1101 is set in stone since there are
> > >              # no lower order bits left
> > >
> > > There are probably fancier ways to do it if you actually understand
> > > math :)
> >
> >  ;)
> >
> > > Since userns nesting is limited to 32 levels (right now, IIRC), and
> > > you have 64 bits, this might be reasonable. You could just teach
> > > container engines to use the first say N bits for themselves, with a 1
> > > bit for the barrier at the end.
> >
> > I like the creativity, but I worry that at some point these
> > limitations are going to be raised (limits have a funny way of doing
> > that over time) and we will be in trouble.  I say "trouble" because I
> > want to be able to quickly do an audit container ID comparison and
> > we're going to pay a penalty for these larger values (we'll need this
> > when we add multiple auditd support and the requisite record routing).
> >
> > Thinking about this makes me also realize we probably need to think a
> > bit longer about audit container ID conflicts between orchestrators.
> > Right now we just take the value that is given to us by the
> > orchestrator, but if we want to allow multiple container orchestrators
> > to work without some form of cooperation in userspace (I think we have
> > to assume the orchestrators will not talk to each other) we likely
> > need to have some way to block reuse of an audit container ID.  We
> > would either need to prevent the orchestrator from explicitly setting
> > an audit container ID to a currently in use value, or instead generate
> > the audit container ID in the kernel upon an event triggered by the
> > orchestrator (e.g. a write to a /proc file).  I suspect we should
> > start looking at the idr code, I think we will need to make use of it.
>
> My first reaction to using the IDR code is that once an idr is given up,
> it can be reused.  I suppose we request IDRs and then never give them up
> to avoid reuse...

I'm not sure we ever what to guarantee that an audit container ID
won't be reused during the lifetime of the system, it is a discrete
integer after all.  What I think we do want to guarantee is that we
won't allow an unintentional audit container ID collision between
different orchestrators; if a single orchestrator wants to reuse an
audit container ID then that is its choice.

> I already had some ideas of preventing an existing ID from being reused,

Cool.  I only made the idr suggestion since it is used for PIDs and
solves a very similar problem.

> but that makes the practice of some container engines injecting
> processes into existing containers difficult if not impossible.

Yes, we'll need some provision to indicate which orchestrator
"controls" that particular audit container ID, and allow that
orchestrator to reuse that particular audit container ID (until all
those containers disappear and the audit container ID is given back to
the pool).

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [RFCv2 4/6] mm: factor out madvise's core functionality
From: Minchan Kim @ 2019-05-31 13:12 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, hdanton
In-Reply-To: <20190531070420.m7sxybbzzayig44o@butterfly.localdomain>

On Fri, May 31, 2019 at 09:04:20AM +0200, Oleksandr Natalenko wrote:
> On Fri, May 31, 2019 at 03:43:11PM +0900, Minchan Kim wrote:
> > This patch factor out madvise's core functionality so that upcoming
> > patch can reuse it without duplication. It shouldn't change any behavior.
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  mm/madvise.c | 188 +++++++++++++++++++++++++++------------------------
> >  1 file changed, 101 insertions(+), 87 deletions(-)
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 9d749a1420b4..466623ea8c36 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -425,9 +425,10 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> >  	struct page *page;
> >  	int isolated = 0;
> >  	struct vm_area_struct *vma = walk->vma;
> > +	struct task_struct *task = walk->private;
> >  	unsigned long next;
> >  
> > -	if (fatal_signal_pending(current))
> > +	if (fatal_signal_pending(task))
> >  		return -EINTR;
> >  
> >  	next = pmd_addr_end(addr, end);
> > @@ -505,12 +506,14 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> >  }
> >  
> >  static void madvise_pageout_page_range(struct mmu_gather *tlb,
> > -			     struct vm_area_struct *vma,
> > -			     unsigned long addr, unsigned long end)
> > +				struct task_struct *task,
> > +				struct vm_area_struct *vma,
> > +				unsigned long addr, unsigned long end)
> >  {
> >  	struct mm_walk warm_walk = {
> >  		.pmd_entry = madvise_pageout_pte_range,
> >  		.mm = vma->vm_mm,
> > +		.private = task,
> >  	};
> >  
> >  	tlb_start_vma(tlb, vma);
> > @@ -519,9 +522,9 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb,
> >  }
> >  
> >  
> > -static long madvise_pageout(struct vm_area_struct *vma,
> > -			struct vm_area_struct **prev,
> > -			unsigned long start_addr, unsigned long end_addr)
> > +static long madvise_pageout(struct task_struct *task,
> > +		struct vm_area_struct *vma, struct vm_area_struct **prev,
> > +		unsigned long start_addr, unsigned long end_addr)
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	struct mmu_gather tlb;
> > @@ -532,7 +535,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
> >  
> >  	lru_add_drain();
> >  	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> > -	madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
> > +	madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr);
> >  	tlb_finish_mmu(&tlb, start_addr, end_addr);
> >  
> >  	return 0;
> > @@ -744,7 +747,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> >  	return 0;
> >  }
> >  
> > -static long madvise_dontneed_free(struct vm_area_struct *vma,
> > +static long madvise_dontneed_free(struct mm_struct *mm,
> > +				  struct vm_area_struct *vma,
> >  				  struct vm_area_struct **prev,
> >  				  unsigned long start, unsigned long end,
> >  				  int behavior)
> > @@ -756,8 +760,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> >  	if (!userfaultfd_remove(vma, start, end)) {
> >  		*prev = NULL; /* mmap_sem has been dropped, prev is stale */
> >  
> > -		down_read(&current->mm->mmap_sem);
> > -		vma = find_vma(current->mm, start);
> > +		down_read(&mm->mmap_sem);
> > +		vma = find_vma(mm, start);
> >  		if (!vma)
> >  			return -ENOMEM;
> >  		if (start < vma->vm_start) {
> > @@ -804,7 +808,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> >   * Application wants to free up the pages and associated backing store.
> >   * This is effectively punching a hole into the middle of a file.
> >   */
> > -static long madvise_remove(struct vm_area_struct *vma,
> > +static long madvise_remove(struct mm_struct *mm,
> > +				struct vm_area_struct *vma,
> >  				struct vm_area_struct **prev,
> >  				unsigned long start, unsigned long end)
> >  {
> > @@ -838,13 +843,13 @@ static long madvise_remove(struct vm_area_struct *vma,
> >  	get_file(f);
> >  	if (userfaultfd_remove(vma, start, end)) {
> >  		/* mmap_sem was not released by userfaultfd_remove() */
> > -		up_read(&current->mm->mmap_sem);
> > +		up_read(&mm->mmap_sem);
> >  	}
> >  	error = vfs_fallocate(f,
> >  				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> >  				offset, end - start);
> >  	fput(f);
> > -	down_read(&current->mm->mmap_sem);
> > +	down_read(&mm->mmap_sem);
> >  	return error;
> >  }
> >  
> > @@ -918,21 +923,23 @@ static int madvise_inject_error(int behavior,
> >  #endif
> >  
> >  static long
> > -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > +madvise_vma(struct task_struct *task, struct mm_struct *mm,
> > +		struct vm_area_struct *vma, struct vm_area_struct **prev,
> >  		unsigned long start, unsigned long end, int behavior)
> >  {
> >  	switch (behavior) {
> >  	case MADV_REMOVE:
> > -		return madvise_remove(vma, prev, start, end);
> > +		return madvise_remove(mm, vma, prev, start, end);
> >  	case MADV_WILLNEED:
> >  		return madvise_willneed(vma, prev, start, end);
> >  	case MADV_COLD:
> >  		return madvise_cold(vma, prev, start, end);
> >  	case MADV_PAGEOUT:
> > -		return madvise_pageout(vma, prev, start, end);
> > +		return madvise_pageout(task, vma, prev, start, end);
> >  	case MADV_FREE:
> >  	case MADV_DONTNEED:
> > -		return madvise_dontneed_free(vma, prev, start, end, behavior);
> > +		return madvise_dontneed_free(mm, vma, prev, start,
> > +						end, behavior);
> >  	default:
> >  		return madvise_behavior(vma, prev, start, end, behavior);
> >  	}
> > @@ -976,68 +983,8 @@ madvise_behavior_valid(int behavior)
> >  	}
> >  }
> >  
> > -/*
> > - * The madvise(2) system call.
> > - *
> > - * Applications can use madvise() to advise the kernel how it should
> > - * handle paging I/O in this VM area.  The idea is to help the kernel
> > - * use appropriate read-ahead and caching techniques.  The information
> > - * provided is advisory only, and can be safely disregarded by the
> > - * kernel without affecting the correct operation of the application.
> > - *
> > - * behavior values:
> > - *  MADV_NORMAL - the default behavior is to read clusters.  This
> > - *		results in some read-ahead and read-behind.
> > - *  MADV_RANDOM - the system should read the minimum amount of data
> > - *		on any access, since it is unlikely that the appli-
> > - *		cation will need more than what it asks for.
> > - *  MADV_SEQUENTIAL - pages in the given range will probably be accessed
> > - *		once, so they can be aggressively read ahead, and
> > - *		can be freed soon after they are accessed.
> > - *  MADV_WILLNEED - the application is notifying the system to read
> > - *		some pages ahead.
> > - *  MADV_DONTNEED - the application is finished with the given range,
> > - *		so the kernel can free resources associated with it.
> > - *  MADV_FREE - the application marks pages in the given range as lazy free,
> > - *		where actual purges are postponed until memory pressure happens.
> > - *  MADV_REMOVE - the application wants to free up the given range of
> > - *		pages and associated backing store.
> > - *  MADV_DONTFORK - omit this area from child's address space when forking:
> > - *		typically, to avoid COWing pages pinned by get_user_pages().
> > - *  MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
> > - *  MADV_WIPEONFORK - present the child process with zero-filled memory in this
> > - *              range after a fork.
> > - *  MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
> > - *  MADV_HWPOISON - trigger memory error handler as if the given memory range
> > - *		were corrupted by unrecoverable hardware memory failure.
> > - *  MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
> > - *  MADV_MERGEABLE - the application recommends that KSM try to merge pages in
> > - *		this area with pages of identical content from other such areas.
> > - *  MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> > - *  MADV_HUGEPAGE - the application wants to back the given range by transparent
> > - *		huge pages in the future. Existing pages might be coalesced and
> > - *		new pages might be allocated as THP.
> > - *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> > - *		transparent huge pages so the existing pages will not be
> > - *		coalesced into THP and new pages will not be allocated as THP.
> > - *  MADV_DONTDUMP - the application wants to prevent pages in the given range
> > - *		from being included in its core dump.
> > - *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > - *
> > - * return values:
> > - *  zero    - success
> > - *  -EINVAL - start + len < 0, start is not page-aligned,
> > - *		"behavior" is not a valid value, or application
> > - *		is attempting to release locked or shared pages,
> > - *		or the specified address range includes file, Huge TLB,
> > - *		MAP_SHARED or VMPFNMAP range.
> > - *  -ENOMEM - addresses in the specified range are not currently
> > - *		mapped, or are outside the AS of the process.
> > - *  -EIO    - an I/O error occurred while paging in data.
> > - *  -EBADF  - map exists, but area maps something that isn't a file.
> > - *  -EAGAIN - a kernel resource was temporarily unavailable.
> > - */
> > -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> > +static int madvise_core(struct task_struct *task, struct mm_struct *mm,
> > +			unsigned long start, size_t len_in, int behavior)
> 
> Just a minor nitpick, but can we please have it named madvise_common,
> not madvise_core? This would follow a usual naming scheme, when some
> common functionality is factored out (like, for mutexes, semaphores
> etc), and within the kernel "core" usually means something completely
> different.

Sure.

> 
> >  {
> >  	unsigned long end, tmp;
> >  	struct vm_area_struct *vma, *prev;
> > @@ -1068,15 +1015,16 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> >  
> >  #ifdef CONFIG_MEMORY_FAILURE
> >  	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> > -		return madvise_inject_error(behavior, start, start + len_in);
> > +		return madvise_inject_error(behavior,
> > +					start, start + len_in);
> 
> Not sure what this change is about except changing the line length.
> Note, madvise_inject_error() still operates on "current" through
> get_user_pages_fast() and gup_pgd_range(), but that was not changed
> here. I Know you've filtered out this hint later, so technically this
> is not an issue, but, maybe, this needs some attention too since we've
> already spotted it?

It is leftover I had done. I actually modified it to handle remote
task but changed my mind not to fix it because process_madvise
will not support it at this moment. I'm not sure it's a good idea
to change it for *might-be-done-in-future* at this moment even though
we have spotted.

> 
> >  #endif
> >  
> >  	write = madvise_need_mmap_write(behavior);
> >  	if (write) {
> > -		if (down_write_killable(&current->mm->mmap_sem))
> > +		if (down_write_killable(&mm->mmap_sem))
> >  			return -EINTR;
> 
> Do you still need that trick with mmget_still_valid() here?
> Something like:

Since MADV_COLD|PAGEOUT doesn't change address space layout or
vma->vm_flags, technically, we don't need it if I understand
correctly. Right?

> 
> if (current->mm != mm && !mmget_still_valid(mm))
>    goto skip_mm;
> 
> and that skip_mm label would be before
> 
> if (write)
>    up_write(&mm->mmap_sem);
> 
> below.
> 
> (see 04f5866e41fb70690e28397487d8bd8eea7d712a for details on this)

^ permalink raw reply

* Re: [RFCv2 5/6] mm: introduce external memory hinting API
From: Minchan Kim @ 2019-05-31 13:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton
In-Reply-To: <20190531083757.GH6896@dhcp22.suse.cz>

On Fri, May 31, 2019 at 10:37:57AM +0200, Michal Hocko wrote:
> On Fri 31-05-19 15:43:12, Minchan Kim wrote:
> > There is some usecase that centralized userspace daemon want to give
> > a memory hint like MADV_[COLD|PAGEEOUT] to other process. Android's
> > ActivityManagerService is one of them.
> > 
> > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > required to make the reclaim decision is not known to the app. Instead,
> > it is known to the centralized userspace daemon(ActivityManagerService),
> > and that daemon must be able to initiate reclaim on its own without
> > any app involvement.
> > 
> > To solve the issue, this patch introduces new syscall process_madvise(2).
> > It could give a hint to the exeternal process of pidfd.
> > 
> >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > 			unsigned long cookie, unsigned long flag);
> > 
> > Since it could affect other process's address range, only privileged
> > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > gives it the right to ptrace the process could use it successfully.
> > 
> > The syscall has a cookie argument to privode atomicity(i.e., detect
> > target process's address space change since monitor process has parsed
> > the address range of target process so the operaion could fail in case
> > of happening race). Although there is no interface to get a cookie
> > at this moment, it could be useful to consider it as argument to avoid
> > introducing another new syscall in future. It could support *atomicity*
> > for disruptive hint(e.g., MADV_DONTNEED|FREE).
> > flag argument is reserved for future use if we need to extend the API.
> 
> Providing an API that is incomplete will not fly. Really. As this really
> begs for much more discussion and it would be good to move on with the
> core idea of the pro active memory memory management from userspace
> usecase. Could you split out the core change so that we can move on and
> leave the external for a later discussion. I believe this would lead to
> a smoother integration.

No problem but I need to understand what you want a little bit more because
I thought this patchset is already step by step so if we reach the agreement
of part of them like [1-5/6], it could be merged first.

Could you say how you want to split the patchset for forward progress?

Thanks.

^ permalink raw reply

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: Peter Zijlstra @ 2019-05-31 13:26 UTC (permalink / raw)
  To: David Howells
  Cc: Jann Horn, Greg KH, Al Viro, raven, linux-fsdevel, Linux API,
	linux-block, keyrings, linux-security-module, kernel list,
	Kees Cook, Kernel Hardening
In-Reply-To: <21942.1559304135@warthog.procyon.org.uk>

On Fri, May 31, 2019 at 01:02:15PM +0100, David Howells wrote:
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Can you re-iterate the exact problem? I konw we talked about this in the
> > past, but I seem to have misplaced those memories :/
> 
> Take this for example:
> 
> 	void afs_put_call(struct afs_call *call)
> 	{
> 		struct afs_net *net = call->net;
> 		int n = atomic_dec_return(&call->usage);
> 		int o = atomic_read(&net->nr_outstanding_calls);
> 
> 		trace_afs_call(call, afs_call_trace_put, n + 1, o,
> 			       __builtin_return_address(0));
> 
> 		ASSERTCMP(n, >=, 0);
> 		if (n == 0) {
> 			...
> 		}
> 	}
> 
> I am printing the usage count in the afs_call tracepoint so that I can use it
> to debug refcount bugs.  If I do it like this:
> 
> 	void afs_put_call(struct afs_call *call)
> 	{
> 		int n = refcount_read(&call->usage);
> 		int o = atomic_read(&net->nr_outstanding_calls);
> 
> 		trace_afs_call(call, afs_call_trace_put, n, o,
> 			       __builtin_return_address(0));
> 
> 		if (refcount_dec_and_test(&call->usage)) {
> 			...
> 		}
> 	}
> 
> then there's a temporal gap between the usage count being read and the actual
> atomic decrement in which another CPU can alter the count.  This can be
> exacerbated by an interrupt occurring, a softirq occurring or someone enabling
> the tracepoint.
> 
> I can't do the tracepoint after the decrement if refcount_dec_and_test()
> returns false unless I save all the values from the object that I might need
> as the object could be destroyed any time from that point on.

Is it not the responsibility of the task that affects the 1->0
transition to actually free the memory?

That is, I'm expecting the '...' in both cases above the include the
actual freeing of the object. If this is not the case, then @usage is
not a reference count.

(and it has already been established that refcount_t doesn't work for
usage count scenarios)

Aside from that, is the problem that refcount_dec_and_test() returns a
boolean (true - last put, false - not last) instead of the refcount
value? This does indeed make it hard to print the exact count value for
the event.

^ permalink raw reply

* Re: [RFCv2 1/6] mm: introduce MADV_COLD
From: Minchan Kim @ 2019-05-31 13:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton
In-Reply-To: <20190531084752.GI6896@dhcp22.suse.cz>

On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range, it could
> > give a hint to kernel that the pages can be reclaimed when memory pressure
> > happens but data should be preserved for future use.  This could reduce
> > workingset eviction so it ends up increasing performance.
> > 
> > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > MADV_COLD can be used by a process to mark a memory range as not expected
> > to be used in the near future. The hint can help kernel in deciding which
> > pages to evict early during memory pressure.
> > 
> > Internally, it works via deactivating pages from active list to inactive's
> > head if the page is private because inactive list could be full of
> > used-once pages which are first candidate for the reclaiming and that's a
> > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > if the memory pressure happens, they will be reclaimed earlier than other
> > active pages unless there is no access until the time.
> 
> [I am intentionally not looking at the implementation because below
> points should be clear from the changelog - sorry about nagging ;)]
> 
> What kind of pages can be deactivated? Anonymous/File backed.
> Private/shared? If shared, are there any restrictions?

Both file and private pages could be deactived from each active LRU
to each inactive LRU if the page has one map_count. In other words,

    if (page_mapcount(page) <= 1)
        deactivate_page(page);

> 
> Are there any restrictions on mappings? E.g. what would be an effect of
> this operation on hugetlbfs mapping?

VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma will be skipped like MADV_FREE|DONTNEED

> 
> Also you are talking about inactive LRU but what kind of LRU is that? Is
> it the anonymous LRU? If yes, don't we have the same problem as with the

active file page -> inactive file LRU
active anon page -> inacdtive anon LRU

> early MADV_FREE implementation when enough page cache causes that
> deactivated anonymous memory doesn't get reclaimed anytime soon. Or
> worse never when there is no swap available?

I think MADV_COLD is a little bit different symantic with MAVD_FREE.
MADV_FREE means it's okay to discard when the memory pressure because
the content of the page is *garbage*. Furthemore, freeing such pages is
almost zero overhead since we don't need to swap out and access
afterward causes minor fault. Thus, it would make sense to put those
freeable pages in inactive file LRU to compete other used-once pages.

However, MADV_COLD doesn't means it's a garbage and freeing requires
swap out/swap in afterward. So, it would be better to move inactive
anon's LRU list, not file LRU. Furthermore, it would avoid unnecessary
scanning of those cold anonymous if system doesn't have a swap device.

^ permalink raw reply

* Re: [RFCv2 3/6] mm: introduce MADV_PAGEOUT
From: Minchan Kim @ 2019-05-31 13:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton
In-Reply-To: <20190531085044.GJ6896@dhcp22.suse.cz>

On Fri, May 31, 2019 at 10:50:44AM +0200, Michal Hocko wrote:
> On Fri 31-05-19 15:43:10, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range
> > for a long time, it could hint kernel that the pages can be
> > reclaimed instantly but data should be preserved for future use.
> > This could reduce workingset eviction so it ends up increasing
> > performance.
> > 
> > This patch introduces the new MADV_PAGEOUT hint to madvise(2)
> > syscall. MADV_PAGEOUT can be used by a process to mark a memory
> > range as not expected to be used for a long time so that kernel
> > reclaims the memory instantly. The hint can help kernel in deciding
> > which pages to evict proactively.
> 
> Again, are there any restictions on what kind of memory can be paged out?
> Private/Shared, anonymous/file backed. Any restrictions on mapping type.
> Etc. Please make sure all that is in the changelog.

It's same with MADV_COLD. Yes, I will include all detail in the
description.

> 
> What are the failure modes? E.g. what if the swap is full, does the call
> fails or it silently ignores the error?

In such case, just ignore the swapout. It returns -EINVAL only if the
vma is one of (VM_LOCKED|VM_HUGETLB|VM_PFNMAP) at this moment.

> 
> Thanks!
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply

* Re: [RFCv2 5/6] mm: introduce external memory hinting API
From: Michal Hocko @ 2019-05-31 14:00 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton
In-Reply-To: <20190531131859.GB195463@google.com>

On Fri 31-05-19 22:19:00, Minchan Kim wrote:
> On Fri, May 31, 2019 at 10:37:57AM +0200, Michal Hocko wrote:
> > On Fri 31-05-19 15:43:12, Minchan Kim wrote:
> > > There is some usecase that centralized userspace daemon want to give
> > > a memory hint like MADV_[COLD|PAGEEOUT] to other process. Android's
> > > ActivityManagerService is one of them.
> > > 
> > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > required to make the reclaim decision is not known to the app. Instead,
> > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > and that daemon must be able to initiate reclaim on its own without
> > > any app involvement.
> > > 
> > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > It could give a hint to the exeternal process of pidfd.
> > > 
> > >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > 			unsigned long cookie, unsigned long flag);
> > > 
> > > Since it could affect other process's address range, only privileged
> > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > gives it the right to ptrace the process could use it successfully.
> > > 
> > > The syscall has a cookie argument to privode atomicity(i.e., detect
> > > target process's address space change since monitor process has parsed
> > > the address range of target process so the operaion could fail in case
> > > of happening race). Although there is no interface to get a cookie
> > > at this moment, it could be useful to consider it as argument to avoid
> > > introducing another new syscall in future. It could support *atomicity*
> > > for disruptive hint(e.g., MADV_DONTNEED|FREE).
> > > flag argument is reserved for future use if we need to extend the API.
> > 
> > Providing an API that is incomplete will not fly. Really. As this really
> > begs for much more discussion and it would be good to move on with the
> > core idea of the pro active memory memory management from userspace
> > usecase. Could you split out the core change so that we can move on and
> > leave the external for a later discussion. I believe this would lead to
> > a smoother integration.
> 
> No problem but I need to understand what you want a little bit more because
> I thought this patchset is already step by step so if we reach the agreement
> of part of them like [1-5/6], it could be merged first.
> 
> Could you say how you want to split the patchset for forward progress?

I would start with new madvise modes and once they are in a shape to be
merged then we can start the remote madvise API. I believe that even
local process reclaim modes are interesting and useful. I haven't heard
anybody objecting to them without having a remote API so far.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [RFCv2 1/6] mm: introduce MADV_COLD
From: Michal Hocko @ 2019-05-31 14:03 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton
In-Reply-To: <20190531133904.GC195463@google.com>

On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > When a process expects no accesses to a certain memory range, it could
> > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > happens but data should be preserved for future use.  This could reduce
> > > workingset eviction so it ends up increasing performance.
> > > 
> > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > to be used in the near future. The hint can help kernel in deciding which
> > > pages to evict early during memory pressure.
> > > 
> > > Internally, it works via deactivating pages from active list to inactive's
> > > head if the page is private because inactive list could be full of
> > > used-once pages which are first candidate for the reclaiming and that's a
> > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > if the memory pressure happens, they will be reclaimed earlier than other
> > > active pages unless there is no access until the time.
> > 
> > [I am intentionally not looking at the implementation because below
> > points should be clear from the changelog - sorry about nagging ;)]
> > 
> > What kind of pages can be deactivated? Anonymous/File backed.
> > Private/shared? If shared, are there any restrictions?
> 
> Both file and private pages could be deactived from each active LRU
> to each inactive LRU if the page has one map_count. In other words,
> 
>     if (page_mapcount(page) <= 1)
>         deactivate_page(page);

Why do we restrict to pages that are single mapped?

> > Are there any restrictions on mappings? E.g. what would be an effect of
> > this operation on hugetlbfs mapping?
> 
> VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma will be skipped like MADV_FREE|DONTNEED

OK documenting that this is restricted to the same vmas as MADV_FREE|DONTNEED
is really useful to mention.

> 
> > 
> > Also you are talking about inactive LRU but what kind of LRU is that? Is
> > it the anonymous LRU? If yes, don't we have the same problem as with the
> 
> active file page -> inactive file LRU
> active anon page -> inacdtive anon LRU
> 
> > early MADV_FREE implementation when enough page cache causes that
> > deactivated anonymous memory doesn't get reclaimed anytime soon. Or
> > worse never when there is no swap available?
> 
> I think MADV_COLD is a little bit different symantic with MAVD_FREE.
> MADV_FREE means it's okay to discard when the memory pressure because
> the content of the page is *garbage*. Furthemore, freeing such pages is
> almost zero overhead since we don't need to swap out and access
> afterward causes minor fault. Thus, it would make sense to put those
> freeable pages in inactive file LRU to compete other used-once pages.
> 
> However, MADV_COLD doesn't means it's a garbage and freeing requires
> swap out/swap in afterward. So, it would be better to move inactive
> anon's LRU list, not file LRU. Furthermore, it would avoid unnecessary
> scanning of those cold anonymous if system doesn't have a swap device.

Please document this, if this is really a desirable semantic because
then you have the same set of problems as we've had with the early
MADV_FREE implementation mentioned above.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [RFCv2 5/6] mm: introduce external memory hinting API
From: Minchan Kim @ 2019-05-31 14:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton
In-Reply-To: <20190531140050.GS6896@dhcp22.suse.cz>

On Fri, May 31, 2019 at 04:00:50PM +0200, Michal Hocko wrote:
> On Fri 31-05-19 22:19:00, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 10:37:57AM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 15:43:12, Minchan Kim wrote:
> > > > There is some usecase that centralized userspace daemon want to give
> > > > a memory hint like MADV_[COLD|PAGEEOUT] to other process. Android's
> > > > ActivityManagerService is one of them.
> > > > 
> > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > required to make the reclaim decision is not known to the app. Instead,
> > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > and that daemon must be able to initiate reclaim on its own without
> > > > any app involvement.
> > > > 
> > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > It could give a hint to the exeternal process of pidfd.
> > > > 
> > > >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > 			unsigned long cookie, unsigned long flag);
> > > > 
> > > > Since it could affect other process's address range, only privileged
> > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > gives it the right to ptrace the process could use it successfully.
> > > > 
> > > > The syscall has a cookie argument to privode atomicity(i.e., detect
> > > > target process's address space change since monitor process has parsed
> > > > the address range of target process so the operaion could fail in case
> > > > of happening race). Although there is no interface to get a cookie
> > > > at this moment, it could be useful to consider it as argument to avoid
> > > > introducing another new syscall in future. It could support *atomicity*
> > > > for disruptive hint(e.g., MADV_DONTNEED|FREE).
> > > > flag argument is reserved for future use if we need to extend the API.
> > > 
> > > Providing an API that is incomplete will not fly. Really. As this really
> > > begs for much more discussion and it would be good to move on with the
> > > core idea of the pro active memory memory management from userspace
> > > usecase. Could you split out the core change so that we can move on and
> > > leave the external for a later discussion. I believe this would lead to
> > > a smoother integration.
> > 
> > No problem but I need to understand what you want a little bit more because
> > I thought this patchset is already step by step so if we reach the agreement
> > of part of them like [1-5/6], it could be merged first.
> > 
> > Could you say how you want to split the patchset for forward progress?
> 
> I would start with new madvise modes and once they are in a shape to be
> merged then we can start the remote madvise API. I believe that even
> local process reclaim modes are interesting and useful. I haven't heard
> anybody objecting to them without having a remote API so far.

Okay, let's focus on [1-3/6] at this moment.

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: David Howells @ 2019-05-31 14:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dhowells, Jann Horn, Greg KH, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list, Kees Cook, Kernel Hardening
In-Reply-To: <20190531132620.GC2606@hirez.programming.kicks-ass.net>

Peter Zijlstra <peterz@infradead.org> wrote:

> Is it not the responsibility of the task that affects the 1->0
> transition to actually free the memory?
> 
> That is, I'm expecting the '...' in both cases above the include the
> actual freeing of the object. If this is not the case, then @usage is
> not a reference count.

Yes.  The '...' does the freeing.  It seemed unnecessary to include the code
ellipsised there since it's not the point of the discussion, but if you want
the full function:

	void afs_put_call(struct afs_call *call)
	{
		struct afs_net *net = call->net;
		int n = atomic_dec_return(&call->usage);
		int o = atomic_read(&net->nr_outstanding_calls);

		trace_afs_call(call, afs_call_trace_put, n + 1, o,
			       __builtin_return_address(0));

		ASSERTCMP(n, >=, 0);
		if (n == 0) {
			ASSERT(!work_pending(&call->async_work));
			ASSERT(call->type->name != NULL);

			if (call->rxcall) {
				rxrpc_kernel_end_call(net->socket, call->rxcall);
				call->rxcall = NULL;
			}
			if (call->type->destructor)
				call->type->destructor(call);

			afs_put_server(call->net, call->server);
			afs_put_cb_interest(call->net, call->cbi);
			afs_put_addrlist(call->alist);
			kfree(call->request);

			trace_afs_call(call, afs_call_trace_free, 0, o,
				       __builtin_return_address(0));
			kfree(call);

			o = atomic_dec_return(&net->nr_outstanding_calls);
			if (o == 0)
				wake_up_var(&net->nr_outstanding_calls);
		}
	}

You can see the kfree(call) in there.
Peter Zijlstra <peterz@infradead.org> wrote:

> (and it has already been established that refcount_t doesn't work for
> usage count scenarios)

?

Does that mean struct kref doesn't either?

> Aside from that, is the problem that refcount_dec_and_test() returns a
> boolean (true - last put, false - not last) instead of the refcount
> value? This does indeed make it hard to print the exact count value for
> the event.

That is the problem, yes - well, one of them: refcount_inc() doesn't either.

David

^ permalink raw reply

* Re: [RFCv2 1/6] mm: introduce MADV_COLD
From: Minchan Kim @ 2019-05-31 14:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton
In-Reply-To: <20190531140332.GT6896@dhcp22.suse.cz>

On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > When a process expects no accesses to a certain memory range, it could
> > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > happens but data should be preserved for future use.  This could reduce
> > > > workingset eviction so it ends up increasing performance.
> > > > 
> > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > to be used in the near future. The hint can help kernel in deciding which
> > > > pages to evict early during memory pressure.
> > > > 
> > > > Internally, it works via deactivating pages from active list to inactive's
> > > > head if the page is private because inactive list could be full of
> > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > active pages unless there is no access until the time.
> > > 
> > > [I am intentionally not looking at the implementation because below
> > > points should be clear from the changelog - sorry about nagging ;)]
> > > 
> > > What kind of pages can be deactivated? Anonymous/File backed.
> > > Private/shared? If shared, are there any restrictions?
> > 
> > Both file and private pages could be deactived from each active LRU
> > to each inactive LRU if the page has one map_count. In other words,
> > 
> >     if (page_mapcount(page) <= 1)
> >         deactivate_page(page);
> 
> Why do we restrict to pages that are single mapped?

Because page table in one of process shared the page would have access bit
so finally we couldn't reclaim the page. The more process it is shared,
the more fail to reclaim.

> 
> > > Are there any restrictions on mappings? E.g. what would be an effect of
> > > this operation on hugetlbfs mapping?
> > 
> > VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma will be skipped like MADV_FREE|DONTNEED
> 
> OK documenting that this is restricted to the same vmas as MADV_FREE|DONTNEED
> is really useful to mention.

Sure.

> 
> > 
> > > 
> > > Also you are talking about inactive LRU but what kind of LRU is that? Is
> > > it the anonymous LRU? If yes, don't we have the same problem as with the
> > 
> > active file page -> inactive file LRU
> > active anon page -> inacdtive anon LRU
> > 
> > > early MADV_FREE implementation when enough page cache causes that
> > > deactivated anonymous memory doesn't get reclaimed anytime soon. Or
> > > worse never when there is no swap available?
> > 
> > I think MADV_COLD is a little bit different symantic with MAVD_FREE.
> > MADV_FREE means it's okay to discard when the memory pressure because
> > the content of the page is *garbage*. Furthemore, freeing such pages is
> > almost zero overhead since we don't need to swap out and access
> > afterward causes minor fault. Thus, it would make sense to put those
> > freeable pages in inactive file LRU to compete other used-once pages.
> > 
> > However, MADV_COLD doesn't means it's a garbage and freeing requires
> > swap out/swap in afterward. So, it would be better to move inactive
> > anon's LRU list, not file LRU. Furthermore, it would avoid unnecessary
> > scanning of those cold anonymous if system doesn't have a swap device.
> 
> Please document this, if this is really a desirable semantic because
> then you have the same set of problems as we've had with the early
> MADV_FREE implementation mentioned above.

IIRC, the problem of MADV_FREE was that we couldn't discard freeable
pages because VM never scan anonymous LRU with swapless system.
However, it's not the our case because we should reclaim them, not
discarding.

I will include it in the description.

Thanks.

^ permalink raw reply

* Re: [RFCv2 4/6] mm: factor out madvise's core functionality
From: Oleksandr Natalenko @ 2019-05-31 14:35 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, hdanton
In-Reply-To: <20190531131226.GA195463@google.com>

On Fri, May 31, 2019 at 10:12:26PM +0900, Minchan Kim wrote:
> On Fri, May 31, 2019 at 09:04:20AM +0200, Oleksandr Natalenko wrote:
> > On Fri, May 31, 2019 at 03:43:11PM +0900, Minchan Kim wrote:
> > > This patch factor out madvise's core functionality so that upcoming
> > > patch can reuse it without duplication. It shouldn't change any behavior.
> > > 
> > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > ---
> > >  mm/madvise.c | 188 +++++++++++++++++++++++++++------------------------
> > >  1 file changed, 101 insertions(+), 87 deletions(-)
> > > 
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 9d749a1420b4..466623ea8c36 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -425,9 +425,10 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > >  	struct page *page;
> > >  	int isolated = 0;
> > >  	struct vm_area_struct *vma = walk->vma;
> > > +	struct task_struct *task = walk->private;
> > >  	unsigned long next;
> > >  
> > > -	if (fatal_signal_pending(current))
> > > +	if (fatal_signal_pending(task))
> > >  		return -EINTR;
> > >  
> > >  	next = pmd_addr_end(addr, end);
> > > @@ -505,12 +506,14 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > >  }
> > >  
> > >  static void madvise_pageout_page_range(struct mmu_gather *tlb,
> > > -			     struct vm_area_struct *vma,
> > > -			     unsigned long addr, unsigned long end)
> > > +				struct task_struct *task,
> > > +				struct vm_area_struct *vma,
> > > +				unsigned long addr, unsigned long end)
> > >  {
> > >  	struct mm_walk warm_walk = {
> > >  		.pmd_entry = madvise_pageout_pte_range,
> > >  		.mm = vma->vm_mm,
> > > +		.private = task,
> > >  	};
> > >  
> > >  	tlb_start_vma(tlb, vma);
> > > @@ -519,9 +522,9 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb,
> > >  }
> > >  
> > >  
> > > -static long madvise_pageout(struct vm_area_struct *vma,
> > > -			struct vm_area_struct **prev,
> > > -			unsigned long start_addr, unsigned long end_addr)
> > > +static long madvise_pageout(struct task_struct *task,
> > > +		struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > +		unsigned long start_addr, unsigned long end_addr)
> > >  {
> > >  	struct mm_struct *mm = vma->vm_mm;
> > >  	struct mmu_gather tlb;
> > > @@ -532,7 +535,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
> > >  
> > >  	lru_add_drain();
> > >  	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> > > -	madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
> > > +	madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr);
> > >  	tlb_finish_mmu(&tlb, start_addr, end_addr);
> > >  
> > >  	return 0;
> > > @@ -744,7 +747,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> > >  	return 0;
> > >  }
> > >  
> > > -static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > +static long madvise_dontneed_free(struct mm_struct *mm,
> > > +				  struct vm_area_struct *vma,
> > >  				  struct vm_area_struct **prev,
> > >  				  unsigned long start, unsigned long end,
> > >  				  int behavior)
> > > @@ -756,8 +760,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > >  	if (!userfaultfd_remove(vma, start, end)) {
> > >  		*prev = NULL; /* mmap_sem has been dropped, prev is stale */
> > >  
> > > -		down_read(&current->mm->mmap_sem);
> > > -		vma = find_vma(current->mm, start);
> > > +		down_read(&mm->mmap_sem);
> > > +		vma = find_vma(mm, start);
> > >  		if (!vma)
> > >  			return -ENOMEM;
> > >  		if (start < vma->vm_start) {
> > > @@ -804,7 +808,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > >   * Application wants to free up the pages and associated backing store.
> > >   * This is effectively punching a hole into the middle of a file.
> > >   */
> > > -static long madvise_remove(struct vm_area_struct *vma,
> > > +static long madvise_remove(struct mm_struct *mm,
> > > +				struct vm_area_struct *vma,
> > >  				struct vm_area_struct **prev,
> > >  				unsigned long start, unsigned long end)
> > >  {
> > > @@ -838,13 +843,13 @@ static long madvise_remove(struct vm_area_struct *vma,
> > >  	get_file(f);
> > >  	if (userfaultfd_remove(vma, start, end)) {
> > >  		/* mmap_sem was not released by userfaultfd_remove() */
> > > -		up_read(&current->mm->mmap_sem);
> > > +		up_read(&mm->mmap_sem);
> > >  	}
> > >  	error = vfs_fallocate(f,
> > >  				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > >  				offset, end - start);
> > >  	fput(f);
> > > -	down_read(&current->mm->mmap_sem);
> > > +	down_read(&mm->mmap_sem);
> > >  	return error;
> > >  }
> > >  
> > > @@ -918,21 +923,23 @@ static int madvise_inject_error(int behavior,
> > >  #endif
> > >  
> > >  static long
> > > -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > +madvise_vma(struct task_struct *task, struct mm_struct *mm,
> > > +		struct vm_area_struct *vma, struct vm_area_struct **prev,
> > >  		unsigned long start, unsigned long end, int behavior)
> > >  {
> > >  	switch (behavior) {
> > >  	case MADV_REMOVE:
> > > -		return madvise_remove(vma, prev, start, end);
> > > +		return madvise_remove(mm, vma, prev, start, end);
> > >  	case MADV_WILLNEED:
> > >  		return madvise_willneed(vma, prev, start, end);
> > >  	case MADV_COLD:
> > >  		return madvise_cold(vma, prev, start, end);
> > >  	case MADV_PAGEOUT:
> > > -		return madvise_pageout(vma, prev, start, end);
> > > +		return madvise_pageout(task, vma, prev, start, end);
> > >  	case MADV_FREE:
> > >  	case MADV_DONTNEED:
> > > -		return madvise_dontneed_free(vma, prev, start, end, behavior);
> > > +		return madvise_dontneed_free(mm, vma, prev, start,
> > > +						end, behavior);
> > >  	default:
> > >  		return madvise_behavior(vma, prev, start, end, behavior);
> > >  	}
> > > @@ -976,68 +983,8 @@ madvise_behavior_valid(int behavior)
> > >  	}
> > >  }
> > >  
> > > -/*
> > > - * The madvise(2) system call.
> > > - *
> > > - * Applications can use madvise() to advise the kernel how it should
> > > - * handle paging I/O in this VM area.  The idea is to help the kernel
> > > - * use appropriate read-ahead and caching techniques.  The information
> > > - * provided is advisory only, and can be safely disregarded by the
> > > - * kernel without affecting the correct operation of the application.
> > > - *
> > > - * behavior values:
> > > - *  MADV_NORMAL - the default behavior is to read clusters.  This
> > > - *		results in some read-ahead and read-behind.
> > > - *  MADV_RANDOM - the system should read the minimum amount of data
> > > - *		on any access, since it is unlikely that the appli-
> > > - *		cation will need more than what it asks for.
> > > - *  MADV_SEQUENTIAL - pages in the given range will probably be accessed
> > > - *		once, so they can be aggressively read ahead, and
> > > - *		can be freed soon after they are accessed.
> > > - *  MADV_WILLNEED - the application is notifying the system to read
> > > - *		some pages ahead.
> > > - *  MADV_DONTNEED - the application is finished with the given range,
> > > - *		so the kernel can free resources associated with it.
> > > - *  MADV_FREE - the application marks pages in the given range as lazy free,
> > > - *		where actual purges are postponed until memory pressure happens.
> > > - *  MADV_REMOVE - the application wants to free up the given range of
> > > - *		pages and associated backing store.
> > > - *  MADV_DONTFORK - omit this area from child's address space when forking:
> > > - *		typically, to avoid COWing pages pinned by get_user_pages().
> > > - *  MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
> > > - *  MADV_WIPEONFORK - present the child process with zero-filled memory in this
> > > - *              range after a fork.
> > > - *  MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
> > > - *  MADV_HWPOISON - trigger memory error handler as if the given memory range
> > > - *		were corrupted by unrecoverable hardware memory failure.
> > > - *  MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
> > > - *  MADV_MERGEABLE - the application recommends that KSM try to merge pages in
> > > - *		this area with pages of identical content from other such areas.
> > > - *  MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> > > - *  MADV_HUGEPAGE - the application wants to back the given range by transparent
> > > - *		huge pages in the future. Existing pages might be coalesced and
> > > - *		new pages might be allocated as THP.
> > > - *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> > > - *		transparent huge pages so the existing pages will not be
> > > - *		coalesced into THP and new pages will not be allocated as THP.
> > > - *  MADV_DONTDUMP - the application wants to prevent pages in the given range
> > > - *		from being included in its core dump.
> > > - *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > > - *
> > > - * return values:
> > > - *  zero    - success
> > > - *  -EINVAL - start + len < 0, start is not page-aligned,
> > > - *		"behavior" is not a valid value, or application
> > > - *		is attempting to release locked or shared pages,
> > > - *		or the specified address range includes file, Huge TLB,
> > > - *		MAP_SHARED or VMPFNMAP range.
> > > - *  -ENOMEM - addresses in the specified range are not currently
> > > - *		mapped, or are outside the AS of the process.
> > > - *  -EIO    - an I/O error occurred while paging in data.
> > > - *  -EBADF  - map exists, but area maps something that isn't a file.
> > > - *  -EAGAIN - a kernel resource was temporarily unavailable.
> > > - */
> > > -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> > > +static int madvise_core(struct task_struct *task, struct mm_struct *mm,
> > > +			unsigned long start, size_t len_in, int behavior)
> > 
> > Just a minor nitpick, but can we please have it named madvise_common,
> > not madvise_core? This would follow a usual naming scheme, when some
> > common functionality is factored out (like, for mutexes, semaphores
> > etc), and within the kernel "core" usually means something completely
> > different.
> 
> Sure.
> 
> > 
> > >  {
> > >  	unsigned long end, tmp;
> > >  	struct vm_area_struct *vma, *prev;
> > > @@ -1068,15 +1015,16 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> > >  
> > >  #ifdef CONFIG_MEMORY_FAILURE
> > >  	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> > > -		return madvise_inject_error(behavior, start, start + len_in);
> > > +		return madvise_inject_error(behavior,
> > > +					start, start + len_in);
> > 
> > Not sure what this change is about except changing the line length.
> > Note, madvise_inject_error() still operates on "current" through
> > get_user_pages_fast() and gup_pgd_range(), but that was not changed
> > here. I Know you've filtered out this hint later, so technically this
> > is not an issue, but, maybe, this needs some attention too since we've
> > already spotted it?
> 
> It is leftover I had done. I actually modified it to handle remote
> task but changed my mind not to fix it because process_madvise
> will not support it at this moment. I'm not sure it's a good idea
> to change it for *might-be-done-in-future* at this moment even though
> we have spotted.

I'd expect to have at least some comments in code on why other hints
are disabled, so if we already know some shortcomings, this information
would not be lost.

Of course, I don't care much about memory poisoning, but if it can be
addressed now, let's address it now.

> 
> > 
> > >  #endif
> > >  
> > >  	write = madvise_need_mmap_write(behavior);
> > >  	if (write) {
> > > -		if (down_write_killable(&current->mm->mmap_sem))
> > > +		if (down_write_killable(&mm->mmap_sem))
> > >  			return -EINTR;
> > 
> > Do you still need that trick with mmget_still_valid() here?
> > Something like:
> 
> Since MADV_COLD|PAGEOUT doesn't change address space layout or
> vma->vm_flags, technically, we don't need it if I understand
> correctly. Right?

I'd expect so, yes. But.

Since we want this interface to be universal and to be able to cover
various needs, and since my initial intention with working in this
direction involved KSM, I'd ask you to enable KSM hints too, and once
(and if) that happens, the work there is done under write lock, and
you'll need this trick to be applied.

Of course, I can do that myself later in a subsequent patch series once
(and, again, if) your series is merged, but, maybe, we can cover this
already especially given the fact that KSM hinting is a relatively easy
task in this pile. I did some preliminary tests with it, and so far no
dragons have started to roar.

> 
> > 
> > if (current->mm != mm && !mmget_still_valid(mm))
> >    goto skip_mm;
> > 
> > and that skip_mm label would be before
> > 
> > if (write)
> >    up_write(&mm->mmap_sem);
> > 
> > below.
> > 
> > (see 04f5866e41fb70690e28397487d8bd8eea7d712a for details on this)

-- 
  Best regards,
    Oleksandr Natalenko (post-factum)
    Senior Software Maintenance Engineer

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v10)
From: Mathieu Desnoyers @ 2019-05-31 14:48 UTC (permalink / raw)
  To: Florian Weimer
  Cc: carlos, Joseph Myers, Szabolcs Nagy, libc-alpha, Thomas Gleixner,
	Ben Maurer, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Will Deacon, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <875zprm4jo.fsf@oldenburg2.str.redhat.com>



----- On May 31, 2019, at 4:06 AM, Florian Weimer fweimer@redhat.com wrote:

> * Mathieu Desnoyers:
> 
>> I found that it's because touching a __thread variable from
>> ld-linux-x86-64.so.2 ends up setting the DF_STATIC_TLS flag
>> for that .so, which is really not expected.
>>
>> Even if I tweak the assert to make it more lenient there,
>> touching the __thread variable ends up triggering a SIGFPE.
> 
> Sorry, I got distracted at this critical juncture.  Yes, I forgot that
> there isn't TLS support in the dynamic loader today.
> 
>> So rather than touching the TLS from ld-linux-x86-64.so.2,
>> I've rather experimented with moving the rseq initialization
>> for both SHARED and !SHARED cases to a library constructor
>> within libc.so.
>>
>> Are you aware of any downside to this approach ?
> 
> The information whether the kernel supports rseq would not be available
> to IFUNC resolvers.  And in some cases, ELF constructors for application
> libraries could run before the libc.so.6 constructor, so applications
> would see a transition from lack of kernel support to kernel support.
> 
>> +static
>> +__attribute__ ((constructor))
>> +void __rseq_libc_init (void)
>> +{
>> +  rseq_init ();
>> +  /* Register rseq ABI to the kernel.   */
>> +  (void) rseq_register_current_thread ();
>> +}
> 
> I think the call to rseq_init (and the __rseq_handled variable) should
> still be part of the dynamic loader.  Otherwise there could be confusion
> about whether glibc handles the registration (due the constructor
> ordering issue).

Let's break this down into the various sub-issues involved:

1) How early do we need to setup rseq ? Should it be setup before:
   - LD_PRELOAD .so constructors ?
     - Without circular dependency,
     - With circular dependency,
   - audit libraries initialization ?
   - IFUNC resolvers ?
   - other callbacks ?
   - memory allocator calls ?

We may end up in a situation where we need memory allocation to be setup
in order to initialize TLS before rseq can be registered for the main
thread. I suspect we will end up needing a fallbacks which always work
for the few cases that would try to use rseq too early in dl/libc startup.

2) Do we need to setup __rseq_handled and __rseq_abi at the same stage of
   startup, or is it OK to setup __rseq_handled before __rseq_abi ?

3) Which shared object owns __rseq_handled and __rseq_abi ?
   - libc.so ?
   - ld-linux-*.so.2 ?
   - Should both symbols be owned by the same .so ?
   - What about the !SHARED case ? I think this would end up in libc.a in all cases.

4) Inability to touch a TLS variable (__rseq_abi) from ld-linux-*.so.2
   - Should we extend the dynamic linker to allow such TLS variable to be
     accessed ? If so, how much effort is required ?
   - Can we find an alternative way to initialize rseq early during
     dl init stages while still performing the TLS access from a function
     implemented within libc.so ?

So far, I got rseq to be initialized before LD_PRELOADed library constructors
by doing the initialization in a constructor within libc.so. I don't particularly
like this approach, because the constructor order is not guaranteed.

One possible solution would be to somehow expose a rseq initialization function
symbol from libc.so, look it up from ld-linux-*.so.2, and invoke it after libc.so
has been loaded. It would end up being similar to a constructor, but with a
fixed invocation order.

I'm just not sure we have everything we need to do this in ld-linux-*.so.2
init stages.

Thoughts ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: David Howells @ 2019-05-31 14:55 UTC (permalink / raw)
  To: Greg KH
  Cc: dhowells, viro, raven, linux-fsdevel, linux-api, linux-block,
	keyrings, linux-security-module, linux-kernel
In-Reply-To: <20190529231112.GB3164@kroah.com>

Greg KH <gregkh@linuxfoundation.org> wrote:

> So, if that's all that needs to be fixed, can you use the same
> buffer/code if that patch is merged?

I really don't know.  The perf code is complex, partially in hardware drivers
and is tricky to understand - though a chunk of that is the "aux" buffer part;
PeterZ used words like "special" and "magic" and the comments in the code talk
about the hardware writing into the buffer.

__perf_output_begin() does not appear to be SMP safe.  It uses local_cmpxchg()
and local_add() which on x86 lack the LOCK prefix.

stracing the perf command on my test machine, it calls perf_event_open(2) four
times and mmap's each fd it gets back.  I'm guessing that each one maps a
separate buffer for each CPU.

So to use watch_queue based on perf's buffering, you would have to have a
(2^N)+1 pages-sized buffer for each CPU.  So that would be a minimum of 64K of
unswappable memory for my desktop machine, say).  Multiply that by each
process that wants to listen for events...

What I'm aiming for is something that has a single buffer used by all CPUs for
each instance of /dev/watch_queue opened and I'd also like to avoid having to
allocate the metadata page and the aux buffer to save space.  This is locked
memory and cannot be swapped.

Also, perf has to leave a gap in the ring because it uses CIRC_SPACE(), though
that's a minor detail that I guess can't be fixed now.

I'm also slightly concerned that __perf_output_begin() doesn't check if
rb->user->tail has got ahead of rb->user->head or that it's lagging too far
behind.  I doubt it's a serious problem for the kernel since it won't write
outside of the buffer, but userspace might screw up.  I think the worst that
will happen is that userspace will get confused.

One thing I would like is to waive the 2^N size requirement.  I understand
*why* we do that, but I wonder how expensive DIV instructions are for
relatively small divisors.

David

^ permalink raw reply

* Re: [RFC][PATCH] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA
From: Amir Goldstein @ 2019-05-31 15:21 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, Darrick J . Wong, Dave Chinner, Chris Mason, Al Viro,
	linux-fsdevel, linux-xfs, Ext4, Linux Btrfs, Linux API
In-Reply-To: <CAOQ4uxgo5jmwQbLAKQre9=7pLQw=CwMgDaWPaJxi-5NGnPEVPQ@mail.gmail.com>

> >
> > So instead of saying "A filesystem that accepts this flag will
> > guaranty, that old inode data will not be exposed in the new linked
> > name."  It's much clearer to state this in the affirmative:
> >
> >         A filesystem which accepts this flag will guarantee that if
> >         the new pathname exists after a crash, all of the data written
> >         to the file at the time of the linkat(2) call will be visible.
> >
>
> Sounds good to me. I will take a swing at another patch.
>

So I am down to single flag documented with 3 tweets ;-)

What do you think of:

"AT_ATOMIC_DATA (since Linux 5.x)
A filesystem which accepts this flag will guarantee that if the linked file
name exists after a system crash, then all of the data written to the file
and all of the file's metadata at the time of the linkat(2) call will be
visible.

The way to achieve this guarantee on old kernels is to call fsync (2)
before linking the file, but doing so will also results in flushing of
volatile disk caches.

A filesystem which accepts this flag does NOT
guarantee that any of the file hardlinks will exist after a system crash,
nor that the last observed value of st_nlink (see stat (2)) will persist."


Thanks,
Amir.

^ permalink raw reply

* Re: [PATCH v9 12/16] sched/core: uclamp: Extend CPU's cgroup controller
From: Tejun Heo @ 2019-05-31 15:35 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, linux-api, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan
In-Reply-To: <20190515094459.10317-13-patrick.bellasi@arm.com>

Hello, Patrick.

On Wed, May 15, 2019 at 10:44:55AM +0100, Patrick Bellasi wrote:
> Extend the CPU controller with a couple of new attributes util.{min,max}
> which allows to enforce utilization boosting and capping for all the
> tasks in a group. Specifically:
> 
> - util.min: defines the minimum utilization which should be considered
> 	    i.e. the RUNNABLE tasks of this group will run at least at a
> 		 minimum frequency which corresponds to the util.min
> 		 utilization
> 
> - util.max: defines the maximum utilization which should be considered
> 	    i.e. the RUNNABLE tasks of this group will run up to a
> 		 maximum frequency which corresponds to the util.max
> 		 utilization

Let's please use a prefix which is more specific.  It's clamping the
utilization estimates of the member tasks which in turn affect
scheduling / frequency decisions but cpu.util.max reads like it's
gonna limit the cpu utilization directly.  Maybe just use uclamp?

> These attributes:
> 
> a) are available only for non-root nodes, both on default and legacy
>    hierarchies, while system wide clamps are defined by a generic
>    interface which does not depends on cgroups. This system wide
>    interface enforces constraints on tasks in the root node.

I'd much prefer if they weren't entangled this way.  The system wide
limits should work the same regardless of cgroup's existence.  cgroup
can put further restriction on top but mere creation of cgroups with
cpu controller enabled shouldn't take them out of the system-wide
limits.

> b) enforce effective constraints at each level of the hierarchy which
>    are a restriction of the group requests considering its parent's
>    effective constraints. Root group effective constraints are defined
>    by the system wide interface.
>    This mechanism allows each (non-root) level of the hierarchy to:
>    - request whatever clamp values it would like to get
>    - effectively get only up to the maximum amount allowed by its parent

I'll come back to this later.

> c) have higher priority than task-specific clamps, defined via
>    sched_setattr(), thus allowing to control and restrict task requests

This sounds good.

> Add two new attributes to the cpu controller to collect "requested"
> clamp values. Allow that at each non-root level of the hierarchy.
> Validate local consistency by enforcing util.min < util.max.
> Keep it simple by do not caring now about "effective" values computation
> and propagation along the hierarchy.

So, the followings are what we're doing for hierarchical protection
and limit propgations.

* Limits (high / max) default to max.  Protections (low / min) 0.  A
  new cgroup by default doesn't constrain itself further and doesn't
  have any protection.

* A limit defines the upper ceiling for the subtree.  If an ancestor
  has a limit of X, none of its descendants can have more than X.

* A protection defines the upper ceiling of protections for the
  subtree.  If an andester has a protection of X, none of its
  descendants can have more protection than X.

Note that there's no way for an ancestor to enforce protection its
descendants.  It can only allow them to claim some.  This is
intentional as the other end of the spectrum is either descendants
losing the ability to further distribute protections as they see fit.

For proportions (as opposed to weights), we use percentage rational
numbers - e.g. 38.44 for 38.44%.  I have parser and doc update commits
pending.  I'll put them on cgroup/for-5.3.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v10)
From: Florian Weimer @ 2019-05-31 15:46 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: carlos, Joseph Myers, Szabolcs Nagy, libc-alpha, Thomas Gleixner,
	Ben Maurer, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Will Deacon, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <732661684.21584.1559314109886.JavaMail.zimbra@efficios.com>

* Mathieu Desnoyers:

> Let's break this down into the various sub-issues involved:
>
> 1) How early do we need to setup rseq ? Should it be setup before:
>    - LD_PRELOAD .so constructors ?
>      - Without circular dependency,
>      - With circular dependency,
>    - audit libraries initialization ?
>    - IFUNC resolvers ?
>    - other callbacks ?
>    - memory allocator calls ?
>
> We may end up in a situation where we need memory allocation to be setup
> in order to initialize TLS before rseq can be registered for the main
> thread. I suspect we will end up needing a fallbacks which always work
> for the few cases that would try to use rseq too early in dl/libc startup.

I think the answer to that depends on whether it's okay to have an
observable transition from “no rseq kernel support” to “kernel supports
rseq”.

> 2) Do we need to setup __rseq_handled and __rseq_abi at the same stage of
>    startup, or is it OK to setup __rseq_handled before __rseq_abi ?

I think we should be able to set __rseq_handle early, even if we can
perform the rseq area registration later.  (The distinction does not
matter if the registration needs to be performed early as well.)

Setting __rseq_handle in ld.so is easy if the variable is defined in
ld.so, which is not a problem at all.

> 3) Which shared object owns __rseq_handled and __rseq_abi ?
>    - libc.so ?
>    - ld-linux-*.so.2 ?
>    - Should both symbols be owned by the same .so ?

I think we can pick whatever works, based on the requirements from (1).
It's an implementation detail (altough it currently becomes part of the
ABI for weird reasons, but the choice itself is arbitrary).

>    - What about the !SHARED case ? I think this would end up in libc.a
>    in all cases.

Correct.

> 4) Inability to touch a TLS variable (__rseq_abi) from ld-linux-*.so.2
>    - Should we extend the dynamic linker to allow such TLS variable to be
>      accessed ? If so, how much effort is required ?
>    - Can we find an alternative way to initialize rseq early during
>      dl init stages while still performing the TLS access from a function
>      implemented within libc.so ?

This is again related to the answer for (1).  There are various hacks we
could implement to make the initialization invisible (e.g., computing
the address of the variable using the equivalent of dlsym, after loading
all the initial objects and before starting relocation).  If it's not
too hard to add TLS support to ld.so, we can consider that as well.
(The allocation side should be pretty easy, relocation support it could
be more tricky.)

> So far, I got rseq to be initialized before LD_PRELOADed library
> constructors by doing the initialization in a constructor within
> libc.so. I don't particularly like this approach, because the
> constructor order is not guaranteed.

Right.

> One possible solution would be to somehow expose a rseq initialization
> function symbol from libc.so, look it up from ld-linux-*.so.2, and
> invoke it after libc.so has been loaded. It would end up being similar
> to a constructor, but with a fixed invocation order.

This would still expose lack of rseq support to IFUNC resolvers
initially.  I don't know if this is a problem (again, it comes down to
(1) above).  There is a school of thought that you can't reference
__rseq_abi from an IFUNC resolver because it needs a relocation.

Thanks,
Florian

^ permalink raw reply

* Re: [RFC][PATCH] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA
From: Theodore Ts'o @ 2019-05-31 16:41 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Darrick J . Wong, Dave Chinner, Chris Mason, Al Viro,
	linux-fsdevel, linux-xfs, Ext4, Linux Btrfs, Linux API
In-Reply-To: <CAOQ4uxgj94WR82iHE4PDGSD0UDxG5sCtr+Sv+t1sOHHmnXFYzQ@mail.gmail.com>

On Fri, May 31, 2019 at 06:21:45PM +0300, Amir Goldstein wrote:
> What do you think of:
> 
> "AT_ATOMIC_DATA (since Linux 5.x)
> A filesystem which accepts this flag will guarantee that if the linked file
> name exists after a system crash, then all of the data written to the file
> and all of the file's metadata at the time of the linkat(2) call will be
> visible.

".... will be visible after the the file system is remounted".  (Never
hurts to be explicit.)

> The way to achieve this guarantee on old kernels is to call fsync (2)
> before linking the file, but doing so will also results in flushing of
> volatile disk caches.
>
> A filesystem which accepts this flag does NOT
> guarantee that any of the file hardlinks will exist after a system crash,
> nor that the last observed value of st_nlink (see stat (2)) will persist."
> 

This is I think more precise:

    This guarantee can be achieved by calling fsync(2) before linking
    the file, but there may be more performant ways to provide these
    semantics.  In particular, note that the use of the AT_ATOMIC_DATA
    flag does *not* guarantee that the new link created by linkat(2)
    will be persisted after a crash.

We should also document that a file system which does not implement
this flag MUST return EINVAL if it is passed this flag to linkat(2).

     	       	      	     	      - Ted

^ permalink raw reply

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: Peter Zijlstra @ 2019-05-31 16:44 UTC (permalink / raw)
  To: David Howells
  Cc: Jann Horn, Greg KH, Al Viro, raven, linux-fsdevel, Linux API,
	linux-block, keyrings, linux-security-module, kernel list,
	Kees Cook, Kernel Hardening
In-Reply-To: <606.1559312412@warthog.procyon.org.uk>

On Fri, May 31, 2019 at 03:20:12PM +0100, David Howells wrote:
> Peter Zijlstra <peterz@infradead.org> wrote:

> > (and it has already been established that refcount_t doesn't work for
> > usage count scenarios)
> 
> ?
> 
> Does that mean struct kref doesn't either?

Indeed, since kref is just a pointless wrapper around refcount_t it does
not either.

The main distinction between a reference count and a usage count is that
0 means different things. For a refcount 0 means dead. For a usage count
0 is merely unused but valid.

Incrementing a 0 refcount is a serious bug -- use-after-free (and hence
refcount_t will refuse this and splat), for a usage count this is no
problem.

Now, it is sort-of possible to merge the two, by basically stating
something like: usage = refcount - 1. But that can get tricky and people
have not really liked the result much for the few times I tried.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox