Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH v4 1/4] mm: introduce MADV_COLD
From: Johannes Weiner @ 2019-07-11 15:25 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, oleksandr, hdanton, lizeb, Dave Hansen,
	Kirill A . Shutemov
In-Reply-To: <20190711012528.176050-2-minchan@kernel.org>

On Thu, Jul 11, 2019 at 10:25:25AM +0900, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range, it could
> give a hint to kernel that the pages can be reclaimed when memory pressure
> happens but data should be preserved for future use.  This could reduce
> workingset eviction so it ends up increasing performance.
> 
> This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> MADV_COLD can be used by a process to mark a memory range as not expected
> to be used in the near future. The hint can help kernel in deciding which
> pages to evict early during memory pressure.
> 
> It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
> 
> 	active file page -> inactive file LRU
> 	active anon page -> inacdtive anon LRU
> 
> Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> file LRU's head because MADV_COLD is a little bit different symantic.
> MADV_FREE means it's okay to discard when the memory pressure because
> the content of the page is *garbage* so freeing such pages is almost zero
> overhead since we don't need to swap out and access afterward causes just
> minor fault. Thus, it would make sense to put those freeable pages in
> inactive file LRU to compete other used-once pages. It makes sense for
> implmentaion point of view, too because it's not swapbacked memory any
> longer until it would be re-dirtied. Even, it could give a bonus to make
> them be reclaimed on swapless system. However, MADV_COLD doesn't mean
> garbage so reclaiming them requires swap-out/in in the end so it's bigger
> cost. Since we have designed VM LRU aging based on cost-model, anonymous
> cold pages would be better to position inactive anon's LRU list, not file
> LRU. Furthermore, it would help to avoid unnecessary scanning if system
> doesn't have a swap device. Let's start simpler way without adding
> complexity at this moment. However, keep in mind, too that it's a caveat
> that workloads with a lot of pages cache are likely to ignore MADV_COLD
> on anonymous memory because we rarely age anonymous LRU lists.
> 
> * man-page material
> 
> MADV_COLD (since Linux x.x)
> 
> Pages in the specified regions will be treated as less-recently-accessed
> compared to pages in the system with similar access frequencies.
> In contrast to MADV_FREE, the contents of the region are preserved
> regardless of subsequent writes to pages.
> 
> MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
> pages.
> 
> * v2
>  * add up the warn with lots of page cache workload - mhocko
>  * add man page stuff - dave
> 
> * v1
>  * remove page_mapcount filter - hannes, mhocko
>  * remove idle page handling - joelaf
> 
> * RFCv2
>  * add more description - mhocko
> 
> * RFCv1
>  * renaming from MADV_COOL to MADV_COLD - hannes
> 
> * internal review
>  * use clear_page_youn in deactivate_page - joelaf
>  * Revise the description - surenb
>  * Renaming from MADV_WARM to MADV_COOL - surenb
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply

* Re: [PATCH v4 3/4] mm: account nr_isolated_xxx in [isolate|putback]_lru_page
From: Johannes Weiner @ 2019-07-11 18:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, oleksandr, hdanton, lizeb, Dave Hansen,
	Kirill A . Shutemov
In-Reply-To: <20190711012528.176050-4-minchan@kernel.org>

On Thu, Jul 11, 2019 at 10:25:27AM +0900, Minchan Kim wrote:
> The isolate counting is pecpu counter so it would be not huge gain
> to work them by batch. Rather than complicating to make them batch,
> let's make it more stright-foward via adding the counting logic
> into [isolate|putback]_lru_page API.
> 
> * v1
>  * fix accounting bug - Hillf
> 
> Link: http://lkml.kernel.org/r/20190531165927.GA20067@cmpxchg.org
> Acked-by: Michal Hocko <mhocko@suse.com>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

This is tricky to review, but fwiw it looks correct to me.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply

* Re: [PATCH v4 4/4] mm: introduce MADV_PAGEOUT
From: Johannes Weiner @ 2019-07-11 18:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, oleksandr, hdanton, lizeb, Dave Hansen,
	Kirill A . Shutemov
In-Reply-To: <20190711012528.176050-5-minchan@kernel.org>

On Thu, Jul 11, 2019 at 10:25:28AM +0900, Minchan Kim wrote:
> @@ -480,6 +482,198 @@ static long madvise_cold(struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
> +{
> +	struct mmu_gather *tlb = walk->private;
> +	struct mm_struct *mm = tlb->mm;
> +	struct vm_area_struct *vma = walk->vma;
> +	pte_t *orig_pte, *pte, ptent;
> +	spinlock_t *ptl;
> +	LIST_HEAD(page_list);
> +	struct page *page;
> +	unsigned long next;
> +
> +	if (fatal_signal_pending(current))
> +		return -EINTR;
> +
> +	next = pmd_addr_end(addr, end);
> +	if (pmd_trans_huge(*pmd)) {
> +		pmd_t orig_pmd;
> +
> +		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
> +		ptl = pmd_trans_huge_lock(pmd, vma);
> +		if (!ptl)
> +			return 0;
> +
> +		orig_pmd = *pmd;
> +		if (is_huge_zero_pmd(orig_pmd))
> +			goto huge_unlock;
> +
> +		if (unlikely(!pmd_present(orig_pmd))) {
> +			VM_BUG_ON(thp_migration_supported() &&
> +					!is_pmd_migration_entry(orig_pmd));
> +			goto huge_unlock;
> +		}
> +
> +		page = pmd_page(orig_pmd);
> +		if (next - addr != HPAGE_PMD_SIZE) {
> +			int err;
> +
> +			if (page_mapcount(page) != 1)
> +				goto huge_unlock;
> +			get_page(page);
> +			spin_unlock(ptl);
> +			lock_page(page);
> +			err = split_huge_page(page);
> +			unlock_page(page);
> +			put_page(page);
> +			if (!err)
> +				goto regular_page;
> +			return 0;
> +		}
> +
> +		if (isolate_lru_page(page))
> +			goto huge_unlock;
> +
> +		if (pmd_young(orig_pmd)) {
> +			pmdp_invalidate(vma, addr, pmd);
> +			orig_pmd = pmd_mkold(orig_pmd);
> +
> +			set_pmd_at(mm, addr, pmd, orig_pmd);
> +			tlb_remove_tlb_entry(tlb, pmd, addr);
> +		}
> +
> +		ClearPageReferenced(page);
> +		test_and_clear_page_young(page);
> +		list_add(&page->lru, &page_list);
> +huge_unlock:
> +		spin_unlock(ptl);
> +		reclaim_pages(&page_list);
> +		return 0;
> +	}
> +
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +regular_page:
> +	tlb_change_page_size(tlb, PAGE_SIZE);
> +	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	flush_tlb_batched_pending(mm);
> +	arch_enter_lazy_mmu_mode();
> +	for (; addr < end; pte++, addr += PAGE_SIZE) {
> +		ptent = *pte;
> +		if (!pte_present(ptent))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (!page)
> +			continue;
> +
> +		/*
> +		 * creating a THP page is expensive so split it only if we
> +		 * are sure it's worth. Split it if we are only owner.
> +		 */
> +		if (PageTransCompound(page)) {
> +			if (page_mapcount(page) != 1)
> +				break;
> +			get_page(page);
> +			if (!trylock_page(page)) {
> +				put_page(page);
> +				break;
> +			}
> +			pte_unmap_unlock(orig_pte, ptl);
> +			if (split_huge_page(page)) {
> +				unlock_page(page);
> +				put_page(page);
> +				pte_offset_map_lock(mm, pmd, addr, &ptl);
> +				break;
> +			}
> +			unlock_page(page);
> +			put_page(page);
> +			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> +			pte--;
> +			addr -= PAGE_SIZE;
> +			continue;
> +		}
> +
> +		VM_BUG_ON_PAGE(PageTransCompound(page), page);
> +
> +		if (isolate_lru_page(page))
> +			continue;
> +
> +		if (pte_young(ptent)) {
> +			ptent = ptep_get_and_clear_full(mm, addr, pte,
> +							tlb->fullmm);
> +			ptent = pte_mkold(ptent);
> +			set_pte_at(mm, addr, pte, ptent);
> +			tlb_remove_tlb_entry(tlb, pte, addr);
> +		}
> +		ClearPageReferenced(page);
> +		test_and_clear_page_young(page);
> +		list_add(&page->lru, &page_list);
> +	}
> +
> +	arch_leave_lazy_mmu_mode();
> +	pte_unmap_unlock(orig_pte, ptl);
> +	reclaim_pages(&page_list);
> +	cond_resched();
> +
> +	return 0;
> +}

I know you have briefly talked about code sharing already.

While I agree that sharing with MADV_FREE is maybe a stretch, I
applied these patches and compared the pageout and the cold page table
functions, and they are line for line the same EXCEPT for 2-3 lines at
the very end, where one reclaims and the other deactivates. It would
be good to share here, it shouldn't be hard or result in fragile code.

Something like int madvise_cold_or_pageout_range(..., bool pageout)?

^ permalink raw reply

* On
From: Carlo Wood @ 2019-07-11 23:42 UTC (permalink / raw)
  To: wangyun
  Cc: palewis, linux-kernel, linux-api, viro, akpm, jbaron, pholland,
	davidel, mtk.manpages, paulmck, ncardwell

Hello everyone,

this is a follow up to a discussion from 2012 about EPOLL_CTL_DISABLE,
which ended up NOT being accepted in the kernel.

I believe however that this issue wasn't addressed.
Michael Kerrisk did an excellent job writing a summary of the problem:
https://lkml.org/lkml/2012/10/16/302 (the first four listed points).

In the past two years I've been working on "yet another" event driven
I/O library, exclusively for linux however, with the general aim to
be highly scalable for many cores. The I/O part (around epoll) is only
part of that library - but it is what I spent the last two years on.

I ran independently into the same problem as described by Paton Lewis
and only afterwards, when I couldn't think of a working solution and
started to do heavy research, found related articles about the problem.

Reading through the thread on lkml.org that given above, it still
isn't clear to me that this would be a non-issue that can be solved
in user space (without severe (unnecessary) performance penalties).

Let me summarize the problem in my own way (probably less general,
but more related to my personal case):

1. The epoll interest list stores a pointer in epoll_event::data.ptr
   that points to a specific file descriptor related object, which
   indeed can be seen as a user-space cache of data related to that
   file descriptor. In my case this includes one mutex that protects
   an unsigned int whose bits keep track of things like whether or
   not that object (pointer) is in the epoll interest list (was added
   with epoll_ctl) and which events are being watched specifically,
   as well as whether or not the fd was closed (after removing it
   from the interest list with EPOLL_CTL_DEL) etc.

2. At least one thread (only one, in my case) calls epoll_wait
   (epoll_pwait) in a loop. I use no timeout because that seems too
   inaccurate, my timers are signal based. When epoll_wait returns, the
   pointer epoll_event::data.ptr is used to call a member function on
   the dereferenced FileDescriptor class. For non-C++ coders, that
   means a function is passed as first parameter the value of
   epoll_event::data.ptr (the 'this' pointer) which is subsequently
   dereferenced inside that function. The exact function depends on the
   actual event. In most cases the handling of the event is passed on
   to a thread pool queue for handling by another thread - but before
   that happens the pointer was already dereferenced (which, among
   others, allows me to increment a reference count for that object).
   This obviously means that epoll_event::data.ptr may never point to
   freed data.

3. As almost all processing happens by worker threads of a thread
   pool - it will be some other thread that decides that an object
   is done and needs to be freed. This thread first removes the object
   from the epoll interest list, and then deletes the object.
   Unfortunately this does not work -- and apart from adding a delay
   before the object is really deleted (which isn't a real solution
   as several of you already pointed out in 2012), I do not see any
   possible alternative.

The reason it can't work is because the Event Loop Thread (that loops
around epoll_wait) may already be in the process of returning from
epoll_wait, lets say -- it already returned from epoll_wait, but
wasn't able to execute ANY code following it. And because this is
ALWAYS possible there is NEVER a safe moment to delete the object.

The scenario is, for example:

   Event Loop Thread                           Worker thread

   Returns from epoll_wait() passing
   back to user space a data.ptr pointing
   to a user allocated object.

                                               Removes the object
                                               from the epoll interest
                                               list with EPOLL_CTL_DEL

                                               <arbitrary delay here,
                                                for example, no delay>

                                               Free the object pointed
                                               to by data.ptr.

   Dereferences data.ptr and crashes.


The solution proposed by Andy Lutomirski
(https://lkml.org/lkml/2012/10/18/434) does not work here:
In order use RCU the ptr must be removed from the protected
"list" before the grace period is started, which must start
before a read-side critical area ends. But the "list" here
is the epoll interest list - and removing it from that list
requires the call to epoll_ctl(..., EPOLL_CTL_DEL, ...) to 
finish. In other words, the Worker thread is the RCU "updater"
and the "arbitrary delay" must be the rcu grace period.
No problems there. The event loop thread however must call
rcu_read_lock() before accessing the epoll_event structure,
which is not possible because that happens inside epoll_wait(),
which doesn't provide a hook to add such call.

As far as the Worker Thread is concerned there are no readers,
and the grace period can finish instantly, simply because
there is no way to register that data.ptr was already copied.

If one tries to begin a read-side critical area after epoll_wait()
returns, then that won't work: in that case you should not
be able to access that ptr when it was already removed from
the interest list. The only way that RCU would work here is
when a reader subscribes *before* the kernel copies the
corresponding epoll_event structure to user space, in a way
that that will never happen when the EPOLL_CTL_DEL finished
before it got to that point.


I believe that the only safe solution is to let the Event Loop
Thread do the deleting. So, if all else fails I'll have to add
objects that a Worker Thread thinks need to be deleted to a
FIFO that is processed by the Event Loop Thread before entering
epoll_wait(). But that is a lot of extra code for something
that could be achieved with just a small change to epoll:


I propose to add a new EPOLL event EPOLLCLOSED that will cause
epoll_wait to return (for that event) whenever a file descriptor is
closed.

The Worker Thread then does not remove an object from the
interest list, but either adds (if it was removed before) or
modifies the event (using EPOLL_CTL_MOD) to watch that fd
for closing, and then closes it.

Aka,

  Working Thread:

  epoll_ctl(epoll_fd, EPOLL_CTL_ADD, fd, &event);
  close(fd);

where event contains the new EPOLLCLOSED (compare EPOLLOUT, EPOLLIN
etc).

This must then guarantee the event EPOLLCLOSED to be reported
by exactly one epoll_wait(), the caller thread of which can then
proceed with deleting the resources.

Note that close(fd) must cause the removal from the interest list
of any epoll struct before causing the event - and that the
EPOLLCLOSED event may only be reported after all other events
for that fd have already been reported (although it would be
ok to report them at the same time: simply handle the other
events first).

I am not sure how this will work when more than one thread
calls epoll_wait and more than one watch the same fd: in
that case it is less trivial because then it seems always
possible that the EPOLLCLOSED event will be reported before
another event in another thread.


What are your thoughts? Is the addition of EPOLLCLOSED
event feasible?

-- 
Carlo Wood <carlo@alinoe.com>

^ permalink raw reply

* Re: On
From: Andy Lutomirski @ 2019-07-12  0:32 UTC (permalink / raw)
  To: Carlo Wood
  Cc: wangyun, palewis, LKML, Linux API, Al Viro, Andrew Morton, jbaron,
	pholland, Davide Libenzi, Michael Kerrisk, Paul E. McKenney,
	Neal Cardwell
In-Reply-To: <20190712014223.66326995@hikaru>

On Thu, Jul 11, 2019 at 5:01 PM Carlo Wood <carlo@alinoe.com> wrote:
>
> I believe that the only safe solution is to let the Event Loop
> Thread do the deleting. So, if all else fails I'll have to add
> objects that a Worker Thread thinks need to be deleted to a
> FIFO that is processed by the Event Loop Thread before entering
> epoll_wait(). But that is a lot of extra code for something
> that could be achieved with just a small change to epoll:

This doesn't seem so bad at all.

>
>
> I propose to add a new EPOLL event EPOLLCLOSED that will cause
> epoll_wait to return (for that event) whenever a file descriptor is
> closed.

This totally falls apart if you ever want to add a feature to your
library to detach the handler for a given fd without closing the fd.

>
> The Worker Thread then does not remove an object from the
> interest list, but either adds (if it was removed before) or
> modifies the event (using EPOLL_CTL_MOD) to watch that fd
> for closing, and then closes it.
>
> Aka,
>
>   Working Thread:
>
>   epoll_ctl(epoll_fd, EPOLL_CTL_ADD, fd, &event);
>   close(fd);
>
> where event contains the new EPOLLCLOSED (compare EPOLLOUT, EPOLLIN
> etc).
>
> This must then guarantee the event EPOLLCLOSED to be reported
> by exactly one epoll_wait(), the caller thread of which can then
> proceed with deleting the resources.
>
> Note that close(fd) must cause the removal from the interest list
> of any epoll struct before causing the event - and that the
> EPOLLCLOSED event may only be reported after all other events
> for that fd have already been reported (although it would be
> ok to report them at the same time: simply handle the other
> events first).

This is a bunch of subtle semantics in the kernel to support your
particular use case.

>
> I am not sure how this will work when more than one thread
> calls epoll_wait and more than one watch the same fd: in
> that case it is less trivial because then it seems always
> possible that the EPOLLCLOSED event will be reported before
> another event in another thread.

But this case is fairly straightforward with the user mode approach --
for example, add it to the list for all threads calling epoll_wait.
Or otherwise defer the deletion until all epoll_wait threads have
woken.

^ permalink raw reply

* Re: Is a new EPOLLCLOSED a solution to the problem that EPOLL_CTL_DISABLE tried to solve?
From: Carlo Wood @ 2019-07-12  1:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: wangyun, palewis, LKML, Linux API, Al Viro, Andrew Morton, jbaron,
	pholland, Davide Libenzi, Michael Kerrisk, Paul E. McKenney,
	Neal Cardwell, carlo
In-Reply-To: <CALCETrWkmaB5K4AR0R6CYcdq8mJwKtbWbmYEq6oxeaoqdA3ZWA@mail.gmail.com>

Hi Andy,

thank you for you quick reply.

On Thu, 11 Jul 2019 17:32:21 -0700
Andy Lutomirski <luto@kernel.org> wrote:

> > I propose to add a new EPOLL event EPOLLCLOSED that will cause
> > epoll_wait to return (for that event) whenever a file descriptor is
> > closed.  
> 
> This totally falls apart if you ever want to add a feature to your
> library to detach the handler for a given fd without closing the fd.

Another way to cause epoll_wait() to wake up for that specific fd is
okay too, of course.

For example, since the new event basically will mean "resources
can now be deleted", the event could be called EPOLLDELETE.
It is just needed to have some easy way to trigger this event.

Nevertheless, in the more exceptional case that one wants to destroy
the object/rss that data.ptr points to without closing the fd it is
probably possible to first dup the fd and then close the old one.

> > The Worker Thread then does not remove an object from the
> > interest list, but either adds (if it was removed before) or
> > modifies the event (using EPOLL_CTL_MOD) to watch that fd
> > for closing, and then closes it.
> >
> > Aka,
> >
> >   Working Thread:
> >
> >   epoll_ctl(epoll_fd, EPOLL_CTL_ADD, fd, &event);
> >   close(fd);
> >
> > where event contains the new EPOLLCLOSED (compare EPOLLOUT, EPOLLIN
> > etc).
> >
> > This must then guarantee the event EPOLLCLOSED to be reported
> > by exactly one epoll_wait(), the caller thread of which can then
> > proceed with deleting the resources.
> >
> > Note that close(fd) must cause the removal from the interest list
> > of any epoll struct before causing the event - and that the
> > EPOLLCLOSED event may only be reported after all other events
> > for that fd have already been reported (although it would be
> > ok to report them at the same time: simply handle the other
> > events first).  
> 
> This is a bunch of subtle semantics in the kernel to support your
> particular use case.

My particular use case? How so? The problem I'm trying to address
is the fact that "It is not currently possible to reliably delete epoll
items when using the same epoll set from multiple threads", end quote
of Paton Lewis' email from 2012.

If there is a simple, generally accepted solution for this in user-space
then of course there is no reason to change the kernel; but despite
all my efforts to research the net for a solution for this, all can
find are people with the same question and no good answers.

If there was a way to pass a special event to the thread waiting in
epoll_wait() that it now is safe to free the memory that data.ptr is
pointing to, then problem would evaporate to something trivally simple.

Lets say we would not be using close(2), but instead some
epoll_destruct(epoll_fd, fd). Then the worker thread, instead of,

  if (last reference to object has gone)
  {
    epoll_ctl(epoll_fd, EPOLL_CTL_DEL, object->fd, NULL);
    delete object;  // Unsafe
  }

could do,

  if (last reference to object has gone)
    epoll_destuct(epoll_fd, object->fd);
        // Or [optionally dup() and] close(object->fd);

Whereas the thread that waits for epoll_fd would take care
of the deletion:

  for (;;)
  {
    int ready = epoll_wait(epoll_fd, s_events, maxevents, -1);
    while (ready > 0)
    {
      epoll_event& event(s_events[--ready]);
      if ((event.events & EPOLLDELETE)) // Or EPOLLCLOSED, or
                                        // whatever the name is.
      {
        delete event.data.ptr;
        break;
      }
      // Handle other events.
    }
  }

In this case, if epoll_wait() had returned just prior to
the call to epoll_destruct()/close(), the object will not be deleted;
The returned events would be handled, epoll_wait() reentered,
and only once EPOLLDELETE is returned the object would be deleted.

The bunch of subtle requirements as you call it is just about how
to implement this in a way that it will do what it is supposed to do,
and in no way specific for my particular use case.

The requirements are, namely:

1. Only one epoll_fd may have added an epoll_event structure with
   a pointer to the resource (if more than one need to point to it,
   you could add a pointer to a smart pointer to the object to
   the epoll_event structure instead). I'd be surprised if this
   is not ALREADY a required for normal implementations.

2. The call to epoll_destruct() (as introduced as example above)
   must remove the fd from the epoll's interest list. Of course
   it doesn't HAVE to - but then the user MUST call epoll_ctl(epoll_fd,
   EPOLL_CTL_DEL, ...) right before it, so why not? The reason
   I opted close(2) is because that ALREADY has this behavior,
   hence it seems a good candidate for epoll_destruct.

3. After an EPOLLDELETE (formly EPOLLCLOSED) has been returned by
   an epoll_wait() no other events may be returned for that fd.
   This is obvious, and should be easy to implement. I just added it for
   completeness ;).
 
> But this case is fairly straightforward with the user mode approach --
> for example, add it to the list for all threads calling epoll_wait.
> Or otherwise defer the deletion until all epoll_wait threads have
> woken.

That really seems a hell of lot more complex (involving mutexes and
updating a queue that might grow till unknown sizes, hence requiring
possibly calls to malloc) then my proposed solution, for something that
basically every application that uses epoll needs.

-- 
Carlo Wood <carlo@alinoe.com>

^ permalink raw reply

* Re: [PATCH v9 01/10] namei: obey trailing magic-link DAC permissions
From: Al Viro @ 2019-07-12  4:14 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Andy Lutomirski, Christian Brauner,
	Eric Biederman, Andrew Morton, Alexei Starovoitov, Kees Cook,
	Jann Horn, Tycho Andersen, David Drysdale, Chanho Min,
	Oleg Nesterov, Aleksa Sarai, Linus Torvalds, containers,
	linux-alpha
In-Reply-To: <20190706145737.5299-2-cyphar@cyphar.com>

On Sun, Jul 07, 2019 at 12:57:28AM +1000, Aleksa Sarai wrote:

> @@ -514,7 +516,14 @@ static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
>  	p->stack = p->internal;
>  	p->dfd = dfd;
>  	p->name = name;
> -	p->total_link_count = old ? old->total_link_count : 0;
> +	p->total_link_count = 0;
> +	p->acc_mode = 0;
> +	p->opath_mask = FMODE_PATH_READ | FMODE_PATH_WRITE;
> +	if (old) {
> +		p->total_link_count = old->total_link_count;
> +		p->acc_mode = old->acc_mode;
> +		p->opath_mask = old->opath_mask;
> +	}

Huh?  Could somebody explain why traversals of NFS4 referrals should inherit
->acc_mode and ->opath_mask?

>  static __always_inline
> -const char *get_link(struct nameidata *nd)
> +const char *get_link(struct nameidata *nd, bool trailing)
>  {
>  	struct saved *last = nd->stack + nd->depth - 1;
>  	struct dentry *dentry = last->link.dentry;
> @@ -1081,6 +1134,44 @@ const char *get_link(struct nameidata *nd)
>  		} else {
>  			res = get(dentry, inode, &last->done);
>  		}
> +		/* If we just jumped it was because of a magic-link. */
> +		if (unlikely(nd->flags & LOOKUP_JUMPED)) {

That's not quite guaranteed (it is possible to bind a symlink on top
of a regular file, and you will get LOOKUP_JUMPED on the entry into
trailing_symlink() when looking the result up).  Moreover, why bother
with LOOKUP_JUMPED here?  See that
	nd->last_type = LAST_BIND;
several lines prior?  That's precisely to be able to recognize those
suckers.

And _that_ would've avoided another piece of ugliness - your LOOKUP_JUMPED
kludge forces you to handle that cra^Wsclero^Wvaluable security hardening
in get_link(), instead of trailing_symlink() where you apparently want
it to be.  Simply because nd_jump_root() done later in get_link() will set
LOOKUP_JUMPED for absolute symlinks, confusing your test.

Moreover, I'm not sure that trailing_symlink() is the right place for
that either - I would be rather tempted to fold do_o_path() into
path_openat(), inline path_lookupat() there (as in
        s = path_init(nd, flags);

        while (!(error = link_path_walk(s, nd))
                && ((error = lookup_last(nd)) > 0)) {
                s = trailing_symlink(nd);
        }
        if (!error)
                error = complete_walk(nd);
        if (!error && nd->flags & LOOKUP_DIRECTORY)
                if (!d_can_lookup(nd->path.dentry))
                        error = -ENOTDIR;
        if (!error) {
                audit_inode(nd->name, nd->path.dentry, 0);
                error = vfs_open(&nd->path, file);
        }
        terminate_walk(nd);
- we don't need LOOKUP_DOWN there) and then we only care about the
two callers of trailing_symlink() that are in path_openat().  Which
is where you have your ->acc_mode and ->opath_mask without the need
to dump them into nameidata.  Or to bring that mess into the
things like stat(2) et.al. - it simply doesn't belong there.

In any case, this "bool trailing" is completely wrong; whether that
check belongs in trailing_symlink() or (some of) its callers, putting
it into get_link() is a mistake, forced by kludgy check for procfs-style
symlinks.

^ permalink raw reply

* Re: [PATCH v9 04/10] namei: split out nd->dfd handling to dirfd_path_init
From: Al Viro @ 2019-07-12  4:20 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Christian Brauner, Tycho Andersen, David Drysdale, Chanho Min,
	Oleg Nesterov, Aleksa Sarai, Linus Torvalds, containers,
	linux-alpha
In-Reply-To: <20190706145737.5299-5-cyphar@cyphar.com>

On Sun, Jul 07, 2019 at 12:57:31AM +1000, Aleksa Sarai wrote:
> Previously, path_init's handling of *at(dfd, ...) was only done once,
> but with LOOKUP_BENEATH (and LOOKUP_IN_ROOT) we have to parse the
> initial nd->path at different times (before or after absolute path
> handling) depending on whether we have been asked to scope resolution
> within a root.

>  	if (*s == '/') {
> -		set_root(nd);
> -		if (likely(!nd_jump_root(nd)))
> -			return s;
> -		return ERR_PTR(-ECHILD);

> +		if (likely(!nd->root.mnt))
> +			set_root(nd);

How can we get there with non-NULL nd->root.mnt, when LOOKUP_ROOT case
has been already handled by that point?

> +		error = nd_jump_root(nd);
> +		if (unlikely(error))
> +			s = ERR_PTR(error);

^ permalink raw reply

* Re: [PATCH v9 05/10] namei: O_BENEATH-style path resolution flags
From: Al Viro @ 2019-07-12  4:33 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: linux-ia64, linux-sh, Alexei Starovoitov, linux-kernel,
	David Howells, linux-kselftest, sparclinux, Shuah Khan,
	linux-arch, linux-s390, Tycho Andersen, Aleksa Sarai,
	linux-arm-kernel, linux-mips, linux-xtensa, Kees Cook,
	Arnd Bergmann, Jann Horn, linuxppc-dev, linux-m68k,
	Andy Lutomirski, Shuah Khan, David Drysdale, Christian Brauner,
	J. Bruce Fields, linux-parisc, linux-api
In-Reply-To: <20190706145737.5299-6-cyphar@cyphar.com>

On Sun, Jul 07, 2019 at 12:57:32AM +1000, Aleksa Sarai wrote:

> @@ -1442,8 +1464,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
>  	struct inode *inode = nd->inode;
>  
>  	while (1) {
> -		if (path_equal(&nd->path, &nd->root))
> +		if (path_equal(&nd->path, &nd->root)) {
> +			if (unlikely(nd->flags & LOOKUP_BENEATH))
> +				return -EXDEV;

> @@ -1468,6 +1493,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
>  				return -ECHILD;
>  			if (&mparent->mnt == nd->path.mnt)
>  				break;
> +			if (unlikely(nd->flags & LOOKUP_XDEV))
> +				return -EXDEV;
>  			/* we know that mountpoint was pinned */
>  			nd->path.dentry = mountpoint;
>  			nd->path.mnt = &mparent->mnt;
> @@ -1482,6 +1509,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
>  			return -ECHILD;
>  		if (!mounted)
>  			break;
> +		if (unlikely(nd->flags & LOOKUP_XDEV))
> +			return -EXDEV;

Are you sure these failure exits in follow_dotdot_rcu() won't give
suprious hard errors?

> +	if (unlikely(nd->flags & LOOKUP_BENEATH)) {
> +		error = dirfd_path_init(nd);
> +		if (unlikely(error))
> +			return ERR_PTR(error);
> +		nd->root = nd->path;
> +		if (!(nd->flags & LOOKUP_RCU))
> +			path_get(&nd->root);
> +	}
>  	if (*s == '/') {
>  		if (likely(!nd->root.mnt))
>  			set_root(nd);
> @@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
>  			s = ERR_PTR(error);
>  		return s;
>  	}
> -	error = dirfd_path_init(nd);
> -	if (unlikely(error))
> -		return ERR_PTR(error);
> +	if (likely(!nd->path.mnt)) {

Is that a weird way of saying "if we hadn't already called dirfd_path_init()"?

^ permalink raw reply

* Re: [PATCH v4 4/4] mm: introduce MADV_PAGEOUT
From: Minchan Kim @ 2019-07-12  5:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, oleksandr, hdanton, lizeb, Dave Hansen,
	Kirill A . Shutemov
In-Reply-To: <20190711184223.GD20341@cmpxchg.org>

Hi Johannes,

On Thu, Jul 11, 2019 at 02:42:23PM -0400, Johannes Weiner wrote:
> On Thu, Jul 11, 2019 at 10:25:28AM +0900, Minchan Kim wrote:
> > @@ -480,6 +482,198 @@ static long madvise_cold(struct vm_area_struct *vma,
> >  	return 0;
> >  }
> >  
> > +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > +				unsigned long end, struct mm_walk *walk)
> > +{
> > +	struct mmu_gather *tlb = walk->private;
> > +	struct mm_struct *mm = tlb->mm;
> > +	struct vm_area_struct *vma = walk->vma;
> > +	pte_t *orig_pte, *pte, ptent;
> > +	spinlock_t *ptl;
> > +	LIST_HEAD(page_list);
> > +	struct page *page;
> > +	unsigned long next;
> > +
> > +	if (fatal_signal_pending(current))
> > +		return -EINTR;
> > +
> > +	next = pmd_addr_end(addr, end);
> > +	if (pmd_trans_huge(*pmd)) {
> > +		pmd_t orig_pmd;
> > +
> > +		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
> > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > +		if (!ptl)
> > +			return 0;
> > +
> > +		orig_pmd = *pmd;
> > +		if (is_huge_zero_pmd(orig_pmd))
> > +			goto huge_unlock;
> > +
> > +		if (unlikely(!pmd_present(orig_pmd))) {
> > +			VM_BUG_ON(thp_migration_supported() &&
> > +					!is_pmd_migration_entry(orig_pmd));
> > +			goto huge_unlock;
> > +		}
> > +
> > +		page = pmd_page(orig_pmd);
> > +		if (next - addr != HPAGE_PMD_SIZE) {
> > +			int err;
> > +
> > +			if (page_mapcount(page) != 1)
> > +				goto huge_unlock;
> > +			get_page(page);
> > +			spin_unlock(ptl);
> > +			lock_page(page);
> > +			err = split_huge_page(page);
> > +			unlock_page(page);
> > +			put_page(page);
> > +			if (!err)
> > +				goto regular_page;
> > +			return 0;
> > +		}
> > +
> > +		if (isolate_lru_page(page))
> > +			goto huge_unlock;
> > +
> > +		if (pmd_young(orig_pmd)) {
> > +			pmdp_invalidate(vma, addr, pmd);
> > +			orig_pmd = pmd_mkold(orig_pmd);
> > +
> > +			set_pmd_at(mm, addr, pmd, orig_pmd);
> > +			tlb_remove_tlb_entry(tlb, pmd, addr);
> > +		}
> > +
> > +		ClearPageReferenced(page);
> > +		test_and_clear_page_young(page);
> > +		list_add(&page->lru, &page_list);
> > +huge_unlock:
> > +		spin_unlock(ptl);
> > +		reclaim_pages(&page_list);
> > +		return 0;
> > +	}
> > +
> > +	if (pmd_trans_unstable(pmd))
> > +		return 0;
> > +regular_page:
> > +	tlb_change_page_size(tlb, PAGE_SIZE);
> > +	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +	flush_tlb_batched_pending(mm);
> > +	arch_enter_lazy_mmu_mode();
> > +	for (; addr < end; pte++, addr += PAGE_SIZE) {
> > +		ptent = *pte;
> > +		if (!pte_present(ptent))
> > +			continue;
> > +
> > +		page = vm_normal_page(vma, addr, ptent);
> > +		if (!page)
> > +			continue;
> > +
> > +		/*
> > +		 * creating a THP page is expensive so split it only if we
> > +		 * are sure it's worth. Split it if we are only owner.
> > +		 */
> > +		if (PageTransCompound(page)) {
> > +			if (page_mapcount(page) != 1)
> > +				break;
> > +			get_page(page);
> > +			if (!trylock_page(page)) {
> > +				put_page(page);
> > +				break;
> > +			}
> > +			pte_unmap_unlock(orig_pte, ptl);
> > +			if (split_huge_page(page)) {
> > +				unlock_page(page);
> > +				put_page(page);
> > +				pte_offset_map_lock(mm, pmd, addr, &ptl);
> > +				break;
> > +			}
> > +			unlock_page(page);
> > +			put_page(page);
> > +			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > +			pte--;
> > +			addr -= PAGE_SIZE;
> > +			continue;
> > +		}
> > +
> > +		VM_BUG_ON_PAGE(PageTransCompound(page), page);
> > +
> > +		if (isolate_lru_page(page))
> > +			continue;
> > +
> > +		if (pte_young(ptent)) {
> > +			ptent = ptep_get_and_clear_full(mm, addr, pte,
> > +							tlb->fullmm);
> > +			ptent = pte_mkold(ptent);
> > +			set_pte_at(mm, addr, pte, ptent);
> > +			tlb_remove_tlb_entry(tlb, pte, addr);
> > +		}
> > +		ClearPageReferenced(page);
> > +		test_and_clear_page_young(page);
> > +		list_add(&page->lru, &page_list);
> > +	}
> > +
> > +	arch_leave_lazy_mmu_mode();
> > +	pte_unmap_unlock(orig_pte, ptl);
> > +	reclaim_pages(&page_list);
> > +	cond_resched();
> > +
> > +	return 0;
> > +}
> 
> I know you have briefly talked about code sharing already.
> 
> While I agree that sharing with MADV_FREE is maybe a stretch, I
> applied these patches and compared the pageout and the cold page table
> functions, and they are line for line the same EXCEPT for 2-3 lines at
> the very end, where one reclaims and the other deactivates. It would
> be good to share here, it shouldn't be hard or result in fragile code.

Fair enough if we leave MADV_FREE.

> 
> Something like int madvise_cold_or_pageout_range(..., bool pageout)?

How about this?

>From 41592f23e876ec21e49dc3c76dc89538e2bb16be Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Fri, 12 Jul 2019 14:05:36 +0900
Subject: [PATCH] mm: factor out common parts between MADV_COLD and
 MADV_PAGEOUT

There are many common parts between MADV_COLD and MADV_PAGEOUT.
This patch factor them out to save code duplication.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/madvise.c | 201 +++++++++++++--------------------------------------
 1 file changed, 52 insertions(+), 149 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index bc2f0138982e..3d3d14517cc8 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -30,6 +30,11 @@
 
 #include "internal.h"
 
+struct madvise_walk_private {
+	struct mmu_gather *tlb;
+	bool pageout;
+};
+
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
  * take mmap_sem for writing. Others, which simply traverse vmas, need
@@ -310,16 +315,23 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	return 0;
 }
 
-static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
-				unsigned long end, struct mm_walk *walk)
+static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
+				unsigned long addr, unsigned long end,
+				struct mm_walk *walk)
 {
-	struct mmu_gather *tlb = walk->private;
+	struct madvise_walk_private *private = walk->private;
+	struct mmu_gather *tlb = private->tlb;
+	bool pageout = private->pageout;
 	struct mm_struct *mm = tlb->mm;
 	struct vm_area_struct *vma = walk->vma;
 	pte_t *orig_pte, *pte, ptent;
 	spinlock_t *ptl;
-	struct page *page;
 	unsigned long next;
+	struct page *page = NULL;
+	LIST_HEAD(page_list);
+
+	if (fatal_signal_pending(current))
+		return -EINTR;
 
 	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd)) {
@@ -358,6 +370,12 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
 			return 0;
 		}
 
+		if (pageout) {
+			if (isolate_lru_page(page))
+				goto huge_unlock;
+			list_add(&page->lru, &page_list);
+		}
+
 		if (pmd_young(orig_pmd)) {
 			pmdp_invalidate(vma, addr, pmd);
 			orig_pmd = pmd_mkold(orig_pmd);
@@ -366,10 +384,14 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
 			tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 		}
 
+		ClearPageReferenced(page);
 		test_and_clear_page_young(page);
-		deactivate_page(page);
 huge_unlock:
 		spin_unlock(ptl);
+		if (pageout)
+			reclaim_pages(&page_list);
+		else
+			deactivate_page(page);
 		return 0;
 	}
 
@@ -423,6 +445,12 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
 
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
 
+		if (pageout) {
+			if (isolate_lru_page(page))
+				continue;
+			list_add(&page->lru, &page_list);
+		}
+
 		if (pte_young(ptent)) {
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
@@ -437,12 +465,16 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
 		 * As a side effect, it makes confuse idle-page tracking
 		 * because they will miss recent referenced history.
 		 */
+		ClearPageReferenced(page);
 		test_and_clear_page_young(page);
-		deactivate_page(page);
+		if (!pageout)
+			deactivate_page(page);
 	}
 
 	arch_enter_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
+	if (pageout)
+		reclaim_pages(&page_list);
 	cond_resched();
 
 	return 0;
@@ -452,10 +484,15 @@ static void madvise_cold_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end)
 {
+	struct madvise_walk_private walk_private = {
+		.tlb = tlb,
+		.pageout = false,
+	};
+
 	struct mm_walk cold_walk = {
-		.pmd_entry = madvise_cold_pte_range,
+		.pmd_entry = madvise_cold_or_pageout_pte_range,
 		.mm = vma->vm_mm,
-		.private = tlb,
+		.private = &walk_private,
 	};
 
 	tlb_start_vma(tlb, vma);
@@ -482,153 +519,19 @@ static long madvise_cold(struct vm_area_struct *vma,
 	return 0;
 }
 
-static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
-				unsigned long end, struct mm_walk *walk)
-{
-	struct mmu_gather *tlb = walk->private;
-	struct mm_struct *mm = tlb->mm;
-	struct vm_area_struct *vma = walk->vma;
-	pte_t *orig_pte, *pte, ptent;
-	spinlock_t *ptl;
-	LIST_HEAD(page_list);
-	struct page *page;
-	unsigned long next;
-
-	if (fatal_signal_pending(current))
-		return -EINTR;
-
-	next = pmd_addr_end(addr, end);
-	if (pmd_trans_huge(*pmd)) {
-		pmd_t orig_pmd;
-
-		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
-		ptl = pmd_trans_huge_lock(pmd, vma);
-		if (!ptl)
-			return 0;
-
-		orig_pmd = *pmd;
-		if (is_huge_zero_pmd(orig_pmd))
-			goto huge_unlock;
-
-		if (unlikely(!pmd_present(orig_pmd))) {
-			VM_BUG_ON(thp_migration_supported() &&
-					!is_pmd_migration_entry(orig_pmd));
-			goto huge_unlock;
-		}
-
-		page = pmd_page(orig_pmd);
-		if (next - addr != HPAGE_PMD_SIZE) {
-			int err;
-
-			if (page_mapcount(page) != 1)
-				goto huge_unlock;
-			get_page(page);
-			spin_unlock(ptl);
-			lock_page(page);
-			err = split_huge_page(page);
-			unlock_page(page);
-			put_page(page);
-			if (!err)
-				goto regular_page;
-			return 0;
-		}
-
-		if (isolate_lru_page(page))
-			goto huge_unlock;
-
-		if (pmd_young(orig_pmd)) {
-			pmdp_invalidate(vma, addr, pmd);
-			orig_pmd = pmd_mkold(orig_pmd);
-
-			set_pmd_at(mm, addr, pmd, orig_pmd);
-			tlb_remove_tlb_entry(tlb, pmd, addr);
-		}
-
-		ClearPageReferenced(page);
-		test_and_clear_page_young(page);
-		list_add(&page->lru, &page_list);
-huge_unlock:
-		spin_unlock(ptl);
-		reclaim_pages(&page_list);
-		return 0;
-	}
-
-	if (pmd_trans_unstable(pmd))
-		return 0;
-regular_page:
-	tlb_change_page_size(tlb, PAGE_SIZE);
-	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	flush_tlb_batched_pending(mm);
-	arch_enter_lazy_mmu_mode();
-	for (; addr < end; pte++, addr += PAGE_SIZE) {
-		ptent = *pte;
-		if (!pte_present(ptent))
-			continue;
-
-		page = vm_normal_page(vma, addr, ptent);
-		if (!page)
-			continue;
-
-		/*
-		 * creating a THP page is expensive so split it only if we
-		 * are sure it's worth. Split it if we are only owner.
-		 */
-		if (PageTransCompound(page)) {
-			if (page_mapcount(page) != 1)
-				break;
-			get_page(page);
-			if (!trylock_page(page)) {
-				put_page(page);
-				break;
-			}
-			pte_unmap_unlock(orig_pte, ptl);
-			if (split_huge_page(page)) {
-				unlock_page(page);
-				put_page(page);
-				pte_offset_map_lock(mm, pmd, addr, &ptl);
-				break;
-			}
-			unlock_page(page);
-			put_page(page);
-			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
-			pte--;
-			addr -= PAGE_SIZE;
-			continue;
-		}
-
-		VM_BUG_ON_PAGE(PageTransCompound(page), page);
-
-		if (isolate_lru_page(page))
-			continue;
-
-		if (pte_young(ptent)) {
-			ptent = ptep_get_and_clear_full(mm, addr, pte,
-							tlb->fullmm);
-			ptent = pte_mkold(ptent);
-			set_pte_at(mm, addr, pte, ptent);
-			tlb_remove_tlb_entry(tlb, pte, addr);
-		}
-		ClearPageReferenced(page);
-		test_and_clear_page_young(page);
-		list_add(&page->lru, &page_list);
-	}
-
-	arch_leave_lazy_mmu_mode();
-	pte_unmap_unlock(orig_pte, ptl);
-	reclaim_pages(&page_list);
-	cond_resched();
-
-	return 0;
-}
-
 static void madvise_pageout_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end)
 {
+	struct madvise_walk_private walk_private = {
+		.pageout = true,
+		.tlb = tlb,
+	};
+
 	struct mm_walk pageout_walk = {
-		.pmd_entry = madvise_pageout_pte_range,
+		.pmd_entry = madvise_cold_or_pageout_pte_range,
 		.mm = vma->vm_mm,
-		.private = tlb,
+		.private = &walk_private,
 	};
 
 	tlb_start_vma(tlb, vma);
-- 
2.22.0.410.gd8fdbe21b5-goog

^ permalink raw reply related

* Re: [PATCH v9 01/10] namei: obey trailing magic-link DAC permissions
From: Al Viro @ 2019-07-12  6:36 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Andy Lutomirski, Christian Brauner,
	Eric Biederman, Andrew Morton, Alexei Starovoitov, Kees Cook,
	Jann Horn, Tycho Andersen, David Drysdale, Chanho Min,
	Oleg Nesterov, Aleksa Sarai, Linus Torvalds, containers,
	linux-alpha
In-Reply-To: <20190712041454.GG17978@ZenIV.linux.org.uk>

On Fri, Jul 12, 2019 at 05:14:54AM +0100, Al Viro wrote:

> That's not quite guaranteed (it is possible to bind a symlink on top
> of a regular file, and you will get LOOKUP_JUMPED on the entry into
> trailing_symlink() when looking the result up).  Moreover, why bother
> with LOOKUP_JUMPED here?  See that
> 	nd->last_type = LAST_BIND;
> several lines prior?  That's precisely to be able to recognize those
> suckers.

... except that this won't work these days (once upon a time it used
to, but that had been a long time ago).  However, that doesn't change
the fact that the test is really wrong.  So let's do it right:

* set a new flag along with LOOKUP_JUMPED in nd_jump_link()
* clear it in get_link() right before
	res = READ_ONCE(inode->i_link);
* check it in trailing_symlink() (or callers thereof)

The rest of comments stand, AFAICS.

^ permalink raw reply

* Re: [PATCH v4 4/4] mm: introduce MADV_PAGEOUT
From: Michal Hocko @ 2019-07-12  7:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, Andrew Morton, linux-mm, LKML, linux-api,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, oleksandr, hdanton, lizeb, Dave Hansen,
	Kirill A . Shutemov
In-Reply-To: <20190712051828.GA128252@google.com>

On Fri 12-07-19 14:18:28, Minchan Kim wrote:
[...]
> >From 41592f23e876ec21e49dc3c76dc89538e2bb16be Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Fri, 12 Jul 2019 14:05:36 +0900
> Subject: [PATCH] mm: factor out common parts between MADV_COLD and
>  MADV_PAGEOUT
> 
> There are many common parts between MADV_COLD and MADV_PAGEOUT.
> This patch factor them out to save code duplication.

This looks better indeed. I still hope that this can get improved even
further but let's do that in a follow up patch.

> Signed-off-by: Minchan Kim <minchan@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/madvise.c | 201 +++++++++++++--------------------------------------
>  1 file changed, 52 insertions(+), 149 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index bc2f0138982e..3d3d14517cc8 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -30,6 +30,11 @@
>  
>  #include "internal.h"
>  
> +struct madvise_walk_private {
> +	struct mmu_gather *tlb;
> +	bool pageout;
> +};
> +
>  /*
>   * Any behaviour which results in changes to the vma->vm_flags needs to
>   * take mmap_sem for writing. Others, which simply traverse vmas, need
> @@ -310,16 +315,23 @@ static long madvise_willneed(struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> -static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> -				unsigned long end, struct mm_walk *walk)
> +static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> +				unsigned long addr, unsigned long end,
> +				struct mm_walk *walk)
>  {
> -	struct mmu_gather *tlb = walk->private;
> +	struct madvise_walk_private *private = walk->private;
> +	struct mmu_gather *tlb = private->tlb;
> +	bool pageout = private->pageout;
>  	struct mm_struct *mm = tlb->mm;
>  	struct vm_area_struct *vma = walk->vma;
>  	pte_t *orig_pte, *pte, ptent;
>  	spinlock_t *ptl;
> -	struct page *page;
>  	unsigned long next;
> +	struct page *page = NULL;
> +	LIST_HEAD(page_list);
> +
> +	if (fatal_signal_pending(current))
> +		return -EINTR;
>  
>  	next = pmd_addr_end(addr, end);
>  	if (pmd_trans_huge(*pmd)) {
> @@ -358,6 +370,12 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
>  			return 0;
>  		}
>  
> +		if (pageout) {
> +			if (isolate_lru_page(page))
> +				goto huge_unlock;
> +			list_add(&page->lru, &page_list);
> +		}
> +
>  		if (pmd_young(orig_pmd)) {
>  			pmdp_invalidate(vma, addr, pmd);
>  			orig_pmd = pmd_mkold(orig_pmd);
> @@ -366,10 +384,14 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
>  			tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
>  		}
>  
> +		ClearPageReferenced(page);
>  		test_and_clear_page_young(page);
> -		deactivate_page(page);
>  huge_unlock:
>  		spin_unlock(ptl);
> +		if (pageout)
> +			reclaim_pages(&page_list);
> +		else
> +			deactivate_page(page);
>  		return 0;
>  	}
>  
> @@ -423,6 +445,12 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
>  
>  		VM_BUG_ON_PAGE(PageTransCompound(page), page);
>  
> +		if (pageout) {
> +			if (isolate_lru_page(page))
> +				continue;
> +			list_add(&page->lru, &page_list);
> +		}
> +
>  		if (pte_young(ptent)) {
>  			ptent = ptep_get_and_clear_full(mm, addr, pte,
>  							tlb->fullmm);
> @@ -437,12 +465,16 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
>  		 * As a side effect, it makes confuse idle-page tracking
>  		 * because they will miss recent referenced history.
>  		 */
> +		ClearPageReferenced(page);
>  		test_and_clear_page_young(page);
> -		deactivate_page(page);
> +		if (!pageout)
> +			deactivate_page(page);
>  	}
>  
>  	arch_enter_lazy_mmu_mode();
>  	pte_unmap_unlock(orig_pte, ptl);
> +	if (pageout)
> +		reclaim_pages(&page_list);
>  	cond_resched();
>  
>  	return 0;
> @@ -452,10 +484,15 @@ static void madvise_cold_page_range(struct mmu_gather *tlb,
>  			     struct vm_area_struct *vma,
>  			     unsigned long addr, unsigned long end)
>  {
> +	struct madvise_walk_private walk_private = {
> +		.tlb = tlb,
> +		.pageout = false,
> +	};
> +
>  	struct mm_walk cold_walk = {
> -		.pmd_entry = madvise_cold_pte_range,
> +		.pmd_entry = madvise_cold_or_pageout_pte_range,
>  		.mm = vma->vm_mm,
> -		.private = tlb,
> +		.private = &walk_private,
>  	};
>  
>  	tlb_start_vma(tlb, vma);
> @@ -482,153 +519,19 @@ static long madvise_cold(struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> -static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> -				unsigned long end, struct mm_walk *walk)
> -{
> -	struct mmu_gather *tlb = walk->private;
> -	struct mm_struct *mm = tlb->mm;
> -	struct vm_area_struct *vma = walk->vma;
> -	pte_t *orig_pte, *pte, ptent;
> -	spinlock_t *ptl;
> -	LIST_HEAD(page_list);
> -	struct page *page;
> -	unsigned long next;
> -
> -	if (fatal_signal_pending(current))
> -		return -EINTR;
> -
> -	next = pmd_addr_end(addr, end);
> -	if (pmd_trans_huge(*pmd)) {
> -		pmd_t orig_pmd;
> -
> -		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
> -		ptl = pmd_trans_huge_lock(pmd, vma);
> -		if (!ptl)
> -			return 0;
> -
> -		orig_pmd = *pmd;
> -		if (is_huge_zero_pmd(orig_pmd))
> -			goto huge_unlock;
> -
> -		if (unlikely(!pmd_present(orig_pmd))) {
> -			VM_BUG_ON(thp_migration_supported() &&
> -					!is_pmd_migration_entry(orig_pmd));
> -			goto huge_unlock;
> -		}
> -
> -		page = pmd_page(orig_pmd);
> -		if (next - addr != HPAGE_PMD_SIZE) {
> -			int err;
> -
> -			if (page_mapcount(page) != 1)
> -				goto huge_unlock;
> -			get_page(page);
> -			spin_unlock(ptl);
> -			lock_page(page);
> -			err = split_huge_page(page);
> -			unlock_page(page);
> -			put_page(page);
> -			if (!err)
> -				goto regular_page;
> -			return 0;
> -		}
> -
> -		if (isolate_lru_page(page))
> -			goto huge_unlock;
> -
> -		if (pmd_young(orig_pmd)) {
> -			pmdp_invalidate(vma, addr, pmd);
> -			orig_pmd = pmd_mkold(orig_pmd);
> -
> -			set_pmd_at(mm, addr, pmd, orig_pmd);
> -			tlb_remove_tlb_entry(tlb, pmd, addr);
> -		}
> -
> -		ClearPageReferenced(page);
> -		test_and_clear_page_young(page);
> -		list_add(&page->lru, &page_list);
> -huge_unlock:
> -		spin_unlock(ptl);
> -		reclaim_pages(&page_list);
> -		return 0;
> -	}
> -
> -	if (pmd_trans_unstable(pmd))
> -		return 0;
> -regular_page:
> -	tlb_change_page_size(tlb, PAGE_SIZE);
> -	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> -	flush_tlb_batched_pending(mm);
> -	arch_enter_lazy_mmu_mode();
> -	for (; addr < end; pte++, addr += PAGE_SIZE) {
> -		ptent = *pte;
> -		if (!pte_present(ptent))
> -			continue;
> -
> -		page = vm_normal_page(vma, addr, ptent);
> -		if (!page)
> -			continue;
> -
> -		/*
> -		 * creating a THP page is expensive so split it only if we
> -		 * are sure it's worth. Split it if we are only owner.
> -		 */
> -		if (PageTransCompound(page)) {
> -			if (page_mapcount(page) != 1)
> -				break;
> -			get_page(page);
> -			if (!trylock_page(page)) {
> -				put_page(page);
> -				break;
> -			}
> -			pte_unmap_unlock(orig_pte, ptl);
> -			if (split_huge_page(page)) {
> -				unlock_page(page);
> -				put_page(page);
> -				pte_offset_map_lock(mm, pmd, addr, &ptl);
> -				break;
> -			}
> -			unlock_page(page);
> -			put_page(page);
> -			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> -			pte--;
> -			addr -= PAGE_SIZE;
> -			continue;
> -		}
> -
> -		VM_BUG_ON_PAGE(PageTransCompound(page), page);
> -
> -		if (isolate_lru_page(page))
> -			continue;
> -
> -		if (pte_young(ptent)) {
> -			ptent = ptep_get_and_clear_full(mm, addr, pte,
> -							tlb->fullmm);
> -			ptent = pte_mkold(ptent);
> -			set_pte_at(mm, addr, pte, ptent);
> -			tlb_remove_tlb_entry(tlb, pte, addr);
> -		}
> -		ClearPageReferenced(page);
> -		test_and_clear_page_young(page);
> -		list_add(&page->lru, &page_list);
> -	}
> -
> -	arch_leave_lazy_mmu_mode();
> -	pte_unmap_unlock(orig_pte, ptl);
> -	reclaim_pages(&page_list);
> -	cond_resched();
> -
> -	return 0;
> -}
> -
>  static void madvise_pageout_page_range(struct mmu_gather *tlb,
>  			     struct vm_area_struct *vma,
>  			     unsigned long addr, unsigned long end)
>  {
> +	struct madvise_walk_private walk_private = {
> +		.pageout = true,
> +		.tlb = tlb,
> +	};
> +
>  	struct mm_walk pageout_walk = {
> -		.pmd_entry = madvise_pageout_pte_range,
> +		.pmd_entry = madvise_cold_or_pageout_pte_range,
>  		.mm = vma->vm_mm,
> -		.private = tlb,
> +		.private = &walk_private,
>  	};
>  
>  	tlb_start_vma(tlb, vma);
> -- 
> 2.22.0.410.gd8fdbe21b5-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v9 05/10] namei: O_BENEATH-style path resolution flags
From: Aleksa Sarai @ 2019-07-12 10:57 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Christian Brauner, David Drysdale,
	Andy Lutomirski, Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Aleksa Sarai, containers, linux-alpha
In-Reply-To: <20190712043341.GI17978@ZenIV.linux.org.uk>

[-- Attachment #1: Type: text/plain, Size: 2295 bytes --]

On 2019-07-12, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sun, Jul 07, 2019 at 12:57:32AM +1000, Aleksa Sarai wrote:
> > @@ -1442,8 +1464,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> >  	struct inode *inode = nd->inode;
> >  
> >  	while (1) {
> > -		if (path_equal(&nd->path, &nd->root))
> > +		if (path_equal(&nd->path, &nd->root)) {
> > +			if (unlikely(nd->flags & LOOKUP_BENEATH))
> > +				return -EXDEV;
> 
> > @@ -1468,6 +1493,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> >  				return -ECHILD;
> >  			if (&mparent->mnt == nd->path.mnt)
> >  				break;
> > +			if (unlikely(nd->flags & LOOKUP_XDEV))
> > +				return -EXDEV;
> >  			/* we know that mountpoint was pinned */
> >  			nd->path.dentry = mountpoint;
> >  			nd->path.mnt = &mparent->mnt;
> > @@ -1482,6 +1509,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> >  			return -ECHILD;
> >  		if (!mounted)
> >  			break;
> > +		if (unlikely(nd->flags & LOOKUP_XDEV))
> > +			return -EXDEV;
> 
> Are you sure these failure exits in follow_dotdot_rcu() won't give
> suprious hard errors?

I could switch to -ECHILD for the *_rcu() checks if you'd prefer that.
Though, I'd have (probably naively) thought that you'd have already
gotten -ECHILD from the seqlock checks if there was a race during ".."
handling.

> > +	if (unlikely(nd->flags & LOOKUP_BENEATH)) {
> > +		error = dirfd_path_init(nd);
> > +		if (unlikely(error))
> > +			return ERR_PTR(error);
> > +		nd->root = nd->path;
> > +		if (!(nd->flags & LOOKUP_RCU))
> > +			path_get(&nd->root);
> > +	}
> >  	if (*s == '/') {
> >  		if (likely(!nd->root.mnt))
> >  			set_root(nd);
> > @@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> >  			s = ERR_PTR(error);
> >  		return s;
> >  	}
> > -	error = dirfd_path_init(nd);
> > -	if (unlikely(error))
> > -		return ERR_PTR(error);
> > +	if (likely(!nd->path.mnt)) {
> 
> Is that a weird way of saying "if we hadn't already called dirfd_path_init()"?

Yes. I did it to be more consistent with the other "have we got the
root" checks elsewhere. Is there another way you'd prefer I do it?

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v9 04/10] namei: split out nd->dfd handling to dirfd_path_init
From: Aleksa Sarai @ 2019-07-12 12:07 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Christian Brauner, Tycho Andersen, David Drysdale, Chanho Min,
	Oleg Nesterov, Aleksa Sarai, Linus Torvalds, containers,
	linux-alpha
In-Reply-To: <20190712042050.GH17978@ZenIV.linux.org.uk>

[-- Attachment #1: Type: text/plain, Size: 960 bytes --]

On 2019-07-12, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sun, Jul 07, 2019 at 12:57:31AM +1000, Aleksa Sarai wrote:
> > Previously, path_init's handling of *at(dfd, ...) was only done once,
> > but with LOOKUP_BENEATH (and LOOKUP_IN_ROOT) we have to parse the
> > initial nd->path at different times (before or after absolute path
> > handling) depending on whether we have been asked to scope resolution
> > within a root.
> 
> >  	if (*s == '/') {
> > -		set_root(nd);
> > -		if (likely(!nd_jump_root(nd)))
> > -			return s;
> > -		return ERR_PTR(-ECHILD);
> 
> > +		if (likely(!nd->root.mnt))
> > +			set_root(nd);
> 
> How can we get there with non-NULL nd->root.mnt, when LOOKUP_ROOT case
> has been already handled by that point?

Yup, you're completely right. I will remove the
  if (!nd->root.mnt)
in the next version.


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v9 04/10] namei: split out nd->dfd handling to dirfd_path_init
From: Aleksa Sarai @ 2019-07-12 12:12 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-ia64, linux-sh, Alexei Starovoitov, linux-kernel,
	David Howells, linux-kselftest, sparclinux, Shuah Khan,
	linux-arch, linux-s390, Tycho Andersen, Aleksa Sarai,
	linux-arm-kernel, linux-mips, linux-xtensa, Kees Cook,
	Arnd Bergmann, Jann Horn, linuxppc-dev, linux-m68k,
	Andy Lutomirski, Shuah Khan, David Drysdale, Christian Brauner,
	J. Bruce Fields, linux-parisc, linux-api
In-Reply-To: <20190712120743.mka3vl5t4zndc5wj@yavin>


[-- Attachment #1.1: Type: text/plain, Size: 1395 bytes --]

On 2019-07-12, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2019-07-12, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Sun, Jul 07, 2019 at 12:57:31AM +1000, Aleksa Sarai wrote:
> > > Previously, path_init's handling of *at(dfd, ...) was only done once,
> > > but with LOOKUP_BENEATH (and LOOKUP_IN_ROOT) we have to parse the
> > > initial nd->path at different times (before or after absolute path
> > > handling) depending on whether we have been asked to scope resolution
> > > within a root.
> > 
> > >  	if (*s == '/') {
> > > -		set_root(nd);
> > > -		if (likely(!nd_jump_root(nd)))
> > > -			return s;
> > > -		return ERR_PTR(-ECHILD);
> > 
> > > +		if (likely(!nd->root.mnt))
> > > +			set_root(nd);
> > 
> > How can we get there with non-NULL nd->root.mnt, when LOOKUP_ROOT case
> > has been already handled by that point?
> 
> Yup, you're completely right. I will remove the
>   if (!nd->root.mnt)
> in the next version.

Ah sorry, there is a reason for it -- later in the series the
LOOKUP_BENEATH case means that you might end up with a non-NULL
nd->root.mnt. If you want, I can move the addition of the conditional to
later in the series (it was easier to split the patch by-hunk back when
you originally asked me to split out dirfd_path_init()).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply

* Re: [PATCH v9 01/10] namei: obey trailing magic-link DAC permissions
From: Aleksa Sarai @ 2019-07-12 12:20 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-ia64, linux-sh, Alexei Starovoitov, linux-kernel,
	David Howells, linux-kselftest, sparclinux, Shuah Khan,
	linux-arch, linux-s390, Tycho Andersen, Aleksa Sarai,
	linux-arm-kernel, linux-mips, linux-xtensa, Kees Cook,
	Arnd Bergmann, Jann Horn, linuxppc-dev, linux-m68k,
	Andy Lutomirski, Shuah Khan, David Drysdale, Christian Brauner,
	J. Bruce Fields, linux-parisc, linux-api
In-Reply-To: <20190712041454.GG17978@ZenIV.linux.org.uk>


[-- Attachment #1.1: Type: text/plain, Size: 2349 bytes --]

On 2019-07-12, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sun, Jul 07, 2019 at 12:57:28AM +1000, Aleksa Sarai wrote:
> > @@ -514,7 +516,14 @@ static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
> >  	p->stack = p->internal;
> >  	p->dfd = dfd;
> >  	p->name = name;
> > -	p->total_link_count = old ? old->total_link_count : 0;
> > +	p->total_link_count = 0;
> > +	p->acc_mode = 0;
> > +	p->opath_mask = FMODE_PATH_READ | FMODE_PATH_WRITE;
> > +	if (old) {
> > +		p->total_link_count = old->total_link_count;
> > +		p->acc_mode = old->acc_mode;
> > +		p->opath_mask = old->opath_mask;
> > +	}
> 
> Huh?  Could somebody explain why traversals of NFS4 referrals should inherit
> ->acc_mode and ->opath_mask?

I'll be honest -- I don't understand what set_nameidata() did so I just
did what I thought would be an obvious change (to just copy the
contents). I thought it was related to some aspect of the symlink stack
handling.

In that case, should they both be set to 0 on set_nameidata()? This will
mean that fd re-opening (or magic-link opening) through a
set_nameidata() would always fail.

> >  static __always_inline
> > -const char *get_link(struct nameidata *nd)
> > +const char *get_link(struct nameidata *nd, bool trailing)
> >  {
> >  	struct saved *last = nd->stack + nd->depth - 1;
> >  	struct dentry *dentry = last->link.dentry;
> > @@ -1081,6 +1134,44 @@ const char *get_link(struct nameidata *nd)
> >  		} else {
> >  			res = get(dentry, inode, &last->done);
> >  		}
> > +		/* If we just jumped it was because of a magic-link. */
> > +		if (unlikely(nd->flags & LOOKUP_JUMPED)) {
> [...]
> In any case, this "bool trailing" is completely wrong; whether that
> check belongs in trailing_symlink() or (some of) its callers, putting
> it into get_link() is a mistake, forced by kludgy check for procfs-style
> symlinks.

The error path for LOOKUP_JUMPED comes from the old O_BENEATH patchset,
but all of the "bool trailing" logic is definitely my gaff (I was
quietly hoping you'd have a much better solution than the whole
get_link() thing -- it definitely felt very kludgey to write).

I will work on the suggestion in your follow-up email. Thanks!

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply

* Re: [PATCH v9 05/10] namei: O_BENEATH-style path resolution flags
From: Al Viro @ 2019-07-12 12:39 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: linux-ia64, linux-sh, Alexei Starovoitov, linux-kernel,
	David Howells, linux-kselftest, sparclinux, Shuah Khan,
	linux-arch, linux-s390, Tycho Andersen, Aleksa Sarai,
	linux-arm-kernel, linux-mips, linux-xtensa, Kees Cook,
	Arnd Bergmann, Jann Horn, linuxppc-dev, linux-m68k,
	Andy Lutomirski, Shuah Khan, David Drysdale, Christian Brauner,
	J. Bruce Fields, linux-parisc, linux-api
In-Reply-To: <20190712105745.nruaftgeat6irhzr@yavin>

On Fri, Jul 12, 2019 at 08:57:45PM +1000, Aleksa Sarai wrote:

> > > @@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > >  			s = ERR_PTR(error);
> > >  		return s;
> > >  	}
> > > -	error = dirfd_path_init(nd);
> > > -	if (unlikely(error))
> > > -		return ERR_PTR(error);
> > > +	if (likely(!nd->path.mnt)) {
> > 
> > Is that a weird way of saying "if we hadn't already called dirfd_path_init()"?
> 
> Yes. I did it to be more consistent with the other "have we got the
> root" checks elsewhere. Is there another way you'd prefer I do it?

"Have we got the root" checks are inevitable evil; here you are making the
control flow in a single function hard to follow.

I *think* what you are doing is
	absolute pathname, no LOOKUP_BENEATH:
		set_root
		error = nd_jump_root(nd)
	else
		error = dirfd_path_init(nd)
	return unlikely(error) ? ERR_PTR(error) : s;
which should be a lot easier to follow (not to mention shorter), but I might
be missing something in all of that.

^ permalink raw reply

* Re: [PATCH v9 05/10] namei: O_BENEATH-style path resolution flags
From: Al Viro @ 2019-07-12 12:55 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Christian Brauner, David Drysdale,
	Andy Lutomirski, Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Aleksa Sarai, containers, linux-alpha
In-Reply-To: <20190712123924.GK17978@ZenIV.linux.org.uk>

On Fri, Jul 12, 2019 at 01:39:24PM +0100, Al Viro wrote:
> On Fri, Jul 12, 2019 at 08:57:45PM +1000, Aleksa Sarai wrote:
> 
> > > > @@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > > >  			s = ERR_PTR(error);
> > > >  		return s;
> > > >  	}
> > > > -	error = dirfd_path_init(nd);
> > > > -	if (unlikely(error))
> > > > -		return ERR_PTR(error);
> > > > +	if (likely(!nd->path.mnt)) {
> > > 
> > > Is that a weird way of saying "if we hadn't already called dirfd_path_init()"?
> > 
> > Yes. I did it to be more consistent with the other "have we got the
> > root" checks elsewhere. Is there another way you'd prefer I do it?
> 
> "Have we got the root" checks are inevitable evil; here you are making the
> control flow in a single function hard to follow.
> 
> I *think* what you are doing is
> 	absolute pathname, no LOOKUP_BENEATH:
> 		set_root
> 		error = nd_jump_root(nd)
> 	else
> 		error = dirfd_path_init(nd)
> 	return unlikely(error) ? ERR_PTR(error) : s;
> which should be a lot easier to follow (not to mention shorter), but I might
> be missing something in all of that.

PS: if that's what's going on, I would be tempted to turn the entire
path_init() part into this:
	if (flags & LOOKUP_BENEATH)
		while (*s == '/')
			s++;
in the very beginning (plus the handling of nd_jump_root() prototype
change, but that belongs with nd_jump_root() change itself, obviously).
Again, I might be missing something here...

^ permalink raw reply

* Re: [PATCH v9 01/10] namei: obey trailing magic-link DAC permissions
From: Al Viro @ 2019-07-12 13:10 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Andy Lutomirski, Christian Brauner,
	Eric Biederman, Andrew Morton, Alexei Starovoitov, Kees Cook,
	Jann Horn, Tycho Andersen, David Drysdale, Chanho Min,
	Oleg Nesterov, Aleksa Sarai, Linus Torvalds, containers,
	linux-alpha
In-Reply-To: <20190712122017.xkowq2cjreylpotm@yavin>

On Fri, Jul 12, 2019 at 10:20:17PM +1000, Aleksa Sarai wrote:
> On 2019-07-12, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Sun, Jul 07, 2019 at 12:57:28AM +1000, Aleksa Sarai wrote:
> > > @@ -514,7 +516,14 @@ static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
> > >  	p->stack = p->internal;
> > >  	p->dfd = dfd;
> > >  	p->name = name;
> > > -	p->total_link_count = old ? old->total_link_count : 0;
> > > +	p->total_link_count = 0;
> > > +	p->acc_mode = 0;
> > > +	p->opath_mask = FMODE_PATH_READ | FMODE_PATH_WRITE;
> > > +	if (old) {
> > > +		p->total_link_count = old->total_link_count;
> > > +		p->acc_mode = old->acc_mode;
> > > +		p->opath_mask = old->opath_mask;
> > > +	}
> > 
> > Huh?  Could somebody explain why traversals of NFS4 referrals should inherit
> > ->acc_mode and ->opath_mask?
> 
> I'll be honest -- I don't understand what set_nameidata() did so I just
> did what I thought would be an obvious change (to just copy the
> contents). I thought it was related to some aspect of the symlink stack
> handling.

No.  It's handling of (very rare) nested pathwalk.  The only case I can think
of is handling of NFS4 referrals - they are triggered by ->d_automount()
and include NFS4 mount.  Which does internal pathwalk of its own, to get
to the root of subtree being automounted.

NFS has its own recursion protection on that path (no deeper nesting than
one level of referral traversals), but there some nesting is inevitable;
we do get another nameidata instance on stack.  And for nd_jump_link() we
need to keep track of the innermost one.

For symlinks nothing of that sort happens - they are dealt with on the same
struct nameidata.  ->total_link_count copying is there for one reason only -
we want the total amount of symlinks traversed during the pathwalk (including
the referral processing, etc.) to count towards MAXSYMLINKS check.  It could've
been moved from nameidata to task_struct, but it's cheaper to handle it that
way.

Again, nesting is *rare*.

> In that case, should they both be set to 0 on set_nameidata()? This will
> mean that fd re-opening (or magic-link opening) through a
> set_nameidata() would always fail.

Huh?  set_nameidata() is done for *all* instances - it's pretty much the
constructor of that object (and restore_nameidata() - a destructor).
Everything goes through it.

And again, I'm not sure we want these fields in nameidata - IMO they belong
in open_flags.  Things like e.g. stat() don't need them at all.

Incidentally, O_PATH opening of symlinks combined with subsequent procfs
symlink traversals is worth testing - that's where the things get subtle
and that's where it's easy to get in trouble on modifications.

^ permalink raw reply

* Re: [PATCH v9 05/10] namei: O_BENEATH-style path resolution flags
From: Al Viro @ 2019-07-12 13:25 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: linux-ia64, linux-sh, Alexei Starovoitov, linux-kernel,
	David Howells, linux-kselftest, sparclinux, Shuah Khan,
	linux-arch, linux-s390, Tycho Andersen, Aleksa Sarai,
	linux-arm-kernel, linux-mips, linux-xtensa, Kees Cook,
	Arnd Bergmann, Jann Horn, linuxppc-dev, linux-m68k,
	Andy Lutomirski, Shuah Khan, David Drysdale, Christian Brauner,
	J. Bruce Fields, linux-parisc, linux-api
In-Reply-To: <20190712125552.GL17978@ZenIV.linux.org.uk>

On Fri, Jul 12, 2019 at 01:55:52PM +0100, Al Viro wrote:
> On Fri, Jul 12, 2019 at 01:39:24PM +0100, Al Viro wrote:
> > On Fri, Jul 12, 2019 at 08:57:45PM +1000, Aleksa Sarai wrote:
> > 
> > > > > @@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > > > >  			s = ERR_PTR(error);
> > > > >  		return s;
> > > > >  	}
> > > > > -	error = dirfd_path_init(nd);
> > > > > -	if (unlikely(error))
> > > > > -		return ERR_PTR(error);
> > > > > +	if (likely(!nd->path.mnt)) {
> > > > 
> > > > Is that a weird way of saying "if we hadn't already called dirfd_path_init()"?
> > > 
> > > Yes. I did it to be more consistent with the other "have we got the
> > > root" checks elsewhere. Is there another way you'd prefer I do it?
> > 
> > "Have we got the root" checks are inevitable evil; here you are making the
> > control flow in a single function hard to follow.
> > 
> > I *think* what you are doing is
> > 	absolute pathname, no LOOKUP_BENEATH:
> > 		set_root
> > 		error = nd_jump_root(nd)
> > 	else
> > 		error = dirfd_path_init(nd)
> > 	return unlikely(error) ? ERR_PTR(error) : s;
> > which should be a lot easier to follow (not to mention shorter), but I might
> > be missing something in all of that.
> 
> PS: if that's what's going on, I would be tempted to turn the entire
> path_init() part into this:
> 	if (flags & LOOKUP_BENEATH)
> 		while (*s == '/')
> 			s++;
> in the very beginning (plus the handling of nd_jump_root() prototype
> change, but that belongs with nd_jump_root() change itself, obviously).
> Again, I might be missing something here...

Argh... I am, at that - you have setting path->root (and grabbing it)
in LOOKUP_BENEATH cases and you do it after dirfd_path_init().  So
how about
	if (flags & LOOKUP_BENEATH)
		while (*s == '/')
			s++;
before the whole thing and
        if (*s == '/') { /* can happen only without LOOKUP_BENEATH */
                set_root(nd);
		error = nd_jump_root(nd);
		if (unlikely(error))
			return ERR_PTR(error);
        } else if (nd->dfd == AT_FDCWD) {
                if (flags & LOOKUP_RCU) {
                        struct fs_struct *fs = current->fs;
                        unsigned seq;

                        do {
                                seq = read_seqcount_begin(&fs->seq);
                                nd->path = fs->pwd;
                                nd->inode = nd->path.dentry->d_inode;
                                nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
                        } while (read_seqcount_retry(&fs->seq, seq));
                } else {
                        get_fs_pwd(current->fs, &nd->path);
                        nd->inode = nd->path.dentry->d_inode;
                }  
        } else {
                /* Caller must check execute permissions on the starting path component */
                struct fd f = fdget_raw(nd->dfd);
                struct dentry *dentry;

                if (!f.file)
                        return ERR_PTR(-EBADF);

                dentry = f.file->f_path.dentry;

                if (*s && unlikely(!d_can_lookup(dentry))) {
                        fdput(f);
                        return ERR_PTR(-ENOTDIR);
                }

                nd->path = f.file->f_path;
                if (flags & LOOKUP_RCU) {
                        nd->inode = nd->path.dentry->d_inode;
                        nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
                } else {
                        path_get(&nd->path);
                        nd->inode = nd->path.dentry->d_inode;
                }
                fdput(f);
        }
	if (flags & LOOKUP_BENEATH) {
		nd->root = nd->path;
		if (!(flags & LOOKUP_RCU))
			path_get(&nd->root);
		else
			nd->root_seq = nd->seq;
	}
	return s;
replacing the part in the end?  Makes for much smaller change; it might
very well still make sense to add dirfd_path_init() as a separate
cleanup (perhaps with the *s == '/' case included), though.

^ permalink raw reply

* Re: [PATCH v4 4/4] mm: introduce MADV_PAGEOUT
From: Johannes Weiner @ 2019-07-12 13:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, oleksandr, hdanton, lizeb, Dave Hansen,
	Kirill A . Shutemov
In-Reply-To: <20190712051828.GA128252@google.com>

On Fri, Jul 12, 2019 at 02:18:28PM +0900, Minchan Kim wrote:
> Hi Johannes,
> 
> On Thu, Jul 11, 2019 at 02:42:23PM -0400, Johannes Weiner wrote:
> > On Thu, Jul 11, 2019 at 10:25:28AM +0900, Minchan Kim wrote:
> > > @@ -480,6 +482,198 @@ static long madvise_cold(struct vm_area_struct *vma,
> > >  	return 0;
> > >  }
> > >  
> > > +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > > +				unsigned long end, struct mm_walk *walk)
> > > +{
> > > +	struct mmu_gather *tlb = walk->private;
> > > +	struct mm_struct *mm = tlb->mm;
> > > +	struct vm_area_struct *vma = walk->vma;
> > > +	pte_t *orig_pte, *pte, ptent;
> > > +	spinlock_t *ptl;
> > > +	LIST_HEAD(page_list);
> > > +	struct page *page;
> > > +	unsigned long next;
> > > +
> > > +	if (fatal_signal_pending(current))
> > > +		return -EINTR;
> > > +
> > > +	next = pmd_addr_end(addr, end);
> > > +	if (pmd_trans_huge(*pmd)) {
> > > +		pmd_t orig_pmd;
> > > +
> > > +		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
> > > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > > +		if (!ptl)
> > > +			return 0;
> > > +
> > > +		orig_pmd = *pmd;
> > > +		if (is_huge_zero_pmd(orig_pmd))
> > > +			goto huge_unlock;
> > > +
> > > +		if (unlikely(!pmd_present(orig_pmd))) {
> > > +			VM_BUG_ON(thp_migration_supported() &&
> > > +					!is_pmd_migration_entry(orig_pmd));
> > > +			goto huge_unlock;
> > > +		}
> > > +
> > > +		page = pmd_page(orig_pmd);
> > > +		if (next - addr != HPAGE_PMD_SIZE) {
> > > +			int err;
> > > +
> > > +			if (page_mapcount(page) != 1)
> > > +				goto huge_unlock;
> > > +			get_page(page);
> > > +			spin_unlock(ptl);
> > > +			lock_page(page);
> > > +			err = split_huge_page(page);
> > > +			unlock_page(page);
> > > +			put_page(page);
> > > +			if (!err)
> > > +				goto regular_page;
> > > +			return 0;
> > > +		}
> > > +
> > > +		if (isolate_lru_page(page))
> > > +			goto huge_unlock;
> > > +
> > > +		if (pmd_young(orig_pmd)) {
> > > +			pmdp_invalidate(vma, addr, pmd);
> > > +			orig_pmd = pmd_mkold(orig_pmd);
> > > +
> > > +			set_pmd_at(mm, addr, pmd, orig_pmd);
> > > +			tlb_remove_tlb_entry(tlb, pmd, addr);
> > > +		}
> > > +
> > > +		ClearPageReferenced(page);
> > > +		test_and_clear_page_young(page);
> > > +		list_add(&page->lru, &page_list);
> > > +huge_unlock:
> > > +		spin_unlock(ptl);
> > > +		reclaim_pages(&page_list);
> > > +		return 0;
> > > +	}
> > > +
> > > +	if (pmd_trans_unstable(pmd))
> > > +		return 0;
> > > +regular_page:
> > > +	tlb_change_page_size(tlb, PAGE_SIZE);
> > > +	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > +	flush_tlb_batched_pending(mm);
> > > +	arch_enter_lazy_mmu_mode();
> > > +	for (; addr < end; pte++, addr += PAGE_SIZE) {
> > > +		ptent = *pte;
> > > +		if (!pte_present(ptent))
> > > +			continue;
> > > +
> > > +		page = vm_normal_page(vma, addr, ptent);
> > > +		if (!page)
> > > +			continue;
> > > +
> > > +		/*
> > > +		 * creating a THP page is expensive so split it only if we
> > > +		 * are sure it's worth. Split it if we are only owner.
> > > +		 */
> > > +		if (PageTransCompound(page)) {
> > > +			if (page_mapcount(page) != 1)
> > > +				break;
> > > +			get_page(page);
> > > +			if (!trylock_page(page)) {
> > > +				put_page(page);
> > > +				break;
> > > +			}
> > > +			pte_unmap_unlock(orig_pte, ptl);
> > > +			if (split_huge_page(page)) {
> > > +				unlock_page(page);
> > > +				put_page(page);
> > > +				pte_offset_map_lock(mm, pmd, addr, &ptl);
> > > +				break;
> > > +			}
> > > +			unlock_page(page);
> > > +			put_page(page);
> > > +			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > > +			pte--;
> > > +			addr -= PAGE_SIZE;
> > > +			continue;
> > > +		}
> > > +
> > > +		VM_BUG_ON_PAGE(PageTransCompound(page), page);
> > > +
> > > +		if (isolate_lru_page(page))
> > > +			continue;
> > > +
> > > +		if (pte_young(ptent)) {
> > > +			ptent = ptep_get_and_clear_full(mm, addr, pte,
> > > +							tlb->fullmm);
> > > +			ptent = pte_mkold(ptent);
> > > +			set_pte_at(mm, addr, pte, ptent);
> > > +			tlb_remove_tlb_entry(tlb, pte, addr);
> > > +		}
> > > +		ClearPageReferenced(page);
> > > +		test_and_clear_page_young(page);
> > > +		list_add(&page->lru, &page_list);
> > > +	}
> > > +
> > > +	arch_leave_lazy_mmu_mode();
> > > +	pte_unmap_unlock(orig_pte, ptl);
> > > +	reclaim_pages(&page_list);
> > > +	cond_resched();
> > > +
> > > +	return 0;
> > > +}
> > 
> > I know you have briefly talked about code sharing already.
> > 
> > While I agree that sharing with MADV_FREE is maybe a stretch, I
> > applied these patches and compared the pageout and the cold page table
> > functions, and they are line for line the same EXCEPT for 2-3 lines at
> > the very end, where one reclaims and the other deactivates. It would
> > be good to share here, it shouldn't be hard or result in fragile code.
> 
> Fair enough if we leave MADV_FREE.
> 
> > 
> > Something like int madvise_cold_or_pageout_range(..., bool pageout)?
> 
> How about this?
> 
> From 41592f23e876ec21e49dc3c76dc89538e2bb16be Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Fri, 12 Jul 2019 14:05:36 +0900
> Subject: [PATCH] mm: factor out common parts between MADV_COLD and
>  MADV_PAGEOUT
> 
> There are many common parts between MADV_COLD and MADV_PAGEOUT.
> This patch factor them out to save code duplication.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>

This looks much better, thanks!

> @@ -423,6 +445,12 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
>  
>  		VM_BUG_ON_PAGE(PageTransCompound(page), page);
>  
> +		if (pageout) {
> +			if (isolate_lru_page(page))
> +				continue;
> +			list_add(&page->lru, &page_list);
> +		}
> +
>  		if (pte_young(ptent)) {
>  			ptent = ptep_get_and_clear_full(mm, addr, pte,
>  							tlb->fullmm);

One thought on the ordering here.

When LRU isolation fails, it would still make sense to clear the young
bit: we cannot reclaim the page as we wanted to, but the user still
provided a clear hint that the page is cold and she won't be touching
it for a while. MADV_PAGEOUT is basically MADV_COLD + try_to_reclaim.
So IMO isolation should go to the end next to deactivate_page().

^ permalink raw reply

* Re: [PATCH v9 05/10] namei: O_BENEATH-style path resolution flags
From: Al Viro @ 2019-07-12 15:00 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Christian Brauner, David Drysdale,
	Andy Lutomirski, Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Aleksa Sarai, containers, linux-alpha
In-Reply-To: <20190712132553.GN17978@ZenIV.linux.org.uk>

On Fri, Jul 12, 2019 at 02:25:53PM +0100, Al Viro wrote:

> 	if (flags & LOOKUP_BENEATH) {
> 		nd->root = nd->path;
> 		if (!(flags & LOOKUP_RCU))
> 			path_get(&nd->root);
> 		else
> 			nd->root_seq = nd->seq;

BTW, this assignment is needed for LOOKUP_RCU case.  Without it
you are pretty much guaranteed that lazy pathwalk will fail,
when it comes to complete_walk().

Speaking of which, what would happen if LOOKUP_ROOT/LOOKUP_BENEATH
combination would someday get passed?

^ permalink raw reply

* Re: [PATCH v4 4/4] mm: introduce MADV_PAGEOUT
From: Michal Hocko @ 2019-07-12 15:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Minchan Kim, Andrew Morton, linux-mm, LKML, linux-api, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, oleksandr, hdanton, lizeb, Dave Hansen,
	Kirill A . Shutemov
In-Reply-To: <20190712135809.GB31107@cmpxchg.org>

On Fri 12-07-19 09:58:09, Johannes Weiner wrote:
[...]
> > @@ -423,6 +445,12 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> >  
> >  		VM_BUG_ON_PAGE(PageTransCompound(page), page);
> >  
> > +		if (pageout) {
> > +			if (isolate_lru_page(page))
> > +				continue;
> > +			list_add(&page->lru, &page_list);
> > +		}
> > +
> >  		if (pte_young(ptent)) {
> >  			ptent = ptep_get_and_clear_full(mm, addr, pte,
> >  							tlb->fullmm);
> 
> One thought on the ordering here.
> 
> When LRU isolation fails, it would still make sense to clear the young
> bit: we cannot reclaim the page as we wanted to, but the user still
> provided a clear hint that the page is cold and she won't be touching
> it for a while. MADV_PAGEOUT is basically MADV_COLD + try_to_reclaim.
> So IMO isolation should go to the end next to deactivate_page().

Make sense to me

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v9 00/10] namei: openat2(2) path resolution restrictions
From: Al Viro @ 2019-07-12 15:11 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Eric Biederman, Andy Lutomirski,
	Jann Horn, Christian Brauner, David Drysdale, Tycho Andersen,
	Kees Cook, Linus Torvalds, containers, linux-fsdevel, linux-api,
	Andrew Morton, Alexei Starovoitov, Chanho Min,
	Oleg Nesterov <ole>
In-Reply-To: <20190706145737.5299-1-cyphar@cyphar.com>

On Sun, Jul 07, 2019 at 12:57:27AM +1000, Aleksa Sarai wrote:
> Patch changelog:
>   v9:
>     * Replace resolveat(2) with openat2(2). [Linus]
>     * Output a warning to dmesg if may_open_magiclink() is violated.
>     * Add an openat2(O_CREAT) testcase.

One general note for the future, BTW: for such series it's generally
a good idea to put it into a public git tree somewhere and mention that
in the announcement...

^ permalink raw reply

* Re: [PATCH v9 00/10] namei: openat2(2) path resolution restrictions
From: Aleksa Sarai @ 2019-07-12 15:32 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Eric Biederman, Andy Lutomirski,
	Jann Horn, Christian Brauner, David Drysdale, Tycho Andersen,
	Kees Cook, Linus Torvalds, containers, linux-fsdevel, linux-api,
	Andrew Morton, Alexei Starovoitov, Chanho Min,
	Oleg Nesterov <ole>
In-Reply-To: <20190712151118.GP17978@ZenIV.linux.org.uk>

[-- Attachment #1: Type: text/plain, Size: 734 bytes --]

On 2019-07-12, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sun, Jul 07, 2019 at 12:57:27AM +1000, Aleksa Sarai wrote:
> > Patch changelog:
> >   v9:
> >     * Replace resolveat(2) with openat2(2). [Linus]
> >     * Output a warning to dmesg if may_open_magiclink() is violated.
> >     * Add an openat2(O_CREAT) testcase.
> 
> One general note for the future, BTW: for such series it's generally
> a good idea to put it into a public git tree somewhere and mention that
> in the announcement...

Sure, I'll mention it next time. For the record the tree is
  <https://github.com/cyphar/linux/tree/resolveat/master>

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox