From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "Raphaël Beamonte" <raphael.beamonte@gmail.com>,
linux-kernel@vger.kernel.org, stable@vger.kernel.org,
"lttng-dev@lists.lttng.org" <lttng-dev@lists.lttng.org>,
"Mel Gorman" <mgorman@suse.de>,
"Rob van der Heij" <rvdheij@gmail.com>
Subject: Re: [lttng-dev] [-stable 3.8.1 performance regression] madvise POSIX_FADV_DONTNEED
Date: Mon, 17 Jun 2013 22:15:28 -0400 [thread overview]
Message-ID: <20130618021528.GA29506@Krystal> (raw)
In-Reply-To: <20130617145714.8032ba33fd3e4e6887755209@linux-foundation.org>
* Andrew Morton (akpm@linux-foundation.org) wrote:
> On Mon, 17 Jun 2013 17:39:36 -0400 Rapha__l Beamonte <raphael.beamonte@gmail.com> wrote:
>
> > 2013/6/17 Andrew Morton <akpm@linux-foundation.org>
> >
> > > That change wasn't terribly efficient - if there are any unpopulated
> > > pages in the range (which is quite likely), fadvise() will now always
> > > call invalidate_mapping_pages() a second time.
> > >
> > > Perhaps this is fixable. Say, make lru_add_drain_all() return a
> > > success code, or even teach lru_add_drain_all() to return a code
> > > indicating that one of the spilled pages was (or might have been) on a
> > > particular mapping.
> > >
> >
> > Following our tests results, that was the call to lru_add_drain_all() that
> > causes the problem. The second call to invalidate_mapping_pages() isn't
> > really important. We tried to compile a kernel with the commit introducing
> > this change but with the "lru_add_drain_all()" line removed, and the
> > problem disappeared, even if we called two times invalidate_mapping_pages()
> > (as the rest of the commit was still here).
>
> Ah, OK, schedule_on_each_cpu() could certainly do that - it has to wait
> for every CPU to context switch and schedule the worker function.
>
> There's a lot we could do here. Such as not doing the schedule_work()
> at all for a cpu which has an empty lru_add_pvec.
First approach proposed, submitted as RFC. Compile-tested only.
---
mm/swap.c | 35 ++++++++++++++++++++++++++++++++++-
1 file changed, 34 insertions(+), 1 deletion(-)
Index: linux/mm/swap.c
===================================================================
--- linux.orig/mm/swap.c
+++ linux/mm/swap.c
@@ -652,7 +652,40 @@ static void lru_add_drain_per_cpu(struct
*/
int lru_add_drain_all(void)
{
- return schedule_on_each_cpu(lru_add_drain_per_cpu);
+ int cpu;
+ struct work_struct __percpu *works;
+
+ works = alloc_percpu(struct work_struct);
+ if (!works)
+ return -ENOMEM;
+
+ get_online_cpus();
+
+ for_each_online_cpu(cpu) {
+ struct pagevec *pvecs = per_cpu(lru_add_pvecs, cpu);
+ struct work_struct *work = per_cpu_ptr(works, cpu);
+ struct pagevec *pvec;
+ int lru;
+
+ INIT_WORK(work, lru_add_drain_per_cpu);
+ for_each_lru(lru) {
+ pvec = &pvecs[lru - LRU_BASE];
+ if (pagevec_count(pvec)) {
+ schedule_work_on(cpu, work);
+ break;
+ }
+ }
+ }
+
+ for_each_online_cpu(cpu) {
+ struct work_struct *work = per_cpu_ptr(works, cpu);
+ if (work_pending(work))
+ flush_work(work);
+ }
+
+ put_online_cpus();
+ free_percpu(works);
+ return 0;
}
/*
> Or even pass down
> the address_space and only schedule the work for CPUs which have a page
> from *this mapping* in their lru_add_pvec. That will all be highly
> racy, but as long as the failure mode is "flushed unnecessarily" then
> that's OK.
Second approach, submitted as RFC with some questions left unanswered
in the code. Compile-tested only.
---
include/linux/swap.h | 1
mm/fadvise.c | 2 -
mm/swap.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 64 insertions(+), 1 deletion(-)
Index: linux/include/linux/swap.h
===================================================================
--- linux.orig/include/linux/swap.h
+++ linux/include/linux/swap.h
@@ -242,6 +242,7 @@ extern void mark_page_accessed(struct pa
extern void lru_add_drain(void);
extern void lru_add_drain_cpu(int cpu);
extern int lru_add_drain_all(void);
+extern int lru_add_drain_mapping(struct address_space *mapping);
extern void rotate_reclaimable_page(struct page *page);
extern void deactivate_page(struct page *page);
extern void swap_setup(void);
Index: linux/mm/swap.c
===================================================================
--- linux.orig/mm/swap.c
+++ linux/mm/swap.c
@@ -689,6 +689,68 @@ int lru_add_drain_all(void)
}
/*
+ * Returns 0 for success
+ */
+int lru_add_drain_mapping(struct address_space *mapping)
+{
+ int cpu;
+ struct work_struct __percpu *works;
+
+ works = alloc_percpu(struct work_struct);
+ if (!works)
+ return -ENOMEM;
+
+ get_online_cpus();
+
+ for_each_online_cpu(cpu) {
+ struct pagevec *pvecs = per_cpu(lru_add_pvecs, cpu);
+ struct work_struct *work = per_cpu_ptr(works, cpu);
+ struct pagevec *pvec;
+ int lru;
+
+ INIT_WORK(work, lru_add_drain_per_cpu);
+ for_each_lru(lru) {
+ int i;
+
+ pvec = &pvecs[lru - LRU_BASE];
+ /*
+ * Race failure mode is to flush unnecessarily.
+ * We use PAGEVEC_SIZE rather than pvec->nr to
+ * stay on the safe-side wrt pvec resize.
+ */
+ for (i = 0; i < PAGEVEC_SIZE; i++) {
+ struct page *page;
+
+ /*
+ * Racy access to page. TODO: is it OK
+ * to access it from the remote CPU's
+ * lru without any kind of ownership or
+ * synchronization ?
+ */
+ page = ACCESS_ONCE(pvec->pages[i]);
+ if (!pvec->pages[i])
+ continue;
+ if (page->mapping == mapping) {
+ schedule_work_on(cpu, work);
+ goto next_cpu;
+ }
+ }
+ }
+ next_cpu: ; /* TODO: coding ugliness */
+ }
+
+ for_each_online_cpu(cpu) {
+ struct work_struct *work = per_cpu_ptr(works, cpu);
+ if (work_pending(work))
+ flush_work(work);
+ }
+
+ put_online_cpus();
+ free_percpu(works);
+ return 0;
+}
+
+/*
* Batched page_cache_release(). Decrement the reference count on all the
* passed pages. If it fell to zero then remove the page from the LRU and
* free it.
Index: linux/mm/fadvise.c
===================================================================
--- linux.orig/mm/fadvise.c
+++ linux/mm/fadvise.c
@@ -132,7 +132,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, l
* pagevecs and try again.
*/
if (count < (end_index - start_index + 1)) {
- lru_add_drain_all();
+ lru_add_drain_mapping(mapping);
invalidate_mapping_pages(mapping, start_index,
end_index);
}
Feedback is welcome,
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
next prev parent reply other threads:[~2013-06-18 2:15 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <51BE1828.3060206@gmail.com>
2013-06-17 14:13 ` [-stable 3.8.1 performance regression] madvise POSIX_FADV_DONTNEED Mathieu Desnoyers
2013-06-17 21:24 ` Andrew Morton
2013-06-17 21:39 ` Raphaël Beamonte
[not found] ` <CAE_Gge34HCroSgNgiXL1j7Le3CNKRXR=7TZQhJSmY+wfWniKug@mail.gmail.com>
2013-06-17 21:57 ` [lttng-dev] " Andrew Morton
2013-06-18 2:15 ` Mathieu Desnoyers [this message]
2013-06-18 2:44 ` Andrew Morton
2013-06-18 9:29 ` Mel Gorman
2013-06-18 10:11 ` Mel Gorman
2013-06-19 19:25 ` Mathieu Desnoyers
2013-06-20 6:36 ` Rob van der Heij
[not found] ` <CAJCc=kijujORhPUmPvzHj-MMdyVbf-iHEK0Jx-VHbTO8q4ESFA@mail.gmail.com>
2013-06-20 12:20 ` Mathieu Desnoyers
2013-06-25 1:56 ` Dave Chinner
2013-07-02 13:58 ` Mathieu Desnoyers
2013-07-03 0:55 ` Mathieu Desnoyers
2013-07-03 8:47 ` Mel Gorman
2013-07-03 14:53 ` Jeff Moyer
2013-07-03 14:53 ` Jeff Moyer
2013-07-04 0:03 ` Dave Chinner
2013-07-04 0:31 ` Mathieu Desnoyers
2013-07-04 21:11 ` Rob van der Heij
2013-07-05 1:42 ` Dave Chinner
2013-07-05 2:34 ` Mathieu Desnoyers
2013-07-03 18:47 ` Yannick Brosseau
2013-07-05 14:18 ` Mel Gorman
2013-07-05 14:18 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130618021528.GA29506@Krystal \
--to=mathieu.desnoyers@efficios.com \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lttng-dev@lists.lttng.org \
--cc=mgorman@suse.de \
--cc=raphael.beamonte@gmail.com \
--cc=rvdheij@gmail.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.