Re: [-stable 3.8.1 performance regression] madvise POSIX_FADV_DONTNEED

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@suse.de>
To: Yannick Brosseau <yannick.brosseau@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Dave Chinner <david@fromorbit.com>,
	Rob van der Heij <rvdheij@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	stable@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
	"lttng-dev@lists.lttng.org" <lttng-dev@lists.lttng.org>
Subject: Re: [-stable 3.8.1 performance regression] madvise POSIX_FADV_DONTNEED
Date: Fri, 5 Jul 2013 15:18:51 +0100	[thread overview]
Message-ID: <20130705141851.GW1875@suse.de> (raw)
In-Reply-To: <51D471DE.3020800@gmail.com>

On Wed, Jul 03, 2013 at 02:47:58PM -0400, Yannick Brosseau wrote:
> On 2013-07-03 04:47, Mel Gorman wrote:
> >> Given it is just a hint, we should be allowed to perform page
> >> > deactivation lazily. Is there any fundamental reason to wait for worker
> >> > threads on each CPU to complete their lru drain before returning from
> >> > fadvise() to user-space ?
> >> > 
> > Only to make sure they pages are actually dropped as requested. The reason
> > the wait was introduced in the first place was that page cache was filling
> > up even with the fadvise calls and causing disruption. In 3.11 disruption
> > due to this sort of parallel IO should be reduced but making fadvise work
> > properly is reasonable in itself. Was that patch I posted ever tested or
> > did I manage to miss it?
>
> I did test it. On our test case, we get a worst result with it.
> 

That's useful to know. Please test the following replacement patch
instead.

---8<---
mm: Only drain CPUs whose pagevecs contain pages backed by a given mapping

cha-cha-cha-changelog

Not-signed-off-by David Bowie
---
 include/linux/swap.h      |  1 +
 include/linux/workqueue.h |  1 +
 kernel/workqueue.c        | 29 +++++++++++++++++++++++------
 mm/fadvise.c              |  2 +-
 mm/swap.c                 | 43 +++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1701ce4..7625d2a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -242,6 +242,7 @@ extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern int lru_add_drain_all(void);
+extern int lru_add_drain_mapping(struct address_space *mapping);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_page(struct page *page);
 extern void swap_setup(void);
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 623488f..89c97e8 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -435,6 +435,7 @@ extern void drain_workqueue(struct workqueue_struct *wq);
 extern void flush_scheduled_work(void);
 
 extern int schedule_on_each_cpu(work_func_t func);
+extern int schedule_on_cpumask(work_func_t func, const cpumask_t *mask);
 
 int execute_in_process_context(work_func_t fn, struct execute_work *);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ee8e29a..c359e30 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2949,17 +2949,18 @@ bool cancel_delayed_work_sync(struct delayed_work *dwork)
 EXPORT_SYMBOL(cancel_delayed_work_sync);
 
 /**
- * schedule_on_each_cpu - execute a function synchronously on each online CPU
+ * schedule_on_cpumask - execute a function synchronously on each online CPU
  * @func: the function to call
+ * @mask: the cpumask of CPUs to execute the function on
  *
- * schedule_on_each_cpu() executes @func on each online CPU using the
+ * schedule_on_each_cpu() executes @func on each CPU specified in the mask the
  * system workqueue and blocks until all CPUs have completed.
- * schedule_on_each_cpu() is very slow.
+ * schedule_on_cpumask() is potentially very slow.
  *
  * RETURNS:
  * 0 on success, -errno on failure.
  */
-int schedule_on_each_cpu(work_func_t func)
+int schedule_on_cpumask(work_func_t func, const cpumask_t *mask)
 {
 	int cpu;
 	struct work_struct __percpu *works;
@@ -2970,14 +2971,14 @@ int schedule_on_each_cpu(work_func_t func)
 
 	get_online_cpus();
 
-	for_each_online_cpu(cpu) {
+	for_each_cpu_mask(cpu, *mask) {
 		struct work_struct *work = per_cpu_ptr(works, cpu);
 
 		INIT_WORK(work, func);
 		schedule_work_on(cpu, work);
 	}
 
-	for_each_online_cpu(cpu)
+	for_each_cpu_mask(cpu, *mask)
 		flush_work(per_cpu_ptr(works, cpu));
 
 	put_online_cpus();
@@ -2986,6 +2987,22 @@ int schedule_on_each_cpu(work_func_t func)
 }
 
 /**
+ * schedule_on_each_cpu - execute a function synchronously on each online CPU
+ * @func: the function to call
+ *
+ * schedule_on_each_cpu() executes @func on each online CPU using the
+ * system workqueue and blocks until all CPUs have completed.
+ * schedule_on_each_cpu() is very slow.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+int schedule_on_each_cpu(work_func_t func)
+{
+	return schedule_on_cpumask(func, cpu_online_mask);
+}
+
+/**
  * flush_scheduled_work - ensure that any scheduled work has run to completion.
  *
  * Forces execution of the kernel-global workqueue and blocks until its
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 3bcfd81..70c4a3d 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -132,7 +132,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 			 * pagevecs and try again.
 			 */
 			if (count < (end_index - start_index + 1)) {
-				lru_add_drain_all();
+				lru_add_drain_mapping(mapping);
 				invalidate_mapping_pages(mapping, start_index,
 						end_index);
 			}
diff --git a/mm/swap.c b/mm/swap.c
index dfd7d71..97de395 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -656,6 +656,49 @@ int lru_add_drain_all(void)
 }
 
 /*
+ * Drains LRU of some CPUs but only if the associated CPUs pagevec contain
+ * pages managed by the given mapping. This is racy and there is no guarantee
+ * that when it returns that there will still not be pages belonging to the
+ * mapping on a pagevec.
+ */
+int lru_add_drain_mapping(struct address_space *mapping)
+{
+	static cpumask_t pagevec_cpus;
+	int cpu;
+
+	cpumask_clear(&pagevec_cpus);
+
+	/*
+	 * get_online_cpus unnecessary as pagevecs persist after hotplug.
+	 * Count may change underneath us but at worst some pages are
+	 * missed or there is the occasional false positive.
+	 */
+	for_each_online_cpu(cpu) {
+		struct pagevec *pvecs = per_cpu(lru_add_pvecs, cpu);
+		struct pagevec *pvec;
+		int lru;
+
+		for_each_lru(lru) {
+			int i;
+			pvec = &pvecs[lru - LRU_BASE];
+
+			for (i = 0; i < pagevec_count(pvec); i++) {
+				struct page *page = pvec->pages[i];
+
+				if (page->mapping == mapping) {
+					cpumask_set_cpu(cpu, &pagevec_cpus);
+					goto next_cpu;
+				}
+			}
+		}
+next_cpu:
+		;
+	}
+
+	return schedule_on_cpumask(lru_add_drain_per_cpu, &pagevec_cpus);
+}
+
+/*
  * Batched page_cache_release().  Decrement the reference count on all the
  * passed pages.  If it fell to zero then remove the page from the LRU and
  * free it.

next prev parent reply	other threads:[~2013-07-05 14:19 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <51BE1828.3060206@gmail.com>
2013-06-17 14:13 ` [-stable 3.8.1 performance regression] madvise POSIX_FADV_DONTNEED Mathieu Desnoyers
2013-06-17 21:24   ` Andrew Morton
2013-06-17 21:39     ` Raphaël Beamonte
     [not found]     ` <CAE_Gge34HCroSgNgiXL1j7Le3CNKRXR=7TZQhJSmY+wfWniKug@mail.gmail.com>
2013-06-17 21:57       ` [lttng-dev] " Andrew Morton
2013-06-18  2:15         ` Mathieu Desnoyers
2013-06-18  2:44           ` Andrew Morton
2013-06-18  9:29     ` Mel Gorman
2013-06-18 10:11       ` Mel Gorman
2013-06-19 19:25         ` Mathieu Desnoyers
2013-06-20  6:36           ` Rob van der Heij
     [not found]           ` <CAJCc=kijujORhPUmPvzHj-MMdyVbf-iHEK0Jx-VHbTO8q4ESFA@mail.gmail.com>
2013-06-20 12:20             ` Mathieu Desnoyers
2013-06-25  1:56               ` Dave Chinner
2013-07-02 13:58                 ` Mathieu Desnoyers
2013-07-03  0:55                   ` Mathieu Desnoyers
2013-07-03  8:47                     ` Mel Gorman
2013-07-03 14:53                       ` Jeff Moyer
2013-07-03 14:53                         ` Jeff Moyer
2013-07-04  0:03                         ` Dave Chinner
2013-07-04  0:31                           ` Mathieu Desnoyers
2013-07-04 21:11                             ` Rob van der Heij
2013-07-05  1:42                             ` Dave Chinner
2013-07-05  2:34                               ` Mathieu Desnoyers
2013-07-03 18:47                       ` Yannick Brosseau
2013-07-05 14:18                         ` Mel Gorman [this message]
2013-07-05 14:18                         ` Mel Gorman

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:1701ce4 dfblob:7625d2a dfblob:623488f dfblob:89c97e8
dfblob:ee8e29a dfblob:c359e30 dfblob:3bcfd81 dfblob:70c4a3d
dfblob:dfd7d71 dfblob:97de395 )
 OR (
bs:"Re: [-stable 3.8.1 performance regression] madvise POSIX_FADV_DONTNEED" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130705141851.GW1875@suse.de \
    --to=mgorman@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lttng-dev@lists.lttng.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=rvdheij@gmail.com \
    --cc=stable@vger.kernel.org \
    --cc=yannick.brosseau@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.