From: Mel Gorman <mgorman@techsingularity.net>
To: Dave Chinner <david@fromorbit.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Michal Hocko <mhocko@suse.cz>, Minchan Kim <minchan@kernel.org>,
Vladimir Davydov <vdavydov@virtuozzo.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Vlastimil Babka <vbabka@suse.cz>,
Andrew Morton <akpm@linux-foundation.org>,
Bob Peterson <rpeterso@redhat.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
"Huang, Ying" <ying.huang@intel.com>,
Christoph Hellwig <hch@lst.de>,
Wu Fengguang <fengguang.wu@intel.com>, LKP <lkp@01.org>,
Tejun Heo <tj@kernel.org>, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Date: Fri, 19 Aug 2016 16:08:34 +0100 [thread overview]
Message-ID: <20160819150834.GP8119@techsingularity.net> (raw)
In-Reply-To: <20160818071111.GD22388@dastard>
On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
> On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
> > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > > > Yes, we could try to batch the locking like DaveC already suggested
> > > > (ie we could move the locking to the caller, and then make
> > > > shrink_page_list() just try to keep the lock held for a few pages if
> > > > the mapping doesn't change), and that might result in fewer crazy
> > > > cacheline ping-pongs overall. But that feels like exactly the wrong
> > > > kind of workaround.
> > > >
> > >
> > > Even if such batching was implemented, it would be very specific to the
> > > case of a single large file filling LRUs on multiple nodes.
> > >
> >
> > The latest Jason Bourne movie was sufficiently bad that I spent time
> > thinking how the tree_lock could be batched during reclaim. It's not
> > straight-forward but this prototype did not blow up on UMA and may be
> > worth considering if Dave can test either approach has a positive impact.
>
> SO, I just did a couple of tests. I'll call the two patches "sleepy"
> for the contention backoff patch and "bourney" for the Jason Bourne
> inspired batching patch. This is an average of 3 runs, overwriting
> a 47GB file on a machine with 16GB RAM:
>
> IO throughput wall time __pv_queued_spin_lock_slowpath
> vanilla 470MB/s 1m42s 25-30%
> sleepy 295MB/s 2m43s <1%
> bourney 425MB/s 1m53s 25-30%
>
This is another blunt-force patch that
a) stalls all but one kswapd instance when treelock contention occurs
b) marks a pgdat congested when tree_lock contention is encountered
which may cause direct reclaimers to wait_iff_congested until
kswapd finishes balancing the node
I tested this on a KVM instance running on a 4-socket box. The vCPUs
were bound to pCPUs and the memory nodes in the KVM mapped to physical
memory nodes. Without the patch 3% of kswapd cycles were spent on
locking. With the patch, the cycle count was 0.23%
xfs_io contention was reduced from 0.63% to 0.39% which is not perfect.
It can be reduced by stalling all kswapd instances but then xfs_io direct
reclaims and throughput drops.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d572b78b65e1..f6d3e886f405 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -532,6 +532,7 @@ enum pgdat_flags {
* many pages under writeback
*/
PGDAT_RECLAIM_LOCKED, /* prevents concurrent reclaim */
+ PGDAT_CONTENDED, /* kswapd contending on tree_lock */
};
static inline unsigned long zone_end_pfn(const struct zone *zone)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 374d95d04178..64ca2148755c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -621,19 +621,43 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
return PAGE_CLEAN;
}
+static atomic_t kswapd_contended = ATOMIC_INIT(0);
+
/*
* Same as remove_mapping, but if the page is removed from the mapping, it
* gets returned with a refcount of 0.
*/
static int __remove_mapping(struct address_space *mapping, struct page *page,
- bool reclaimed)
+ bool reclaimed, unsigned long *nr_contended)
{
unsigned long flags;
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
- spin_lock_irqsave(&mapping->tree_lock, flags);
+ if (!nr_contended || !current_is_kswapd())
+ spin_lock_irqsave(&mapping->tree_lock, flags);
+ else {
+ /* Account for trylock contentions in kswapd */
+ if (!spin_trylock_irqsave(&mapping->tree_lock, flags)) {
+ pg_data_t *pgdat = page_pgdat(page);
+ int nr_kswapd;
+
+ /* Account for contended pages and contended kswapds */
+ (*nr_contended)++;
+ if (!test_and_set_bit(PGDAT_CONTENDED, &pgdat->flags))
+ nr_kswapd = atomic_inc_return(&kswapd_contended);
+ else
+ nr_kswapd = atomic_read(&kswapd_contended);
+ BUG_ON(nr_kswapd > nr_online_nodes || nr_kswapd < 0);
+
+ /* Stall kswapd if multiple kswapds are contending */
+ if (nr_kswapd > 1)
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+ spin_lock_irqsave(&mapping->tree_lock, flags);
+ }
+ }
/*
* The non racy check for a busy page.
*
@@ -719,7 +743,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
*/
int remove_mapping(struct address_space *mapping, struct page *page)
{
- if (__remove_mapping(mapping, page, false)) {
+ if (__remove_mapping(mapping, page, false, NULL)) {
/*
* Unfreezing the refcount with 1 rather than 2 effectively
* drops the pagecache ref for us without requiring another
@@ -906,6 +930,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
unsigned long *ret_nr_congested,
unsigned long *ret_nr_writeback,
unsigned long *ret_nr_immediate,
+ unsigned long *ret_nr_contended,
bool force_reclaim)
{
LIST_HEAD(ret_pages);
@@ -917,6 +942,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
unsigned long nr_reclaimed = 0;
unsigned long nr_writeback = 0;
unsigned long nr_immediate = 0;
+ unsigned long nr_contended = 0;
cond_resched();
@@ -1206,7 +1232,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
}
lazyfree:
- if (!mapping || !__remove_mapping(mapping, page, true))
+ if (!mapping || !__remove_mapping(mapping, page, true, &nr_contended))
goto keep_locked;
/*
@@ -1263,6 +1289,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
*ret_nr_unqueued_dirty += nr_unqueued_dirty;
*ret_nr_writeback += nr_writeback;
*ret_nr_immediate += nr_immediate;
+ *ret_nr_contended += nr_contended;
return nr_reclaimed;
}
@@ -1274,7 +1301,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
.priority = DEF_PRIORITY,
.may_unmap = 1,
};
- unsigned long ret, dummy1, dummy2, dummy3, dummy4, dummy5;
+ unsigned long ret, dummy1, dummy2, dummy3, dummy4, dummy5, dummy6;
struct page *page, *next;
LIST_HEAD(clean_pages);
@@ -1288,7 +1315,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
TTU_UNMAP|TTU_IGNORE_ACCESS,
- &dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
+ &dummy1, &dummy2, &dummy3, &dummy4, &dummy5, &dummy6, true);
list_splice(&clean_pages, page_list);
mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
return ret;
@@ -1693,6 +1720,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
unsigned long nr_unqueued_dirty = 0;
unsigned long nr_writeback = 0;
unsigned long nr_immediate = 0;
+ unsigned long nr_contended = 0;
isolate_mode_t isolate_mode = 0;
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
@@ -1738,7 +1766,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, TTU_UNMAP,
&nr_dirty, &nr_unqueued_dirty, &nr_congested,
- &nr_writeback, &nr_immediate,
+ &nr_writeback, &nr_immediate, &nr_contended,
false);
spin_lock_irq(&pgdat->lru_lock);
@@ -1789,6 +1817,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
set_bit(PGDAT_CONGESTED, &pgdat->flags);
/*
+ * Tag a zone as congested if kswapd encounters contended pages
+ * as it may indicate contention with a heavy writer or
+ * other kswapd instances. The tag may stall direct reclaimers
+ * in wait_iff_congested.
+ */
+ if (nr_contended && current_is_kswapd())
+ set_bit(PGDAT_CONGESTED, &pgdat->flags);
+
+ /*
* If dirty pages are scanned that are not queued for IO, it
* implies that flushers are not keeping up. In this case, flag
* the pgdat PGDAT_DIRTY and kswapd will start writing pages from
@@ -1805,6 +1842,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
*/
if (nr_immediate && current_may_throttle())
congestion_wait(BLK_RW_ASYNC, HZ/10);
+
}
/*
@@ -3109,6 +3147,9 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
+ if (test_and_clear_bit(PGDAT_CONTENDED, &zone->zone_pgdat->flags))
+ atomic_dec(&kswapd_contended);
+
return true;
}
next prev parent reply other threads:[~2016-08-19 15:08 UTC|newest]
Thread overview: 109+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-08-09 14:33 [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression kernel test robot
2016-08-10 18:24 ` Linus Torvalds
2016-08-10 23:08 ` Dave Chinner
2016-08-10 23:51 ` Linus Torvalds
2016-08-10 23:58 ` [LKP] " Huang, Ying
2016-08-11 0:11 ` Huang, Ying
2016-08-11 0:23 ` Linus Torvalds
2016-08-11 0:33 ` Huang, Ying
2016-08-11 1:00 ` Linus Torvalds
2016-08-11 4:46 ` Dave Chinner
2016-08-15 17:22 ` Huang, Ying
2016-08-16 0:08 ` Dave Chinner
2016-08-11 15:57 ` Christoph Hellwig
2016-08-11 16:55 ` Linus Torvalds
2016-08-11 17:51 ` Huang, Ying
2016-08-11 19:51 ` Linus Torvalds
2016-08-11 20:00 ` Christoph Hellwig
2016-08-11 20:35 ` Linus Torvalds
2016-08-11 22:16 ` Al Viro
2016-08-11 22:30 ` Linus Torvalds
2016-08-11 21:16 ` Huang, Ying
2016-08-11 21:40 ` Linus Torvalds
2016-08-11 22:08 ` Christoph Hellwig
2016-08-12 0:54 ` Dave Chinner
2016-08-12 2:23 ` Dave Chinner
2016-08-12 2:32 ` Linus Torvalds
2016-08-12 2:52 ` Christoph Hellwig
2016-08-12 3:20 ` Linus Torvalds
2016-08-12 4:16 ` Dave Chinner
2016-08-12 5:02 ` Linus Torvalds
2016-08-12 6:04 ` Dave Chinner
2016-08-12 6:29 ` Ye Xiaolong
2016-08-12 8:51 ` Ye Xiaolong
2016-08-12 10:02 ` Dave Chinner
2016-08-12 10:43 ` Fengguang Wu
2016-08-13 0:30 ` [LKP] [lkp] " Christoph Hellwig
2016-08-13 21:48 ` Christoph Hellwig
2016-08-13 22:07 ` Fengguang Wu
2016-08-13 22:15 ` Christoph Hellwig
2016-08-13 22:51 ` Fengguang Wu
2016-08-14 14:50 ` Fengguang Wu
2016-08-14 16:17 ` Christoph Hellwig
2016-08-14 23:46 ` Dave Chinner
2016-08-14 23:57 ` Fengguang Wu
2016-08-15 14:14 ` Fengguang Wu
2016-08-15 21:22 ` Dave Chinner
2016-08-16 12:20 ` Fengguang Wu
2016-08-15 20:30 ` Huang, Ying
2016-08-22 22:09 ` Huang, Ying
2016-09-26 6:25 ` Huang, Ying
2016-09-26 14:55 ` Christoph Hellwig
2016-09-27 0:52 ` Huang, Ying
2016-08-16 13:25 ` Fengguang Wu
2016-08-13 23:32 ` Dave Chinner
2016-08-12 2:27 ` Linus Torvalds
2016-08-12 3:56 ` Dave Chinner
2016-08-12 18:03 ` Linus Torvalds
2016-08-13 23:58 ` Fengguang Wu
2016-08-15 0:48 ` Dave Chinner
2016-08-15 1:37 ` Linus Torvalds
2016-08-15 2:28 ` Dave Chinner
2016-08-15 2:53 ` Linus Torvalds
2016-08-15 5:00 ` Dave Chinner
[not found] ` <CA+55aFwva2Xffai+Eqv1Jn_NGryk3YJ2i5JoHOQnbQv6qVPAsw@mail.gmail.com>
[not found] ` <CA+55aFy14nUnJQ_GdF=j8Fa9xiH70c6fY2G3q5HQ01+8z1z3qQ@mail.gmail.com>
[not found] ` <CA+55aFxp+rLehC8c157uRbH459wUC1rRPfCVgvmcq5BrG9gkyg@mail.gmail.com>
2016-08-15 22:22 ` Dave Chinner
2016-08-15 22:42 ` Dave Chinner
2016-08-15 23:20 ` Linus Torvalds
2016-08-15 23:48 ` Linus Torvalds
2016-08-16 0:44 ` Dave Chinner
2016-08-16 15:05 ` Mel Gorman
2016-08-16 17:47 ` Linus Torvalds
2016-08-17 15:48 ` Michal Hocko
2016-08-17 16:42 ` Michal Hocko
2016-08-17 15:49 ` Mel Gorman
2016-08-18 0:45 ` Mel Gorman
2016-08-18 7:11 ` Dave Chinner
2016-08-18 13:24 ` Mel Gorman
2016-08-18 17:55 ` Linus Torvalds
2016-08-18 21:19 ` Dave Chinner
2016-08-18 22:25 ` Linus Torvalds
2016-08-19 9:00 ` Michal Hocko
2016-08-19 10:49 ` Mel Gorman
2016-08-19 23:48 ` Dave Chinner
2016-08-20 1:08 ` Linus Torvalds
2016-08-20 12:16 ` Mel Gorman
2016-08-19 15:08 ` Mel Gorman [this message]
2016-09-01 23:32 ` Dave Chinner
2016-09-06 15:37 ` Mel Gorman
2016-09-06 15:52 ` Huang, Ying
2016-08-24 15:40 ` Huang, Ying
2016-08-25 9:37 ` Mel Gorman
2016-08-18 2:44 ` Dave Chinner
2016-08-16 0:15 ` Linus Torvalds
2016-08-16 0:38 ` Dave Chinner
2016-08-16 0:50 ` Linus Torvalds
2016-08-16 0:19 ` Dave Chinner
2016-08-16 1:51 ` Linus Torvalds
2016-08-16 22:02 ` Dave Chinner
2016-08-16 23:23 ` Linus Torvalds
2016-08-15 23:01 ` Linus Torvalds
2016-08-16 0:17 ` Dave Chinner
2016-08-16 0:45 ` Linus Torvalds
2016-08-15 5:03 ` Ingo Molnar
2016-08-17 16:24 ` Peter Zijlstra
2016-08-15 12:58 ` Fengguang Wu
2016-08-11 1:16 ` Dave Chinner
2016-08-11 1:32 ` Dave Chinner
2016-08-11 2:36 ` Ye Xiaolong
2016-08-11 3:05 ` Dave Chinner
2016-08-12 1:26 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160819150834.GP8119@techsingularity.net \
--to=mgorman@techsingularity.net \
--cc=akpm@linux-foundation.org \
--cc=david@fromorbit.com \
--cc=fengguang.wu@intel.com \
--cc=hannes@cmpxchg.org \
--cc=hch@lst.de \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lkp@01.org \
--cc=mhocko@suse.cz \
--cc=minchan@kernel.org \
--cc=rpeterso@redhat.com \
--cc=tj@kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=vbabka@suse.cz \
--cc=vdavydov@virtuozzo.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox