From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1948481AbcHRNYX (ORCPT <rfc822;w@1wt.eu>);
	Thu, 18 Aug 2016 09:24:23 -0400
Received: from outbound-smtp09.blacknight.com ([46.22.139.14]:32891 "EHLO
	outbound-smtp09.blacknight.com" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1947219AbcHRNYT (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 18 Aug 2016 09:24:19 -0400
Date: Thu, 18 Aug 2016 14:24:14 +0100
From: Mel Gorman <mgorman@techsingularity.net>
To: Dave Chinner <david@fromorbit.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
        Michal Hocko <mhocko@suse.cz>, Minchan Kim <minchan@kernel.org>,
        Vladimir Davydov <vdavydov@virtuozzo.com>,
        Johannes Weiner <hannes@cmpxchg.org>, Vlastimil Babka <vbabka@suse.cz>,
        Andrew Morton <akpm@linux-foundation.org>,
        Bob Peterson <rpeterso@redhat.com>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
        "Huang, Ying" <ying.huang@intel.com>, Christoph Hellwig <hch@lst.de>,
        Wu Fengguang <fengguang.wu@intel.com>, LKP <lkp@01.org>,
        Tejun Heo <tj@kernel.org>, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Message-ID: <20160818132414.GK8119@techsingularity.net>
References: <CA+55aFxp+rLehC8c157uRbH459wUC1rRPfCVgvmcq5BrG9gkyg@mail.gmail.com>
 <20160815222211.GA19025@dastard>
 <20160815224259.GB19025@dastard>
 <CA+55aFzOAorMxCsv3uyyyhS8c5xteVnZVEm+bGyBjkjWVT5Zag@mail.gmail.com>
 <CA+55aFwp-Aeu-6j2MfMgEDoUwq+1vThL4nBdMj-p5TqDMA5RrA@mail.gmail.com>
 <20160816150500.GH8119@techsingularity.net>
 <CA+55aFxCSU=Hy7OqRxHoJDx1ruMD3H2qvmy4hdZ0Bjx94dwDug@mail.gmail.com>
 <20160817154907.GI8119@techsingularity.net>
 <20160818004517.GJ8119@techsingularity.net>
 <20160818071111.GD22388@dastard>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-15
Content-Disposition: inline
In-Reply-To: <20160818071111.GD22388@dastard>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
> On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
> > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > > > Yes, we could try to batch the locking like DaveC already suggested
> > > > (ie we could move the locking to the caller, and then make
> > > > shrink_page_list() just try to keep the lock held for a few pages if
> > > > the mapping doesn't change), and that might result in fewer crazy
> > > > cacheline ping-pongs overall. But that feels like exactly the wrong
> > > > kind of workaround.
> > > > 
> > > 
> > > Even if such batching was implemented, it would be very specific to the
> > > case of a single large file filling LRUs on multiple nodes.
> > > 
> > 
> > The latest Jason Bourne movie was sufficiently bad that I spent time
> > thinking how the tree_lock could be batched during reclaim. It's not
> > straight-forward but this prototype did not blow up on UMA and may be
> > worth considering if Dave can test either approach has a positive impact.
> 
> SO, I just did a couple of tests. I'll call the two patches "sleepy"
> for the contention backoff patch and "bourney" for the Jason Bourne
> inspired batching patch. This is an average of 3 runs, overwriting
> a 47GB file on a machine with 16GB RAM:
> 
> 		IO throughput	wall time __pv_queued_spin_lock_slowpath
> vanilla		470MB/s		1m42s		25-30%
> sleepy		295MB/s		2m43s		<1%
> bourney		425MB/s		1m53s		25-30%
> 

Thanks. I updated the tests today and reran them trying to reproduce what
you saw but I'm simply not seeing it on bare metal with a spinning disk.

xfsio Throughput
                          4.8.0-rc2             4.8.0-rc2             4.8.0-rc2
                            vanilla                sleepy               bourney
Min      tput    147.4450 (  0.00%)    147.2580 (  0.13%)    147.3900 (  0.04%)
Hmean    tput    147.5853 (  0.00%)    147.5101 (  0.05%)    147.6121 ( -0.02%)
Stddev   tput      0.1041 (  0.00%)      0.1785 (-71.47%)      0.2036 (-95.63%)
CoeffVar tput      0.0705 (  0.00%)      0.1210 (-71.56%)      0.1379 (-95.59%)
Max      tput    147.6940 (  0.00%)    147.6420 (  0.04%)    147.8820 ( -0.13%)

I'm currently setting up a KVM instance that may fare better. Due to
quirks of where machines are, I have to setup the KVM instance on real
NUMA hardware but maybe that'll make the problem even more obvious.

> The overall CPU usage of sleepy was much lower than the others, but
> it was also much slower. Too much sleeping and not enough reclaim
> work being done, I think.
> 

Looks like it. On my initial test, there was barely any sleeping.

> As for bourney, it's not immediately clear as to why it's nearly as
> bad as the movie. At worst I would have expected it to have not
> noticable impact, but maybe we are delaying freeing of pages too
> long and so stalling allocation of new pages? It also doesn't do
> much to reduce contention, especially considering the reduction in
> throughput.
> 
> On a hunch that the batch list isn't all one mapping, I sorted it.
> Patch is below if you're curious.
> 

The fact that sorting makes such a difference makes me think that it's
the wrong direction. It's far too specific to this test case and does
nothing to throttle a reclaimer. It's also fairly complex and I expected
that normal users of remove_mapping such as truncation would take a hit.

The hit of bouncing the lock around just hurts too much.

> FWIW, I just remembered about /proc/sys/vm/zone_reclaim_mode.
> 

That is a terrifying "fix" for this problem. It just happens to work
because there is no spillover to other nodes so only one kswapd instance
is potentially active.

> Anyway, I've burnt enough erase cycles on this SSD for today....
> 

I'll continue looking at getting KVM up and running and then consider
other possibilities for throttling.

-- 
Mel Gorman
SUSE Labs