Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>, Vlastimil Babka <vbabka@suse.cz>,
	Joonsoo Kim <js1304@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests
Date: Thu, 13 Oct 2016 11:29:24 +1100	[thread overview]
Message-ID: <20161013002924.GO23194@dastard> (raw)
In-Reply-To: <20161007131814.GL18439@dhcp22.suse.cz>

On Fri, Oct 07, 2016 at 03:18:14PM +0200, Michal Hocko wrote:
> On Thu 06-10-16 13:11:42, Dave Chinner wrote:
> > On Wed, Oct 05, 2016 at 01:38:45PM +0200, Michal Hocko wrote:
> > > On Wed 05-10-16 07:32:02, Dave Chinner wrote:
> > > > On Tue, Oct 04, 2016 at 10:12:15AM +0200, Michal Hocko wrote:
> > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > 
> > > > > compaction has been disabled for GFP_NOFS and GFP_NOIO requests since
> > > > > the direct compaction was introduced by 56de7263fcf3 ("mm: compaction:
> > > > > direct compact when a high-order allocation fails"). The main reason
> > > > > is that the migration of page cache pages might recurse back to fs/io
> > > > > layer and we could potentially deadlock. This is overly conservative
> > > > > because all the anonymous memory is migrateable in the GFP_NOFS context
> > > > > just fine.  This might be a large portion of the memory in many/most
> > > > > workkloads.
> > > > > 
> > > > > Remove the GFP_NOFS restriction and make sure that we skip all fs pages
> > > > > (those with a mapping) while isolating pages to be migrated. We cannot
> > > > > consider clean fs pages because they might need a metadata update so
> > > > > only isolate pages without any mapping for nofs requests.
> > > > > 
> > > > > The effect of this patch will be probably very limited in many/most
> > > > > workloads because higher order GFP_NOFS requests are quite rare,
> > > > 
> > > > You say they are rare only because you don't know how to trigger
> > > > them easily.  :/
> > > 
> > > true
> > > 
> > > > Try this:
> > > > 
> > > > # mkfs.xfs -f -n size=64k <dev>
> > > > # mount <dev> /mnt/scratch
> > > > # time ./fs_mark  -D  10000  -S0  -n  100000  -s  0  -L  32 \
> > > >         -d  /mnt/scratch/0  -d  /mnt/scratch/1 \
> > > >         -d  /mnt/scratch/2  -d  /mnt/scratch/3 \
> > > >         -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
> > > >         -d  /mnt/scratch/6  -d  /mnt/scratch/7 \
> > > >         -d  /mnt/scratch/8  -d  /mnt/scratch/9 \
> > > >         -d  /mnt/scratch/10  -d  /mnt/scratch/11 \
> > > >         -d  /mnt/scratch/12  -d  /mnt/scratch/13 \
> > > >         -d  /mnt/scratch/14  -d  /mnt/scratch/15
> > > 
> > > Does this simulate a standard or usual fs workload/configuration?  I am
> > 
> > Unfortunately, there was an era of cargo cult configuration tweaks
> > in the Ceph community that has resulted in a large number of
> > production machines with XFS filesystems configured this way. And a
> > lot of them store large numbers of small files and run under
> > significant sustained memory pressure.
> 
> I see
> 
> > I slowly working towards getting rid of these high order allocations
> > and replacing them with the equivalent number of single page
> > allocations, but I haven't got that (complex) change working yet.
> 
> Definitely a good plan!
> 
> Anyway I was playing with this in my virtual machine (4CPUs, 512MB of
> RAM split into two NUMA nodes). Started on a freshly created fs after
> boot, no other load in the guest. The performance numbers should be
> taken with grain of salt, though, because the host has 4CPUs as well and
> it wasn't completely idle, but should be OK enough to give us at least
> some picture. This is what fs_mark told me:
> Unpatched kernel:
> #       Version 3.3, 16 thread(s) starting at Fri Oct  7 09:55:05 2016
> #       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> #       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
> #       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
> #       Files info: size 0 bytes, written with an IO size of 16384 bytes per write
> #       App overhead is time in microseconds spent in the test not doing file writing related system calls.
> #
> FSUse%        Count         Size    Files/sec     App Overhead
>      1      1600000            0       4300.1         20745838
>      3      3200000            0       4239.9         23849857
>      5      4800000            0       4243.4         25939543
>      6      6400000            0       4248.4         19514050
>      8      8000000            0       4262.1         20796169
>      9      9600000            0       4257.6         21288675
>     11     11200000            0       4259.7         19375120
>     13     12800000            0       4220.7         22734141
>     14     14400000            0       4238.5         31936458
>     16     16000000            0       4231.5         23409901
>     18     17600000            0       4045.3         23577700
>     19     19200000            0       2783.4         58299526
>     21     20800000            0       2678.2         40616302
>     23     22400000            0       2693.5         83973996
> Ctrl+C because it just took too long.

Try running it on a larger filesystem, or configure the fs with more
AGs and a larger log (i.e. mkfs.xfs -f -dagcount=24 -l size=512m
<dev>). That will speed up modifications and increase concurrency.
This test should be able to run 5-10x faster than this (it
sustains 55,000 files/s @ 300MB/s write on my test fs on a cheap
SSD).


> while it doesn't seem to drop the Files/sec numbers starting with
> Count=19200000. I also see only a single
> 
> [ 3063.815003] XFS: fs_mark(3272) possible memory allocation deadlock size 65624 in kmem_alloc (mode:0x2408240)

Remember that this is emitted only after /100/ consecutive
allocation failures. So the fact it is still being emitted means
that the situation is not drastically better....

> Unpatched kernel
> all orders
> begin:44.718798 end:5774.618736 allocs:15019288
> order > 0 
> begin:44.718798 end:5773.587195 allocs:10438610
> 
> Patched kernel
> all orders
> begin:64.612804 end:5794.193619 allocs:16110081 [107.2%]
> order > 0
> begin:64.612804 end:5794.193619 allocs:11741492 [112.5%]
> 
> which would suggest that diving into the compaction rather than backing
> off and waiting for kcompactd to make the work for us was indeed a
> better strategy and helped the throughput.

Well, without a success/failure ratio being calculated it's hard to
tell what improvement it made. Did it increase the success rate, or
reduce failure latency so retries happened faster?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2016-10-13  0:29 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-04  8:12 [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests Michal Hocko
2016-10-04 20:32 ` Dave Chinner
2016-10-05 11:38   ` Michal Hocko
2016-10-06  2:11     ` Dave Chinner
2016-10-07 13:18       ` Michal Hocko
2016-10-13  0:29         ` Dave Chinner [this message]
2016-10-13  7:39           ` Michal Hocko
2016-10-13 11:04             ` Michal Hocko
2016-10-16 20:49               ` Dave Chinner
2016-10-17  8:22                 ` Michal Hocko
2016-10-18  6:24                   ` Dave Chinner
2016-10-18 12:29                     ` Michal Hocko
2016-10-07  5:27 ` Vlastimil Babka
2016-10-07  6:50   ` Michal Hocko
2016-10-07  8:15     ` Vlastimil Babka
2016-10-07  9:21       ` Michal Hocko
2016-10-10  6:44         ` Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161013002924.GO23194@dastard \
    --to=david@fromorbit.com \
    --cc=akpm@linux-foundation.org \
    --cc=js1304@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.