All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Hansen <dave.hansen@intel.com>
To: Dave Chinner <david@fromorbit.com>,
	"Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andi Kleen <ak@linux.intel.com>, Hugh Dickins <hughd@google.com>,
	Michal Hocko <mhocko@kernel.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] shmem: avoid huge pages for small files
Date: Mon, 24 Oct 2016 13:34:53 -0700	[thread overview]
Message-ID: <580E706D.6030905@intel.com> (raw)
In-Reply-To: <20161021225013.GS14023@dastard>

On 10/21/2016 03:50 PM, Dave Chinner wrote:
> On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote:
>> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
>> To me, most of things you're talking about is highly dependent on access
>> pattern generated by userspace:
>>
>>   - we may want to allocate huge pages from byte 1 if we know that file
>>     will grow;
> 
> delayed allocation takes care of that. We use a growing speculative
> delalloc size that kicks in at specific sizes and can be used
> directly to determine if a large page shoul dbe allocated. This code
> is aware of sparse files, sparse writes, etc.

OK, so somebody does a write() of 1 byte.  We can delay the underlying
block allocation for a long time, but we can *not* delay the memory
allocation.  We've got to decide before the write() returns.

How does delayed allocation help with that decision?

I guess we could (always?) allocate small pages up front, and then only
bother promoting them once the FS delayed-allocation code kicks in and
is *also* giving us underlying large allocations.  That punts the logic
to the filesystem, which is a bit counterintuitive, but it seems
relatively sane.

>>> As such, there is no way we should be considering different
>>> interfaces and methods for configuring the /same functionality/ just
>>> because DAX is enabled or not. It's the /same decision/ that needs
>>> to be made, and the filesystem knows an awful lot more about whether
>>> huge pages can be used efficiently at the time of access than just
>>> about any other actor you can name....
>>
>> I'm not convinced that filesystem is in better position to see access
>> patterns than mm for page cache. It's not all about on-disk layout.
> 
> Spoken like a true mm developer. IO performance is all about IO
> patterns, and the primary contributor to bad IO patterns is bad
> filesystem allocation patterns.... :P

For writes, I think you have a good point.  Managing a horribly
fragmented file with larger pages and eating the associated write
magnification that comes along with it seems like a recipe for disaster.

But, Isn't some level of disconnection between the page cache and the
underlying IO patterns a *good* thing?  Once we've gone to the trouble
of bringing some (potentially very fragmented) data into the page cache,
why _not_ manage it in a lower-overhead way if we can?  For read-only
data it seems like a no-brainer that we'd want things in as large of a
management unit as we can get.

IOW, why let the underlying block allocation layout hamstring how the
memory is managed?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Dave Hansen <dave.hansen@intel.com>
To: Dave Chinner <david@fromorbit.com>,
	"Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andi Kleen <ak@linux.intel.com>, Hugh Dickins <hughd@google.com>,
	Michal Hocko <mhocko@kernel.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] shmem: avoid huge pages for small files
Date: Mon, 24 Oct 2016 13:34:53 -0700	[thread overview]
Message-ID: <580E706D.6030905@intel.com> (raw)
In-Reply-To: <20161021225013.GS14023@dastard>

On 10/21/2016 03:50 PM, Dave Chinner wrote:
> On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote:
>> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
>> To me, most of things you're talking about is highly dependent on access
>> pattern generated by userspace:
>>
>>   - we may want to allocate huge pages from byte 1 if we know that file
>>     will grow;
> 
> delayed allocation takes care of that. We use a growing speculative
> delalloc size that kicks in at specific sizes and can be used
> directly to determine if a large page shoul dbe allocated. This code
> is aware of sparse files, sparse writes, etc.

OK, so somebody does a write() of 1 byte.  We can delay the underlying
block allocation for a long time, but we can *not* delay the memory
allocation.  We've got to decide before the write() returns.

How does delayed allocation help with that decision?

I guess we could (always?) allocate small pages up front, and then only
bother promoting them once the FS delayed-allocation code kicks in and
is *also* giving us underlying large allocations.  That punts the logic
to the filesystem, which is a bit counterintuitive, but it seems
relatively sane.

>>> As such, there is no way we should be considering different
>>> interfaces and methods for configuring the /same functionality/ just
>>> because DAX is enabled or not. It's the /same decision/ that needs
>>> to be made, and the filesystem knows an awful lot more about whether
>>> huge pages can be used efficiently at the time of access than just
>>> about any other actor you can name....
>>
>> I'm not convinced that filesystem is in better position to see access
>> patterns than mm for page cache. It's not all about on-disk layout.
> 
> Spoken like a true mm developer. IO performance is all about IO
> patterns, and the primary contributor to bad IO patterns is bad
> filesystem allocation patterns.... :P

For writes, I think you have a good point.  Managing a horribly
fragmented file with larger pages and eating the associated write
magnification that comes along with it seems like a recipe for disaster.

But, Isn't some level of disconnection between the page cache and the
underlying IO patterns a *good* thing?  Once we've gone to the trouble
of bringing some (potentially very fragmented) data into the page cache,
why _not_ manage it in a lower-overhead way if we can?  For read-only
data it seems like a no-brainer that we'd want things in as large of a
management unit as we can get.

IOW, why let the underlying block allocation layout hamstring how the
memory is managed?

  parent reply	other threads:[~2016-10-24 20:34 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-17 12:18 [PATCH] shmem: avoid huge pages for small files Kirill A. Shutemov
2016-10-17 12:18 ` Kirill A. Shutemov
2016-10-17 12:30 ` Kirill A. Shutemov
2016-10-17 12:30   ` Kirill A. Shutemov
2016-10-17 14:12   ` Michal Hocko
2016-10-17 14:12     ` Michal Hocko
2016-10-17 14:55     ` Kirill A. Shutemov
2016-10-17 14:55       ` Kirill A. Shutemov
2016-10-18 14:20       ` Michal Hocko
2016-10-18 14:20         ` Michal Hocko
2016-10-18 14:32         ` Kirill A. Shutemov
2016-10-18 14:32           ` Kirill A. Shutemov
2016-10-18 18:30           ` Michal Hocko
2016-10-18 18:30             ` Michal Hocko
2016-10-19 18:13             ` Hugh Dickins
2016-10-19 18:13               ` Hugh Dickins
2016-10-20 10:39               ` Kirill A. Shutemov
2016-10-20 10:39                 ` Kirill A. Shutemov
2016-10-20 22:46                 ` Dave Chinner
2016-10-20 22:46                   ` Dave Chinner
2016-10-21  2:01                   ` Andi Kleen
2016-10-21  2:01                     ` Andi Kleen
2016-10-21  5:01                     ` Dave Chinner
2016-10-21  5:01                       ` Dave Chinner
2016-10-21 15:00                       ` Kirill A. Shutemov
2016-10-21 15:00                         ` Kirill A. Shutemov
2016-10-21 15:12                         ` Michal Hocko
2016-10-21 15:12                           ` Michal Hocko
2016-10-21 22:50                         ` Dave Chinner
2016-10-21 22:50                           ` Dave Chinner
2016-10-21 23:32                           ` Kirill A. Shutemov
2016-10-21 23:32                             ` Kirill A. Shutemov
2016-10-24 20:34                           ` Dave Hansen [this message]
2016-10-24 20:34                             ` Dave Hansen
2016-10-25  5:28                             ` Dave Chinner
2016-10-25  5:28                               ` Dave Chinner
  -- strict thread matches above, loose matches on Subject: below --
2016-11-10 16:25 [PATCHv4] " Kirill A. Shutemov
2016-11-10 17:42 ` [PATCH] " kbuild test robot
2016-11-10 17:51   ` Kirill A. Shutemov
2016-11-10 17:51     ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=580E706D.6030905@intel.com \
    --to=dave.hansen@intel.com \
    --cc=aarcange@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.