From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757366AbcJXUez (ORCPT ); Mon, 24 Oct 2016 16:34:55 -0400 Received: from mga04.intel.com ([192.55.52.120]:61157 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757008AbcJXUey (ORCPT ); Mon, 24 Oct 2016 16:34:54 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.31,543,1473145200"; d="scan'208";a="893466023" Subject: Re: [PATCH] shmem: avoid huge pages for small files To: Dave Chinner , "Kirill A. Shutemov" References: <20161017145539.GA26930@node.shutemov.name> <20161018142007.GL12092@dhcp22.suse.cz> <20161018143207.GA5833@node.shutemov.name> <20161018183023.GC27792@dhcp22.suse.cz> <20161020103946.GA3881@node.shutemov.name> <20161020224630.GO23194@dastard> <20161021020116.GD1075@tassilo.jf.intel.com> <20161021050118.GR23194@dastard> <20161021150007.GA13597@node.shutemov.name> <20161021225013.GS14023@dastard> Cc: Andi Kleen , Hugh Dickins , Michal Hocko , "Kirill A. Shutemov" , Andrea Arcangeli , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org From: Dave Hansen Message-ID: <580E706D.6030905@intel.com> Date: Mon, 24 Oct 2016 13:34:53 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.8.0 MIME-Version: 1.0 In-Reply-To: <20161021225013.GS14023@dastard> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/21/2016 03:50 PM, Dave Chinner wrote: > On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote: >> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote: >> To me, most of things you're talking about is highly dependent on access >> pattern generated by userspace: >> >> - we may want to allocate huge pages from byte 1 if we know that file >> will grow; > > delayed allocation takes care of that. We use a growing speculative > delalloc size that kicks in at specific sizes and can be used > directly to determine if a large page shoul dbe allocated. This code > is aware of sparse files, sparse writes, etc. OK, so somebody does a write() of 1 byte. We can delay the underlying block allocation for a long time, but we can *not* delay the memory allocation. We've got to decide before the write() returns. How does delayed allocation help with that decision? I guess we could (always?) allocate small pages up front, and then only bother promoting them once the FS delayed-allocation code kicks in and is *also* giving us underlying large allocations. That punts the logic to the filesystem, which is a bit counterintuitive, but it seems relatively sane. >>> As such, there is no way we should be considering different >>> interfaces and methods for configuring the /same functionality/ just >>> because DAX is enabled or not. It's the /same decision/ that needs >>> to be made, and the filesystem knows an awful lot more about whether >>> huge pages can be used efficiently at the time of access than just >>> about any other actor you can name.... >> >> I'm not convinced that filesystem is in better position to see access >> patterns than mm for page cache. It's not all about on-disk layout. > > Spoken like a true mm developer. IO performance is all about IO > patterns, and the primary contributor to bad IO patterns is bad > filesystem allocation patterns.... :P For writes, I think you have a good point. Managing a horribly fragmented file with larger pages and eating the associated write magnification that comes along with it seems like a recipe for disaster. But, Isn't some level of disconnection between the page cache and the underlying IO patterns a *good* thing? Once we've gone to the trouble of bringing some (potentially very fragmented) data into the page cache, why _not_ manage it in a lower-overhead way if we can? For read-only data it seems like a no-brainer that we'd want things in as large of a management unit as we can get. IOW, why let the underlying block allocation layout hamstring how the memory is managed?