From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Subject: Re: Subtle races between DAX mmap fault and write path Date: Mon, 1 Aug 2016 17:39:06 +1000 Message-ID: <20160801073906.GK16044@dastard> References: <20160727221949.GU16044@dastard> <20160728081033.GC4094@quack2.suse.cz> <20160729022152.GZ16044@dastard> <20160730001249.GE16044@dastard> <20160801014645.GI16044@dastard> <86k2g15gh8.fsf@hiro.keithp.com> <20160801040737.GJ16044@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Cc: Jan Kara , "linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org" , XFS Developers , linux-fsdevel , linux-ext4 To: Dan Williams Return-path: Content-Disposition: inline In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: linux-nvdimm-bounces-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org Sender: "Linux-nvdimm" List-Id: linux-ext4.vger.kernel.org On Sun, Jul 31, 2016 at 09:39:38PM -0700, Dan Williams wrote: > On Sun, Jul 31, 2016 at 9:07 PM, Dave Chinner wrote: > > OTOH, DAX directly exposes the physical layout to the filesytem. > > And because it's DAX-based pmem and not cached struct pages, we > > can't run vm_map_ram() to virtually map the range we need to see as > > a contiguous range, as we do in XFS for large objects such as directory > > blocks and log buffers. For other large objects such as inode > > clusters, we can directly map each page as the objects within the > > clusters are page aligned and never overlap page boundaries, but > > that only works for inode and dquot buffers. Hence DAX as it stands > > makes it extremely difficult to "retrofit" DAX into all aspects of > > existing fileystems because exposing physical discontiguities breaks > > code that assumes they don't exist. > > On this specific point about page remapping, the administrator can > configure struct pages for pmem and you can detect whether they are > present in the filesystem with pfn_t_has_page(). I.e. you could > require pages be present for XFS, if that helps... It's kinda silly to require struct pages for the entire pmem device if they are only needed for accessing a (comparitively) small amount of metadata. Besides, now that I look at it more deeply, we can't use virtually mapped pmem for the log buffers. We can't allocate memory at the point in time where we work out what LBA in the log we need to map to physical pmem for the current log write. Hence calls to vm_map_ram() can't be used, and so that rules out using mapped page based pmem for log buffers. I'll probably have to rewrite the xlog_write() engine completely to be able to handle discontiguous pages in the iclog buffers before we can consider mapping them via DAX now, and I'm really not sure it's worth the effort. I'd much prefer to spend time designing a native pmem filesystem.... Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ipmail04.adl6.internode.on.net (ipmail04.adl6.internode.on.net [150.101.137.141]) by ml01.01.org (Postfix) with ESMTP id E4D0F1A1E11 for ; Mon, 1 Aug 2016 00:39:09 -0700 (PDT) Date: Mon, 1 Aug 2016 17:39:06 +1000 From: Dave Chinner Subject: Re: Subtle races between DAX mmap fault and write path Message-ID: <20160801073906.GK16044@dastard> References: <20160727221949.GU16044@dastard> <20160728081033.GC4094@quack2.suse.cz> <20160729022152.GZ16044@dastard> <20160730001249.GE16044@dastard> <20160801014645.GI16044@dastard> <86k2g15gh8.fsf@hiro.keithp.com> <20160801040737.GJ16044@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" To: Dan Williams Cc: Jan Kara , "linux-nvdimm@lists.01.org" , XFS Developers , linux-fsdevel , linux-ext4 List-ID: On Sun, Jul 31, 2016 at 09:39:38PM -0700, Dan Williams wrote: > On Sun, Jul 31, 2016 at 9:07 PM, Dave Chinner wrote: > > OTOH, DAX directly exposes the physical layout to the filesytem. > > And because it's DAX-based pmem and not cached struct pages, we > > can't run vm_map_ram() to virtually map the range we need to see as > > a contiguous range, as we do in XFS for large objects such as directory > > blocks and log buffers. For other large objects such as inode > > clusters, we can directly map each page as the objects within the > > clusters are page aligned and never overlap page boundaries, but > > that only works for inode and dquot buffers. Hence DAX as it stands > > makes it extremely difficult to "retrofit" DAX into all aspects of > > existing fileystems because exposing physical discontiguities breaks > > code that assumes they don't exist. > > On this specific point about page remapping, the administrator can > configure struct pages for pmem and you can detect whether they are > present in the filesystem with pfn_t_has_page(). I.e. you could > require pages be present for XFS, if that helps... It's kinda silly to require struct pages for the entire pmem device if they are only needed for accessing a (comparitively) small amount of metadata. Besides, now that I look at it more deeply, we can't use virtually mapped pmem for the log buffers. We can't allocate memory at the point in time where we work out what LBA in the log we need to map to physical pmem for the current log write. Hence calls to vm_map_ram() can't be used, and so that rules out using mapped page based pmem for log buffers. I'll probably have to rewrite the xlog_write() engine completely to be able to handle discontiguous pages in the iclog buffers before we can consider mapping them via DAX now, and I'm really not sure it's worth the effort. I'd much prefer to spend time designing a native pmem filesystem.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 03D0F7CFD for ; Mon, 1 Aug 2016 02:39:15 -0500 (CDT) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay1.corp.sgi.com (Postfix) with ESMTP id C8BFF8F8035 for ; Mon, 1 Aug 2016 00:39:11 -0700 (PDT) Received: from ipmail04.adl6.internode.on.net (ipmail04.adl6.internode.on.net [150.101.137.141]) by cuda.sgi.com with ESMTP id lIYtBu5WCsvw0asr for ; Mon, 01 Aug 2016 00:39:09 -0700 (PDT) Date: Mon, 1 Aug 2016 17:39:06 +1000 From: Dave Chinner Subject: Re: Subtle races between DAX mmap fault and write path Message-ID: <20160801073906.GK16044@dastard> References: <20160727221949.GU16044@dastard> <20160728081033.GC4094@quack2.suse.cz> <20160729022152.GZ16044@dastard> <20160730001249.GE16044@dastard> <20160801014645.GI16044@dastard> <86k2g15gh8.fsf@hiro.keithp.com> <20160801040737.GJ16044@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dan Williams Cc: Keith Packard , Jan Kara , "linux-nvdimm@lists.01.org" , XFS Developers , linux-fsdevel , linux-ext4 On Sun, Jul 31, 2016 at 09:39:38PM -0700, Dan Williams wrote: > On Sun, Jul 31, 2016 at 9:07 PM, Dave Chinner wrote: > > OTOH, DAX directly exposes the physical layout to the filesytem. > > And because it's DAX-based pmem and not cached struct pages, we > > can't run vm_map_ram() to virtually map the range we need to see as > > a contiguous range, as we do in XFS for large objects such as directory > > blocks and log buffers. For other large objects such as inode > > clusters, we can directly map each page as the objects within the > > clusters are page aligned and never overlap page boundaries, but > > that only works for inode and dquot buffers. Hence DAX as it stands > > makes it extremely difficult to "retrofit" DAX into all aspects of > > existing fileystems because exposing physical discontiguities breaks > > code that assumes they don't exist. > > On this specific point about page remapping, the administrator can > configure struct pages for pmem and you can detect whether they are > present in the filesystem with pfn_t_has_page(). I.e. you could > require pages be present for XFS, if that helps... It's kinda silly to require struct pages for the entire pmem device if they are only needed for accessing a (comparitively) small amount of metadata. Besides, now that I look at it more deeply, we can't use virtually mapped pmem for the log buffers. We can't allocate memory at the point in time where we work out what LBA in the log we need to map to physical pmem for the current log write. Hence calls to vm_map_ram() can't be used, and so that rules out using mapped page based pmem for log buffers. I'll probably have to rewrite the xlog_write() engine completely to be able to handle discontiguous pages in the iclog buffers before we can consider mapping them via DAX now, and I'm really not sure it's worth the effort. I'd much prefer to spend time designing a native pmem filesystem.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ipmail04.adl6.internode.on.net ([150.101.137.141]:8627 "EHLO ipmail04.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751811AbcHAHjT (ORCPT ); Mon, 1 Aug 2016 03:39:19 -0400 Date: Mon, 1 Aug 2016 17:39:06 +1000 From: Dave Chinner To: Dan Williams Cc: Keith Packard , Jan Kara , "linux-nvdimm@lists.01.org" , XFS Developers , linux-fsdevel , linux-ext4 Subject: Re: Subtle races between DAX mmap fault and write path Message-ID: <20160801073906.GK16044@dastard> References: <20160727221949.GU16044@dastard> <20160728081033.GC4094@quack2.suse.cz> <20160729022152.GZ16044@dastard> <20160730001249.GE16044@dastard> <20160801014645.GI16044@dastard> <86k2g15gh8.fsf@hiro.keithp.com> <20160801040737.GJ16044@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Sun, Jul 31, 2016 at 09:39:38PM -0700, Dan Williams wrote: > On Sun, Jul 31, 2016 at 9:07 PM, Dave Chinner wrote: > > OTOH, DAX directly exposes the physical layout to the filesytem. > > And because it's DAX-based pmem and not cached struct pages, we > > can't run vm_map_ram() to virtually map the range we need to see as > > a contiguous range, as we do in XFS for large objects such as directory > > blocks and log buffers. For other large objects such as inode > > clusters, we can directly map each page as the objects within the > > clusters are page aligned and never overlap page boundaries, but > > that only works for inode and dquot buffers. Hence DAX as it stands > > makes it extremely difficult to "retrofit" DAX into all aspects of > > existing fileystems because exposing physical discontiguities breaks > > code that assumes they don't exist. > > On this specific point about page remapping, the administrator can > configure struct pages for pmem and you can detect whether they are > present in the filesystem with pfn_t_has_page(). I.e. you could > require pages be present for XFS, if that helps... It's kinda silly to require struct pages for the entire pmem device if they are only needed for accessing a (comparitively) small amount of metadata. Besides, now that I look at it more deeply, we can't use virtually mapped pmem for the log buffers. We can't allocate memory at the point in time where we work out what LBA in the log we need to map to physical pmem for the current log write. Hence calls to vm_map_ram() can't be used, and so that rules out using mapped page based pmem for log buffers. I'll probably have to rewrite the xlog_write() engine completely to be able to handle discontiguous pages in the iclog buffers before we can consider mapping them via DAX now, and I'm really not sure it's worth the effort. I'd much prefer to spend time designing a native pmem filesystem.... Cheers, Dave. -- Dave Chinner david@fromorbit.com