From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f53.google.com (mail-pa0-f53.google.com [209.85.220.53]) by kanga.kvack.org (Postfix) with ESMTP id E8F1E6B0005 for ; Tue, 23 Feb 2016 12:25:26 -0500 (EST) Received: by mail-pa0-f53.google.com with SMTP id fy10so113391368pac.1 for ; Tue, 23 Feb 2016 09:25:26 -0800 (PST) Received: from mga03.intel.com (mga03.intel.com. [134.134.136.65]) by mx.google.com with ESMTP id uk9si18869493pac.166.2016.02.23.09.25.25 for ; Tue, 23 Feb 2016 09:25:26 -0800 (PST) Date: Tue, 23 Feb 2016 10:25:12 -0700 From: Ross Zwisler Subject: Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag Message-ID: <20160223172512.GC15877@linux.intel.com> References: <56CA1CE7.6050309@plexistor.com> <56CA2AC9.7030905@plexistor.com> <20160221223157.GC25832@dastard> <20160222174426.GA30110@infradead.org> <257B23E37BCB93459F4D566B5EBAEAC550098A32@FMSMSX106.amr.corp.intel.com> <20160223095225.GB32294@infradead.org> <56CC686A.9040909@plexistor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <56CC686A.9040909@plexistor.com> Sender: owner-linux-mm@kvack.org List-ID: To: Boaz Harrosh Cc: Christoph Hellwig , "Rudoff, Andy" , Dave Chinner , Jeff Moyer , Arnd Bergmann , linux-nvdimm , Oleg Nesterov , linux-mm , Mel Gorman , Johannes Weiner , "Kirill A. Shutemov" On Tue, Feb 23, 2016 at 04:10:50PM +0200, Boaz Harrosh wrote: > On 02/23/2016 11:52 AM, Christoph Hellwig wrote: > <> > > > > And this is BS. Using msync or fsync might not perform as well as not > > actually using them, but without them you do not get persistence. If > > you use your pmem as a throw away cache that's fine, but for most people > > that is not the case. > > > > Hi Christoph > > So is exactly my suggestion. My approach is *not* the we do not call > m/fsync to let the FS clean up. > > In my model we still do that, only we eliminate the m/fsync slowness > and the all page faults overhead by being instructed by the application > that we do not need to track the data modified cachelines. Since the > application is telling us that it will do so. > > In my model the job is split: > App will take care of data persistence by instructing a MAP_PMEM_AWARE, > and doing its own cl_flushing / movnt. > Which is the heavy cost > > The FS will keep track of the Meta-Data persistence as it already does, via the > call to m/fsync. Which is marginal performance compared to the above heavy > IO. > > Note that the FS is still free to move blocks around, as Dave said: > lockout pagefaultes, unmap from user space, let app fault again on a new > block. this will still work as before, already in COW we flush the old > block so there will be no persistence lost. > > So this all thread started with my patches, and my patches do not say > "no m/fsync" they say, make this 3-8 times faster than today if the app > is participating in the heavy lifting. > > Please tell me what you find wrong with my approach? It seems like we are trying to solve a couple of different problems: 1) Make page faults faster by skipping any radix tree insertions, tag updates, etc. 2) Make fsync/msync faster by not flushing data that the application says it is already making durable from userspace. I agree that your approach seems to improve both of these problems, but I would argue that it is an incomplete solution for problem #2 because a fsync/msync from the PMEM aware application would still flush any radix tree entries from *other* threads that were writing to the same file. It seems like a more direct solution for #2 above would be to have a metadata-only equivalent of fsync/fdatasync, say "fmetasync", which says "I'll make the writes I do to my mmaps durable from userspace, but I need you to sync all filesystem metadata for me, please". This would allow a complete separation of data synchronization in userspace from metadata synchronization in kernel space by the filesystem code. By itself a fmetasync() type solution of course would do nothing for issue #1 - if that was a compelling issue you'd need something like the mmap tag you're proposing to skip work on page faults. All that being said, though, I agree with others in the thread that we should still be focused on correctness, as we have a lot of correctness issues remaining. When we eventually get to the place where we are trying to do performance optimizations, those optimizations should be measurement driven. - Ross -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org