From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Subject: Re: async O_DIRECT vs. buffered synchronization (or lack thereof) Date: Sun, 25 Oct 2015 08:50:14 +1100 Message-ID: <20151024215014.GF8773@dastard> References: <20151022234245.GA20005@kmo-pixel> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, Alexander Viro , Andrew Morton , Theodore Ts'o To: Kent Overstreet Return-path: Received: from ipmail04.adl6.internode.on.net ([150.101.137.141]:23074 "EHLO ipmail04.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752336AbbJXVuQ (ORCPT ); Sat, 24 Oct 2015 17:50:16 -0400 Content-Disposition: inline In-Reply-To: <20151022234245.GA20005@kmo-pixel> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Thu, Oct 22, 2015 at 03:42:45PM -0800, Kent Overstreet wrote: > While off reading code, I noticed something that didn't look quite right... > > Look at generic_file_direct_write(), in mm/filemap.c. What the code there is > doing is: > > - dropping the range we're writing to from the page cache (writing it first if > necessary), then > - doing the write, then > - invalidating that range in the pagecache again. .... > Yet _another_ fun fact: I mentioned that for the filemap_write_and_wait_range(); > invalidate_inode_pages2() sequence to work we have to be preventing pages from > being redirtied. Well, i_mutex does the job for buffered writes, but not > page faults - AFAICT page_mkwrite() would have to take i_mutex for this code to > not race with page faults, and the default page_mkwrite implementation > (filemap_page_mkwrite()) definitely does not. > > It does _lock_ the page though, so if we had something that combined > filemap_write_and_wait_range() with invalidating pages, making sure to have the > page still locked when removing it from the page cache - that ought to work. > > XFS does seem to attempt to get this right - its .page_mkwrite takes the inode > XFS_MMAPLOCK_SHARED lock, and the xfs truncate and fallocate code both take > XFS_MMAPLOCK_EXCL (truncate and (in particular) fcollapse also need to drop > ranges from the page cache, fcollapse is where I first noticed this particular > issue). > > But AFAICT xfs's dio path does _not_ take the correct lock for this to work - > although if you look at xfs_file_dio_aio_write() they were clearly thinking > about page cache synchronization, so perhaps I'm missing something about how > xfs's locking works. We can't take it across direct IO submission/completion because that creates a mmap_sem/XFS_MMAPLOCK inversion due to the direct IO code calling get_user_pages(). We can't put locks in the page fault path to solve this problem - I created the XFS_MMAPLOCK locking to solve the "page faults race with extent manipulation operations" knowing that it couldn't be used to solve the DIO vs mmap race conditions. There's a reason why XFS developers still say "if you mix buffered/mmap IO on the same file as direct IO, you get to keep all the corrupted bits" and then point users at the open(2) man page: Applications should avoid mixing O_DIRECT and normal I/O to the same file, and especially to overlapping byte regions in the same file. Even when the filesystem correctly handles the coherency issues in this situation, overall I/O throughput is likely to be slower than using either mode alone. Likewise, applications should avoid mixing mmap(2) of files with direct I/O to the same files. We attempt best effort at maintaining coherency and preventing data corruption, but we cannot guarantee coherency... Cheers, Dave. -- Dave Chinner david@fromorbit.com