From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id p0H4ZY5c231609 for ; Sun, 16 Jan 2011 22:35:34 -0600 Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id C24F71005D65 for ; Sun, 16 Jan 2011 20:37:49 -0800 (PST) Received: from mail.internode.on.net (bld-mail13.adl6.internode.on.net [150.101.137.98]) by cuda.sgi.com with ESMTP id WmqkfJyqyMmktYWO for ; Sun, 16 Jan 2011 20:37:49 -0800 (PST) Date: Mon, 17 Jan 2011 15:37:40 +1100 From: Dave Chinner Subject: Re: Issues with delalloc->real extent allocation Message-ID: <20110117043740.GK16267@dastard> References: <20110114214334.GN28274@sgi.com> <1414867864.5452.1295224107104.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <1414867864.5452.1295224107104.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: xfs-bounces@oss.sgi.com Errors-To: xfs-bounces@oss.sgi.com To: Lachlan McIlroy Cc: bpm@sgi.com, xfs@oss.sgi.com On Sun, Jan 16, 2011 at 07:28:27PM -0500, Lachlan McIlroy wrote: > ----- Original Message ----- > > Hi Dave, > > > > On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote: > > > I've noticed a few suspicious things trying to reproduce the > > > allocate-in-the-middle-of-a-delalloc-extent, > > ... > > > Secondly, I think we have the same expose-the-entire-delalloc-extent > > > -to-stale-data-exposure problem in ->writepage. This onnne, however, > > > is due to using BMAPI_ENTIRE to allocate the entire delalloc extent > > > the first time any part of it is written to. Even if we are only > > > writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc > > > extent covers gigabytes. So, same problem when we crash. > > > > > > Finally, I think the extsize based problem exposed by test 229 is a > > > also a result of allocating space we have no pages covering in the > > > page cache (triggered by BMAPI_ENTIRE allocation) so the allocated > > > space is never zeroed and hence exposes stale data. > > > > This is precisely the bug I was going after when I hit the > > allocate-in-the-middle-of-a-delalloc-extent bug. This is a race > > between > > block_prepare_write/__xfs_get_blocks and writepage/xfs_page_state > > convert. When xfs_page_state_convert allocates a real extent for a > > page > > toward the beginning of a delalloc extent, XFS_BMAPI converts the > > entire > > delalloc extent. Any subsequent writes into the page cache toward the > > end of this freshly allocated extent will see a written extent instead > > of delalloc and read the block from disk into the page before writing > > over it. If the write does not cover the entire page garbage from disk > > will be exposed into the page cache. > > Ben, take a look at the XFS gap list code in IRIX - this code was > designed specifically to handle this problem. It's also implemented in > several of the cxfs clients too. On entry to a write() it will create a > list of all the holes that exist in the write range before any delayed > allocations are created. The first call to xfs_bmapi() sets up the delayed > allocation for the entire write (even if the bmaps returned don't cover the > full write range). If the write results in a fault from disk then it checks > the gap list to see if any sections of the buffer used to cover a hole and > if so it ignores the state of the extent and zeroes those region(s) in the > buffer that match the pre-existing holes. If the buffer has multiple > non-hole sections that need to be read from disk the whole buffer will be > read from disk and the zeroing of the holes is done post I/O - this reduces > the number of I/Os to be done. The whole delayed allocation can be safely > converted at any time without risk of reading exposed data (assuming no > crash that is). IMO, that caveat exposes the problem with this approach - it is not persistent. It adds a lot of complexity but doesn't solve the underlying problem: that we are exposing real extents via allocation without having written any data to them. > As the write progresses through the range it removes > sections from the front of the gap list so by the time the write is complete > the gap list is empty. The gap list does not have a dedicated lock to > protect it but instead relies on the iolock to ensure that only one write > operation occurs at once (so it's not appropriate for direct I/O). > > > > > > > > > > I'm sure there are other ways to solve these problems, but these two > > > are the ones that come to mind for me. I'm open to other solutions > > > or ways to improve on these ones, especially if they are simpler. ;) > > > Anyone got any ideas or improvements? > > > > The direction I've been taking is to use XFS_BMAPI_EXACT in > > *xfs_iomap_write_allocate to ensure that an extent covering exactly > > the > > pages we're prepared to write out immediately is allocated and the > > rest > > of the delalloc extent is left as is. This exercises some of the btree > > code more heavily and led to the discovery of the > > allocate-in-the-middle-of-a-delalloc-extent bug. It also presents a > > performance issue which I've tried to resolve by extending > > xfs_probe_cluster to probe delalloc extents-- lock up all of the pages > > to be converted before performing the allocation and hold those locks > > until they are submitted for writeback. It's not very pretty but it > > resolves the corruption. > > > > There is still the issue of crashes... This could be solved by > > converting from delalloc to unwritten in xfs_page_state_convert in > > this > > very exact way and then to written in the io completion handler. Never > > go delalloc->written directly. > > That solution would probably make the gap list redundant too. Yes, I think you are right. The "gaps" would get mapped as unwritten rather than as normal extents and hence get zereod appropriately. It would also preventing exposure after a crash due to being persistent. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs