From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Subject: Re: [RFC] new ->perform_write fop Date: Fri, 21 May 2010 09:05:24 +1000 Message-ID: <20100520230524.GU8120@dastard> References: <20100514064145.GJ13617@dastard> <20100514072219.GC4706@laptop> <20100514083821.GL13617@dastard> <20100518063647.GD2516@laptop> <20100518080503.GF2150@dastard> <20100518104351.GF2516@laptop> <20100518122714.GG2150@dastard> <20100518150912.GI2516@laptop> <20100519235054.GS8120@dastard> <20100520201232.GP3395@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Nick Piggin , Josef Bacik , linux-fsdevel@vger.kernel.org, chris.mason@oracle.com, hch@infradead.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20100520201232.GP3395@quack.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote: > On Thu 20-05-10 09:50:54, Dave Chinner wrote: > > On Wed, May 19, 2010 at 01:09:12AM +1000, Nick Piggin wrote: > > > On Tue, May 18, 2010 at 10:27:14PM +1000, Dave Chinner wrote: > > > > On Tue, May 18, 2010 at 08:43:51PM +1000, Nick Piggin wrote: > > > > > On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote: > > > > > > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote= : > > > > > > > Well you could do a large span block allocation at the be= ginning, > > > > > > > and then dirty the pagecache one by one like we do right = now. > > > > > >=20 > > > > > > The problem is that if we fail to allocate a page (e.g. EN= OMEM) or > > > > > > fail the copy (EFAULT) after the block allocation, we have = to undo > > > > > > the allocation we have already completed. If we don't, we l= eave > > > > > > uninitialisaed allocations on disk that will expose stale d= ata. > > > > > >=20 > > > > > > In the second case (EFAULT) we might be able to zero the pa= ges to > > > > > > avoid punching out blocks, but the first case where pages c= an't be > > > > > > allocated to cover the block allocated range makes it very > > > > > > difficult without being able to punch holes in allocated bl= ock > > > > > > ranges. > > > > > >=20 > > > > > > AFAIK, only XFS and OCFS2 currently support punching out ar= bitrary > > > > > > ranges of allocated blocks from an inode - there is not VFS= method > > > > > > for it, just an ioctl (XFS_IOC_UNRESVSP). > > > > > >=20 > > > > > > Hence the way to avoid needing hole punching is to allocate= and lock > > > > > > down all the pages into the page cache f=D1=96rst, then do = the copy so > > > > > > they fail before the allocation is done if they are going t= o fail. > > > > > > That makes it much, much easier to handle failures.... > > > > >=20 > > > > > So it is just a matter of what is exposed as a vfs interface? > > > >=20 > > > > More a matter of utilising the functionality most filesystems > > > > already have and minimising the amount of churn in critical are= as of > > > > filesytsem code. Hole punching is not simple, an=D1=95 bugs wil= l likely > > > > result in a corrupted filesystem. And the hole punching will o= nly > > > > occur in a hard to trigger corner case, so it's likely that bug= s > > > > will go undetected and filesystems will suffer from random, > > > > impossible to track down corruptions as a result. > > > >=20 > > > > In comparison, adding reserve/unreserve functionality might cau= se > > > > block accounting issues if there is a bug, but it won't cause > > > > on-disk corruption that results in data loss. Hole punching is= not > > > > simple or easy - it's a damn complex way to handle errors and i= f > > > > that's all it's required for then we've failed already. > > >=20 > > > As I said, we can have a dumb fallback path for filesystems that > > > don't implement hole punching. Clear the blocks past i size, and > > > zero out the allocated but not initialized blocks. > > >=20 > > > There does not have to be pagecache allocated in order to do this= , > > > you could do direct IO from the zero page in order to do it. > >=20 > > I don't see that as a good solution - it's once again a fairly > > complex way of dealing with the problem, especially as it now means > > that direct io would fall back to buffered which would fall back to > > direct IO.... > >=20 > > > Hole punching is not only useful there, it is already exposed to > > > userspace via MADV_REMOVE. > >=20 > > That interface is *totally broken*. It has all the same problems as > > vmtruncate() for removing file blocks (because it uses vmtruncate). > > It also has the fundamental problem of being called un the mmap_sem= , > > which means that inode locks and therefore de-allocation cannot be > > executed without the possibility of deadlocks. Fundamentally, hole > > punching is an inode operation, not a VM operation.... > >=20 > >=20 > > > > > > > Basically, once pagecache is marked uptodate, I don't thi= nk we should > > > > > > > ever put maybe-invalid data into it -- the way to do it i= s to invalidate > > > > > > > that page and put a *new* page in there. > > > > > >=20 > > > > > > Ok, so lets do that... > > > > > >=20 > > > > > > > Why? Because user mappings are just one problem, but once= you had a > > > > > > > user mapping, you can have been subject to get_user_pages= , so it could > > > > > > > be in the middle of a DMA operation or something. > > > > > >=20 > > > > > > ... because we already know this behaviour causes problems = for > > > > > > high end enterprise level features like hardware checksummi= ng IO > > > > > > paths. > > > > > >=20 > > > > > > Hence it seems that a multipage write needs to: > > > > > >=20 > > > > > > 1. allocate new pages > > > > > > 2. attach bufferheads/mapping structures to pages (if requ= ired) > > > > > > 3. copy data into pages > > > > > > 4. allocate space > > > > > > 5. for each old page in the range: > > > > > > lock page > > > > > > invalidate mappings > > > > > > clear page uptodate flag > > > > > > remove page from page cache > > > > > > 6. for each new page: > > > > > > map new page to allocated space > > > > > > lock new page > > > > > > insert new page into pagecache > > > > > > update new page state (write_end equivalent) > > > > > > unlock new page > > > > > > 7. free old pages > > > > > >=20 > > > > > > Steps 1-4 can all fail, and can all be backed out from with= out > > > > > > changing the current state. Steps 5-7 can't fail AFAICT, so= we > > > > > > should be able to run this safely after the allocation with= out > > > > > > needing significant error unwinding... > > > > > >=20 > > > > > > Thoughts? > > > > >=20 > > > > > Possibly. The importance of hot cache is reduced, because we = are > > > > > doing full-page copies, and bulk copies, by definition. But i= t > > > > > could still be an issue. The allocations and deallocations co= uld > > > > > cost a little as well. > > > >=20 > > > > They will cost far less than the reduction in allocation overhe= ad > > > > saves us, and there are potential optimisations there=20 > > >=20 > > > An API that doesn't require that, though, should be less overhead > > > and simpler. > > >=20 > > > Is it really going to be a problem to implement block hole punchi= ng > > > in ext4 and gfs2? > >=20 > > I can't follow the ext4 code - it's an intricate maze of weird entr= y > > and exit points, so I'm not even going to attempt to comment on it. > Hmm, I was thinking about it and I see two options how to get out > of problems: > a) Filesystems which are not able to handle hole punching will allo= w > multipage writes only after EOF (which can be easily undone by > truncate in case of failure). That should actually cover lots of > cases we are interested in (I don't expect multipage writes to h= oles > to be a common case). multipage writes to holes is a relatively common operation in the HPC space that XFS is designed for (e.g. calculations on huge sparse matrices), so I'm not really fond of this idea.... > b) E.g. ext4 can do even without hole punching. It can allocate ext= ent > as 'unwritten' and when something during the write fails, it jus= t > leaves the extent allocated and the 'unwritten' flag makes sure = that > any read will see zeros. I suppose that other filesystems that c= are > about multipage writes are able to do similar things (e.g. btrfs= can > do the same as far as I remember, I'm not sure about gfs2). Allocating multipage writes as unwritten extents turns off delayed allocation and hence we'd lose all the benefits that this gives... Cheers, Dave. --=20 Dave Chinner david@fromorbit.com