From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ric Wheeler Subject: Re: [RFC] extending splice for copy offloading Date: Mon, 30 Sep 2013 10:48:42 -0400 Message-ID: <52498F4A.2040809@redhat.com> References: <20130925210742.GG30372@lenny.home.zabbo.net> <20130926185508.GO30372@lenny.home.zabbo.net> <5244A68F.906@redhat.com> <20130927200550.GA22640@fieldses.org> <20130927205013.GZ30372@lenny.home.zabbo.net> <4FA345DA4F4AE44899BD2B03EEEC2FA9467EF2D7@SACEXCMBX04-PRD.hq.netapp.com> <52474839.2080201@redhat.com> <20130930143432.GG16579@fieldses.org> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Miklos Szeredi , "Myklebust, Trond" , Zach Brown , Anna Schumaker , Kernel Mailing List , Linux-Fsdevel , "linux-nfs@vger.kernel.org" , "Schumaker, Bryan" , "Martin K. Petersen" , Jens Axboe , Mark Fasheh , Joel Becker , Eric Wong To: "J. Bruce Fields" Return-path: In-Reply-To: <20130930143432.GG16579@fieldses.org> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On 09/30/2013 10:34 AM, J. Bruce Fields wrote: > On Mon, Sep 30, 2013 at 02:20:30PM +0200, Miklos Szeredi wrote: >> On Sat, Sep 28, 2013 at 11:20 PM, Ric Wheeler = wrote: >> >>>>> I don't see the safety argument very compelling either. There ar= e real >>>>> semantic differences, however: ENOSPC on a write to a >>>>> (apparentl=C3=ADy) already allocated block. That could be a bit = unexpected. >>>>> Do we >>>>> need a fallocate extension to deal with shared blocks? >>>> The above has been the case for all enterprise storage arrays ever= since >>>> the invention of snapshots. The NFSv4.2 spec does allow you to set= a >>>> per-file attribute that causes the storage server to always preall= ocate >>>> enough buffers to guarantee that you can rewrite the entire file, = however >>>> the fact that we've lived without it for said 20 years leads me to= believe >>>> that demand for it is going to be limited. I haven't put it top of= the list >>>> of features we care to implement... >>>> >>>> Cheers, >>>> Trond >>> >>> I agree - this has been common behaviour for a very long time in th= e array >>> space. Even without an array, this is the same as overwriting a bl= ock in >>> btrfs or any file system with a read-write LVM snapshot. >> Okay, I'm convinced. >> >> So I suggest >> >> - mount(..., MNT_REFLINK): *allow* splice to reflink. If this is = not >> set, fall back to page cache copy. >> - splice(... SPLICE_REFLINK): fail non-reflink copy. With this a= pp >> can force reflink. >> >> Both are trivial to implement and make sure that no backward >> incompatibility surprises happen. >> >> My other worry is about interruptibility/restartability. Ideas? >> >> What happens on splice(from, to, 4G) and it's a non-reflink copy? >> Can the page cache copy be made restartable? Or should splice() be >> allowed to return a short count? What happens on (non-reflink) remo= te >> copies and huge request sizes? > If I were writing an application that required copies to be restartab= le, > I'd probably use the largest possible range in the reflink case but > break the copy into smaller chunks in the splice case. > > For that reason I don't like the idea of a mount option--the choice i= s > something that the application probably wants to make (or at least to > know about). > > The NFS COPY operation, as specified in current drafts, allows for > asynchronous copies but leaves the state of the file undefined in the > case of an aborted COPY. I worry that agreeing on standard behavior = in > the case of an abort might be difficult. > > --b. I think that this is still confusing - reflink and array copy offload s= hould not=20 be differentiated. In effect, they should often be the same order of m= agnitude=20 in performance and possibly even use the same or very similar technique= s (just=20 on different sides of the initiator/target transaction!). It is much simpler to let the application fail if the offload (or refli= nk) is=20 not supported and let it do the traditional copy offload. Then you alw= ays send=20 the largest possible offload operation and do whatever you do now if th= at fails. thanks! Ric