From: Ric Wheeler <rwheeler@redhat.com>
To: Miklos Szeredi <miklos@szeredi.hu>
Cc: "J. Bruce Fields" <bfields@fieldses.org>,
"Myklebust, Trond" <Trond.Myklebust@netapp.com>,
Zach Brown <zab@redhat.com>,
Anna Schumaker <schumaker.anna@gmail.com>,
Kernel Mailing List <linux-kernel@vger.kernel.org>,
Linux-Fsdevel <linux-fsdevel@vger.kernel.org>,
"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
"Schumaker, Bryan" <Bryan.Schumaker@netapp.com>,
"Martin K. Petersen" <mkp@mkp.net>, Jens Axboe <axboe@kernel.dk>,
Mark Fasheh <mfasheh@suse.com>, Joel Becker <jlbec@evilplan.org>,
Eric Wong <normalperson@yhbt.net>
Subject: Re: [RFC] extending splice for copy offloading
Date: Mon, 30 Sep 2013 09:41:58 -0500 [thread overview]
Message-ID: <52498DB6.7060901@redhat.com> (raw)
In-Reply-To: <CAJfpegv_C6cLOuA-mNtgtf2QbmmmcHwjQVo8mAnhf_wbJ8iRhg@mail.gmail.com>
On 09/30/2013 10:38 AM, Miklos Szeredi wrote:
> On Mon, Sep 30, 2013 at 4:28 PM, Ric Wheeler <rwheeler@redhat.com> wrote:
>> On 09/30/2013 10:24 AM, Miklos Szeredi wrote:
>>> On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler <rwheeler@redhat.com> wrote:
>>>> On 09/30/2013 10:51 AM, Miklos Szeredi wrote:
>>>>> On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields <bfields@fieldses.org>
>>>>> wrote:
>>>>>>> My other worry is about interruptibility/restartability. Ideas?
>>>>>>>
>>>>>>> What happens on splice(from, to, 4G) and it's a non-reflink copy?
>>>>>>> Can the page cache copy be made restartable? Or should splice() be
>>>>>>> allowed to return a short count? What happens on (non-reflink) remote
>>>>>>> copies and huge request sizes?
>>>>>> If I were writing an application that required copies to be
>>>>>> restartable,
>>>>>> I'd probably use the largest possible range in the reflink case but
>>>>>> break the copy into smaller chunks in the splice case.
>>>>>>
>>>>> The app really doesn't want to care about that. And it doesn't want
>>>>> to care about restartability, etc.. It's something the *kernel* has
>>>>> to care about. You just can't have uninterruptible syscalls that
>>>>> sleep for a "long" time, otherwise first you'll just have annoyed
>>>>> users pressing ^C in vain; then, if the sleep is even longer, warnings
>>>>> about task sleeping too long.
>>>>>
>>>>> One idea is letting splice() return a short count, and so the app can
>>>>> safely issue SIZE_MAX requests and the kernel can decide if it can
>>>>> copy the whole file in one go or if it wants to do it in smaller
>>>>> chunks.
>>>>>
>>>> You cannot rely on a short count. That implies that an offloaded copy
>>>> starts
>>>> at byte 0 and the short count first bytes are all valid.
>>> Huh?
>>>
>>> - app calls splice(from, 0, to, 0, SIZE_MAX)
>>> 1) VFS calls ->direct_splice(from, 0, to, 0, SIZE_MAX)
>>> 1.a) fs reflinks the whole file in a jiffy and returns the size of
>>> the file
>>> 1 b) fs does copy offload of, say, 64MB and returns 64M
>>> 2) VFS does page copy of, say, 1MB and returns 1MB
>>> - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
>>> ...
>>>
>>> The point is: the app is always doing the same (incrementing offset
>>> with the return value from splice) and the kernel can decide what is
>>> the best size it can service within a single uninterruptible syscall.
>>>
>>> Wouldn't that work?
>>>
>> No.
>>
>> Keep in mind that the offload operation in (1) might fail partially. The
>> target file (the copy) is allocated, the question is what ranges have valid
>> data.
> You are talking about case 1.a, right? So if the offload copy 0-64MB
> fails partially, we return failure from splice, yet some of the copy
> did succeed. Is that the problem? Why?
>
> Thanks,
> Miklos
The way the array based offload (and some software side reflink works) is not a
byte by byte copy. We cannot assume that a valid count can be returned or that
such a count would be an indication of a sequential segment of good data. The
whole thing would normally have to be reissued.
To make that a true assumption, you would have to mandate that in each of the
specifications (and sw targets)...
ric
WARNING: multiple messages have this Message-ID (diff)
From: Ric Wheeler <rwheeler-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Miklos Szeredi <miklos-sUDqSbJrdHQHWmgEVkV9KA@public.gmane.org>
Cc: "J. Bruce Fields"
<bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>,
"Myklebust,
Trond" <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>,
Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
Anna Schumaker
<schumaker.anna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
Kernel Mailing List
<linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
Linux-Fsdevel
<linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
"linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
<linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
"Schumaker,
Bryan" <Bryan.Schumaker-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>,
"Martin K. Petersen" <mkp-30zCAauEzIw@public.gmane.org>,
Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>,
Mark Fasheh <mfasheh-IBi9RG/b67k@public.gmane.org>,
Joel Becker <jlbec-aKy9MeLSZ9dg9hUCZPvPmw@public.gmane.org>,
Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org>
Subject: Re: [RFC] extending splice for copy offloading
Date: Mon, 30 Sep 2013 09:41:58 -0500 [thread overview]
Message-ID: <52498DB6.7060901@redhat.com> (raw)
In-Reply-To: <CAJfpegv_C6cLOuA-mNtgtf2QbmmmcHwjQVo8mAnhf_wbJ8iRhg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On 09/30/2013 10:38 AM, Miklos Szeredi wrote:
> On Mon, Sep 30, 2013 at 4:28 PM, Ric Wheeler <rwheeler-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> On 09/30/2013 10:24 AM, Miklos Szeredi wrote:
>>> On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler <rwheeler-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On 09/30/2013 10:51 AM, Miklos Szeredi wrote:
>>>>> On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
>>>>> wrote:
>>>>>>> My other worry is about interruptibility/restartability. Ideas?
>>>>>>>
>>>>>>> What happens on splice(from, to, 4G) and it's a non-reflink copy?
>>>>>>> Can the page cache copy be made restartable? Or should splice() be
>>>>>>> allowed to return a short count? What happens on (non-reflink) remote
>>>>>>> copies and huge request sizes?
>>>>>> If I were writing an application that required copies to be
>>>>>> restartable,
>>>>>> I'd probably use the largest possible range in the reflink case but
>>>>>> break the copy into smaller chunks in the splice case.
>>>>>>
>>>>> The app really doesn't want to care about that. And it doesn't want
>>>>> to care about restartability, etc.. It's something the *kernel* has
>>>>> to care about. You just can't have uninterruptible syscalls that
>>>>> sleep for a "long" time, otherwise first you'll just have annoyed
>>>>> users pressing ^C in vain; then, if the sleep is even longer, warnings
>>>>> about task sleeping too long.
>>>>>
>>>>> One idea is letting splice() return a short count, and so the app can
>>>>> safely issue SIZE_MAX requests and the kernel can decide if it can
>>>>> copy the whole file in one go or if it wants to do it in smaller
>>>>> chunks.
>>>>>
>>>> You cannot rely on a short count. That implies that an offloaded copy
>>>> starts
>>>> at byte 0 and the short count first bytes are all valid.
>>> Huh?
>>>
>>> - app calls splice(from, 0, to, 0, SIZE_MAX)
>>> 1) VFS calls ->direct_splice(from, 0, to, 0, SIZE_MAX)
>>> 1.a) fs reflinks the whole file in a jiffy and returns the size of
>>> the file
>>> 1 b) fs does copy offload of, say, 64MB and returns 64M
>>> 2) VFS does page copy of, say, 1MB and returns 1MB
>>> - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
>>> ...
>>>
>>> The point is: the app is always doing the same (incrementing offset
>>> with the return value from splice) and the kernel can decide what is
>>> the best size it can service within a single uninterruptible syscall.
>>>
>>> Wouldn't that work?
>>>
>> No.
>>
>> Keep in mind that the offload operation in (1) might fail partially. The
>> target file (the copy) is allocated, the question is what ranges have valid
>> data.
> You are talking about case 1.a, right? So if the offload copy 0-64MB
> fails partially, we return failure from splice, yet some of the copy
> did succeed. Is that the problem? Why?
>
> Thanks,
> Miklos
The way the array based offload (and some software side reflink works) is not a
byte by byte copy. We cannot assume that a valid count can be returned or that
such a count would be an indication of a sequential segment of good data. The
whole thing would normally have to be reissued.
To make that a true assumption, you would have to mandate that in each of the
specifications (and sw targets)...
ric
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2013-09-30 15:42 UTC|newest]
Thread overview: 111+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-09-11 17:06 [RFC] extending splice for copy offloading Zach Brown
2013-09-11 17:06 ` [PATCH 1/3] splice: add DIRECT flag for splicing between files Zach Brown
2013-09-11 17:06 ` Zach Brown
2013-09-11 17:06 ` [PATCH 2/3] splice: add f_op->splice_direct Zach Brown
2013-09-11 17:06 ` Zach Brown
2013-09-11 17:06 ` [PATCH 3/3] btrfs: implement .splice_direct extent copying Zach Brown
2013-09-11 17:06 ` Zach Brown
2013-09-11 21:17 ` [RFC] extending splice for copy offloading Eric Wong
2013-09-16 19:44 ` Rob Landley
2013-09-16 19:44 ` Rob Landley
2013-09-19 12:59 ` Jeff Layton
2013-09-20 9:49 ` Szeredi Miklos
2013-09-20 9:49 ` Szeredi Miklos
2013-09-25 18:38 ` Zach Brown
2013-09-25 18:38 ` Zach Brown
2013-09-25 19:02 ` Anna Schumaker
2013-09-25 19:06 ` Zach Brown
2013-09-25 19:06 ` Zach Brown
2013-09-25 19:55 ` J. Bruce Fields
2013-09-25 19:55 ` J. Bruce Fields
2013-09-25 21:07 ` Zach Brown
2013-09-26 8:58 ` Miklos Szeredi
2013-09-26 15:34 ` J. Bruce Fields
2013-09-26 16:46 ` Ric Wheeler
2013-09-26 16:46 ` Ric Wheeler
2013-09-26 18:06 ` Miklos Szeredi
2013-09-26 19:06 ` Zach Brown
2013-09-26 19:53 ` Miklos Szeredi
2013-09-26 21:23 ` Ric Wheeler
2013-09-26 21:23 ` Ric Wheeler
2013-09-27 4:47 ` Miklos Szeredi
2013-09-27 4:47 ` Miklos Szeredi
2013-09-27 14:00 ` Ric Wheeler
2013-09-27 14:39 ` Miklos Szeredi
2013-10-06 8:42 ` Rob Landley
2013-10-06 8:42 ` Rob Landley
2013-09-26 18:55 ` Zach Brown
2013-09-26 21:26 ` Ric Wheeler
2013-09-26 21:26 ` Ric Wheeler
2013-09-27 20:05 ` J. Bruce Fields
2013-09-27 20:05 ` J. Bruce Fields
2013-09-27 20:50 ` Zach Brown
2013-09-28 5:49 ` Miklos Szeredi
2013-09-28 15:20 ` Myklebust, Trond
2013-09-28 15:20 ` Myklebust, Trond
2013-09-28 15:20 ` Myklebust, Trond
2013-09-28 21:20 ` Ric Wheeler
2013-09-30 12:20 ` Miklos Szeredi
2013-09-30 12:20 ` Miklos Szeredi
2013-09-30 14:34 ` J. Bruce Fields
2013-09-30 14:34 ` J. Bruce Fields
2013-09-30 14:48 ` Ric Wheeler
2013-09-30 14:51 ` Miklos Szeredi
2013-09-30 14:52 ` Ric Wheeler
2013-09-30 14:52 ` Ric Wheeler
2013-09-30 15:24 ` Miklos Szeredi
2013-09-30 14:28 ` Ric Wheeler
2013-09-30 15:33 ` Myklebust, Trond
2013-09-30 15:33 ` Myklebust, Trond
2013-09-30 15:33 ` Myklebust, Trond
2013-09-30 15:38 ` Miklos Szeredi
2013-09-30 15:38 ` Miklos Szeredi
2013-09-30 14:41 ` Ric Wheeler [this message]
2013-09-30 14:41 ` Ric Wheeler
2013-09-30 15:46 ` Miklos Szeredi
2013-09-30 15:46 ` Miklos Szeredi
2013-09-30 14:49 ` Ric Wheeler
2013-09-30 14:49 ` Ric Wheeler
2013-09-30 15:57 ` Miklos Szeredi
2013-09-30 15:57 ` Miklos Szeredi
2013-09-30 16:31 ` Miklos Szeredi
2013-09-30 16:31 ` Miklos Szeredi
2013-09-30 17:17 ` Bernd Schubert
2013-09-30 17:44 ` Myklebust, Trond
2013-09-30 17:44 ` Myklebust, Trond
2013-09-30 17:44 ` Myklebust, Trond
2013-09-30 17:48 ` Bernd Schubert
2013-09-30 17:48 ` Bernd Schubert
2013-09-30 18:02 ` Myklebust, Trond
2013-09-30 18:02 ` Myklebust, Trond
2013-09-30 18:02 ` Myklebust, Trond
2013-09-30 18:49 ` Bernd Schubert
2013-09-30 19:34 ` Myklebust, Trond
2013-09-30 19:34 ` Myklebust, Trond
2013-09-30 19:34 ` Myklebust, Trond
2013-09-30 20:00 ` Bernd Schubert
2013-09-30 20:00 ` Bernd Schubert
2013-09-30 20:08 ` Ric Wheeler
2013-09-30 20:08 ` Ric Wheeler
2013-09-30 20:27 ` Myklebust, Trond
2013-09-30 20:27 ` Myklebust, Trond
2013-09-30 20:27 ` Myklebust, Trond
2013-09-30 20:10 ` Myklebust, Trond
2013-09-30 20:10 ` Myklebust, Trond
2013-09-30 20:10 ` Myklebust, Trond
2013-10-01 18:42 ` J. Bruce Fields
2013-10-01 18:42 ` J. Bruce Fields
2013-10-01 19:58 ` Zach Brown
2013-10-01 19:58 ` Zach Brown
2013-10-02 12:58 ` Jan Kara
2013-10-02 12:58 ` Jan Kara
2013-10-02 13:31 ` David Lang
2013-12-18 12:41 ` Christoph Hellwig
2013-12-18 12:41 ` Christoph Hellwig
2013-12-18 17:10 ` Zach Brown
2013-12-18 17:26 ` Anna Schumaker
-- strict thread matches above, loose matches on Subject: below --
2013-09-26 17:22 Steve French
[not found] ` <CAH2r5muBuTK7ZZ+aKGC4q35gqaSWF4o07eoHypLKiNn5Y83RbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-26 19:34 ` David Disseldorp
2013-10-10 2:18 ` Steve French
2013-10-01 21:05 ` J. Bruce Fields
[not found] ` <20131001210531.GA7093-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2013-10-02 1:19 ` Steve French
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52498DB6.7060901@redhat.com \
--to=rwheeler@redhat.com \
--cc=Bryan.Schumaker@netapp.com \
--cc=Trond.Myklebust@netapp.com \
--cc=axboe@kernel.dk \
--cc=bfields@fieldses.org \
--cc=jlbec@evilplan.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=mfasheh@suse.com \
--cc=miklos@szeredi.hu \
--cc=mkp@mkp.net \
--cc=normalperson@yhbt.net \
--cc=schumaker.anna@gmail.com \
--cc=zab@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.