* Note about rbd_aio_write usage
@ 2017-07-06 12:26 Piotr Dałek
2017-07-06 13:03 ` Jason Dillaman
0 siblings, 1 reply; 12+ messages in thread
From: Piotr Dałek @ 2017-07-06 12:26 UTC (permalink / raw)
To: ceph-devel, ceph-users
Hi,
If you're using "rbd_aio_write()" in your code, be aware of the fact that
before Luminous release, this function expects buffer to remain unchanged
until write op ends, and on Luminous and later this function internally
copies the buffer, allocating memory where needed, freeing it once write is
done.
If you write an app that may need to work with Luminous *and* pre-Luminous
versions of librbd, you may want to provide a version check (using
rbd_version() for example) so either your buffers won't change before write
is done or you don't incur a penalty for unnecessary memory allocation and
copy on your side (though it's probably unavoidable with current state of
Luminous).
--
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage
2017-07-06 12:26 Note about rbd_aio_write usage Piotr Dałek
@ 2017-07-06 13:03 ` Jason Dillaman
[not found] ` <CA+aFP1A5UVZzSDfFvphsVq9Jra9PXbJHUXhaJ9tH8LGx5OKJfA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Jason Dillaman @ 2017-07-06 13:03 UTC (permalink / raw)
To: Piotr Dałek; +Cc: ceph-devel, ceph-users
Pre-Luminous also copies the provided buffer when using the C API --
it just copies it at a later point and not immediately. The eventual
goal is to eliminate the copy completely, but that requires some
additional plumbing work deep down within the librados messenger
layer.
On Thu, Jul 6, 2017 at 8:26 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote:
> Hi,
>
> If you're using "rbd_aio_write()" in your code, be aware of the fact that
> before Luminous release, this function expects buffer to remain unchanged
> until write op ends, and on Luminous and later this function internally
> copies the buffer, allocating memory where needed, freeing it once write is
> done.
>
> If you write an app that may need to work with Luminous *and* pre-Luminous
> versions of librbd, you may want to provide a version check (using
> rbd_version() for example) so either your buffers won't change before write
> is done or you don't incur a penalty for unnecessary memory allocation and
> copy on your side (though it's probably unavoidable with current state of
> Luminous).
>
> --
> Piotr Dałek
> piotr.dalek@corp.ovh.com
> https://www.ovh.com/us/
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jason
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage
[not found] ` <CA+aFP1A5UVZzSDfFvphsVq9Jra9PXbJHUXhaJ9tH8LGx5OKJfA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-07-06 13:33 ` Piotr Dałek
2017-07-06 13:43 ` Jason Dillaman
0 siblings, 1 reply; 12+ messages in thread
From: Piotr Dałek @ 2017-07-06 13:33 UTC (permalink / raw)
To: dillaman-H+wXaHxf7aLQT0dZR+AlfA; +Cc: ceph-devel, ceph-users
On 17-07-06 03:03 PM, Jason Dillaman wrote:
> On Thu, Jul 6, 2017 at 8:26 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote:
>> Hi,
>>
>> If you're using "rbd_aio_write()" in your code, be aware of the fact that
>> before Luminous release, this function expects buffer to remain unchanged
>> until write op ends, and on Luminous and later this function internally
>> copies the buffer, allocating memory where needed, freeing it once write is
>> done.
> Pre-Luminous also copies the provided buffer when using the C API --
> it just copies it at a later point and not immediately. The eventual
> goal is to eliminate the copy completely, but that requires some
> additional plumbing work deep down within the librados messenger
> layer.
I've learned the hard way that pre-luminous, even if it copies the buffer,
it does so too late. In my specific case, my FUSE module does enter the
write call and issues rbd_aio_write there, then exits the write - expecting
the buffer provided by FUSE to be copied by librbd (as it happens now in
Luminous). I didn't expect that it's a new behavior and once my code was
deployed to use Jewel librbd, it started to consistently corrupt data during
write.
--
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage
2017-07-06 13:33 ` Piotr Dałek
@ 2017-07-06 13:43 ` Jason Dillaman
2017-07-06 14:22 ` Piotr Dałek
0 siblings, 1 reply; 12+ messages in thread
From: Jason Dillaman @ 2017-07-06 13:43 UTC (permalink / raw)
To: Piotr Dałek; +Cc: ceph-devel, ceph-users
The correct (POSIX-style) program behavior should treat the buffer as
immutable until the IO operation completes. It is never safe to assume
the buffer can be re-used while the IO is in-flight. You should not
add any logic to assume the buffer is safely copied prior to the
completion of the IO.
On Thu, Jul 6, 2017 at 9:33 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote:
> On 17-07-06 03:03 PM, Jason Dillaman wrote:
>>
>> On Thu, Jul 6, 2017 at 8:26 AM, Piotr Dałek <piotr.dalek@corp.ovh.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> If you're using "rbd_aio_write()" in your code, be aware of the fact that
>>> before Luminous release, this function expects buffer to remain unchanged
>>> until write op ends, and on Luminous and later this function internally
>>> copies the buffer, allocating memory where needed, freeing it once write
>>> is
>>> done.
>
>
>> Pre-Luminous also copies the provided buffer when using the C API --
>> it just copies it at a later point and not immediately. The eventual
>> goal is to eliminate the copy completely, but that requires some
>> additional plumbing work deep down within the librados messenger
>> layer.
>
>
> I've learned the hard way that pre-luminous, even if it copies the buffer,
> it does so too late. In my specific case, my FUSE module does enter the
> write call and issues rbd_aio_write there, then exits the write - expecting
> the buffer provided by FUSE to be copied by librbd (as it happens now in
> Luminous). I didn't expect that it's a new behavior and once my code was
> deployed to use Jewel librbd, it started to consistently corrupt data during
> write.
>
>
> --
> Piotr Dałek
> piotr.dalek@corp.ovh.com
> https://www.ovh.com/us/
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jason
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage
2017-07-06 13:43 ` Jason Dillaman
@ 2017-07-06 14:22 ` Piotr Dałek
2017-07-06 14:40 ` Jason Dillaman
0 siblings, 1 reply; 12+ messages in thread
From: Piotr Dałek @ 2017-07-06 14:22 UTC (permalink / raw)
To: dillaman; +Cc: ceph-devel, ceph-users
On 17-07-06 03:43 PM, Jason Dillaman wrote:
>> I've learned the hard way that pre-luminous, even if it copies the buffer,
>> it does so too late. In my specific case, my FUSE module does enter the
>> write call and issues rbd_aio_write there, then exits the write - expecting
>> the buffer provided by FUSE to be copied by librbd (as it happens now in
>> Luminous). I didn't expect that it's a new behavior and once my code was
>> deployed to use Jewel librbd, it started to consistently corrupt data during
>> write.
> The correct (POSIX-style) program behavior should treat the buffer as
> immutable until the IO operation completes. It is never safe to assume
> the buffer can be re-used while the IO is in-flight. You should not
> add any logic to assume the buffer is safely copied prior to the
> completion of the IO.
Indeed, most systems - not only POSIX ones - supporting asynchronous writes
expect that buffer remain unchanged until the write is done. I wasn't sure
how rbd_aio_write operates and consulted the source, as there's no docs for
the api itself. That intermediate copy in librbd deceived me -- because if
librbd copies the data, why should I do the same before calling
rbd_aio_write? To stress-test memory bus? So I really see two problems here:
lack of API docs and backwards-incompatible change in API behavior.
--
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage
2017-07-06 14:22 ` Piotr Dałek
@ 2017-07-06 14:40 ` Jason Dillaman
2017-07-06 15:46 ` Piotr Dałek
0 siblings, 1 reply; 12+ messages in thread
From: Jason Dillaman @ 2017-07-06 14:40 UTC (permalink / raw)
To: Piotr Dałek; +Cc: ceph-devel, ceph-users
On Thu, Jul 6, 2017 at 10:22 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote:
> So I really see two problems here: lack of API docs and
> backwards-incompatible change in API behavior.
Docs are always in need of update, so any pull requests would be
greatly appreciated.
However, I disagree that the behavior has substantively changed -- it
was always possible for pre-Luminous to (sometimes) copy the buffer
before the "rbd_aio_write" method completed. With Luminous, this
behavior is more consistent -- but in a future release memory may be
zero-copied. If your application can properly conform to the
(unwritten) contract that the buffers should remain unchanged, there
would be no need for the application to pre-copy the buffers.
If the libfuse implementation requires that the memory is not-in-use
by the time you return control to it (i.e. it's a synchronous API and
you are using async methods), you will always need to copy it. The C++
API allows you to control the copying since you need to pass
"bufferlist"s to the API methods and since they utilize a reference
counter, there is no internal copying within librbd / librados.
--
Jason
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage
2017-07-06 14:40 ` Jason Dillaman
@ 2017-07-06 15:46 ` Piotr Dałek
[not found] ` <e39eba47-ae4d-05bb-41cd-928565324d6c-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Piotr Dałek @ 2017-07-06 15:46 UTC (permalink / raw)
To: dillaman; +Cc: ceph-devel, ceph-users
On 17-07-06 04:40 PM, Jason Dillaman wrote:
> On Thu, Jul 6, 2017 at 10:22 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote:
>> So I really see two problems here: lack of API docs and
>> backwards-incompatible change in API behavior.
>
> Docs are always in need of update, so any pull requests would be
> greatly appreciated.
>
> However, I disagree that the behavior has substantively changed -- it
> was always possible for pre-Luminous to (sometimes) copy the buffer
> before the "rbd_aio_write" method completed.
But that copy was buried somewhere deep in the librbd internals and -
looking at Jewel version - most would assume that it's not really copied and
user is responsible for keeping buffer intact until write is complete. API
user doesn't really care about what's going on internally and is beyond
their control.
> With Luminous, this
> behavior is more consistent -- but in a future release memory may be
> zero-copied. If your application can properly conform to the
> (unwritten) contract that the buffers should remain unchanged, there
> would be no need for the application to pre-copy the buffers.
So far I am forced to do a copy anyway (see below). The question is whether
it's me doing it, or librbd. It doesn't make sense to have it both do the
same -- especially if it's going to handle tens of terabytes of data, which
could mean for 10TB of data at least 83 886 080 memory allocations, releases
and copies plus 2 684 354 560 page faults (assuming 4KB pages) -- and these
are the best case scenario numbers assuming 128KB I/O size. What I
understand that you expect from me, is to have at least number of memory
copies doubled and push not "just" 20TB over the memory bus (reading 10TB
from one buffer and writing these 10TB to another), but 40.
In other words, if I'd write my code considering how Jewel librbd works,
there would be no real issue, apart from the fact that suddenly my program
would consume more memory and would burn more CPU cycles once librbd is
upgraded to Luminous which, considering the amount of data, would be
noticeable change.
> If the libfuse implementation requires that the memory is not-in-use
> by the time you return control to it (i.e. it's a synchronous API and
> you are using async methods), you will always need to copy it.
Yes, libfuse expects that once I leave entrypoint, it is free to do anything
it wishes with previously provided buffers -- and that's what it actually does.
> The C++
> API allows you to control the copying since you need to pass
> "bufferlist"s to the API methods and since they utilize a reference
> counter, there is no internal copying within librbd / librados.
How about a hybrid solution? Keep the old rbd_aio_write contract (don't copy
the buffer with the assumption that it won't change) and instead of
constructing bufferlist containing bufferptr to copied data, construct a
bufferlist containing bufferptr made with create_static(user_buffer)?
--
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage
[not found] ` <e39eba47-ae4d-05bb-41cd-928565324d6c-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
@ 2017-07-06 18:02 ` Jason Dillaman
2017-07-06 19:25 ` Piotr Dałek
0 siblings, 1 reply; 12+ messages in thread
From: Jason Dillaman @ 2017-07-06 18:02 UTC (permalink / raw)
To: Piotr Dałek; +Cc: ceph-devel, ceph-users
On Thu, Jul 6, 2017 at 11:46 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote:
> How about a hybrid solution? Keep the old rbd_aio_write contract (don't copy
> the buffer with the assumption that it won't change) and instead of
> constructing bufferlist containing bufferptr to copied data, construct a
> bufferlist containing bufferptr made with create_static(user_buffer)?
We must be talking past each other -- there was never such a
pre-Lumunous contract since (1) it did copy the buffer on every IO and
(2) it could have potentially copied before the 'rbd_aio_write' call
returned or after (but at least before the completion was fired). Just
because it works some times doesn't mean it would always work since it
would be a race between two threads.
Unfortunately, until the librados issue is solved, you will still have
to copy the data once when using the C++ API. The only advantage is
that you would be responsible for the copying and it would only need
to be performed once instead of twice.
--
Jason
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage
2017-07-06 18:02 ` Jason Dillaman
@ 2017-07-06 19:25 ` Piotr Dałek
2017-07-06 19:39 ` Jason Dillaman
0 siblings, 1 reply; 12+ messages in thread
From: Piotr Dałek @ 2017-07-06 19:25 UTC (permalink / raw)
To: dillaman; +Cc: Piotr Dałek, ceph-devel, ceph-users
On Thu, Jul 06, 2017 at 02:02:39PM -0400, Jason Dillaman wrote:
> On Thu, Jul 6, 2017 at 11:46 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote:
> > How about a hybrid solution? Keep the old rbd_aio_write contract (don't copy
> > the buffer with the assumption that it won't change) and instead of
> > constructing bufferlist containing bufferptr to copied data, construct a
> > bufferlist containing bufferptr made with create_static(user_buffer)?
>
> We must be talking past each other -- there was never such a
> pre-Lumunous contract since (1) it did copy the buffer on every IO and
> (2) it could have potentially copied before the 'rbd_aio_write' call
> returned or after (but at least before the completion was fired). Just
> because it works some times doesn't mean it would always work since it
> would be a race between two threads.
>
> Unfortunately, until the librados issue is solved, you will still have
> to copy the data once when using the C++ API. The only advantage is
> that you would be responsible for the copying and it would only need
> to be performed once instead of twice.
Even if it did copied the buffer, it did at unspecified point in time (as
you noted above!) requiring rbd_aio_write caller to follow usual AIO
practice of keeping buffers passed for as long as AIO is in progress. With
Luminous, not only it's no longer necessary, but it also hinders the user
program performance as buffers are always deep-copied, regardless of whether
caller follows AIO rules or not. Is that deep copy an equivalent of what
Jewel librbd did at unspecified point of time, or extra one?
--
Piotr Dałek
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage
2017-07-06 19:25 ` Piotr Dałek
@ 2017-07-06 19:39 ` Jason Dillaman
2017-07-07 6:48 ` Piotr Dałek
0 siblings, 1 reply; 12+ messages in thread
From: Jason Dillaman @ 2017-07-06 19:39 UTC (permalink / raw)
To: Piotr Dałek; +Cc: ceph-devel, ceph-users
On Thu, Jul 6, 2017 at 3:25 PM, Piotr Dałek <branch@predictor.org.pl> wrote:
> Is that deep copy an equivalent of what
> Jewel librbd did at unspecified point of time, or extra one?
It's equivalent / replacement -- not an additional copy. This was
changed to support scatter/gather IO API methods which the latest
version of QEMU now directly utilizes (eliminating the need for a
bounce-buffer copy on every IO).
Once we get that librados issue resolved, that initial librbd IO
buffer copy will be dropped and librbd will become zero-copy for IO
(at least that's the goal). That's why I am recommending that you just
assume normal AIO semantics and not try to optimize for Luminous since
perhaps the next release will have that implementation detail of the
extra copy removed.
--
Jason
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage
2017-07-06 19:39 ` Jason Dillaman
@ 2017-07-07 6:48 ` Piotr Dałek
2017-07-07 13:14 ` Jason Dillaman
0 siblings, 1 reply; 12+ messages in thread
From: Piotr Dałek @ 2017-07-07 6:48 UTC (permalink / raw)
Cc: ceph-devel, ceph-users
On 17-07-06 09:39 PM, Jason Dillaman wrote:
> On Thu, Jul 6, 2017 at 3:25 PM, Piotr Dałek <branch@predictor.org.pl> wrote:
>> Is that deep copy an equivalent of what
>> Jewel librbd did at unspecified point of time, or extra one?
>
> It's equivalent / replacement -- not an additional copy. This was
> changed to support scatter/gather IO API methods which the latest
> version of QEMU now directly utilizes (eliminating the need for a
> bounce-buffer copy on every IO).
OK, that makes more sense now.
> Once we get that librados issue resolved, that initial librbd IO
> buffer copy will be dropped and librbd will become zero-copy for IO
> (at least that's the goal). That's why I am recommending that you just
> assume normal AIO semantics and not try to optimize for Luminous since
> perhaps the next release will have that implementation detail of the
> extra copy removed.
Is this:
https://github.com/yuyuyu101/ceph/commit/794b49b5b860c538a349bdadb16bb6ae97ad9c20#commitcomment-15707924
the issue you mention? Because at this point I'm considering switching to
C++ API and passing static bufferptr buried in my bufferlist instead of
having extra copy done by C API rbd_aio_write (that way I'd at least control
the allocations).
--
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage
2017-07-07 6:48 ` Piotr Dałek
@ 2017-07-07 13:14 ` Jason Dillaman
0 siblings, 0 replies; 12+ messages in thread
From: Jason Dillaman @ 2017-07-07 13:14 UTC (permalink / raw)
To: Piotr Dałek; +Cc: ceph-devel, ceph-users
On Fri, Jul 7, 2017 at 2:48 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote:
> Is this:
> https://github.com/yuyuyu101/ceph/commit/794b49b5b860c538a349bdadb16bb6ae97ad9c20#commitcomment-15707924
> the issue you mention? Because at this point I'm considering switching to
> C++ API and passing static bufferptr buried in my bufferlist instead of
> having extra copy done by C API rbd_aio_write (that way I'd at least control
> the allocations).
Yes, that is the issue along with the comment from Sage in your
previous thread [1] where I proposed a potential workaround w/o the
need to deeply perturb the messaging layer.
[1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-January/015687.html
--
Jason
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2017-07-07 13:14 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-07-06 12:26 Note about rbd_aio_write usage Piotr Dałek
2017-07-06 13:03 ` Jason Dillaman
[not found] ` <CA+aFP1A5UVZzSDfFvphsVq9Jra9PXbJHUXhaJ9tH8LGx5OKJfA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-07-06 13:33 ` Piotr Dałek
2017-07-06 13:43 ` Jason Dillaman
2017-07-06 14:22 ` Piotr Dałek
2017-07-06 14:40 ` Jason Dillaman
2017-07-06 15:46 ` Piotr Dałek
[not found] ` <e39eba47-ae4d-05bb-41cd-928565324d6c-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
2017-07-06 18:02 ` Jason Dillaman
2017-07-06 19:25 ` Piotr Dałek
2017-07-06 19:39 ` Jason Dillaman
2017-07-07 6:48 ` Piotr Dałek
2017-07-07 13:14 ` Jason Dillaman
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.