* Note about rbd_aio_write usage @ 2017-07-06 12:26 Piotr Dałek 2017-07-06 13:03 ` Jason Dillaman 0 siblings, 1 reply; 12+ messages in thread From: Piotr Dałek @ 2017-07-06 12:26 UTC (permalink / raw) To: ceph-devel, ceph-users Hi, If you're using "rbd_aio_write()" in your code, be aware of the fact that before Luminous release, this function expects buffer to remain unchanged until write op ends, and on Luminous and later this function internally copies the buffer, allocating memory where needed, freeing it once write is done. If you write an app that may need to work with Luminous *and* pre-Luminous versions of librbd, you may want to provide a version check (using rbd_version() for example) so either your buffers won't change before write is done or you don't incur a penalty for unnecessary memory allocation and copy on your side (though it's probably unavoidable with current state of Luminous). -- Piotr Dałek piotr.dalek@corp.ovh.com https://www.ovh.com/us/ _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage 2017-07-06 12:26 Note about rbd_aio_write usage Piotr Dałek @ 2017-07-06 13:03 ` Jason Dillaman [not found] ` <CA+aFP1A5UVZzSDfFvphsVq9Jra9PXbJHUXhaJ9tH8LGx5OKJfA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 12+ messages in thread From: Jason Dillaman @ 2017-07-06 13:03 UTC (permalink / raw) To: Piotr Dałek; +Cc: ceph-devel, ceph-users Pre-Luminous also copies the provided buffer when using the C API -- it just copies it at a later point and not immediately. The eventual goal is to eliminate the copy completely, but that requires some additional plumbing work deep down within the librados messenger layer. On Thu, Jul 6, 2017 at 8:26 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote: > Hi, > > If you're using "rbd_aio_write()" in your code, be aware of the fact that > before Luminous release, this function expects buffer to remain unchanged > until write op ends, and on Luminous and later this function internally > copies the buffer, allocating memory where needed, freeing it once write is > done. > > If you write an app that may need to work with Luminous *and* pre-Luminous > versions of librbd, you may want to provide a version check (using > rbd_version() for example) so either your buffers won't change before write > is done or you don't incur a penalty for unnecessary memory allocation and > copy on your side (though it's probably unavoidable with current state of > Luminous). > > -- > Piotr Dałek > piotr.dalek@corp.ovh.com > https://www.ovh.com/us/ > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jason ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <CA+aFP1A5UVZzSDfFvphsVq9Jra9PXbJHUXhaJ9tH8LGx5OKJfA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Note about rbd_aio_write usage [not found] ` <CA+aFP1A5UVZzSDfFvphsVq9Jra9PXbJHUXhaJ9tH8LGx5OKJfA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-07-06 13:33 ` Piotr Dałek 2017-07-06 13:43 ` Jason Dillaman 0 siblings, 1 reply; 12+ messages in thread From: Piotr Dałek @ 2017-07-06 13:33 UTC (permalink / raw) To: dillaman-H+wXaHxf7aLQT0dZR+AlfA; +Cc: ceph-devel, ceph-users On 17-07-06 03:03 PM, Jason Dillaman wrote: > On Thu, Jul 6, 2017 at 8:26 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote: >> Hi, >> >> If you're using "rbd_aio_write()" in your code, be aware of the fact that >> before Luminous release, this function expects buffer to remain unchanged >> until write op ends, and on Luminous and later this function internally >> copies the buffer, allocating memory where needed, freeing it once write is >> done. > Pre-Luminous also copies the provided buffer when using the C API -- > it just copies it at a later point and not immediately. The eventual > goal is to eliminate the copy completely, but that requires some > additional plumbing work deep down within the librados messenger > layer. I've learned the hard way that pre-luminous, even if it copies the buffer, it does so too late. In my specific case, my FUSE module does enter the write call and issues rbd_aio_write there, then exits the write - expecting the buffer provided by FUSE to be copied by librbd (as it happens now in Luminous). I didn't expect that it's a new behavior and once my code was deployed to use Jewel librbd, it started to consistently corrupt data during write. -- Piotr Dałek piotr.dalek@corp.ovh.com https://www.ovh.com/us/ _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage 2017-07-06 13:33 ` Piotr Dałek @ 2017-07-06 13:43 ` Jason Dillaman 2017-07-06 14:22 ` Piotr Dałek 0 siblings, 1 reply; 12+ messages in thread From: Jason Dillaman @ 2017-07-06 13:43 UTC (permalink / raw) To: Piotr Dałek; +Cc: ceph-devel, ceph-users The correct (POSIX-style) program behavior should treat the buffer as immutable until the IO operation completes. It is never safe to assume the buffer can be re-used while the IO is in-flight. You should not add any logic to assume the buffer is safely copied prior to the completion of the IO. On Thu, Jul 6, 2017 at 9:33 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote: > On 17-07-06 03:03 PM, Jason Dillaman wrote: >> >> On Thu, Jul 6, 2017 at 8:26 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> >> wrote: >>> >>> Hi, >>> >>> If you're using "rbd_aio_write()" in your code, be aware of the fact that >>> before Luminous release, this function expects buffer to remain unchanged >>> until write op ends, and on Luminous and later this function internally >>> copies the buffer, allocating memory where needed, freeing it once write >>> is >>> done. > > >> Pre-Luminous also copies the provided buffer when using the C API -- >> it just copies it at a later point and not immediately. The eventual >> goal is to eliminate the copy completely, but that requires some >> additional plumbing work deep down within the librados messenger >> layer. > > > I've learned the hard way that pre-luminous, even if it copies the buffer, > it does so too late. In my specific case, my FUSE module does enter the > write call and issues rbd_aio_write there, then exits the write - expecting > the buffer provided by FUSE to be copied by librbd (as it happens now in > Luminous). I didn't expect that it's a new behavior and once my code was > deployed to use Jewel librbd, it started to consistently corrupt data during > write. > > > -- > Piotr Dałek > piotr.dalek@corp.ovh.com > https://www.ovh.com/us/ > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jason ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage 2017-07-06 13:43 ` Jason Dillaman @ 2017-07-06 14:22 ` Piotr Dałek 2017-07-06 14:40 ` Jason Dillaman 0 siblings, 1 reply; 12+ messages in thread From: Piotr Dałek @ 2017-07-06 14:22 UTC (permalink / raw) To: dillaman; +Cc: ceph-devel, ceph-users On 17-07-06 03:43 PM, Jason Dillaman wrote: >> I've learned the hard way that pre-luminous, even if it copies the buffer, >> it does so too late. In my specific case, my FUSE module does enter the >> write call and issues rbd_aio_write there, then exits the write - expecting >> the buffer provided by FUSE to be copied by librbd (as it happens now in >> Luminous). I didn't expect that it's a new behavior and once my code was >> deployed to use Jewel librbd, it started to consistently corrupt data during >> write. > The correct (POSIX-style) program behavior should treat the buffer as > immutable until the IO operation completes. It is never safe to assume > the buffer can be re-used while the IO is in-flight. You should not > add any logic to assume the buffer is safely copied prior to the > completion of the IO. Indeed, most systems - not only POSIX ones - supporting asynchronous writes expect that buffer remain unchanged until the write is done. I wasn't sure how rbd_aio_write operates and consulted the source, as there's no docs for the api itself. That intermediate copy in librbd deceived me -- because if librbd copies the data, why should I do the same before calling rbd_aio_write? To stress-test memory bus? So I really see two problems here: lack of API docs and backwards-incompatible change in API behavior. -- Piotr Dałek piotr.dalek@corp.ovh.com https://www.ovh.com/us/ ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage 2017-07-06 14:22 ` Piotr Dałek @ 2017-07-06 14:40 ` Jason Dillaman 2017-07-06 15:46 ` Piotr Dałek 0 siblings, 1 reply; 12+ messages in thread From: Jason Dillaman @ 2017-07-06 14:40 UTC (permalink / raw) To: Piotr Dałek; +Cc: ceph-devel, ceph-users On Thu, Jul 6, 2017 at 10:22 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote: > So I really see two problems here: lack of API docs and > backwards-incompatible change in API behavior. Docs are always in need of update, so any pull requests would be greatly appreciated. However, I disagree that the behavior has substantively changed -- it was always possible for pre-Luminous to (sometimes) copy the buffer before the "rbd_aio_write" method completed. With Luminous, this behavior is more consistent -- but in a future release memory may be zero-copied. If your application can properly conform to the (unwritten) contract that the buffers should remain unchanged, there would be no need for the application to pre-copy the buffers. If the libfuse implementation requires that the memory is not-in-use by the time you return control to it (i.e. it's a synchronous API and you are using async methods), you will always need to copy it. The C++ API allows you to control the copying since you need to pass "bufferlist"s to the API methods and since they utilize a reference counter, there is no internal copying within librbd / librados. -- Jason ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage 2017-07-06 14:40 ` Jason Dillaman @ 2017-07-06 15:46 ` Piotr Dałek [not found] ` <e39eba47-ae4d-05bb-41cd-928565324d6c-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 12+ messages in thread From: Piotr Dałek @ 2017-07-06 15:46 UTC (permalink / raw) To: dillaman; +Cc: ceph-devel, ceph-users On 17-07-06 04:40 PM, Jason Dillaman wrote: > On Thu, Jul 6, 2017 at 10:22 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote: >> So I really see two problems here: lack of API docs and >> backwards-incompatible change in API behavior. > > Docs are always in need of update, so any pull requests would be > greatly appreciated. > > However, I disagree that the behavior has substantively changed -- it > was always possible for pre-Luminous to (sometimes) copy the buffer > before the "rbd_aio_write" method completed. But that copy was buried somewhere deep in the librbd internals and - looking at Jewel version - most would assume that it's not really copied and user is responsible for keeping buffer intact until write is complete. API user doesn't really care about what's going on internally and is beyond their control. > With Luminous, this > behavior is more consistent -- but in a future release memory may be > zero-copied. If your application can properly conform to the > (unwritten) contract that the buffers should remain unchanged, there > would be no need for the application to pre-copy the buffers. So far I am forced to do a copy anyway (see below). The question is whether it's me doing it, or librbd. It doesn't make sense to have it both do the same -- especially if it's going to handle tens of terabytes of data, which could mean for 10TB of data at least 83 886 080 memory allocations, releases and copies plus 2 684 354 560 page faults (assuming 4KB pages) -- and these are the best case scenario numbers assuming 128KB I/O size. What I understand that you expect from me, is to have at least number of memory copies doubled and push not "just" 20TB over the memory bus (reading 10TB from one buffer and writing these 10TB to another), but 40. In other words, if I'd write my code considering how Jewel librbd works, there would be no real issue, apart from the fact that suddenly my program would consume more memory and would burn more CPU cycles once librbd is upgraded to Luminous which, considering the amount of data, would be noticeable change. > If the libfuse implementation requires that the memory is not-in-use > by the time you return control to it (i.e. it's a synchronous API and > you are using async methods), you will always need to copy it. Yes, libfuse expects that once I leave entrypoint, it is free to do anything it wishes with previously provided buffers -- and that's what it actually does. > The C++ > API allows you to control the copying since you need to pass > "bufferlist"s to the API methods and since they utilize a reference > counter, there is no internal copying within librbd / librados. How about a hybrid solution? Keep the old rbd_aio_write contract (don't copy the buffer with the assumption that it won't change) and instead of constructing bufferlist containing bufferptr to copied data, construct a bufferlist containing bufferptr made with create_static(user_buffer)? -- Piotr Dałek piotr.dalek@corp.ovh.com https://www.ovh.com/us/ ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <e39eba47-ae4d-05bb-41cd-928565324d6c-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>]
* Re: Note about rbd_aio_write usage [not found] ` <e39eba47-ae4d-05bb-41cd-928565324d6c-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org> @ 2017-07-06 18:02 ` Jason Dillaman 2017-07-06 19:25 ` Piotr Dałek 0 siblings, 1 reply; 12+ messages in thread From: Jason Dillaman @ 2017-07-06 18:02 UTC (permalink / raw) To: Piotr Dałek; +Cc: ceph-devel, ceph-users On Thu, Jul 6, 2017 at 11:46 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote: > How about a hybrid solution? Keep the old rbd_aio_write contract (don't copy > the buffer with the assumption that it won't change) and instead of > constructing bufferlist containing bufferptr to copied data, construct a > bufferlist containing bufferptr made with create_static(user_buffer)? We must be talking past each other -- there was never such a pre-Lumunous contract since (1) it did copy the buffer on every IO and (2) it could have potentially copied before the 'rbd_aio_write' call returned or after (but at least before the completion was fired). Just because it works some times doesn't mean it would always work since it would be a race between two threads. Unfortunately, until the librados issue is solved, you will still have to copy the data once when using the C++ API. The only advantage is that you would be responsible for the copying and it would only need to be performed once instead of twice. -- Jason _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage 2017-07-06 18:02 ` Jason Dillaman @ 2017-07-06 19:25 ` Piotr Dałek 2017-07-06 19:39 ` Jason Dillaman 0 siblings, 1 reply; 12+ messages in thread From: Piotr Dałek @ 2017-07-06 19:25 UTC (permalink / raw) To: dillaman; +Cc: Piotr Dałek, ceph-devel, ceph-users On Thu, Jul 06, 2017 at 02:02:39PM -0400, Jason Dillaman wrote: > On Thu, Jul 6, 2017 at 11:46 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote: > > How about a hybrid solution? Keep the old rbd_aio_write contract (don't copy > > the buffer with the assumption that it won't change) and instead of > > constructing bufferlist containing bufferptr to copied data, construct a > > bufferlist containing bufferptr made with create_static(user_buffer)? > > We must be talking past each other -- there was never such a > pre-Lumunous contract since (1) it did copy the buffer on every IO and > (2) it could have potentially copied before the 'rbd_aio_write' call > returned or after (but at least before the completion was fired). Just > because it works some times doesn't mean it would always work since it > would be a race between two threads. > > Unfortunately, until the librados issue is solved, you will still have > to copy the data once when using the C++ API. The only advantage is > that you would be responsible for the copying and it would only need > to be performed once instead of twice. Even if it did copied the buffer, it did at unspecified point in time (as you noted above!) requiring rbd_aio_write caller to follow usual AIO practice of keeping buffers passed for as long as AIO is in progress. With Luminous, not only it's no longer necessary, but it also hinders the user program performance as buffers are always deep-copied, regardless of whether caller follows AIO rules or not. Is that deep copy an equivalent of what Jewel librbd did at unspecified point of time, or extra one? -- Piotr Dałek ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage 2017-07-06 19:25 ` Piotr Dałek @ 2017-07-06 19:39 ` Jason Dillaman 2017-07-07 6:48 ` Piotr Dałek 0 siblings, 1 reply; 12+ messages in thread From: Jason Dillaman @ 2017-07-06 19:39 UTC (permalink / raw) To: Piotr Dałek; +Cc: ceph-devel, ceph-users On Thu, Jul 6, 2017 at 3:25 PM, Piotr Dałek <branch@predictor.org.pl> wrote: > Is that deep copy an equivalent of what > Jewel librbd did at unspecified point of time, or extra one? It's equivalent / replacement -- not an additional copy. This was changed to support scatter/gather IO API methods which the latest version of QEMU now directly utilizes (eliminating the need for a bounce-buffer copy on every IO). Once we get that librados issue resolved, that initial librbd IO buffer copy will be dropped and librbd will become zero-copy for IO (at least that's the goal). That's why I am recommending that you just assume normal AIO semantics and not try to optimize for Luminous since perhaps the next release will have that implementation detail of the extra copy removed. -- Jason _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage 2017-07-06 19:39 ` Jason Dillaman @ 2017-07-07 6:48 ` Piotr Dałek 2017-07-07 13:14 ` Jason Dillaman 0 siblings, 1 reply; 12+ messages in thread From: Piotr Dałek @ 2017-07-07 6:48 UTC (permalink / raw) Cc: ceph-devel, ceph-users On 17-07-06 09:39 PM, Jason Dillaman wrote: > On Thu, Jul 6, 2017 at 3:25 PM, Piotr Dałek <branch@predictor.org.pl> wrote: >> Is that deep copy an equivalent of what >> Jewel librbd did at unspecified point of time, or extra one? > > It's equivalent / replacement -- not an additional copy. This was > changed to support scatter/gather IO API methods which the latest > version of QEMU now directly utilizes (eliminating the need for a > bounce-buffer copy on every IO). OK, that makes more sense now. > Once we get that librados issue resolved, that initial librbd IO > buffer copy will be dropped and librbd will become zero-copy for IO > (at least that's the goal). That's why I am recommending that you just > assume normal AIO semantics and not try to optimize for Luminous since > perhaps the next release will have that implementation detail of the > extra copy removed. Is this: https://github.com/yuyuyu101/ceph/commit/794b49b5b860c538a349bdadb16bb6ae97ad9c20#commitcomment-15707924 the issue you mention? Because at this point I'm considering switching to C++ API and passing static bufferptr buried in my bufferlist instead of having extra copy done by C API rbd_aio_write (that way I'd at least control the allocations). -- Piotr Dałek piotr.dalek@corp.ovh.com https://www.ovh.com/us/ ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Note about rbd_aio_write usage 2017-07-07 6:48 ` Piotr Dałek @ 2017-07-07 13:14 ` Jason Dillaman 0 siblings, 0 replies; 12+ messages in thread From: Jason Dillaman @ 2017-07-07 13:14 UTC (permalink / raw) To: Piotr Dałek; +Cc: ceph-devel, ceph-users On Fri, Jul 7, 2017 at 2:48 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote: > Is this: > https://github.com/yuyuyu101/ceph/commit/794b49b5b860c538a349bdadb16bb6ae97ad9c20#commitcomment-15707924 > the issue you mention? Because at this point I'm considering switching to > C++ API and passing static bufferptr buried in my bufferlist instead of > having extra copy done by C API rbd_aio_write (that way I'd at least control > the allocations). Yes, that is the issue along with the comment from Sage in your previous thread [1] where I proposed a potential workaround w/o the need to deeply perturb the messaging layer. [1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-January/015687.html -- Jason ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2017-07-07 13:14 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-07-06 12:26 Note about rbd_aio_write usage Piotr Dałek
2017-07-06 13:03 ` Jason Dillaman
[not found] ` <CA+aFP1A5UVZzSDfFvphsVq9Jra9PXbJHUXhaJ9tH8LGx5OKJfA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-07-06 13:33 ` Piotr Dałek
2017-07-06 13:43 ` Jason Dillaman
2017-07-06 14:22 ` Piotr Dałek
2017-07-06 14:40 ` Jason Dillaman
2017-07-06 15:46 ` Piotr Dałek
[not found] ` <e39eba47-ae4d-05bb-41cd-928565324d6c-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
2017-07-06 18:02 ` Jason Dillaman
2017-07-06 19:25 ` Piotr Dałek
2017-07-06 19:39 ` Jason Dillaman
2017-07-07 6:48 ` Piotr Dałek
2017-07-07 13:14 ` Jason Dillaman
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.