* O_DIRECT logic in CephFS, ceph-fuse / Performance @ 2014-03-12 20:27 Kasper Dieter 2014-03-12 22:38 ` Milosz Tanski 0 siblings, 1 reply; 3+ messages in thread From: Kasper Dieter @ 2014-03-12 20:27 UTC (permalink / raw) To: ceph-devel@vger.kernel.org; +Cc: Kasper Dieter, mark.nelson, Sage Weil The 'man 2 open' states ---snip--- The behaviour of O_DIRECT with NFS will differ from local file systems. (...) The NFS protocol does not support passing the flag to the server, so O_DIRECT I/O will bypass the page cache only on the client; the server may still cache the I/O. ---snip--- Q1: How does CephFS and ceph-fuse handle the O_DIRECT flag ? (similar to NFS Ceph is Network FS, too and has client/server) Some Test cases with O_DIRECT & io_submit() on 4K (65536, 262144, 1048576, 4194304 is the different obj_size): out.rand.fuse.ssd2-r2-1-1-1048576: Max. throughput read : 7.22768MB/s out.rand.fuse.ssd2-r2-1-1-262144: Max. throughput read : 7.18318MB/s out.rand.fuse.ssd2-r2-1-1-65536: Max. throughput read : 7.25543MB/s out.sequ.fuse.ssd2-r2-1-1-1048576: Max. throughput read : 118.092MB/s out.sequ.fuse.ssd2-r2-1-1-262144: Max. throughput read : 111.073MB/s out.sequ.fuse.ssd2-r2-1-1-65536: Max. throughput read : 95.4332MB/s out.rand.cephfs.ssd2-r2-1-1-1048576: Max. throughput read : 11.2144MB/s out.rand.cephfs.ssd2-r2-1-1-262144: Max. throughput read : 11.0371MB/s out.rand.cephfs.ssd2-r2-1-1-65536: Max. throughput read : 11.017MB/s out.sequ.cephfs.ssd2-r2-1-1-1048576: Max. throughput read : 11.2299MB/s out.sequ.cephfs.ssd2-r2-1-1-262144: Max. throughput read : 10.9488MB/s out.sequ.cephfs.ssd2-r2-1-1-65536: Max. throughput read : 10.5669MB/s out.rand.t3-ssd2-v2-1-1048576-20: Max. throughput read : 81.9598MB/s out.rand.t3-ssd2-v2-1-262144-18: Max. throughput read : 140.45MB/s out.rand.t3-ssd2-v2-1-4194304-22: Max. throughput read : 55.8478MB/s out.rand.t3-ssd2-v2-1-65536-16: Max. throughput read : 158.441MB/s out.sequ.t3-ssd2-v2-1-1048576-20: Max. throughput read : 74.3693MB/s out.sequ.t3-ssd2-v2-1-262144-18: Max. throughput read : 140.444MB/s out.sequ.t3-ssd2-v2-1-4194304-22: Max. throughput read : 42.7327MB/s out.sequ.t3-ssd2-v2-1-65536-16: Max. throughput read : 165.434MB/s t3 = XFS on rbd.ko CephFS and ceph-fuse seems to use no caching at all on random-reads. Ceph-fuse seems to use some caching on sequential-reads. rbd.ko seems to use caching on all reads (because only XFS knows about O_DIRECT ;-)) Q2: How can the read-caching logic be enabled for ceph-fuse / CephFS ? BTW I'm aware of the "O_DIRECT (...) designed by a deranged monkey" text in the open-2-manpage ;-) -Dieter ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: O_DIRECT logic in CephFS, ceph-fuse / Performance 2014-03-12 20:27 O_DIRECT logic in CephFS, ceph-fuse / Performance Kasper Dieter @ 2014-03-12 22:38 ` Milosz Tanski 2014-03-13 0:08 ` Sage Weil 0 siblings, 1 reply; 3+ messages in thread From: Milosz Tanski @ 2014-03-12 22:38 UTC (permalink / raw) To: Kasper Dieter; +Cc: ceph-devel@vger.kernel.org, mark.nelson, Sage Weil Kasper, I only know about the kernel cephfs... but there are special code paths for O_DIRECT read/writes. Both read and write bypass the page cache and send commands directly to OSDs for the objects, on the write case the object has a write lock with MDS. So unlike NFS this seams like it does the right thing. I'm guessing when you say XFS on rbd with O_DIRECT you mean the files are opened O_DIRECT on the filesystem. That doesn't take into account readahead that the kernel does in the block device layer which is independent of file read-ahead and (it's at much lower layer). You can find out what that is set to using the "blockdev --getra /dev/XXX" command. Cheers, - Milosz On Wed, Mar 12, 2014 at 4:27 PM, Kasper Dieter <dieter.kasper@ts.fujitsu.com> wrote: > The 'man 2 open' states > ---snip--- > The behaviour of O_DIRECT with NFS will differ from local file systems. (...) > The NFS protocol does not support passing the flag to the server, > so O_DIRECT I/O will bypass the page cache only on the client; > the server may still cache the I/O. > ---snip--- > > Q1: How does CephFS and ceph-fuse handle the O_DIRECT flag ? > (similar to NFS Ceph is Network FS, too and has client/server) > > > Some Test cases with O_DIRECT & io_submit() on 4K (65536, 262144, 1048576, 4194304 is the different obj_size): > > out.rand.fuse.ssd2-r2-1-1-1048576: Max. throughput read : 7.22768MB/s > out.rand.fuse.ssd2-r2-1-1-262144: Max. throughput read : 7.18318MB/s > out.rand.fuse.ssd2-r2-1-1-65536: Max. throughput read : 7.25543MB/s > out.sequ.fuse.ssd2-r2-1-1-1048576: Max. throughput read : 118.092MB/s > out.sequ.fuse.ssd2-r2-1-1-262144: Max. throughput read : 111.073MB/s > out.sequ.fuse.ssd2-r2-1-1-65536: Max. throughput read : 95.4332MB/s > > out.rand.cephfs.ssd2-r2-1-1-1048576: Max. throughput read : 11.2144MB/s > out.rand.cephfs.ssd2-r2-1-1-262144: Max. throughput read : 11.0371MB/s > out.rand.cephfs.ssd2-r2-1-1-65536: Max. throughput read : 11.017MB/s > out.sequ.cephfs.ssd2-r2-1-1-1048576: Max. throughput read : 11.2299MB/s > out.sequ.cephfs.ssd2-r2-1-1-262144: Max. throughput read : 10.9488MB/s > out.sequ.cephfs.ssd2-r2-1-1-65536: Max. throughput read : 10.5669MB/s > > out.rand.t3-ssd2-v2-1-1048576-20: Max. throughput read : 81.9598MB/s > out.rand.t3-ssd2-v2-1-262144-18: Max. throughput read : 140.45MB/s > out.rand.t3-ssd2-v2-1-4194304-22: Max. throughput read : 55.8478MB/s > out.rand.t3-ssd2-v2-1-65536-16: Max. throughput read : 158.441MB/s > out.sequ.t3-ssd2-v2-1-1048576-20: Max. throughput read : 74.3693MB/s > out.sequ.t3-ssd2-v2-1-262144-18: Max. throughput read : 140.444MB/s > out.sequ.t3-ssd2-v2-1-4194304-22: Max. throughput read : 42.7327MB/s > out.sequ.t3-ssd2-v2-1-65536-16: Max. throughput read : 165.434MB/s > > t3 = XFS on rbd.ko > > CephFS and ceph-fuse seems to use no caching at all on random-reads. > Ceph-fuse seems to use some caching on sequential-reads. > rbd.ko seems to use caching on all reads (because only XFS knows about O_DIRECT ;-)) > > > Q2: How can the read-caching logic be enabled for ceph-fuse / CephFS ? > > BTW I'm aware of the "O_DIRECT (...) designed by a deranged monkey" text in the open-2-manpage ;-) > > > -Dieter > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Milosz Tanski CTO 10 East 53rd Street, 37th floor New York, NY 10022 p: 646-253-9055 e: milosz@adfin.com ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: O_DIRECT logic in CephFS, ceph-fuse / Performance 2014-03-12 22:38 ` Milosz Tanski @ 2014-03-13 0:08 ` Sage Weil 0 siblings, 0 replies; 3+ messages in thread From: Sage Weil @ 2014-03-13 0:08 UTC (permalink / raw) To: Milosz Tanski; +Cc: Kasper Dieter, ceph-devel@vger.kernel.org, mark.nelson Hi Kasper, In order to do what you want here, we need to make O_DIRECT-initiated requests on the client get a flag that tells the OSD to also bypass its cache. That doesn't happen right now. Assuming we do add that flag, we can either make the IO actually do O_DIRECT, or we can make it do some fadvise after the call, or any manner of things depending on what makes the most sense for that particular backend implementation. For FileStore, it seems pretty likely that O_DIRECT is the right thing. It is somewhat complicated by the presence of the FDCache which avoids opening a new file descriptor for each IO, so it is non-trivial, but doable. There's nothing preventing us from identifying what these hints on writes might be now. Other possibilities that have come up: - the following write should be done O_DIRECT. or perhaps more precisely, the write should not be cached (e.g., because the client is caching it, or doesn't expect to ever read it) - the following write is on data that is expected to be immutable - the following write is on data that is expected to have a short/long lifetime. etc. sage On Wed, 12 Mar 2014, Milosz Tanski wrote: > Kasper, > > I only know about the kernel cephfs... but there are special code > paths for O_DIRECT read/writes. Both read and write bypass the page > cache and send commands directly to OSDs for the objects, on the write > case the object has a write lock with MDS. So unlike NFS this seams > like it does the right thing. > > I'm guessing when you say XFS on rbd with O_DIRECT you mean the files > are opened O_DIRECT on the filesystem. That doesn't take into account > readahead that the kernel does in the block device layer which is > independent of file read-ahead and (it's at much lower layer). You can > find out what that is set to using the "blockdev --getra /dev/XXX" > command. > > Cheers, > - Milosz > > On Wed, Mar 12, 2014 at 4:27 PM, Kasper Dieter > <dieter.kasper@ts.fujitsu.com> wrote: > > The 'man 2 open' states > > ---snip--- > > The behaviour of O_DIRECT with NFS will differ from local file systems. (...) > > The NFS protocol does not support passing the flag to the server, > > so O_DIRECT I/O will bypass the page cache only on the client; > > the server may still cache the I/O. > > ---snip--- > > > > Q1: How does CephFS and ceph-fuse handle the O_DIRECT flag ? > > (similar to NFS Ceph is Network FS, too and has client/server) > > > > > > Some Test cases with O_DIRECT & io_submit() on 4K (65536, 262144, 1048576, 4194304 is the different obj_size): > > > > out.rand.fuse.ssd2-r2-1-1-1048576: Max. throughput read : 7.22768MB/s > > out.rand.fuse.ssd2-r2-1-1-262144: Max. throughput read : 7.18318MB/s > > out.rand.fuse.ssd2-r2-1-1-65536: Max. throughput read : 7.25543MB/s > > out.sequ.fuse.ssd2-r2-1-1-1048576: Max. throughput read : 118.092MB/s > > out.sequ.fuse.ssd2-r2-1-1-262144: Max. throughput read : 111.073MB/s > > out.sequ.fuse.ssd2-r2-1-1-65536: Max. throughput read : 95.4332MB/s > > > > out.rand.cephfs.ssd2-r2-1-1-1048576: Max. throughput read : 11.2144MB/s > > out.rand.cephfs.ssd2-r2-1-1-262144: Max. throughput read : 11.0371MB/s > > out.rand.cephfs.ssd2-r2-1-1-65536: Max. throughput read : 11.017MB/s > > out.sequ.cephfs.ssd2-r2-1-1-1048576: Max. throughput read : 11.2299MB/s > > out.sequ.cephfs.ssd2-r2-1-1-262144: Max. throughput read : 10.9488MB/s > > out.sequ.cephfs.ssd2-r2-1-1-65536: Max. throughput read : 10.5669MB/s > > > > out.rand.t3-ssd2-v2-1-1048576-20: Max. throughput read : 81.9598MB/s > > out.rand.t3-ssd2-v2-1-262144-18: Max. throughput read : 140.45MB/s > > out.rand.t3-ssd2-v2-1-4194304-22: Max. throughput read : 55.8478MB/s > > out.rand.t3-ssd2-v2-1-65536-16: Max. throughput read : 158.441MB/s > > out.sequ.t3-ssd2-v2-1-1048576-20: Max. throughput read : 74.3693MB/s > > out.sequ.t3-ssd2-v2-1-262144-18: Max. throughput read : 140.444MB/s > > out.sequ.t3-ssd2-v2-1-4194304-22: Max. throughput read : 42.7327MB/s > > out.sequ.t3-ssd2-v2-1-65536-16: Max. throughput read : 165.434MB/s > > > > t3 = XFS on rbd.ko > > > > CephFS and ceph-fuse seems to use no caching at all on random-reads. > > Ceph-fuse seems to use some caching on sequential-reads. > > rbd.ko seems to use caching on all reads (because only XFS knows about O_DIRECT ;-)) > > > > > > Q2: How can the read-caching logic be enabled for ceph-fuse / CephFS ? > > > > BTW I'm aware of the "O_DIRECT (...) designed by a deranged monkey" text in the open-2-manpage ;-) > > > > > > -Dieter > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Milosz Tanski > CTO > 10 East 53rd Street, 37th floor > New York, NY 10022 > > p: 646-253-9055 > e: milosz@adfin.com > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2014-03-13 0:08 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-03-12 20:27 O_DIRECT logic in CephFS, ceph-fuse / Performance Kasper Dieter 2014-03-12 22:38 ` Milosz Tanski 2014-03-13 0:08 ` Sage Weil
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.