O_DIRECT logic in CephFS, ceph-fuse / Performance

All of lore.kernel.org
 help / color / mirror / Atom feed

* O_DIRECT logic in CephFS, ceph-fuse / Performance
@ 2014-03-12 20:27 Kasper Dieter
  2014-03-12 22:38 ` Milosz Tanski
  0 siblings, 1 reply; 3+ messages in thread
From: Kasper Dieter @ 2014-03-12 20:27 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org; +Cc: Kasper Dieter, mark.nelson, Sage Weil

The 'man 2 open' states 
---snip---
The behaviour of O_DIRECT with NFS will differ from local file systems.  (...)
The  NFS  protocol does not support passing the flag to the server, 
so O_DIRECT I/O will bypass the page cache only on the client; 
the server may still cache the I/O.
---snip---

Q1: How does CephFS and ceph-fuse handle the O_DIRECT flag ?	
	(similar to NFS Ceph is Network FS, too and has client/server)


Some Test cases with O_DIRECT & io_submit() on 4K (65536, 262144, 1048576, 4194304 is the different obj_size):

out.rand.fuse.ssd2-r2-1-1-1048576:  Max. throughput read         : 7.22768MB/s
out.rand.fuse.ssd2-r2-1-1-262144:  Max. throughput read         : 7.18318MB/s
out.rand.fuse.ssd2-r2-1-1-65536:  Max. throughput read         : 7.25543MB/s
out.sequ.fuse.ssd2-r2-1-1-1048576:  Max. throughput read         : 118.092MB/s
out.sequ.fuse.ssd2-r2-1-1-262144:  Max. throughput read         : 111.073MB/s
out.sequ.fuse.ssd2-r2-1-1-65536:  Max. throughput read         : 95.4332MB/s

out.rand.cephfs.ssd2-r2-1-1-1048576:  Max. throughput read         : 11.2144MB/s
out.rand.cephfs.ssd2-r2-1-1-262144:  Max. throughput read         : 11.0371MB/s
out.rand.cephfs.ssd2-r2-1-1-65536:  Max. throughput read         : 11.017MB/s
out.sequ.cephfs.ssd2-r2-1-1-1048576:  Max. throughput read         : 11.2299MB/s
out.sequ.cephfs.ssd2-r2-1-1-262144:  Max. throughput read         : 10.9488MB/s
out.sequ.cephfs.ssd2-r2-1-1-65536:  Max. throughput read         : 10.5669MB/s

out.rand.t3-ssd2-v2-1-1048576-20:  Max. throughput read         : 81.9598MB/s
out.rand.t3-ssd2-v2-1-262144-18:  Max. throughput read         : 140.45MB/s
out.rand.t3-ssd2-v2-1-4194304-22:  Max. throughput read         : 55.8478MB/s
out.rand.t3-ssd2-v2-1-65536-16:  Max. throughput read         : 158.441MB/s
out.sequ.t3-ssd2-v2-1-1048576-20:  Max. throughput read         : 74.3693MB/s
out.sequ.t3-ssd2-v2-1-262144-18:  Max. throughput read         : 140.444MB/s
out.sequ.t3-ssd2-v2-1-4194304-22:  Max. throughput read         : 42.7327MB/s
out.sequ.t3-ssd2-v2-1-65536-16:  Max. throughput read         : 165.434MB/s

t3 = XFS on rbd.ko

CephFS and ceph-fuse 	seems to use no caching at all on random-reads.
Ceph-fuse 		seems to use some caching on sequential-reads.
rbd.ko 			seems to use caching on all reads (because only XFS knows about O_DIRECT ;-))


Q2: How can the read-caching logic be enabled for ceph-fuse / CephFS ?

BTW I'm aware of the "O_DIRECT (...) designed  by  a  deranged monkey" text in the open-2-manpage ;-)


-Dieter

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: O_DIRECT logic in CephFS, ceph-fuse / Performance
  2014-03-12 20:27 O_DIRECT logic in CephFS, ceph-fuse / Performance Kasper Dieter
@ 2014-03-12 22:38 ` Milosz Tanski
  2014-03-13  0:08   ` Sage Weil
  0 siblings, 1 reply; 3+ messages in thread
From: Milosz Tanski @ 2014-03-12 22:38 UTC (permalink / raw)
  To: Kasper Dieter; +Cc: ceph-devel@vger.kernel.org, mark.nelson, Sage Weil

Kasper,

I only know about the kernel cephfs... but there are special code
paths for O_DIRECT read/writes. Both read and write bypass the page
cache and send commands directly to OSDs for the objects, on the write
case the object has a write lock with MDS. So unlike NFS this seams
like it does the right thing.

I'm guessing when you say XFS on rbd with O_DIRECT you mean the files
are opened O_DIRECT on the filesystem. That doesn't take into account
readahead that the kernel does in the block device layer which is
independent of file read-ahead and (it's at much lower layer). You can
find out what that is set to using the "blockdev --getra /dev/XXX"
command.

Cheers,
- Milosz

On Wed, Mar 12, 2014 at 4:27 PM, Kasper Dieter
<dieter.kasper@ts.fujitsu.com> wrote:
> The 'man 2 open' states
> ---snip---
> The behaviour of O_DIRECT with NFS will differ from local file systems.  (...)
> The  NFS  protocol does not support passing the flag to the server,
> so O_DIRECT I/O will bypass the page cache only on the client;
> the server may still cache the I/O.
> ---snip---
>
> Q1: How does CephFS and ceph-fuse handle the O_DIRECT flag ?
>         (similar to NFS Ceph is Network FS, too and has client/server)
>
>
> Some Test cases with O_DIRECT & io_submit() on 4K (65536, 262144, 1048576, 4194304 is the different obj_size):
>
> out.rand.fuse.ssd2-r2-1-1-1048576:  Max. throughput read         : 7.22768MB/s
> out.rand.fuse.ssd2-r2-1-1-262144:  Max. throughput read         : 7.18318MB/s
> out.rand.fuse.ssd2-r2-1-1-65536:  Max. throughput read         : 7.25543MB/s
> out.sequ.fuse.ssd2-r2-1-1-1048576:  Max. throughput read         : 118.092MB/s
> out.sequ.fuse.ssd2-r2-1-1-262144:  Max. throughput read         : 111.073MB/s
> out.sequ.fuse.ssd2-r2-1-1-65536:  Max. throughput read         : 95.4332MB/s
>
> out.rand.cephfs.ssd2-r2-1-1-1048576:  Max. throughput read         : 11.2144MB/s
> out.rand.cephfs.ssd2-r2-1-1-262144:  Max. throughput read         : 11.0371MB/s
> out.rand.cephfs.ssd2-r2-1-1-65536:  Max. throughput read         : 11.017MB/s
> out.sequ.cephfs.ssd2-r2-1-1-1048576:  Max. throughput read         : 11.2299MB/s
> out.sequ.cephfs.ssd2-r2-1-1-262144:  Max. throughput read         : 10.9488MB/s
> out.sequ.cephfs.ssd2-r2-1-1-65536:  Max. throughput read         : 10.5669MB/s
>
> out.rand.t3-ssd2-v2-1-1048576-20:  Max. throughput read         : 81.9598MB/s
> out.rand.t3-ssd2-v2-1-262144-18:  Max. throughput read         : 140.45MB/s
> out.rand.t3-ssd2-v2-1-4194304-22:  Max. throughput read         : 55.8478MB/s
> out.rand.t3-ssd2-v2-1-65536-16:  Max. throughput read         : 158.441MB/s
> out.sequ.t3-ssd2-v2-1-1048576-20:  Max. throughput read         : 74.3693MB/s
> out.sequ.t3-ssd2-v2-1-262144-18:  Max. throughput read         : 140.444MB/s
> out.sequ.t3-ssd2-v2-1-4194304-22:  Max. throughput read         : 42.7327MB/s
> out.sequ.t3-ssd2-v2-1-65536-16:  Max. throughput read         : 165.434MB/s
>
> t3 = XFS on rbd.ko
>
> CephFS and ceph-fuse    seems to use no caching at all on random-reads.
> Ceph-fuse               seems to use some caching on sequential-reads.
> rbd.ko                  seems to use caching on all reads (because only XFS knows about O_DIRECT ;-))
>
>
> Q2: How can the read-caching logic be enabled for ceph-fuse / CephFS ?
>
> BTW I'm aware of the "O_DIRECT (...) designed  by  a  deranged monkey" text in the open-2-manpage ;-)
>
>
> -Dieter
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Milosz Tanski
CTO
10 East 53rd Street, 37th floor
New York, NY 10022

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: O_DIRECT logic in CephFS, ceph-fuse / Performance
  2014-03-12 22:38 ` Milosz Tanski
@ 2014-03-13  0:08   ` Sage Weil
  0 siblings, 0 replies; 3+ messages in thread
From: Sage Weil @ 2014-03-13  0:08 UTC (permalink / raw)
  To: Milosz Tanski; +Cc: Kasper Dieter, ceph-devel@vger.kernel.org, mark.nelson

Hi Kasper,

In order to do what you want here, we need to make O_DIRECT-initiated 
requests on the client get a flag that tells the OSD to also bypass its 
cache.  That doesn't happen right now.

Assuming we do add that flag, we can either make the IO actually do 
O_DIRECT, or we can make it do some fadvise after the call, or any manner 
of things depending on what makes the most sense for that particular 
backend implementation.  For FileStore, it seems pretty likely that 
O_DIRECT is the right thing.  It is somewhat complicated by the presence 
of the FDCache which avoids opening a new file descriptor for each IO, so 
it is non-trivial, but doable.

There's nothing preventing us from identifying what these hints on writes 
might be now.  Other possibilities that have come up:

- the following write should be done O_DIRECT.  or perhaps more precisely, 
  the write should not be cached (e.g., because the client is caching it, 
  or doesn't expect to ever read it)
- the following write is on data that is expected to be immutable
- the following write is on data that is expected to have a short/long 
  lifetime.

etc.

sage



On Wed, 12 Mar 2014, Milosz Tanski wrote:

> Kasper,
> 
> I only know about the kernel cephfs... but there are special code
> paths for O_DIRECT read/writes. Both read and write bypass the page
> cache and send commands directly to OSDs for the objects, on the write
> case the object has a write lock with MDS. So unlike NFS this seams
> like it does the right thing.
> 
> I'm guessing when you say XFS on rbd with O_DIRECT you mean the files
> are opened O_DIRECT on the filesystem. That doesn't take into account
> readahead that the kernel does in the block device layer which is
> independent of file read-ahead and (it's at much lower layer). You can
> find out what that is set to using the "blockdev --getra /dev/XXX"
> command.
> 
> Cheers,
> - Milosz
> 
> On Wed, Mar 12, 2014 at 4:27 PM, Kasper Dieter
> <dieter.kasper@ts.fujitsu.com> wrote:
> > The 'man 2 open' states
> > ---snip---
> > The behaviour of O_DIRECT with NFS will differ from local file systems.  (...)
> > The  NFS  protocol does not support passing the flag to the server,
> > so O_DIRECT I/O will bypass the page cache only on the client;
> > the server may still cache the I/O.
> > ---snip---
> >
> > Q1: How does CephFS and ceph-fuse handle the O_DIRECT flag ?
> >         (similar to NFS Ceph is Network FS, too and has client/server)
> >
> >
> > Some Test cases with O_DIRECT & io_submit() on 4K (65536, 262144, 1048576, 4194304 is the different obj_size):
> >
> > out.rand.fuse.ssd2-r2-1-1-1048576:  Max. throughput read         : 7.22768MB/s
> > out.rand.fuse.ssd2-r2-1-1-262144:  Max. throughput read         : 7.18318MB/s
> > out.rand.fuse.ssd2-r2-1-1-65536:  Max. throughput read         : 7.25543MB/s
> > out.sequ.fuse.ssd2-r2-1-1-1048576:  Max. throughput read         : 118.092MB/s
> > out.sequ.fuse.ssd2-r2-1-1-262144:  Max. throughput read         : 111.073MB/s
> > out.sequ.fuse.ssd2-r2-1-1-65536:  Max. throughput read         : 95.4332MB/s
> >
> > out.rand.cephfs.ssd2-r2-1-1-1048576:  Max. throughput read         : 11.2144MB/s
> > out.rand.cephfs.ssd2-r2-1-1-262144:  Max. throughput read         : 11.0371MB/s
> > out.rand.cephfs.ssd2-r2-1-1-65536:  Max. throughput read         : 11.017MB/s
> > out.sequ.cephfs.ssd2-r2-1-1-1048576:  Max. throughput read         : 11.2299MB/s
> > out.sequ.cephfs.ssd2-r2-1-1-262144:  Max. throughput read         : 10.9488MB/s
> > out.sequ.cephfs.ssd2-r2-1-1-65536:  Max. throughput read         : 10.5669MB/s
> >
> > out.rand.t3-ssd2-v2-1-1048576-20:  Max. throughput read         : 81.9598MB/s
> > out.rand.t3-ssd2-v2-1-262144-18:  Max. throughput read         : 140.45MB/s
> > out.rand.t3-ssd2-v2-1-4194304-22:  Max. throughput read         : 55.8478MB/s
> > out.rand.t3-ssd2-v2-1-65536-16:  Max. throughput read         : 158.441MB/s
> > out.sequ.t3-ssd2-v2-1-1048576-20:  Max. throughput read         : 74.3693MB/s
> > out.sequ.t3-ssd2-v2-1-262144-18:  Max. throughput read         : 140.444MB/s
> > out.sequ.t3-ssd2-v2-1-4194304-22:  Max. throughput read         : 42.7327MB/s
> > out.sequ.t3-ssd2-v2-1-65536-16:  Max. throughput read         : 165.434MB/s
> >
> > t3 = XFS on rbd.ko
> >
> > CephFS and ceph-fuse    seems to use no caching at all on random-reads.
> > Ceph-fuse               seems to use some caching on sequential-reads.
> > rbd.ko                  seems to use caching on all reads (because only XFS knows about O_DIRECT ;-))
> >
> >
> > Q2: How can the read-caching logic be enabled for ceph-fuse / CephFS ?
> >
> > BTW I'm aware of the "O_DIRECT (...) designed  by  a  deranged monkey" text in the open-2-manpage ;-)
> >
> >
> > -Dieter
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Milosz Tanski
> CTO
> 10 East 53rd Street, 37th floor
> New York, NY 10022
> 
> p: 646-253-9055
> e: milosz@adfin.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-03-13  0:08 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-12 20:27 O_DIRECT logic in CephFS, ceph-fuse / Performance Kasper Dieter
2014-03-12 22:38 ` Milosz Tanski
2014-03-13  0:08   ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.