Re: CephFS use cases + MDS limitations

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: CephFS use cases + MDS limitations
       [not found] <1404675857.58.1383757057746.JavaMail.root@thunderbeast.private.linuxbox.com>
@ 2013-11-06 16:59 ` Matt W. Benjamin
  0 siblings, 0 replies; 3+ messages in thread
From: Matt W. Benjamin @ 2013-11-06 16:59 UTC (permalink / raw)
  To: Michael Sevilla; +Cc: ceph-devel

Hi Michael,

Thanks for posting this.

We don't have specific workload information, but we did want to mention some of
the experimental Cephfs development we (Cohortfs) have been doing, in case it
might be of interest to others in the community.

One of the projects we've undertaken is to implement pnfs-metastripe
(a proposal for scale-out metadata in NFSv4) on Cephfs.  In doing that we've
essentially been evolving a metastripe-flavored version Cephfs, building on
previous work to provide first-class lookup-by-inode# support (more below).

Our current codebase has a number of changes.

In support of metastripe, we've augmented directory fragmentation with the
concept of stripes, each of which can be locked and modified independently.
In order to permit parallel updates on stripes, clients take "stripe caps" in
place of a single capset on directories.

We've also extended the Ceph cap model to support in-place state updates, as
well as invalidates.

We have a group of changes intended to increase mds workload independence,
including more independent caching.

There are many cases where a ceph mds needs to get a cache replica of an object
from its auth mds. Most are needed in order to satisfy a client request (like a
rename from one mds to another).  Many others, however, are necessitated by the
reliance on full paths  to locate objects.  This means that every cache object
must then have cache replicas of all parent objects in order to make these
traversals possible.  These extra cache replicas have a cost in terms of memory,
lock latency, and messaging overhead that will have an effect on scalability.

All of these overheads are essentially side effects of Ceph's method of storing
inodes with their primary dentry.  We're attacking this by storing inodes in a
separate container, which is also striped across MDS nodes to enable
lookup-by-ino with a simple placement function. Obviously, the former design
change is a big one, which trades away some of the Cephfs' inlining properties
for parallel performance and better NFS tuning.

We have other client and MDS work planned and/or in progress, including client
concurrency work (in progress), MDS concurrency work (planned), MDS cache management
changes (planned), and client cache management changes (in progress).

We're looking to add the ability to journal inode updates as deltas, in order
to compress the journals and speed up replay. Further down the line, we'd like
to create a journal for each stripe of the inode container (where stripe count
>> mds count), rather than tying them to an individual mds. This would
facilitate load balancing and failover, by allowing any mds to become
authoritative for a stripe of inodes by replaying its journal.

One of our main goals is for a plurality of Cohortfs (and Cephfs) file systems
to coexist in a Ceph cluster, in separate or unified namespaces.

So, in fine, our Cohortfs version of Cephfs makes some tradeoffs that we expect
to perform better on some workloads, and perhaps worse on others, but some of the
work we've performed may also be useful to traditional Cephfs.

We've been working entirely on our own so far, but we're doing open source work.
We welcome feedback, and if there are others in the community interested in
collaborating in these or related areas, you're welcome to join in.

Matt, Casey, Adam, Marcus

----- "Michael Sevilla" <mikesevilla3@gmail.com> wrote:

> Hi Ceph community,
> 
> I’d like to get a feel for some of the problems that CephFS users are
> encountering with single MDS deployments. There were requests for
> stable distributed metadata/MDS services [1] and I’m guessing its
> because your workloads exhibit many, many metadata operations. Some
> of
> you mentioned opening many files in a directory for checkpointing,
> recursive stats on a directory, etc. [2] and I’d like more details,
> such as:
> - workloads/applications that stress the MDS service that would cause
> you to call for multi-MDS support
> - use cases for the Ceph file system (I’m not really too interested
> in
> users using CephFS to host VMs, since many of these use cases are
> migrating to RBD)
> 
> I’m just trying to get an idea of what’s out there and the problems
> CephFS users encounter as a result of a bottlenecked MDS (single node
> or cluster).
> 
> Thanks!
> 
> Michael
> 
> [1] CephFS MDS Status Discussion,
> http://ceph.com/dev-notes/cephfs-mds-status-discussion/
> [2] CephFS First Product Release Discussion,
> http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13524
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Matt Benjamin
CohortFS, LLC.
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://cohortfs.com

tel.  734-761-4689 
fax.  734-769-8938 
cel.  734-216-5309 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

* CephFS use cases + MDS limitations
@ 2013-11-03 23:53 Michael Sevilla
  2013-11-06  5:40 ` Malcolm Haak
  0 siblings, 1 reply; 3+ messages in thread
From: Michael Sevilla @ 2013-11-03 23:53 UTC (permalink / raw)
  To: ceph-devel

Hi Ceph community,

I’d like to get a feel for some of the problems that CephFS users are
encountering with single MDS deployments. There were requests for
stable distributed metadata/MDS services [1] and I’m guessing its
because your workloads exhibit many, many metadata operations. Some of
you mentioned opening many files in a directory for checkpointing,
recursive stats on a directory, etc. [2] and I’d like more details,
such as:
- workloads/applications that stress the MDS service that would cause
you to call for multi-MDS support
- use cases for the Ceph file system (I’m not really too interested in
users using CephFS to host VMs, since many of these use cases are
migrating to RBD)

I’m just trying to get an idea of what’s out there and the problems
CephFS users encounter as a result of a bottlenecked MDS (single node
or cluster).

Thanks!

Michael

[1] CephFS MDS Status Discussion,
http://ceph.com/dev-notes/cephfs-mds-status-discussion/
[2] CephFS First Product Release Discussion,
http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13524
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: CephFS use cases + MDS limitations
  2013-11-03 23:53 Michael Sevilla
@ 2013-11-06  5:40 ` Malcolm Haak
  0 siblings, 0 replies; 3+ messages in thread
From: Malcolm Haak @ 2013-11-06  5:40 UTC (permalink / raw)
  To: Michael Sevilla, ceph-devel

Michael,

I haven't seen any on-list replies yet, so I wasn't sure if this was the 
right place. But I'll just reply and somebody will let me know if I am 
wrong.

The use cases I have encountered, in my clustered computing universe, 
were implemented with a different proprietary clustered file system. 
These file-systems were being used as home folders or "shared scratch" 
space. And the specific issues occur when you have users who 'misbehave' 
or have code that, by way of function create(and destroy) large numbers 
of files. And in the process bog down file-system access for everybody. 
I have not yet implemented ceph in production in this role but base 
testing shows it will encounter the same issues.

While it is ideal to not do such things to a clustered file system, it 
would be nice to be able to dedicate an MDS to specific sub folders 
without having to create a whole separate sub-file-system/mount-point 
(as is the current procedure with other solutions).

It would be really AWESOME to do this 'on the fly'. Having more than one 
MDS look after the whole file-system in an ACTIVE/ACTIVE fashion would 
be nice/ideal (as long as latency is not too negativity impacted), but 
really just being able to 'shard' the file-system up would be more than 
sufficient to solve most of the issues I usually encounter. Having this 
kind of functionality would be a 'killer feature' for this kind of workload.

I hope my wall of text makes sense. Please feel free to ping me with 
questions.

Regards

Malcolm Haak

On 04/11/13 09:53, Michael Sevilla wrote:
> Hi Ceph community,
>
> I’d like to get a feel for some of the problems that CephFS users are
> encountering with single MDS deployments. There were requests for
> stable distributed metadata/MDS services [1] and I’m guessing its
> because your workloads exhibit many, many metadata operations. Some of
> you mentioned opening many files in a directory for checkpointing,
> recursive stats on a directory, etc. [2] and I’d like more details,
> such as:
> - workloads/applications that stress the MDS service that would cause
> you to call for multi-MDS support
> - use cases for the Ceph file system (I’m not really too interested in
> users using CephFS to host VMs, since many of these use cases are
> migrating to RBD)
>
> I’m just trying to get an idea of what’s out there and the problems
> CephFS users encounter as a result of a bottlenecked MDS (single node
> or cluster).
>
> Thanks!
>
> Michael
>
> [1] CephFS MDS Status Discussion,
> http://ceph.com/dev-notes/cephfs-mds-status-discussion/
> [2] CephFS First Product Release Discussion,
> http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13524
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2013-11-06 16:59 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1404675857.58.1383757057746.JavaMail.root@thunderbeast.private.linuxbox.com>
2013-11-06 16:59 ` CephFS use cases + MDS limitations Matt W. Benjamin
2013-11-03 23:53 Michael Sevilla
2013-11-06  5:40 ` Malcolm Haak

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.