* Re: CephFS use cases + MDS limitations [not found] <1404675857.58.1383757057746.JavaMail.root@thunderbeast.private.linuxbox.com> @ 2013-11-06 16:59 ` Matt W. Benjamin 0 siblings, 0 replies; 3+ messages in thread From: Matt W. Benjamin @ 2013-11-06 16:59 UTC (permalink / raw) To: Michael Sevilla; +Cc: ceph-devel Hi Michael, Thanks for posting this. We don't have specific workload information, but we did want to mention some of the experimental Cephfs development we (Cohortfs) have been doing, in case it might be of interest to others in the community. One of the projects we've undertaken is to implement pnfs-metastripe (a proposal for scale-out metadata in NFSv4) on Cephfs. In doing that we've essentially been evolving a metastripe-flavored version Cephfs, building on previous work to provide first-class lookup-by-inode# support (more below). Our current codebase has a number of changes. In support of metastripe, we've augmented directory fragmentation with the concept of stripes, each of which can be locked and modified independently. In order to permit parallel updates on stripes, clients take "stripe caps" in place of a single capset on directories. We've also extended the Ceph cap model to support in-place state updates, as well as invalidates. We have a group of changes intended to increase mds workload independence, including more independent caching. There are many cases where a ceph mds needs to get a cache replica of an object from its auth mds. Most are needed in order to satisfy a client request (like a rename from one mds to another). Many others, however, are necessitated by the reliance on full paths to locate objects. This means that every cache object must then have cache replicas of all parent objects in order to make these traversals possible. These extra cache replicas have a cost in terms of memory, lock latency, and messaging overhead that will have an effect on scalability. All of these overheads are essentially side effects of Ceph's method of storing inodes with their primary dentry. We're attacking this by storing inodes in a separate container, which is also striped across MDS nodes to enable lookup-by-ino with a simple placement function. Obviously, the former design change is a big one, which trades away some of the Cephfs' inlining properties for parallel performance and better NFS tuning. We have other client and MDS work planned and/or in progress, including client concurrency work (in progress), MDS concurrency work (planned), MDS cache management changes (planned), and client cache management changes (in progress). We're looking to add the ability to journal inode updates as deltas, in order to compress the journals and speed up replay. Further down the line, we'd like to create a journal for each stripe of the inode container (where stripe count >> mds count), rather than tying them to an individual mds. This would facilitate load balancing and failover, by allowing any mds to become authoritative for a stripe of inodes by replaying its journal. One of our main goals is for a plurality of Cohortfs (and Cephfs) file systems to coexist in a Ceph cluster, in separate or unified namespaces. So, in fine, our Cohortfs version of Cephfs makes some tradeoffs that we expect to perform better on some workloads, and perhaps worse on others, but some of the work we've performed may also be useful to traditional Cephfs. We've been working entirely on our own so far, but we're doing open source work. We welcome feedback, and if there are others in the community interested in collaborating in these or related areas, you're welcome to join in. Matt, Casey, Adam, Marcus ----- "Michael Sevilla" <mikesevilla3@gmail.com> wrote: > Hi Ceph community, > > I’d like to get a feel for some of the problems that CephFS users are > encountering with single MDS deployments. There were requests for > stable distributed metadata/MDS services [1] and I’m guessing its > because your workloads exhibit many, many metadata operations. Some > of > you mentioned opening many files in a directory for checkpointing, > recursive stats on a directory, etc. [2] and I’d like more details, > such as: > - workloads/applications that stress the MDS service that would cause > you to call for multi-MDS support > - use cases for the Ceph file system (I’m not really too interested > in > users using CephFS to host VMs, since many of these use cases are > migrating to RBD) > > I’m just trying to get an idea of what’s out there and the problems > CephFS users encounter as a result of a bottlenecked MDS (single node > or cluster). > > Thanks! > > Michael > > [1] CephFS MDS Status Discussion, > http://ceph.com/dev-notes/cephfs-mds-status-discussion/ > [2] CephFS First Product Release Discussion, > http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13524 > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Matt Benjamin CohortFS, LLC. 206 South Fifth Ave. Suite 150 Ann Arbor, MI 48104 http://cohortfs.com tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 3+ messages in thread
* CephFS use cases + MDS limitations @ 2013-11-03 23:53 Michael Sevilla 2013-11-06 5:40 ` Malcolm Haak 0 siblings, 1 reply; 3+ messages in thread From: Michael Sevilla @ 2013-11-03 23:53 UTC (permalink / raw) To: ceph-devel Hi Ceph community, I’d like to get a feel for some of the problems that CephFS users are encountering with single MDS deployments. There were requests for stable distributed metadata/MDS services [1] and I’m guessing its because your workloads exhibit many, many metadata operations. Some of you mentioned opening many files in a directory for checkpointing, recursive stats on a directory, etc. [2] and I’d like more details, such as: - workloads/applications that stress the MDS service that would cause you to call for multi-MDS support - use cases for the Ceph file system (I’m not really too interested in users using CephFS to host VMs, since many of these use cases are migrating to RBD) I’m just trying to get an idea of what’s out there and the problems CephFS users encounter as a result of a bottlenecked MDS (single node or cluster). Thanks! Michael [1] CephFS MDS Status Discussion, http://ceph.com/dev-notes/cephfs-mds-status-discussion/ [2] CephFS First Product Release Discussion, http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13524 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: CephFS use cases + MDS limitations 2013-11-03 23:53 Michael Sevilla @ 2013-11-06 5:40 ` Malcolm Haak 0 siblings, 0 replies; 3+ messages in thread From: Malcolm Haak @ 2013-11-06 5:40 UTC (permalink / raw) To: Michael Sevilla, ceph-devel Michael, I haven't seen any on-list replies yet, so I wasn't sure if this was the right place. But I'll just reply and somebody will let me know if I am wrong. The use cases I have encountered, in my clustered computing universe, were implemented with a different proprietary clustered file system. These file-systems were being used as home folders or "shared scratch" space. And the specific issues occur when you have users who 'misbehave' or have code that, by way of function create(and destroy) large numbers of files. And in the process bog down file-system access for everybody. I have not yet implemented ceph in production in this role but base testing shows it will encounter the same issues. While it is ideal to not do such things to a clustered file system, it would be nice to be able to dedicate an MDS to specific sub folders without having to create a whole separate sub-file-system/mount-point (as is the current procedure with other solutions). It would be really AWESOME to do this 'on the fly'. Having more than one MDS look after the whole file-system in an ACTIVE/ACTIVE fashion would be nice/ideal (as long as latency is not too negativity impacted), but really just being able to 'shard' the file-system up would be more than sufficient to solve most of the issues I usually encounter. Having this kind of functionality would be a 'killer feature' for this kind of workload. I hope my wall of text makes sense. Please feel free to ping me with questions. Regards Malcolm Haak On 04/11/13 09:53, Michael Sevilla wrote: > Hi Ceph community, > > I’d like to get a feel for some of the problems that CephFS users are > encountering with single MDS deployments. There were requests for > stable distributed metadata/MDS services [1] and I’m guessing its > because your workloads exhibit many, many metadata operations. Some of > you mentioned opening many files in a directory for checkpointing, > recursive stats on a directory, etc. [2] and I’d like more details, > such as: > - workloads/applications that stress the MDS service that would cause > you to call for multi-MDS support > - use cases for the Ceph file system (I’m not really too interested in > users using CephFS to host VMs, since many of these use cases are > migrating to RBD) > > I’m just trying to get an idea of what’s out there and the problems > CephFS users encounter as a result of a bottlenecked MDS (single node > or cluster). > > Thanks! > > Michael > > [1] CephFS MDS Status Discussion, > http://ceph.com/dev-notes/cephfs-mds-status-discussion/ > [2] CephFS First Product Release Discussion, > http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13524 > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2013-11-06 16:59 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1404675857.58.1383757057746.JavaMail.root@thunderbeast.private.linuxbox.com>
2013-11-06 16:59 ` CephFS use cases + MDS limitations Matt W. Benjamin
2013-11-03 23:53 Michael Sevilla
2013-11-06 5:40 ` Malcolm Haak
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.