From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Matt W. Benjamin" Subject: Re: CephFS use cases + MDS limitations Date: Wed, 6 Nov 2013 11:59:14 -0500 (EST) Message-ID: <1831866980.60.1383757154859.JavaMail.root@thunderbeast.private.linuxbox.com> References: <1404675857.58.1383757057746.JavaMail.root@thunderbeast.private.linuxbox.com> Reply-To: "Matt W. Benjamin" Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from aa.linuxbox.com ([69.128.83.226]:1053 "EHLO aa.linuxbox.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750755Ab3KFQ7T convert rfc822-to-8bit (ORCPT ); Wed, 6 Nov 2013 11:59:19 -0500 In-Reply-To: <1404675857.58.1383757057746.JavaMail.root@thunderbeast.private.linuxbox.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Michael Sevilla Cc: ceph-devel@vger.kernel.org Hi Michael, Thanks for posting this. We don't have specific workload information, but we did want to mention= some of the experimental Cephfs development we (Cohortfs) have been doing, in c= ase it might be of interest to others in the community. One of the projects we've undertaken is to implement pnfs-metastripe (a proposal for scale-out metadata in NFSv4) on Cephfs. In doing that = we've essentially been evolving a metastripe-flavored version Cephfs, buildin= g on previous work to provide first-class lookup-by-inode# support (more bel= ow). Our current codebase has a number of changes. In support of metastripe, we've augmented directory fragmentation with = the concept of stripes, each of which can be locked and modified independen= tly. In order to permit parallel updates on stripes, clients take "stripe ca= ps" in place of a single capset on directories. We've also extended the Ceph cap model to support in-place state update= s, as well as invalidates. We have a group of changes intended to increase mds workload independen= ce, including more independent caching. There are many cases where a ceph mds needs to get a cache replica of a= n object from its auth mds. Most are needed in order to satisfy a client request= (like a rename from one mds to another). Many others, however, are necessitate= d by the reliance on full paths to locate objects. This means that every cache= object must then have cache replicas of all parent objects in order to make th= ese traversals possible. These extra cache replicas have a cost in terms o= f memory, lock latency, and messaging overhead that will have an effect on scalab= ility. All of these overheads are essentially side effects of Ceph's method of= storing inodes with their primary dentry. We're attacking this by storing inod= es in a separate container, which is also striped across MDS nodes to enable lookup-by-ino with a simple placement function. Obviously, the former d= esign change is a big one, which trades away some of the Cephfs' inlining pro= perties for parallel performance and better NFS tuning. We have other client and MDS work planned and/or in progress, including= client concurrency work (in progress), MDS concurrency work (planned), MDS cac= he management changes (planned), and client cache management changes (in progress). We're looking to add the ability to journal inode updates as deltas, in= order to compress the journals and speed up replay. Further down the line, we= 'd like to create a journal for each stripe of the inode container (where strip= e count >> mds count), rather than tying them to an individual mds. This would facilitate load balancing and failover, by allowing any mds to become authoritative for a stripe of inodes by replaying its journal. One of our main goals is for a plurality of Cohortfs (and Cephfs) file = systems to coexist in a Ceph cluster, in separate or unified namespaces. So, in fine, our Cohortfs version of Cephfs makes some tradeoffs that w= e expect to perform better on some workloads, and perhaps worse on others, but s= ome of the work we've performed may also be useful to traditional Cephfs. We've been working entirely on our own so far, but we're doing open sou= rce work. We welcome feedback, and if there are others in the community intereste= d in collaborating in these or related areas, you're welcome to join in. Matt, Casey, Adam, Marcus ----- "Michael Sevilla" wrote: > Hi Ceph community, >=20 > I=E2=80=99d like to get a feel for some of the problems that CephFS u= sers are > encountering with single MDS deployments. There were requests for > stable distributed metadata/MDS services [1] and I=E2=80=99m guessing= its > because your workloads exhibit many, many metadata operations. Some > of > you mentioned opening many files in a directory for checkpointing, > recursive stats on a directory, etc. [2] and I=E2=80=99d like more de= tails, > such as: > - workloads/applications that stress the MDS service that would cause > you to call for multi-MDS support > - use cases for the Ceph file system (I=E2=80=99m not really too inte= rested > in > users using CephFS to host VMs, since many of these use cases are > migrating to RBD) >=20 > I=E2=80=99m just trying to get an idea of what=E2=80=99s out there an= d the problems > CephFS users encounter as a result of a bottlenecked MDS (single node > or cluster). >=20 > Thanks! >=20 > Michael >=20 > [1] CephFS MDS Status Discussion, > http://ceph.com/dev-notes/cephfs-mds-status-discussion/ > [2] CephFS First Product Release Discussion, > http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13524 > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html --=20 Matt Benjamin CohortFS, LLC. 206 South Fifth Ave. Suite 150 Ann Arbor, MI 48104 http://cohortfs.com tel. 734-761-4689=20 fax. 734-769-8938=20 cel. 734-216-5309=20 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html