From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Nelson Subject: Re: NewStore update Date: Fri, 20 Feb 2015 10:35:59 -0600 Message-ID: <54E7626F.2040205@redhat.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mx1.redhat.com ([209.132.183.28]:54831 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754560AbbBTQgF (ORCPT ); Fri, 20 Feb 2015 11:36:05 -0500 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Sage Weil , Haomai Wang Cc: "ceph-devel@vger.kernel.org" On 02/20/2015 09:00 AM, Sage Weil wrote: > On Fri, 20 Feb 2015, Haomai Wang wrote: >> So cool! >> >> A little notes: >> >> 1. What about sync thread in NewStore? > > My thought right now is that there will be a WAL thread and (maybe) a > transaction commit completion thread. What do you mean by sync thread? > > One thing I want to avoid is the current 'op' thread in FileStore. > Instead of queueing a transaction we will start all of the aio operations > synchronously. This has the nice (?) side-effect that if there is memory > blackpressure it will block at submit time so we don't need to do our own > throttling. (...though we may want to do it ourselves later anyway.) > >> 2. Could we consider skipping WAL for large overwrite(backfill, RGW)? > > We do (or will)... if there is a truncate to 0 it doesn't need to do WAL > at all. The onode stores the size so we'll ignore any stray bytes after > that in the file; that let's us do the truncate async after the txn > commits. (Slightly sloppy but the space leakage window is so small I > don't think it's worth worrying about.) > >> 3. Sorry, what means [aio_]fsync? > > aio_fsync is just an fsync that's submitted as an aio operation. It'll > make fsync fit into the same bucket as the aio writes we queue up, and it > also means that if/when the experimental batched fsync stuff goes into XFS > we'll take advantage of it (lots of fsyncs will be merged into a single > XFS transaction and be much more efficient). Looks like I need to reacquaint myself with aio.c again and figure out why it was breaking. :) > > sage > > >> >> >> On Fri, Feb 20, 2015 at 7:50 AM, Sage Weil wrote: >>> Hi everyone, >>> >>> We talked a bit about the proposed "KeyFile" backend a couple months back. >>> I've started putting together a basic implementation and wanted to give >>> people and update about what things are currently looking like. We're >>> calling it NewStore for now unless/until someone comes up with a better >>> name (KeyFileStore is way too confusing). (*) >>> >>> You can peruse the incomplete code at >>> >>> https://github.com/liewegas/ceph/tree/wip-newstore/src/os/newstore >>> >>> This is a bit of a brain dump. Please ask questions if anything isn't >>> clear. Also keep in mind I'm still at the stage where I'm trying to get >>> it into a semi-working state as quickly as possible so the implementation >>> is pretty rough. >>> >>> Basic design: >>> >>> We use a KeyValueDB (leveldb, rocksdb, ...) for all of our metadata. >>> Object data is stored in files with simple names (%d) in a simple >>> directory structure (one level deep, default 1M files per dir). The main >>> piece of metadata we store is a mapping from object name (ghobject_t) to >>> onode_t, which looks like this: >>> >>> struct onode_t { >>> uint64_t size; ///< object size >>> map attrs; ///< attrs >>> map data_map; ///< data (offset to fragment mapping) >>> >>> i.e., it's what we used to rely on xattrs on the inode for. Here, we'll >>> only lean on the file system for file data and it's block management. >>> >>> fragment_t looks like >>> >>> struct fragment_t { >>> uint32_t offset; ///< offset in file to first byte of this fragment >>> uint32_t length; ///< length of fragment/extent >>> fid_t fid; ///< file backing this fragment >>> >>> and fid_t is >>> >>> struct fid_t { >>> uint32_t fset, fno; // identify the file name: fragments/%d/%d >>> >>> To start we'll keep the mapping pretty simple (just one fragment_t) but >>> later we can go for varying degrees of complexity. >>> >>> We lean on the kvdb for our transactions. >>> >>> If we are creating new objects, we write data into a new file/fid, >>> [aio_]fsync, and then commit the transaction. >>> >>> If we are doing an overwrite, we include a write-ahead log (wal) >>> item in our transaction, and then apply it afterwards. For example, a 4k >>> overwrite would make whatever metadata changes are included, and a wal >>> item that says "then overwrite this 4k in this fid with this data". i.e., >>> the worst case is more or less what FileStore is doing now with its >>> journal, except here we're using the kvdb (and its journal) for that. On >>> restart we can queue up and apply any unapplied wal items. >>> >>> An alternative approach here that we discussed a bit yesterday would be to >>> write the small overwrites into the kvdb adjacent to the onode. Actually >>> writing them back to the file could be deferred until later, maybe when >>> there are many small writes to be done together. >>> >>> But right now the write behavior is very simple, and handles just 3 cases: >>> >>> https://github.com/liewegas/ceph/blob/wip-newstore/src/os/newstore/NewStore.cc#L1339 >>> >>> 1. New object: create a new file and write there. >>> >>> 2. Append: append to an existing fid. We store the size in the onode so >>> we can be a bit sloppy and in the failure case (where we write some >>> extra data to the file but don't commit the onode) just ignore any >>> trailing file data. >>> >>> 3. Anything else: generate a WAL item. >>> >>> 4. Maybe later, for some small [over]writes, we instead put the new data >>> next to the onode. >>> >>> There is no omap yet. I think we should do basically what DBObjectMap did >>> (with a layer of indirection to allow clone etc), but we need to rejigger >>> it so that the initial pointer into that structure is embedded in the >>> onode. We may want to do some other optimization to avoid extra >>> indirection in the common case. Leaving this for later, though... >>> >>> We are designing for the case where the workload is already sharded across >>> collections. Each collection gets an in-memory Collection, which has its >>> own RWLock and its own onode_map (SharedLRU cache). A split will >>> basically amount to registering the new collection in the kvdb and >>> clearing the in-memory onode cache. >>> >>> There is a TransContext structure that is used to track the progress of a >>> transaction. It'll list which fd's need to get synced pre-commit, which >>> onodes need to get written back in the transaction, and any WAL items to >>> include and queue up after the transaction commits. Right now the >>> queue_transaction path does most of the work synchronously just to get >>> things working. Looking ahead I think what it needs to do is: >>> >>> - assemble the transaction >>> - start any aio writes (we could use O_DIRECT here if the new hints >>> include WONTNEED?) >>> - start any aio fsync's >>> - queue kvdb transaction >>> - fire onreadable[_sync] notifications (I suspect we'll want to do this >>> unconditionally; maybe we avoid using them entirely?) >>> >>> On transaction commit, >>> - fire commit notifications >>> - queue WAL operations to a finisher >>> >>> The WAL ops will be linked to the TransContext so that if you want to do a >>> read on the onode you can block until it completes. If we keep the >>> (currently simple) locking then we can use the Collection rwlock to block >>> new writes while we want for previous ones to apply. Or we can get more >>> granular with the read vs write locks, but I'm not sure it'll be any use >>> until we make major changes in the OSD (like dispatching parallel reads >>> within a PG). >>> >>> Clone is annoying; if the FS doesn't support it natively (anything not >>> btrfs) I think we should just do a sync read and then write for >>> simplicity. >>> >>> A few other thoughts: >>> >>> - For a fast kvdb, we may want to do the transaction commit synchronously. >>> For disk backends I think we'll want it async, though, to avoid blocking >>> the caller. >>> >>> - The fid_t has a inode number stashed in it. The idea is to use >>> open_by_handle to avoid traversing the (shallow) directory and go straight >>> to the inode. On XFS this means we traverse the inode btree to verify it >>> is in fast a valid ino, which isn't totally ideal but probably what we >>> have to live with. Note that open_by_handle will work on any other >>> (NFS-exportable) filesystem as well so this is in no way XFS-specific. >>> This is implemented yet, but when we do, we'll probably want to verify we >>> got the right file by putting some id in an xattr; that way you could >>> safely copy the whole thing to another filesystem and it could gracefully >>> fall back to opening using the file names. >>> >>> - I think we could build a variation on this implementation on top of an >>> NVMe device instead of a file system. It could pretty trivially lay out >>> writes in the address space as a linear sweep across the virutal address >>> space. If the NVMe address space is big enough, maybe we could even avoid >>> thinking about reusing addresses for deleted object? We'd just send a >>> discard and then forget about it. Not sure if the address space is really >>> that big, though... If not, we'd need to do make a simple allocator >>> (blah). >>> >>> sage >>> >>> >>> * This follows in the Messenger's naming footsteps, which went like this: >>> MPIMessenger, NewMessenger, NewerMessenger, SimpleMessenger (which ended >>> up being anything but simple). >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> -- >> Best Regards, >> >> Wheat >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >