[LSF/MM TOPIC] bcachefs - status update, upstreaming (!?)

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Kent Overstreet <kent.overstreet@gmail.com>
To: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org
Subject: [LSF/MM TOPIC] bcachefs - status update, upstreaming (!?)
Date: Wed, 7 Feb 2018 05:26:22 -0500	[thread overview]
Message-ID: <20180207102622.GA13600@moria.home.lan> (raw)

Hi, I'd like to talk a bit about what I've sunk the past few years of my life
into :)

For those who haven't heard, bcachefs started out as an extended version of
bcache, and eventually grew into a full posix filesystem. It's a long weird
story.

Today, it's really a real filesystem with a small community of users and
testers, and the main focus has been on making it production quality and rock
solid - it's not a research project or a toy, it's meant to be used.

What's done:
 - pretty much all the normal posix fs functionality - xattrs, acls, fallocate,
   quotas.
 - fsck
 - full data checksumming
 - compression
 - encryption
 - multiple devivices (raid1 is done minus exposing a way to rereplicate
   degraded data after device failure)
 - caching (right now only writeback caching is exposed; a new more flexible
   interface is being worked on for caching and other allocation policy stuff)

What's _not_ done:
 - persistent allocation information; we still have to walk all our metadata on
   every mount to see what disk space is in use (and for a few other relatively
   minor reasons).

   This is less of an issue than you'd think: bcachefs walks metadata _really_
   fast, fast enough that nobody's complaining (even on multi terabyte
   filesystems; erasure coding is the most asked for feature, "faster mounts"
   never comes up). But of the remaining features to implement/things to deal
   with, this is going to be one of the most complex.

   One of the upsides though - because I've had to make walking metadata as fast
   as possible, bcachefs fsck is also really, really fast (it's run by default
   on every mount).

Planned features:
 - erasure coding (i.e. raid5/6)
 - snapshots

I also want to come up with a plan for eventually upstreaming this damned thing :)

One of the reasons I haven't even talked about upstreaming before is I _really_
haven't wanted to fix the on disk format before I was ready. This is still a
concern w.r.t. persistent allocation information and snapshots, but overall
there's been fewer and fewer reasons for on disk format changes; things seem to
be naturally stabilizing.

And I know there's going to be plenty of other people at LSF with recent
experience on upstreaming new filesystems, right now I don't have any strong
ideas of my own and welcome any input :)

Not sure what else I should talk about; I've been quiet for _way_ too long. I'd
welcome any questions or suggestions.

One other cool thing I've been doing lately is I finally rigged up some pure
btree performance/torture tests: I am _exceedingly_ proud of bcachefs's btree
(bcache's btree code is at best a prototype or a toy compared to bcachefs's).
The numbers are, I think, well worth showing off; I'd be curious if anyone knows
how other competing btree implementations (xfs's?) do in comparison:

These benchmarks are with 64 bit keys and 64 bit values: sequentially create,
iterate over, and delete 100M keys:

seq_insert:  100M with 1 threads in   104 sec,   998 nsec per iter,  978k per sec
seq_lookup:  100M with 1 threads in     1 sec,    10 nsec per iter, 90.8M per sec
seq_delete:  100M with 1 threads in    41 sec,   392 nsec per iter,  2.4M per sec

create 100M keys at random (64 bit random ints for the keys)

rand_insert: 100M with 1 threads in   227 sec,  2166 nsec per iter,  450k per sec
rand_insert: 100M with 6 threads in   106 sec,  6086 nsec per iter,  962k per sec

random lookups, over the 100M random keys we just created:

rand_lookup: 10M  with 1 threads in    10 sec,   995 nsec per iter,  981k per sec
rand_lookup: 10M  with 6 threads in     2 sec,  1223 nsec per iter,  4.6M per sec

mixed lookup/update: 75% lookup, 25% update:

rand_mixed:  10M  with 1 threads in    16 sec,  1615 nsec per iter,  604k per sec
rand_mixed:  10M  with 6 threads in     8 sec,  4614 nsec per iter,  1.2M per sec

This is on my ancient i7 gulftown, using a micron p320h (it's not a pure in
memory test, we're actually writing out those random inserts!). Numbers are
slightly better on my haswell :)

                 reply	other threads:[~2018-02-07 10:26 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180207102622.GA13600@moria.home.lan \
    --to=kent.overstreet@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).