From: Kent Overstreet <kent.overstreet@gmail.com>
To: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org
Subject: [LSF/MM TOPIC] bcachefs - status update, upstreaming (!?)
Date: Wed, 7 Feb 2018 05:26:22 -0500 [thread overview]
Message-ID: <20180207102622.GA13600@moria.home.lan> (raw)
Hi, I'd like to talk a bit about what I've sunk the past few years of my life
into :)
For those who haven't heard, bcachefs started out as an extended version of
bcache, and eventually grew into a full posix filesystem. It's a long weird
story.
Today, it's really a real filesystem with a small community of users and
testers, and the main focus has been on making it production quality and rock
solid - it's not a research project or a toy, it's meant to be used.
What's done:
- pretty much all the normal posix fs functionality - xattrs, acls, fallocate,
quotas.
- fsck
- full data checksumming
- compression
- encryption
- multiple devivices (raid1 is done minus exposing a way to rereplicate
degraded data after device failure)
- caching (right now only writeback caching is exposed; a new more flexible
interface is being worked on for caching and other allocation policy stuff)
What's _not_ done:
- persistent allocation information; we still have to walk all our metadata on
every mount to see what disk space is in use (and for a few other relatively
minor reasons).
This is less of an issue than you'd think: bcachefs walks metadata _really_
fast, fast enough that nobody's complaining (even on multi terabyte
filesystems; erasure coding is the most asked for feature, "faster mounts"
never comes up). But of the remaining features to implement/things to deal
with, this is going to be one of the most complex.
One of the upsides though - because I've had to make walking metadata as fast
as possible, bcachefs fsck is also really, really fast (it's run by default
on every mount).
Planned features:
- erasure coding (i.e. raid5/6)
- snapshots
I also want to come up with a plan for eventually upstreaming this damned thing :)
One of the reasons I haven't even talked about upstreaming before is I _really_
haven't wanted to fix the on disk format before I was ready. This is still a
concern w.r.t. persistent allocation information and snapshots, but overall
there's been fewer and fewer reasons for on disk format changes; things seem to
be naturally stabilizing.
And I know there's going to be plenty of other people at LSF with recent
experience on upstreaming new filesystems, right now I don't have any strong
ideas of my own and welcome any input :)
Not sure what else I should talk about; I've been quiet for _way_ too long. I'd
welcome any questions or suggestions.
One other cool thing I've been doing lately is I finally rigged up some pure
btree performance/torture tests: I am _exceedingly_ proud of bcachefs's btree
(bcache's btree code is at best a prototype or a toy compared to bcachefs's).
The numbers are, I think, well worth showing off; I'd be curious if anyone knows
how other competing btree implementations (xfs's?) do in comparison:
These benchmarks are with 64 bit keys and 64 bit values: sequentially create,
iterate over, and delete 100M keys:
seq_insert: 100M with 1 threads in 104 sec, 998 nsec per iter, 978k per sec
seq_lookup: 100M with 1 threads in 1 sec, 10 nsec per iter, 90.8M per sec
seq_delete: 100M with 1 threads in 41 sec, 392 nsec per iter, 2.4M per sec
create 100M keys at random (64 bit random ints for the keys)
rand_insert: 100M with 1 threads in 227 sec, 2166 nsec per iter, 450k per sec
rand_insert: 100M with 6 threads in 106 sec, 6086 nsec per iter, 962k per sec
random lookups, over the 100M random keys we just created:
rand_lookup: 10M with 1 threads in 10 sec, 995 nsec per iter, 981k per sec
rand_lookup: 10M with 6 threads in 2 sec, 1223 nsec per iter, 4.6M per sec
mixed lookup/update: 75% lookup, 25% update:
rand_mixed: 10M with 1 threads in 16 sec, 1615 nsec per iter, 604k per sec
rand_mixed: 10M with 6 threads in 8 sec, 4614 nsec per iter, 1.2M per sec
This is on my ancient i7 gulftown, using a micron p320h (it's not a pure in
memory test, we're actually writing out those random inserts!). Numbers are
slightly better on my haswell :)
reply other threads:[~2018-02-07 10:26 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180207102622.GA13600@moria.home.lan \
--to=kent.overstreet@gmail.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=lsf-pc@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).