From: Ravishankar N <ravishankar@redhat.com>
To: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>,
BTRFS ML <linux-btrfs@vger.kernel.org>,
"gluster-users@gluster.org List" <Gluster-users@gluster.org>
Subject: Re: BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such.
Date: Wed, 12 Apr 2017 07:13:41 +0530 [thread overview]
Message-ID: <f21a5903-19a6-d6ef-ad1e-6f6e5900b6ff@redhat.com> (raw)
In-Reply-To: <7e63733a-6ab5-d92d-f9b2-f129ebd81f36@gmail.com>
Adding gluster-users list. I think there are a few users out there
running gluster on top of btrfs, so this might benefit a broader audience.
On 04/11/2017 09:10 PM, Austin S. Hemmelgarn wrote:
> About a year ago now, I decided to set up a small storage cluster to
> store backups (and partially replace Dropbox for my usage, but that's
> a separate story). I ended up using GlusterFS as the clustering
> software itself, and BTRFS as the back-end storage.
>
> GlusterFS itself is actually a pretty easy workload as far as cluster
> software goes. It does some processing prior to actually storing the
> data (a significant amount in fact), but the actual on-device storage
> on any given node is pretty simple. You have the full directory
> structure for the whole volume, and whatever files happen to be on
> that node are located within that tree exactly like they are in the
> GlusterFS volume. Beyond the basic data, gluster only stores 2-4
> xattrs per-file (which are used to track synchronization, and also for
> it's internal data scrubbing), and a directory called .glusterfs in
> the top of the back-end storage location for the volume which contains
> the data required to figure out which node a file is on. Overall, the
> access patterns mostly mirror whatever is using the Gluster volume, or
> are reduced to slow streaming writes (when writing files and the
> back-end nodes are computationally limited instead of I/O limited),
> with the addition of some serious metadata operations in the
> .glusterfs directory (lots of stat calls there, together with large
> numbers of small files).
>
> As far as overall performance, BTRFS is actually on par for this usage
> with both ext4 and XFS (at least, on my hardware it is), and I
> actually see more SSD friendly access patterns when using BTRFS in
> this case than any other FS I tried.
>
> After some serious experimentation with various configurations for
> this during the past few months, I've noticed a handful of other things:
>
> 1. The 'ssd' mount option does not actually improve performance on
> these SSD's. To a certain extent, this actually surprised me at
> first, but having seen Hans' e-mail and what he found about this
> option, it actually makes sense, since erase-blocks on these devices
> are 4MB, not 2MB, and the drives have a very good FTL (so they will
> aggregate all the little writes properly).
>
> Given this, I'm beginning to wonder if it actually makes sense to not
> automatically enable this on mount when dealing with certain types of
> storage (for example, most SATA and SAS SSD's have reasonably good
> FTL's, so I would expect them to have similar behavior).
> Extrapolating further, it might instead make sense to just never
> automatically enable this, and expose the value this option is
> manipulating as a mount option as there are other circumstances where
> setting specific values could improve performance (for example, if
> you're on hardware RAID6, setting this to the stripe size would
> probably improve performance on many cheaper controllers).
>
> 2. Up to a certain point, running a single larger BTRFS volume with
> multiple sub-volumes is more computationally efficient than running
> multiple smaller BTRFS volumes. More specifically, there is lower
> load on the system and lower CPU utilization by BTRFS itself without
> much noticeable difference in performance (in my tests it was about
> 0.5-1% performance difference, YMMV). To a certain extent this makes
> some sense, but the turnover point was actually a lot higher than I
> expected (with this workload, the turnover point was around half a
> terabyte).
>
> I believe this to be a side-effect of how we use per-filesystem
> worker-pools. In essence, we can schedule parallel access better when
> it's all through the same worker pool than we can when using multiple
> worker pools. Having realized this, I think it might be interesting
> to see if using a worker-pool per physical device (or at least what
> the system sees as a physical device) might make more sense in terms
> of performance than our current method of using a pool per-filesystem.
>
> 3. On these SSD's, running a single partition in dup mode is actually
> marginally more efficient than running 2 partitions in raid1 mode. I
> was actually somewhat surprised by this, and I haven't been able to
> find a clear explanation as to why (I suspect caching may have
> something to do with it, but I'm not 100% certain about that), but
> some limited testing with other SSD's seems to indicate that it's the
> case for most SSD's, with the difference being smaller on smaller and
> faster devices. On a traditional hard disk, it's significantly more
> efficient, but that's generally to be expected.
>
> 4. Depending on other factors, compression can actually slow you down
> pretty significantly. In the particular case I saw this happen (all
> cores completely utilized by userspace software), LZO compression
> actually caused around 5-10% performance degradation compared to no
> compression. This is somewhat obvious once it's explained, but it's
> not exactly intuitive and as such it's probably worth documenting in
> the man pages that compression won't always make things better. I may
> send a patch to add this at some point in the near future.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2017-04-12 1:43 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-04-11 15:40 BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such Austin S. Hemmelgarn
2017-04-12 1:43 ` Ravishankar N [this message]
2017-04-12 5:49 ` Qu Wenruo
2017-04-12 7:16 ` Sargun Dhillon
2017-04-12 11:18 ` Austin S. Hemmelgarn
2017-04-12 22:48 ` Duncan
2017-04-13 11:33 ` Austin S. Hemmelgarn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f21a5903-19a6-d6ef-ad1e-6f6e5900b6ff@redhat.com \
--to=ravishankar@redhat.com \
--cc=Gluster-users@gluster.org \
--cc=ahferroin7@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).