\bUnderstanding metadata efficiency of btrfs

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

*  \bUnderstanding metadata efficiency of btrfs
@ 2012-03-06  2:16 Kai Ren
  2012-03-06  2:32 ` Kai Ren
  2012-03-06  5:30 ` Duncan
  0 siblings, 2 replies; 5+ messages in thread
From: Kai Ren @ 2012-03-06  2:16 UTC (permalink / raw)
  To: linux-btrfs

I've run a little wired benchmark on comparing Btrfs v0.19 and XFS:

There are 2000 directories and each directory contains 1000 files.
The workload randomly stat a file or chmod a file for 2000000 times.
And the number of stat and chmod are 50% and 50%.

I monitor the number of disk read requests

             #Disk Write Requests,  #Disk Read Requests,  #Disk Write Sectors,  #Disk Read Sectors
Btrfs       2403520      1571183    29249216  13512248
XFS         625493        396080    10302718   4932800

I found the number of write quests of Btrfs is significant larger than XFS.
I am not quite familiar with how btrfs commits the metadata change into the disks.
>From the website, it is said that btrfs uses COW B-tree which never overwrite previous disk pages.
I assume that Btrfs also keep an in-memory buffer to keep the metadata changes.
But it is unclear to me that how often Btrfs will commit these changes and what is the behind mechanism.

Could anyone please comment on the experiment results and give a brief explanation of Btrfs's metadata committing mechanism?

Sincerely,

Kai Ren

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re:   \bUnderstanding metadata efficiency of btrfs
  2012-03-06  2:16 \bUnderstanding metadata efficiency of btrfs Kai Ren
@ 2012-03-06  2:32 ` Kai Ren
  2012-03-06  5:30 ` Duncan
  1 sibling, 0 replies; 5+ messages in thread
From: Kai Ren @ 2012-03-06  2:32 UTC (permalink / raw)
  To: linux-btrfs

Forgot to mention:

Each file is zero byte.

And the available of memory is limited to 512 MB.

Best,

Kai

On Mar 5, 2012, at 9:16 PM, Kai Ren wrote:

> I've run a little wired benchmark on comparing Btrfs v0.19 and XFS:
> 
> There are 2000 directories and each directory contains 1000 files.
> The workload randomly stat a file or chmod a file for 2000000 times.
> And the number of stat and chmod are 50% and 50%.
> 
> I monitor the number of disk read requests
> 
>             #Disk Write Requests,  #Disk Read Requests,  #Disk Write Sectors,  #Disk Read Sectors
> Btrfs       2403520      1571183    29249216  13512248
> XFS         625493        396080    10302718   4932800
> 
> I found the number of write quests of Btrfs is significant larger than XFS.
> I am not quite familiar with how btrfs commits the metadata change into the disks.
> From the website, it is said that btrfs uses COW B-tree which never overwrite previous disk pages.
> I assume that Btrfs also keep an in-memory buffer to keep the metadata changes.
> But it is unclear to me that how often Btrfs will commit these changes and what is the behind mechanism.
> 
> Could anyone please comment on the experiment results and give a brief explanation of Btrfs's metadata committing mechanism?
> 
> Sincerely,
> 
> Kai Ren


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re:    \bUnderstanding metadata efficiency of btrfs
  2012-03-06  2:16 \bUnderstanding metadata efficiency of btrfs Kai Ren
  2012-03-06  2:32 ` Kai Ren
@ 2012-03-06  5:30 ` Duncan
  2012-03-06 11:29   ` ?Understanding " Hugo Mills
  1 sibling, 1 reply; 5+ messages in thread
From: Duncan @ 2012-03-06  5:30 UTC (permalink / raw)
  To: linux-btrfs

Kai Ren posted on Mon, 05 Mar 2012 21:16:34 -0500 as excerpted:

> I've run a little wired benchmark on comparing Btrfs v0.19 and XFS:

[snip description of test]
> 
> I monitor the number of disk read requests
> 
>        #WriteRq  #ReadRq  #WriteSect  #ReadSect
> Btrfs   2403520  1571183    29249216   13512248 
> XFS      625493   396080    10302718    4932800
> 
> I found the number of write quests of Btrfs is significant larger than
> XFS.

> I am not quite familiar with how btrfs commits the metadata change into
> the disks. From the website, it is said that btrfs uses COW B-tree
> which never overwrite previous disk pages. I assume that Btrfs also
> keep an in-memory buffer to keep the metadata changes.  But it is
> unclear to me that how often Btrfs will commit these changes
> and what is the behind mechanism.
> 
> Could anyone please comment on the experiment results and give a brief
> explanation of Btrfs's metadata committing mechanism?

First...

You mentioned "the web site", but didn't specify which one.  FWIW, the 
kernel.org breakin of some months ago threw a monkey wrench in a lot of 
things, one of them being the btrfs wiki.  The official 
btrfs.wiki.kernel.org site is currently a static copy of the wiki from 
before the breakin, so while it has the general btrfs ideas which haven't 
changed from back then, current status, etc, is now rather stale.

But there's a "temporary" (that could end up being permanent, it's been 
months...) btrfs wiki that's MUCH more current, at:

http://btrfs.ipv5.de/index.php?title=Main_Page

So before going further, catch up with things on the current
(temporary?) wiki.  From your post, I'd suggest you read up a bit more 
than you have, because you failed to mention at all the most important 
metadata differences between the two filesystems.  I'm not deep enough 
into filesystem internals to know if these facts explain the whole 
differences above; in fact, the wiki's where I got most of my btrfs 
specific info myself, but they certainly explain a good portion of it!

The #1 biggest difference between btrfs and most other filesystems is 
that btrfs, by default, duplicates all metadata -- two copies of all 
metadata, one copy of data, by default.  On a single disk/partition, 
that's called DUP mode, else it's referred to (not entirely correctly) as 
raid1 or raid10 mode depending on layout.  (The not entirely correctly 
bit is because a true raid1 will have as many copies as there are active 
disks, while btrfs presently only does two-way mirroring.  As such, with 
three plus disks, it's not proper raid1, only two-way-mirroring.  3-way 
and possibly N-way mirroring is on the roadmap for after raid5/6 support, 
which is roadmapped for kernels 3.4 or 3.5, so multi-way-mirroring is 
presumably 3.5 or 3.6.)

It IS possible to setup only single-copy metadata, SINGLE mode, or two 
mirror data as well, but by default, btrfs keeps two copies of metadata, 
only one of data.

So that doubles the btrfs metadata writes, right there, since by default, 
btrfs double-copies all metadata.

The #2 big factor is that btrfs (again, by default, but this is a major 
feature of btrfs, otherwise, you might as well run something else) does 
full checksumming for both data and metadata.  Unlike most filesystems, 
if cosmic rays or whatever start flipping bits on your data, btrfs will 
catch that, and if possible, retrieve a correct copy from elsewhere.  
This is actually one of the reasons for dual-copy metadata... and data 
too if you configure btrfs for it -- if the one copy is bad (fails the 
checksum validation) and there's another copy, btrfs will try to use it, 
instead.

And of course all these checksums must be written somewhere as well, so 
that's another huge increase in written metadata, even for 0-length 
files, since the metadata itself is checksummed!

And the checksumming goes some way toward explaining all those extra 
reads, as well, as any sysadmin who has run raid5/6 against raid1 can 
tell you, because in ordered to write out the new checksums, unchanged 
(meta)data must be read in, and on btrfs, existing checksums read in and 
verified as well, to make sure the existing version is valid, before 
making the change and writing it back out.

As I said, I don't know if this explains /all/ the difference that you're 
seeing, but it should be quite plain that the btrfs double-metadata and 
integrity checking is going to be MULTIPLE TIMES more work and I/O than 
what more traditional filesystems such as the xfs you're comparing 
against must do.

That's all covered in the wiki, actually, both of them, since those are 
btrfs basics that haven't changed (except the multi-way-mirroring roadmap) 
in some time.  That they're such big factors and that you didn't mention 
them at all, indicates to me that you've quite some reading to do about 
btrfs, since they're so very basic to what makes it what it is.  
Otherwise, you might as well just be using some other filesystem instead, 
especially since btrfs is still quite experimental, while there's many 
more traditional filesystems out there that are fully production ready.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ?Understanding metadata efficiency of btrfs
  2012-03-06  5:30 ` Duncan
@ 2012-03-06 11:29   ` Hugo Mills
  2012-03-06 21:25     ` Duncan
  0 siblings, 1 reply; 5+ messages in thread
From: Hugo Mills @ 2012-03-06 11:29 UTC (permalink / raw)
  To: Duncan, Kai Ren; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4352 bytes --]

On Tue, Mar 06, 2012 at 05:30:23AM +0000, Duncan wrote:
> Kai Ren posted on Mon, 05 Mar 2012 21:16:34 -0500 as excerpted:
> 
> > I've run a little wired benchmark on comparing Btrfs v0.19 and XFS:
> 
> [snip description of test]
> > 
> > I monitor the number of disk read requests
> > 
> >        #WriteRq  #ReadRq  #WriteSect  #ReadSect
> > Btrfs   2403520  1571183    29249216   13512248 
> > XFS      625493   396080    10302718    4932800
> > 
> > I found the number of write quests of Btrfs is significant larger than
> > XFS.
> 
> > I am not quite familiar with how btrfs commits the metadata change into
> > the disks. From the website, it is said that btrfs uses COW B-tree
> > which never overwrite previous disk pages. I assume that Btrfs also
> > keep an in-memory buffer to keep the metadata changes.  But it is
> > unclear to me that how often Btrfs will commit these changes
> > and what is the behind mechanism.

   By default, btrfs will commit a transaction every 30 seconds.

   (Some of this is probably playing a bit fast and loose with
terminology such as "block cache". I'm sure if I've made any major
errors, I'll be corrected.)

   The "in-memory buffer" is simply the standard Linux block layer and
FS cache: When a piece of metadata is searched for, btrfs walks down
the relevant tree, loading each tree node (a 4k page) in turn, until
it finds the metadata. Unless there is a huge amount of memory
pressure, Linux's block cache will hang on those blocks in RAM.

   btrfs can then modify those blocks as much as it likes, in RAM, as
userspace tools request those changes to be made (e.g. writes,
deletes, etc). By the CoW nature of the FS, modifying a metadata block
will also require modification of the block above it in the tree, and
so on up to the top of the tree. If it's all kept in RAM, this is a
fast operation, since the trees aren't usually very deep(*).

   At regular intervals (30s), the btrfs code will ensure that it has
a consistent in-memory set of blocks, and flushes those dirty blocks
to disk, ensuring that they're moved from the original location. It
does so by first writing all of the tree data, sending down disk flush
commands to ensure that the data gets to disk reliably, and then
writing out new copies of the superblocks so that they point to the
new trees.

[snip]
> The #1 biggest difference between btrfs and most other filesystems is 
> that btrfs, by default, duplicates all metadata -- two copies of all 
> metadata, one copy of data, by default.
[snip]
> So that doubles the btrfs metadata writes, right there, since by default, 
> btrfs double-copies all metadata.
> 
> The #2 big factor is that btrfs (again, by default, but this is a major 
> feature of btrfs, otherwise, you might as well run something else) does 
> full checksumming for both data and metadata.  Unlike most filesystems, 
[snip] 
> And of course all these checksums must be written somewhere as well, so 
> that's another huge increase in written metadata, even for 0-length 
> files, since the metadata itself is checksummed!

   This isn't quite true. btrfs checksums everything at the rate of a
single (currently 4-byte) checksum per (4096-byte) page. In the case
of data blocks, those go in the checksum tree -- thus increasing the
size of the metadata, as you suggest. However, for metadata blocks,
the checksum is written into each metadata block itself. Thus the
checksum overhead of metadata is effectively zero.

   Further, if the files are all zero length, I'd expect the file
"data" to be held inline in the extent tree, which will increase the
per-file size of the metadata a little, but not much. As a result of
this, there won't be any checksums in the checksum tree, because all
of the extents are stored inline elsewhere, and checksummed by the
normal embedded metadata checksums.

   Hugo.

(*) An internal tree node holds up to 121 tree-node references, so the
depth of a tree with n items in it is approximately log(n)/log(121). A
tree with a billion items in it would have a maximum depth of 5.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
           --- I believe that it's closely correlated with ---           
                       the aeroswine coefficient.                        

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ?Understanding metadata efficiency of btrfs
  2012-03-06 11:29   ` ?Understanding " Hugo Mills
@ 2012-03-06 21:25     ` Duncan
  0 siblings, 0 replies; 5+ messages in thread
From: Duncan @ 2012-03-06 21:25 UTC (permalink / raw)
  To: linux-btrfs

Hugo Mills posted on Tue, 06 Mar 2012 11:29:58 +0000 as excerpted:

> The "in-memory buffer" is simply the standard Linux block layer and FS
> cache: When a piece of metadata is searched for, btrfs walks down the
> relevant tree, loading each tree node (a 4k page) in turn, until it
> finds the metadata. Unless there is a huge amount of memory pressure,
> Linux's block cache will hang on those blocks in RAM.
> 
>    btrfs can then modify those blocks as much as it likes, in RAM, as
> userspace tools request those changes to be made (e.g. writes, deletes,
> etc). By the CoW nature of the FS, modifying a metadata block will also
> require modification of the block above it in the tree, and so on up to
> the top of the tree. If it's all kept in RAM, this is a fast operation,
> since the trees aren't usually very deep(*).
> 
>    At regular intervals (30s), the btrfs code will ensure that it has
> a consistent in-memory set of blocks, and flushes those dirty blocks to
> disk, ensuring that they're moved from the original location. It does so
> by first writing all of the tree data, sending down disk flush commands
> to ensure that the data gets to disk reliably, and then writing out new
> copies of the superblocks so that they point to the new trees.

Thanks for this and the (snipped) rest.  It's nice to know a bit of 
detail at a level below where I was, tho while it makes sense, I suspect 
it may take reading it more than once to sink in. =:^(  But I expect by 
sticking around I'll get that chance. =:^)

Thanks for the truer checksumming picture as well. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-03-06 21:25 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-06  2:16 \bUnderstanding metadata efficiency of btrfs Kai Ren
2012-03-06  2:32 ` Kai Ren
2012-03-06  5:30 ` Duncan
2012-03-06 11:29   ` ?Understanding " Hugo Mills
2012-03-06 21:25     ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).