From: Ric Wheeler <ricwheeler@gmail.com>
To: Chris Mason <chris.mason@oracle.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: New data=ordered code pushed out to btrfs-unstable
Date: Fri, 18 Jul 2008 16:09:31 -0400 [thread overview]
Message-ID: <4880F87B.7020908@gmail.com> (raw)
In-Reply-To: <1216398992.6932.36.camel@think.oraclecorp.com>
Chris Mason wrote:
> Hello everyone,
>
> It took me much longer to chase down races in my new data=ordered code,
> but I think I've finally got it, and have pushed it out to the unstable
> trees.
>
> There are no disk format changes included. I need to make minor mods to
> the resizing and balancing code, but I wanted to get this stuff out the
> door.
>
> In general, I'll call data=ordered any system that prevents seeing stale
> data on the disk after a crash. This would include null bytes from
> areas not yet written when we crashed and the contents of old blocks the
> filesystem had freed in the past.
>
> The old data=ordered code worked something like this:
>
> file_write:
> * modify pages in page cache
> * set delayed allocation bits
> * Update in memory and on-disk i_size
>
> writepage:
> * collect a large delalloc region
> * allocate new extent
> * drop existing extents from the metadata
> * insert new extent
> * start the page io
>
> transaction commit:
> * write and wait on any dirty file data to finish
> * commit the new btree pointers
>
> The end result was very large latencies during transaction commit
> because it had to wait on all the file data. A fsync of a single file
> was forced to write out all the dirty metadata and dirty data on the FS.
> This is how ext3 works today, xfs does something smarter. ext4 is
> moving to something similar to xfs.
>
> With the new code, metadata is not modified in the btree until new
> extents are fully on disk. It now looks something like this:
>
> file write (start, len):
> * wait on pending ordered extents for the start, len range
> * modify pages in the page cache
> * set delayed allocation bits
> * Update in memory only i_size
>
> writepage:
> * collect a large delalloc extent
> * reserve a extent on disk in the allocation tree
> * create an ordered extent record
> * start the page io
>
> At IO completion (done in a kthread):
> * find the corresponding ordered extent record
> * if fully written, remove old extents from the tree,
> add new extents to the tree, update on disk i_size
>
> At commit time:
> * Just do only metadata IO
>
> The end result of all of this is lower commit latencies and a smoother
> system.
>
> -chris
>
Just to kick the tires, I tried the same test that I ran last week on
ext4. Everything was going great, I decided to kill it after 6 million
files or so and restart.
The unmount has taken a very, very long time - seems like we are
cleaning up the pending transactions at a very slow rate:
Jul 18 16:06:04 localhost kernel: cleaner awake
Jul 18 16:06:04 localhost kernel: cleaner done
Jul 18 16:06:34 localhost kernel: trans 188 in commit
Jul 18 16:06:35 localhost kernel: trans 188 done in commit
Jul 18 16:06:35 localhost kernel: cleaner awake
Jul 18 16:06:35 localhost kernel: cleaner done
Jul 18 16:07:05 localhost kernel: trans 189 in commit
Jul 18 16:07:06 localhost kernel: trans 189 done in commit
Jul 18 16:07:06 localhost kernel: cleaner awake
Jul 18 16:07:06 localhost kernel: cleaner done
Jul 18 16:07:36 localhost kernel: trans 190 in commit
Jul 18 16:07:37 localhost kernel: trans 190 done in commit
Jul 18 16:07:37 localhost kernel: cleaner awake
Jul 18 16:07:37 localhost kernel: cleaner done
Jul 18 16:08:07 localhost kernel: trans 191 in commit
Jul 18 16:08:09 localhost kernel: trans 191 done in commit
Jul 18 16:08:09 localhost kernel: cleaner awake
Jul 18 16:08:09 localhost kernel: cleaner done
Jul 18 16:08:39 localhost kernel: trans 192 in commit
Jul 18 16:08:39 localhost kernel: trans 192 done in commit
Jul 18 16:08:39 localhost kernel: cleaner awake
Jul 18 16:08:39 localhost kernel: cleaner done
The command I ran was:
fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0
-l btrfs_new.txt
(No fsyncs involved here)
ric
next prev parent reply other threads:[~2008-07-18 20:09 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-07-18 16:36 New data=ordered code pushed out to btrfs-unstable Chris Mason
2008-07-18 20:09 ` Ric Wheeler [this message]
2008-07-18 20:12 ` Chris Mason
2008-07-18 22:35 ` Ric Wheeler
2008-07-19 0:45 ` Chris Mason
2008-07-20 12:19 ` Ric Wheeler
2008-07-20 13:32 ` Chris Mason
2008-07-20 13:46 ` Ric Wheeler
2008-07-21 15:08 ` Chris Mason
[not found] ` <4884D578.7040901@redhat.com>
2008-07-21 18:35 ` Chris Mason
2008-07-21 19:23 ` Ric Wheeler
2008-07-25 13:15 ` Chris Mason
2008-07-28 19:52 ` Chris Mason
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4880F87B.7020908@gmail.com \
--to=ricwheeler@gmail.com \
--cc=chris.mason@oracle.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox