From: Ric Wheeler <ricwheeler@gmail.com>
To: Chris Mason <chris.mason@oracle.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: New data=ordered code pushed out to btrfs-unstable
Date: Fri, 18 Jul 2008 16:09:31 -0400 [thread overview]
Message-ID: <4880F87B.7020908@gmail.com> (raw)
In-Reply-To: <1216398992.6932.36.camel@think.oraclecorp.com>
Chris Mason wrote:
> Hello everyone,
>
> It took me much longer to chase down races in my new data=ordered code,
> but I think I've finally got it, and have pushed it out to the unstable
> trees.
>
> There are no disk format changes included. I need to make minor mods to
> the resizing and balancing code, but I wanted to get this stuff out the
> door.
>
> In general, I'll call data=ordered any system that prevents seeing stale
> data on the disk after a crash. This would include null bytes from
> areas not yet written when we crashed and the contents of old blocks the
> filesystem had freed in the past.
>
> The old data=ordered code worked something like this:
>
> file_write:
> * modify pages in page cache
> * set delayed allocation bits
> * Update in memory and on-disk i_size
>
> writepage:
> * collect a large delalloc region
> * allocate new extent
> * drop existing extents from the metadata
> * insert new extent
> * start the page io
>
> transaction commit:
> * write and wait on any dirty file data to finish
> * commit the new btree pointers
>
> The end result was very large latencies during transaction commit
> because it had to wait on all the file data. A fsync of a single file
> was forced to write out all the dirty metadata and dirty data on the FS.
> This is how ext3 works today, xfs does something smarter. ext4 is
> moving to something similar to xfs.
>
> With the new code, metadata is not modified in the btree until new
> extents are fully on disk. It now looks something like this:
>
> file write (start, len):
> * wait on pending ordered extents for the start, len range
> * modify pages in the page cache
> * set delayed allocation bits
> * Update in memory only i_size
>
> writepage:
> * collect a large delalloc extent
> * reserve a extent on disk in the allocation tree
> * create an ordered extent record
> * start the page io
>
> At IO completion (done in a kthread):
> * find the corresponding ordered extent record
> * if fully written, remove old extents from the tree,
> add new extents to the tree, update on disk i_size
>
> At commit time:
> * Just do only metadata IO
>
> The end result of all of this is lower commit latencies and a smoother
> system.
>
> -chris
>
Just to kick the tires, I tried the same test that I ran last week on
ext4. Everything was going great, I decided to kill it after 6 million
files or so and restart.
The unmount has taken a very, very long time - seems like we are
cleaning up the pending transactions at a very slow rate:
Jul 18 16:06:04 localhost kernel: cleaner awake
Jul 18 16:06:04 localhost kernel: cleaner done
Jul 18 16:06:34 localhost kernel: trans 188 in commit
Jul 18 16:06:35 localhost kernel: trans 188 done in commit
Jul 18 16:06:35 localhost kernel: cleaner awake
Jul 18 16:06:35 localhost kernel: cleaner done
Jul 18 16:07:05 localhost kernel: trans 189 in commit
Jul 18 16:07:06 localhost kernel: trans 189 done in commit
Jul 18 16:07:06 localhost kernel: cleaner awake
Jul 18 16:07:06 localhost kernel: cleaner done
Jul 18 16:07:36 localhost kernel: trans 190 in commit
Jul 18 16:07:37 localhost kernel: trans 190 done in commit
Jul 18 16:07:37 localhost kernel: cleaner awake
Jul 18 16:07:37 localhost kernel: cleaner done
Jul 18 16:08:07 localhost kernel: trans 191 in commit
Jul 18 16:08:09 localhost kernel: trans 191 done in commit
Jul 18 16:08:09 localhost kernel: cleaner awake
Jul 18 16:08:09 localhost kernel: cleaner done
Jul 18 16:08:39 localhost kernel: trans 192 in commit
Jul 18 16:08:39 localhost kernel: trans 192 done in commit
Jul 18 16:08:39 localhost kernel: cleaner awake
Jul 18 16:08:39 localhost kernel: cleaner done
The command I ran was:
fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0
-l btrfs_new.txt
(No fsyncs involved here)
ric
next prev parent reply other threads:[~2008-07-18 20:09 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-07-18 16:36 New data=ordered code pushed out to btrfs-unstable Chris Mason
2008-07-18 20:09 ` Ric Wheeler [this message]
2008-07-18 20:12 ` Chris Mason
2008-07-18 22:35 ` Ric Wheeler
2008-07-19 0:45 ` Chris Mason
2008-07-20 12:19 ` Ric Wheeler
2008-07-20 13:32 ` Chris Mason
2008-07-20 13:46 ` Ric Wheeler
2008-07-21 15:08 ` Chris Mason
[not found] ` <4884D578.7040901@redhat.com>
2008-07-21 18:35 ` Chris Mason
2008-07-21 19:23 ` Ric Wheeler
2008-07-25 13:15 ` Chris Mason
2008-07-28 19:52 ` Chris Mason
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4880F87B.7020908@gmail.com \
--to=ricwheeler@gmail.com \
--cc=chris.mason@oracle.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.