From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ric Wheeler Subject: Re: New data=ordered code pushed out to btrfs-unstable Date: Fri, 18 Jul 2008 16:09:31 -0400 Message-ID: <4880F87B.7020908@gmail.com> References: <1216398992.6932.36.camel@think.oraclecorp.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Cc: linux-btrfs To: Chris Mason Return-path: In-Reply-To: <1216398992.6932.36.camel@think.oraclecorp.com> List-ID: Chris Mason wrote: > Hello everyone, > > It took me much longer to chase down races in my new data=ordered code, > but I think I've finally got it, and have pushed it out to the unstable > trees. > > There are no disk format changes included. I need to make minor mods to > the resizing and balancing code, but I wanted to get this stuff out the > door. > > In general, I'll call data=ordered any system that prevents seeing stale > data on the disk after a crash. This would include null bytes from > areas not yet written when we crashed and the contents of old blocks the > filesystem had freed in the past. > > The old data=ordered code worked something like this: > > file_write: > * modify pages in page cache > * set delayed allocation bits > * Update in memory and on-disk i_size > > writepage: > * collect a large delalloc region > * allocate new extent > * drop existing extents from the metadata > * insert new extent > * start the page io > > transaction commit: > * write and wait on any dirty file data to finish > * commit the new btree pointers > > The end result was very large latencies during transaction commit > because it had to wait on all the file data. A fsync of a single file > was forced to write out all the dirty metadata and dirty data on the FS. > This is how ext3 works today, xfs does something smarter. ext4 is > moving to something similar to xfs. > > With the new code, metadata is not modified in the btree until new > extents are fully on disk. It now looks something like this: > > file write (start, len): > * wait on pending ordered extents for the start, len range > * modify pages in the page cache > * set delayed allocation bits > * Update in memory only i_size > > writepage: > * collect a large delalloc extent > * reserve a extent on disk in the allocation tree > * create an ordered extent record > * start the page io > > At IO completion (done in a kthread): > * find the corresponding ordered extent record > * if fully written, remove old extents from the tree, > add new extents to the tree, update on disk i_size > > At commit time: > * Just do only metadata IO > > The end result of all of this is lower commit latencies and a smoother > system. > > -chris > Just to kick the tires, I tried the same test that I ran last week on ext4. Everything was going great, I decided to kill it after 6 million files or so and restart. The unmount has taken a very, very long time - seems like we are cleaning up the pending transactions at a very slow rate: Jul 18 16:06:04 localhost kernel: cleaner awake Jul 18 16:06:04 localhost kernel: cleaner done Jul 18 16:06:34 localhost kernel: trans 188 in commit Jul 18 16:06:35 localhost kernel: trans 188 done in commit Jul 18 16:06:35 localhost kernel: cleaner awake Jul 18 16:06:35 localhost kernel: cleaner done Jul 18 16:07:05 localhost kernel: trans 189 in commit Jul 18 16:07:06 localhost kernel: trans 189 done in commit Jul 18 16:07:06 localhost kernel: cleaner awake Jul 18 16:07:06 localhost kernel: cleaner done Jul 18 16:07:36 localhost kernel: trans 190 in commit Jul 18 16:07:37 localhost kernel: trans 190 done in commit Jul 18 16:07:37 localhost kernel: cleaner awake Jul 18 16:07:37 localhost kernel: cleaner done Jul 18 16:08:07 localhost kernel: trans 191 in commit Jul 18 16:08:09 localhost kernel: trans 191 done in commit Jul 18 16:08:09 localhost kernel: cleaner awake Jul 18 16:08:09 localhost kernel: cleaner done Jul 18 16:08:39 localhost kernel: trans 192 in commit Jul 18 16:08:39 localhost kernel: trans 192 done in commit Jul 18 16:08:39 localhost kernel: cleaner awake Jul 18 16:08:39 localhost kernel: cleaner done The command I ran was: fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0 -l btrfs_new.txt (No fsyncs involved here) ric