Re: [RFC PATCH] ext4: fix 50% disk write performance regression

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Eric Sandeen <sandeen@redhat.com>
To: Bill Fink <billfink@mindspring.com>
Cc: tytso@mit.edu, adilger@sun.com, linux-ext4@vger.kernel.org,
	bill.fink@nasa.gov
Subject: Re: [RFC PATCH] ext4: fix 50% disk write performance regression
Date: Mon, 30 Aug 2010 12:05:33 -0500	[thread overview]
Message-ID: <4C7BE4DD.1060208@redhat.com> (raw)
In-Reply-To: <20100829231126.8d8b2086.billfink@mindspring.com>

Bill Fink wrote:
> A 50% ext4 disk write performance regression was introduced
> in 2.6.32 and still exists in 2.6.35, although somewhat improved
> from 2.6.32.  Read performance was not affected).
> 
> 2.6.31 disk write performance (RAID5 with 8 disks):
> 
> i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> 32768+0 records in
> 32768+0 records out
> 34359738368 bytes (34 GB) copied, 49.7106 s, 691 MB/s
> 
> 2.6.32 disk write performance (RAID5 with 8 disks):
> 
> i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> 32768+0 records in
> 32768+0 records out
> 34359738368 bytes (34 GB) copied, 100.395 s, 342 MB/s
> 
> 2.6.35 disk write performance (RAID5 with 8 disks):
> 
> i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> 32768+0 records in
> 32768+0 records out
> 34359738368 bytes (34 GB) copied, 75.7265 s, 454 MB/s
> 
> A git bisect targetted commit 55138e0bc29c0751e2152df9ad35deea542f29b3
> (ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks).
> Specifically the performance issue is caused by the use of the function
> ext4_num_dirty_pages.
> 
> The included patch avoids calling ext4_num_dirty_pages
> (and removes its definition) by unconditionally setting
> desired_nr_to_write to wbc->nr_to_write * 8.
> 
> With the patch, the disk write performance is back to
> approximately 2.6.31 performance levels.

Firstly, thanks very much for tracking that down.  I've had various &
sundry reports of slowdowns but I'd never really gotten to the bottom
of it with a simple testcase somehow.

When I get some time (soon I hope) I'll look into the ramifications
of this change (i.e. what if wbc->nr_to_write * 8 is more than the dirty
pages, do things work out ok?) but it seems pretty reasonable.

Since the commit was Ted's originally, perhaps he has some more
immediate comments.

Thanks a ton!

-Eric

> 2.6.35+patch disk write performance (RAID5 with 8 disks):
> 
> i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> 32768+0 records in
> 32768+0 records out
> 34359738368 bytes (34 GB) copied, 50.7234 s, 677 MB/s
> 
> Since I'm no expert in this area, I'm submitting this
> RFC patch against 2.6.35.  I'm not sure what all the
> ramifications of my suggested change would be.  However,
> to my admittedly novice eyes, it doesn't seem to be an
> unreasonable change.  Also, subjectively from building
> kernels on a RAID5 ext4 filesystem using the patched
> 2.6.35 kernel (via make -j 8), I didn't notice any issues,
> and it actually seemed more responsive than when using
> the unpatched 2.6.35 kernel.
> 
> 					-Bill
> 
> P.S.  I am not subscribed to the linux-ext4 e-mail list,
>       plus this is my very first attempted linux kernel
>       patch submission.
> 
> 
> 
> Partially revert 55138e0bc29c0751e2152df9ad35deea542f29b3
> (ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks)
> to fix a 50% ext4 disk write performance regression introduced
> between 2.6.31 and 2.6.32.
> 
> Signed-off-by: Bill Fink <bill.fink@nasa.gov>
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 42272d6..f6e639b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1143,64 +1143,6 @@ static int check_block_validity(struct inode *inode, const char *func,
>  }
>  
>  /*
> - * Return the number of contiguous dirty pages in a given inode
> - * starting at page frame idx.
> - */
> -static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
> -				    unsigned int max_pages)
> -{
> -	struct address_space *mapping = inode->i_mapping;
> -	pgoff_t	index;
> -	struct pagevec pvec;
> -	pgoff_t num = 0;
> -	int i, nr_pages, done = 0;
> -
> -	if (max_pages == 0)
> -		return 0;
> -	pagevec_init(&pvec, 0);
> -	while (!done) {
> -		index = idx;
> -		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> -					      PAGECACHE_TAG_DIRTY,
> -					      (pgoff_t)PAGEVEC_SIZE);
> -		if (nr_pages == 0)
> -			break;
> -		for (i = 0; i < nr_pages; i++) {
> -			struct page *page = pvec.pages[i];
> -			struct buffer_head *bh, *head;
> -
> -			lock_page(page);
> -			if (unlikely(page->mapping != mapping) ||
> -			    !PageDirty(page) ||
> -			    PageWriteback(page) ||
> -			    page->index != idx) {
> -				done = 1;
> -				unlock_page(page);
> -				break;
> -			}
> -			if (page_has_buffers(page)) {
> -				bh = head = page_buffers(page);
> -				do {
> -					if (!buffer_delay(bh) &&
> -					    !buffer_unwritten(bh))
> -						done = 1;
> -					bh = bh->b_this_page;
> -				} while (!done && (bh != head));
> -			}
> -			unlock_page(page);
> -			if (done)
> -				break;
> -			idx++;
> -			num++;
> -			if (num >= max_pages)
> -				break;
> -		}
> -		pagevec_release(&pvec);
> -	}
> -	return num;
> -}
> -
> -/*
>   * The ext4_map_blocks() function tries to look up the requested blocks,
>   * and returns if the blocks are already mapped.
>   *
> @@ -2972,15 +2914,10 @@ static int ext4_da_writepages(struct address_space *mapping,
>  	 * contiguous.  Unfortunately this brings us to the second
>  	 * stupidity, which is that ext4's mballoc code only allocates
>  	 * at most 2048 blocks.  So we force contiguous writes up to
> -	 * the number of dirty blocks in the inode, or
> -	 * sbi->max_writeback_mb_bump whichever is smaller.
> +	 * sbi->max_writeback_mb_bump
>  	 */
>  	max_pages = sbi->s_max_writeback_mb_bump << (20 - PAGE_CACHE_SHIFT);
> -	if (!range_cyclic && range_whole)
> -		desired_nr_to_write = wbc->nr_to_write * 8;
> -	else
> -		desired_nr_to_write = ext4_num_dirty_pages(inode, index,
> -							   max_pages);
> +	desired_nr_to_write = wbc->nr_to_write * 8;
>  	if (desired_nr_to_write > max_pages)
>  		desired_nr_to_write = max_pages;
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2010-08-30 17:05 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-08-30  3:11 [RFC PATCH] ext4: fix 50% disk write performance regression Bill Fink
2010-08-30 17:05 ` Eric Sandeen [this message]
2010-08-30 19:30   ` Bill Fink
2010-08-30 19:35     ` Eric Sandeen
2010-08-30 17:40 ` Ted Ts'o
2010-08-30 20:49   ` Bill Fink
2010-08-30 21:05     ` Eric Sandeen
     [not found]       ` <20100830194533.6d09c38b.bill@wizard.sci.gsfc.nasa.gov>
2010-08-30 23:53         ` Eric Sandeen
     [not found]           ` <20100830210541.8b248a14.billfink@mindspring.com>
     [not found]             ` <4C7C62E9.4090707@redhat.com>
2010-08-31  3:27               ` Bill Fink
2010-08-31  3:29                 ` Eric Sandeen
2010-08-31  0:37     ` Ted Ts'o
2010-08-31  0:51       ` Justin Maggard
2010-08-31  1:44         ` Bill Fink
2010-08-31  1:14       ` Bill Fink
2010-08-31  3:43 ` [PATCH] " Eric Sandeen
2010-08-31  4:26   ` Eric Sandeen
2010-08-31  4:53   ` Bill Fink
2010-08-31  5:05     ` Eric Sandeen
2010-08-31  5:31       ` Bill Fink
2010-09-09  0:23       ` Daniel Taylor
2010-09-09  3:29         ` Eric Sandeen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C7BE4DD.1060208@redhat.com \
    --to=sandeen@redhat.com \
    --cc=adilger@sun.com \
    --cc=bill.fink@nasa.gov \
    --cc=billfink@mindspring.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).