Re: [RFC PATCH] ext4: fix 50% disk write performance regression

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Bill Fink <billfink@mindspring.com>
To: Eric Sandeen <sandeen@redhat.com>
Cc: tytso@mit.edu, bill.fink@nasa.gov, linux-ext4@vger.kernel.org
Subject: Re: [RFC PATCH] ext4: fix 50% disk write performance regression
Date: Mon, 30 Aug 2010 23:27:04 -0400	[thread overview]
Message-ID: <20100830232704.39d2e9f4.billfink@mindspring.com> (raw)
In-Reply-To: <4C7C62E9.4090707@redhat.com>

[adding linux-ext4 back in]

On Mon, 30 Aug, Eric Sandeen wrote:

> Bill Fink wrote:
> > On Mon, 30 Aug 2010, Eric Sandeen wrote:
> > 
> >> Bill Fink wrote:
> >>> On Mon, 30 Aug 2010, Eric Sandeen wrote:
> >>>
> >>>> Bill Fink wrote:
> >>>>> On Mon, 30 Aug 2010, Ted Ts'o wrote:
> >>>>>
> >>>>>> On Sun, Aug 29, 2010 at 11:11:26PM -0400, Bill Fink wrote:
> >>>>>>> A 50% ext4 disk write performance regression was introduced
> >>>>>>> in 2.6.32 and still exists in 2.6.35, although somewhat improved
> >>>>>>> from 2.6.32.  Read performance was not affected).
> >>>>>> Thanks for reporting it.  I'm going to have to take a closer look at
> >>>>>> why this makes a difference.  I'm going to guess though that what's
> >>>>>> going on is that we're posting writes in such a way that they're no
> >>>>>> longer aligned or ending at the end of a RAID5 stripe, causing a
> >>>>>> read-modify-write pass.  That would easily explain the write
> >>>>>> performance regression.
> >>>>> I'm not sure I understand.  How could calling or not calling
> >>>>> ext4_num_dirty_pages() (unpatched versus patched 2.6.35 kernel)
> >>>>> affect the write alignment?
> >>>>>
> >>>>> I was wondering if the locking being done in ext4_num_dirty_pages()
> >>>>> could somehow be affecting the performance.  I did notice from top
> >>>>> that in the patched 2.6.35 kernel, the I/O wait time was generally
> >>>>> in the 60-65% range, while in the unpatched 2.6.35 kernel, it was
> >>>>> at a higher 75-80% range.  However, I don't know if that's just a
> >>>>> result of the lower performance, or a possible clue to its cause.
> >>>> Using oprofile might also show you how much time is getting spent there..
> >>>>
> >>>>>> The interesting thing is that we don't actually do anything in
> >>>>>> ext4_da_writepages() to assure that we are making our writes are
> >>>>>> appropriate aligned and sized.  We do pay attention to make sure they
> >>>>>> are alligned correctly in the allocator, but _not_ in the writepages
> >>>>>> code.  So the fact that apparently things were well aligned in 2.6.32
> >>>>>> seems to be luck... (or maybe the writes are perfectly aligned in
> >>>>>> 2.6.32; they're just much worse with 2.6.35, and with explicit
> >>>>>> attention paid to the RAID stripe size, we could do even better :-)
> >>>>> It was 2.6.31 that was good.  The regression was in 2.6.32.  And again
> >>>>> how does the write alignment get modified simply by whether or not
> >>>>> ext4_num_dirty_pages() is called?
> >>>> writeback is full of deep mysteries ... :)
> >>>>
> >>>>>> If you could run blktraces on 2.6.32, 2.6.35 stock, and 2.6.35 with
> >>>>>> your patch, that would be really helpful to confirm my hypothesis.  Is
> >>>>>> that something that wouldn't be too much trouble?
> >>>>> I'd be glad to if you explain how one runs blktraces.
> >>>> Probably the easiest thing to do is to use seekwatcher to invoke blktrace,
> >>>> if it's easily available for your distro.  Then it's just mount debugfs on
> >>>> /sys/kernel/debug, and:
> >>>>
> >>>> # seekwatcher -d /dev/whatever -t tracename -o tracename.png -p "your dd command"
> >>>>
> >>>> It'll leave tracename.* blktrace files, and generate a graph of the IO
> >>>> in the PNG file.
> >>>>
> >>>> (this causes an abbreviated trace, but it's probably enough to see what
> >>>> boundaries the IO was issued on)
> >>> Thanks for the info.  How would you like me to send the blktraces?
> >>> Even using bzip2 they're 2.6 MB.  I could send them to you and Ted
> >>> via private e-mail or I can hunt around and try and find somewhere
> >>> I can post them.  I'm attaching the PNG files (2.6.35 is unpatched
> >>> and 2.6.35+ is patched).
> >> Private email is fine I think, I don't mind a 2.6MB attachment and doubt
> >> Ted would either.  :)
> > 
> > OK.  It's two such attachments (2.6.35 is unpatched and 2.6.35+
> > is patched).
> 
> (fwiw I had to create *.0 and *.1 files to make blktrace happy, it didn't
> like starting with *.2 ...?  oh well)

The .0 and .1 files were 0 bytes.  There was also a 56 byte .3 file
in each case.  Did you also want to see those?

> [sandeen@sandeen tmp]$ blkparse ext4dd-2.6.35-trace | tail -n 13
> CPU2 (ext4dd-2.6.35-trace):
>  Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
>  Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
>  Reads Requeued:         0		 Writes Requeued:         0
>  Reads Completed:      249,      996KiB	 Writes Completed:  270,621,   30,544MiB
>  Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>  Read depth:             0        	 Write depth:             0
>  IO unplugs:             0        	 Timer unplugs:           0
> 
> Throughput (R/W): 13KiB/s / 409,788KiB/s
> Events (ext4dd-2.6.35-trace): 270,870 entries
> Skips: 0 forward (0 -   0.0%)
> Input file ext4dd-2.6.35-trace.blktrace.2 added
> 
> 
> [sandeen@sandeen tmp]$ blkparse ext4dd-2.6.35+-trace | tail -n 13
> CPU2 (ext4dd-2.6.35+-trace):
>  Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
>  Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
>  Reads Requeued:         0		 Writes Requeued:         0
>  Reads Completed:      504,    2,016KiB	 Writes Completed:  246,500,   30,610MiB
>  Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>  Read depth:             0        	 Write depth:             0
>  IO unplugs:             0        	 Timer unplugs:           0
> 
> Throughput (R/W): 38KiB/s / 590,917KiB/s
> Events (ext4dd-2.6.35+-trace): 247,004 entries
> Skips: 0 forward (0 -   0.0%)
> Input file ext4dd-2.6.35+-trace.blktrace.2 added
> 
> 
> Ok not a -huge- difference in the overall stats, though the unpatched version did
> fdo 24,000 more writes, which is 10% ...
> 
> At the extremes in both cases there are 8-block and 256-block writes.
> 
> 2.6.35:
> 
>  nr wrt size
>   25256 8
>    1701 16
>    1646 24
>     297 248
> ...
>  232657 256
> 
> 
> 2.6.35+:
> 
>  nr wrt size
>    4785 8
>    1732 16
>     357 24
> ...
>      50 248
>  237907 256
> 
> So not a huge difference in the distribution really, though there were
> 20,000 more 1-block (8-sector) writes in the unpatched version.
> 
> So a lot of 32-block writes turned into 1-block writes and smaller, I guess.
> 
> To know for sure about alignment, what is your raid stripe unit, and is this
> filesystem on a partition?  If so at what offset?

Would the stripe unit be the Block size in the following?

[root@i7test7 linux-git]# hptrc -s1 query arrays
ID    Capacity(GB)    Type        Status   Block  Sector   Cache            Name
-------------------------------------------------------------------------------
1           500.00   RAID5        NORMAL    256k    512B      WB        RAID50_1
2           250.00   RAID5        NORMAL    256k    512B      WB       RAID500_1
3          2350.00   RAID5        NORMAL    256k    512B      WB         RAID5_1

The testing was done on the first 500 GB array.

And here's how the ext4 filesystem was created (using the
entire device):

[root@i7test7 linux-git]# mkfs.ext4 /dev/sde
mke2fs 1.41.10 (10-Feb-2009)
/dev/sde is entire device, not just one partition!
Proceed anyway? (y,n) y
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
30523392 inodes, 122070144 blocks
6103507 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
3726 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000

Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 25 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

						-Thanks

						-Bill



> -Eric
> 
> >> I keep meaning to patch seekwatcher to color unaligned IOs differently,
> >> but without that we need the blktrace data to know if that's what's going
> >> on.
> >>
> >> It's interesting that the patched run is starting at block 0 while
> >> unpatched is starting futher in (which would be a little slower at least)
> >>
> >> was there a fresh mkfs in between?
> > 
> > No mkfs in between, and the original mkfs.ext4 was done without
> > any special options.  I am using the noauto_da_alloc option on the
> > mount to workaround another 9% performance hit between 2.6.31 and
> > 2.6.32, introduced by 5534fb5bb35a62a94e0bd1fa2421f7fb6e894f10
> > (ext4: Fix the alloc on close after a truncate hueristic).  That
> > one only affected already existing files.
> > 
> >> Thanks!
> >>
> >> -Eric

next prev parent reply	other threads:[~2010-08-31  3:27 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-08-30  3:11 [RFC PATCH] ext4: fix 50% disk write performance regression Bill Fink
2010-08-30 17:05 ` Eric Sandeen
2010-08-30 19:30   ` Bill Fink
2010-08-30 19:35     ` Eric Sandeen
2010-08-30 17:40 ` Ted Ts'o
2010-08-30 20:49   ` Bill Fink
2010-08-30 21:05     ` Eric Sandeen
     [not found]       ` <20100830194533.6d09c38b.bill@wizard.sci.gsfc.nasa.gov>
2010-08-30 23:53         ` Eric Sandeen
     [not found]           ` <20100830210541.8b248a14.billfink@mindspring.com>
     [not found]             ` <4C7C62E9.4090707@redhat.com>
2010-08-31  3:27               ` Bill Fink [this message]
2010-08-31  3:29                 ` Eric Sandeen
2010-08-31  0:37     ` Ted Ts'o
2010-08-31  0:51       ` Justin Maggard
2010-08-31  1:44         ` Bill Fink
2010-08-31  1:14       ` Bill Fink
2010-08-31  3:43 ` [PATCH] " Eric Sandeen
2010-08-31  4:26   ` Eric Sandeen
2010-08-31  4:53   ` Bill Fink
2010-08-31  5:05     ` Eric Sandeen
2010-08-31  5:31       ` Bill Fink
2010-09-09  0:23       ` Daniel Taylor
2010-09-09  3:29         ` Eric Sandeen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100830232704.39d2e9f4.billfink@mindspring.com \
    --to=billfink@mindspring.com \
    --cc=bill.fink@nasa.gov \
    --cc=linux-ext4@vger.kernel.org \
    --cc=sandeen@redhat.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.