public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices
@ 2010-04-15  4:40 Anton Blanchard
  2010-04-15  8:47 ` Jan Kara
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Anton Blanchard @ 2010-04-15  4:40 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig, Alexander Viro, Jens Axboe,
	Andrew Morton
  Cc: linux-kernel


We are seeing a large regression in database performance on recent kernels.
The database opens a block device with O_DIRECT|O_SYNC and a number of threads
write to different regions of the file at the same time.

A simple test case is below. I haven't defined DEVICE to anything since getting
it wrong will destroy your data :) On an 3 disk LVM with a 64k chunk size we
see about 17MB/sec and only a few threads in IO wait:

procs  -----io---- -system-- -----cpu------
 r  b     bi    bo   in   cs us sy id wa st
 0  3      0 16170  656 2259  0  0 86 14  0
 0  2      0 16704  695 2408  0  0 92  8  0
 0  2      0 17308  744 2653  0  0 86 14  0
 0  2      0 17933  759 2777  0  0 89 10  0

Most threads are blocking in vfs_fsync_range, which has:

        mutex_lock(&mapping->host->i_mutex);
        err = fop->fsync(file, dentry, datasync);
        if (!ret)
                ret = err;
        mutex_unlock(&mapping->host->i_mutex);

Commit 148f948ba877f4d3cdef036b1ff6d9f68986706a (vfs: Introduce new helpers for
syncing after writing to O_SYNC file or IS_SYNC inode) offers some explanation
of what is going on:

    Use these new helpers for syncing from generic VFS functions. This makes
    O_SYNC writes to block devices acquire i_mutex for syncing. If we really
    care about this, we can make block_fsync() drop the i_mutex and reacquire
    it before it returns.

Thanks Jan for such a good commit message! The patch below drops the i_mutex
in blkdev_fsync as suggested. With it the testcase improves from 17MB/s to
68M/sec:

procs  -----io---- -system-- -----cpu------
 r  b     bi    bo   in   cs us sy id wa st
 0  7      0 65536 1000 3878  0  0 70 30  0
 0 34      0 69632 1016 3921  0  1 46 53  0
 0 57      0 69632 1000 3921  0  0 55 45  0
 0 53      0 69640  754 4111  0  0 81 19  0

I'd appreciate any comments from the I/O guys on if this is the right approach.


Testcase:

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#define NR_THREADS 64
#define BUFSIZE (64 * 1024)

#define DEVICE "/dev/mapper/XXXXXX"

#define ALIGN(VAL, SIZE) (((VAL)+(SIZE)-1) & ~((SIZE)-1))

static int fd;

static void *doit(void *arg)
{
	unsigned long offset = (long)arg;
	char *b, *buf;

	b = malloc(BUFSIZE + 1024);
	buf = (char *)ALIGN((unsigned long)b, 1024);
	memset(buf, 0, BUFSIZE);

	while (1)
		pwrite(fd, buf, BUFSIZE, offset);
}

int main(int argc, char *argv[])
{
	int flags = O_RDWR|O_DIRECT;
	int i;
	unsigned long offset = 0;

	if (argc > 1 && !strcmp(argv[1], "O_SYNC"))
		flags |= O_SYNC;

	fd = open(DEVICE, flags);
	if (fd == -1) {
		perror("open");
		exit(1);
	}

	for (i = 0; i < NR_THREADS-1; i++) {
		pthread_t tid;
		pthread_create(&tid, NULL, doit, (void *)offset);
		offset += BUFSIZE;
	}
	doit((void *)offset);

	return 0;
}


Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c	2010-04-14 12:55:50.000000000 +1000
+++ linux-2.6/fs/block_dev.c	2010-04-14 13:17:45.000000000 +1000
@@ -406,16 +406,24 @@ static loff_t block_llseek(struct file *
  
 int blkdev_fsync(struct file *filp, struct dentry *dentry, int datasync)
 {
-	struct block_device *bdev = I_BDEV(filp->f_mapping->host);
+	struct inode *bd_inode = filp->f_mapping->host;
+	struct block_device *bdev = I_BDEV(bd_inode);
 	int error;
 
+	mutex_unlock(&bd_inode->i_mutex);
+
 	error = sync_blockdev(bdev);
-	if (error)
+	if (error) {
+		mutex_lock(&bd_inode->i_mutex);
 		return error;
+	}
 	
 	error = blkdev_issue_flush(bdev, NULL);
 	if (error == -EOPNOTSUPP)
 		error = 0;
+
+	mutex_lock(&bd_inode->i_mutex);
+
 	return error;
 }
 EXPORT_SYMBOL(blkdev_fsync);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices
  2010-04-15  4:40 [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices Anton Blanchard
@ 2010-04-15  8:47 ` Jan Kara
  2010-04-15 10:04 ` Jens Axboe
  2010-04-15 10:42 ` Christoph Hellwig
  2 siblings, 0 replies; 8+ messages in thread
From: Jan Kara @ 2010-04-15  8:47 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Jan Kara, Christoph Hellwig, Alexander Viro, Jens Axboe,
	Andrew Morton, linux-kernel

On Thu 15-04-10 14:40:39, Anton Blanchard wrote:
> 
> We are seeing a large regression in database performance on recent kernels.
> The database opens a block device with O_DIRECT|O_SYNC and a number of threads
> write to different regions of the file at the same time.
> 
> A simple test case is below. I haven't defined DEVICE to anything since getting
> it wrong will destroy your data :) On an 3 disk LVM with a 64k chunk size we
> see about 17MB/sec and only a few threads in IO wait:
> 
> procs  -----io---- -system-- -----cpu------
>  r  b     bi    bo   in   cs us sy id wa st
>  0  3      0 16170  656 2259  0  0 86 14  0
>  0  2      0 16704  695 2408  0  0 92  8  0
>  0  2      0 17308  744 2653  0  0 86 14  0
>  0  2      0 17933  759 2777  0  0 89 10  0
> 
> Most threads are blocking in vfs_fsync_range, which has:
> 
>         mutex_lock(&mapping->host->i_mutex);
>         err = fop->fsync(file, dentry, datasync);
>         if (!ret)
>                 ret = err;
>         mutex_unlock(&mapping->host->i_mutex);
  ...
  Just a few style nitpicks:

> Index: linux-2.6/fs/block_dev.c
> ===================================================================
> --- linux-2.6.orig/fs/block_dev.c	2010-04-14 12:55:50.000000000 +1000
> +++ linux-2.6/fs/block_dev.c	2010-04-14 13:17:45.000000000 +1000
> @@ -406,16 +406,24 @@ static loff_t block_llseek(struct file *
>   
>  int blkdev_fsync(struct file *filp, struct dentry *dentry, int datasync)
>  {
> -	struct block_device *bdev = I_BDEV(filp->f_mapping->host);
> +	struct inode *bd_inode = filp->f_mapping->host;
> +	struct block_device *bdev = I_BDEV(bd_inode);
>  	int error;
>  
  Could you please add a comment here? Like "There is no need to
protect syncing of the block device by i_mutex and it unnecessarily
serializes workloads with several O_SYNC writers to the block device"

> +	mutex_unlock(&bd_inode->i_mutex);
> +
>  	error = sync_blockdev(bdev);
> -	if (error)
> +	if (error) {
> +		mutex_lock(&bd_inode->i_mutex);
>  		return error;
  Usually, "goto out" is preferred instead of the above.

> +	}
>  	
>  	error = blkdev_issue_flush(bdev, NULL);
>  	if (error == -EOPNOTSUPP)
>  		error = 0;
> +
And define out: here.

> +	mutex_lock(&bd_inode->i_mutex);
> +
>  	return error;
>  }
>  EXPORT_SYMBOL(blkdev_fsync);

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices
  2010-04-15  4:40 [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices Anton Blanchard
  2010-04-15  8:47 ` Jan Kara
@ 2010-04-15 10:04 ` Jens Axboe
  2010-04-15 10:42 ` Christoph Hellwig
  2 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2010-04-15 10:04 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Jan Kara, Christoph Hellwig, Alexander Viro, Andrew Morton,
	linux-kernel

On Thu, Apr 15 2010, Anton Blanchard wrote:
> 
> We are seeing a large regression in database performance on recent kernels.
> The database opens a block device with O_DIRECT|O_SYNC and a number of threads
> write to different regions of the file at the same time.
> 
> A simple test case is below. I haven't defined DEVICE to anything since getting
> it wrong will destroy your data :) On an 3 disk LVM with a 64k chunk size we
> see about 17MB/sec and only a few threads in IO wait:
> 
> procs  -----io---- -system-- -----cpu------
>  r  b     bi    bo   in   cs us sy id wa st
>  0  3      0 16170  656 2259  0  0 86 14  0
>  0  2      0 16704  695 2408  0  0 92  8  0
>  0  2      0 17308  744 2653  0  0 86 14  0
>  0  2      0 17933  759 2777  0  0 89 10  0
> 
> Most threads are blocking in vfs_fsync_range, which has:
> 
>         mutex_lock(&mapping->host->i_mutex);
>         err = fop->fsync(file, dentry, datasync);
>         if (!ret)
>                 ret = err;
>         mutex_unlock(&mapping->host->i_mutex);
> 
> Commit 148f948ba877f4d3cdef036b1ff6d9f68986706a (vfs: Introduce new helpers for
> syncing after writing to O_SYNC file or IS_SYNC inode) offers some explanation
> of what is going on:
> 
>     Use these new helpers for syncing from generic VFS functions. This makes
>     O_SYNC writes to block devices acquire i_mutex for syncing. If we really
>     care about this, we can make block_fsync() drop the i_mutex and reacquire
>     it before it returns.
> 
> Thanks Jan for such a good commit message! The patch below drops the i_mutex
> in blkdev_fsync as suggested. With it the testcase improves from 17MB/s to
> 68M/sec:
> 
> procs  -----io---- -system-- -----cpu------
>  r  b     bi    bo   in   cs us sy id wa st
>  0  7      0 65536 1000 3878  0  0 70 30  0
>  0 34      0 69632 1016 3921  0  1 46 53  0
>  0 57      0 69632 1000 3921  0  0 55 45  0
>  0 53      0 69640  754 4111  0  0 81 19  0
> 
> I'd appreciate any comments from the I/O guys on if this is the right approach.

Looks good to me, I see Jan already made a few style suggestions.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices
  2010-04-15  4:40 [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices Anton Blanchard
  2010-04-15  8:47 ` Jan Kara
  2010-04-15 10:04 ` Jens Axboe
@ 2010-04-15 10:42 ` Christoph Hellwig
  2010-04-15 13:34   ` Jan Kara
                     ` (2 more replies)
  2 siblings, 3 replies; 8+ messages in thread
From: Christoph Hellwig @ 2010-04-15 10:42 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Jan Kara, Christoph Hellwig, Alexander Viro, Jens Axboe,
	Andrew Morton, linux-kernel

On Thu, Apr 15, 2010 at 02:40:39PM +1000, Anton Blanchard wrote:
>  int blkdev_fsync(struct file *filp, struct dentry *dentry, int datasync)
>  {
> -	struct block_device *bdev = I_BDEV(filp->f_mapping->host);
> +	struct inode *bd_inode = filp->f_mapping->host;
> +	struct block_device *bdev = I_BDEV(bd_inode);
>  	int error;
>  
> +	mutex_unlock(&bd_inode->i_mutex);
> +
>  	error = sync_blockdev(bdev);

Actually you can just drop this call entirely.  sync_blockdev is an
overcomplicated alias for filemap_write_and_wait on the  block device
inode, which is exactl what we did just before calling into ->fsync

It might be worth to still drop i_mutex for the cache flush, though.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices
  2010-04-15 10:42 ` Christoph Hellwig
@ 2010-04-15 13:34   ` Jan Kara
  2010-04-20  2:26   ` Anton Blanchard
  2010-04-20  2:30   ` Anton Blanchard
  2 siblings, 0 replies; 8+ messages in thread
From: Jan Kara @ 2010-04-15 13:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Anton Blanchard, Jan Kara, Alexander Viro, Jens Axboe,
	Andrew Morton, linux-kernel

On Thu 15-04-10 12:42:24, Christoph Hellwig wrote:
> On Thu, Apr 15, 2010 at 02:40:39PM +1000, Anton Blanchard wrote:
> >  int blkdev_fsync(struct file *filp, struct dentry *dentry, int datasync)
> >  {
> > -	struct block_device *bdev = I_BDEV(filp->f_mapping->host);
> > +	struct inode *bd_inode = filp->f_mapping->host;
> > +	struct block_device *bdev = I_BDEV(bd_inode);
> >  	int error;
> >  
> > +	mutex_unlock(&bd_inode->i_mutex);
> > +
> >  	error = sync_blockdev(bdev);
> 
> Actually you can just drop this call entirely.  sync_blockdev is an
> overcomplicated alias for filemap_write_and_wait on the  block device
> inode, which is exactl what we did just before calling into ->fsync
> 
> It might be worth to still drop i_mutex for the cache flush, though.
  Yeah, that's a good point... Anton, just remove sync_blockdev() from
blkdev_fsync completely please.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices
  2010-04-15 10:42 ` Christoph Hellwig
  2010-04-15 13:34   ` Jan Kara
@ 2010-04-20  2:26   ` Anton Blanchard
  2010-04-20  2:30   ` Anton Blanchard
  2 siblings, 0 replies; 8+ messages in thread
From: Anton Blanchard @ 2010-04-20  2:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Alexander Viro, Jens Axboe, Andrew Morton, linux-kernel

 
Hi,

> Actually you can just drop this call entirely.  sync_blockdev is an
> overcomplicated alias for filemap_write_and_wait on the  block device
> inode, which is exactl what we did just before calling into ->fsync
> 
> It might be worth to still drop i_mutex for the cache flush, though.

Thanks for the feedback Jan + Christoph. New patch on the way.

Anton

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices
  2010-04-15 10:42 ` Christoph Hellwig
  2010-04-15 13:34   ` Jan Kara
  2010-04-20  2:26   ` Anton Blanchard
@ 2010-04-20  2:30   ` Anton Blanchard
  2010-04-22 19:25     ` Jan Kara
  2 siblings, 1 reply; 8+ messages in thread
From: Anton Blanchard @ 2010-04-20  2:30 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig
  Cc: Alexander Viro, Jens Axboe, Andrew Morton, linux-kernel


We are seeing a large regression in database performance on recent kernels.
The database opens a block device with O_DIRECT|O_SYNC and a number of threads
write to different regions of the file at the same time.

A simple test case is below. I haven't defined DEVICE since getting it wrong
will destroy your data :) On an 3 disk LVM with a 64k chunk size we see about
17MB/sec and only a few threads in IO wait:

procs  -----io---- -system-- -----cpu------
 r  b     bi    bo   in   cs us sy id wa st
 0  3      0 16170  656 2259  0  0 86 14  0
 0  2      0 16704  695 2408  0  0 92  8  0
 0  2      0 17308  744 2653  0  0 86 14  0
 0  2      0 17933  759 2777  0  0 89 10  0

Most threads are blocking in vfs_fsync_range, which has:

        mutex_lock(&mapping->host->i_mutex);
        err = fop->fsync(file, dentry, datasync);
        if (!ret)
                ret = err;
        mutex_unlock(&mapping->host->i_mutex);

commit 148f948ba877f4d3cdef036b1ff6d9f68986706a (vfs: Introduce new helpers for
syncing after writing to O_SYNC file or IS_SYNC inode) offers some explanation
of what is going on:

    Use these new helpers for syncing from generic VFS functions. This makes
    O_SYNC writes to block devices acquire i_mutex for syncing. If we really
    care about this, we can make block_fsync() drop the i_mutex and reacquire
    it before it returns.

Thanks Jan for such a good commit message! As well as dropping i_mutex,
Christoph suggests we should remove the call to sync_blockdev():

> sync_blockdev is an overcomplicated alias for filemap_write_and_wait on
> the block device inode, which is exactly what we did just before calling
> into ->fsync

The patch below incorporates both suggestions. With it the testcase improves
from 17MB/s to 68M/sec:

procs  -----io---- -system-- -----cpu------
 r  b     bi    bo   in   cs us sy id wa st
 0  7      0 65536 1000 3878  0  0 70 30  0
 0 34      0 69632 1016 3921  0  1 46 53  0
 0 57      0 69632 1000 3921  0  0 55 45  0
 0 53      0 69640  754 4111  0  0 81 19  0


Testcase:

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#define NR_THREADS 64
#define BUFSIZE (64 * 1024)

#define DEVICE "/dev/mapper/XXXXXX"

#define ALIGN(VAL, SIZE) (((VAL)+(SIZE)-1) & ~((SIZE)-1))

static int fd;

static void *doit(void *arg)
{
	unsigned long offset = (long)arg;
	char *b, *buf;

	b = malloc(BUFSIZE + 1024);
	buf = (char *)ALIGN((unsigned long)b, 1024);
	memset(buf, 0, BUFSIZE);

	while (1)
		pwrite(fd, buf, BUFSIZE, offset);
}

int main(int argc, char *argv[])
{
	int flags = O_RDWR|O_DIRECT;
	int i;
	unsigned long offset = 0;

	if (argc > 1 && !strcmp(argv[1], "O_SYNC"))
		flags |= O_SYNC;

	fd = open(DEVICE, flags);
	if (fd == -1) {
		perror("open");
		exit(1);
	}

	for (i = 0; i < NR_THREADS-1; i++) {
		pthread_t tid;
		pthread_create(&tid, NULL, doit, (void *)offset);
		offset += BUFSIZE;
	}
	doit((void *)offset);

	return 0;
}


Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c	2010-04-20 11:28:32.000000000 +1000
+++ linux-2.6/fs/block_dev.c	2010-04-20 11:36:46.000000000 +1000
@@ -406,16 +406,23 @@ static loff_t block_llseek(struct file *
  
 int blkdev_fsync(struct file *filp, struct dentry *dentry, int datasync)
 {
-	struct block_device *bdev = I_BDEV(filp->f_mapping->host);
+	struct inode *bd_inode = filp->f_mapping->host;
+	struct block_device *bdev = I_BDEV(bd_inode);
 	int error;
 
-	error = sync_blockdev(bdev);
-	if (error)
-		return error;
-	
+	/*
+	 * There is no need to serialise calls to blkdev_issue_flush with
+	 * i_mutex and doing so causes performance issues with concurrent
+	 * O_SYNC writers to a block device.
+	 */
+	mutex_unlock(&bd_inode->i_mutex);
+
 	error = blkdev_issue_flush(bdev, NULL);
 	if (error == -EOPNOTSUPP)
 		error = 0;
+
+	mutex_lock(&bd_inode->i_mutex);
+
 	return error;
 }
 EXPORT_SYMBOL(blkdev_fsync);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices
  2010-04-20  2:30   ` Anton Blanchard
@ 2010-04-22 19:25     ` Jan Kara
  0 siblings, 0 replies; 8+ messages in thread
From: Jan Kara @ 2010-04-22 19:25 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Jan Kara, Christoph Hellwig, Alexander Viro, Jens Axboe,
	Andrew Morton, linux-kernel

On Tue 20-04-10 12:30:47, Anton Blanchard wrote:
<snip>
> Signed-off-by: Anton Blanchard <anton@samba.org>
  The patch looks good to me now.

Acked-by: Jan Kara <jack@suse.cz>

> Index: linux-2.6/fs/block_dev.c
> ===================================================================
> --- linux-2.6.orig/fs/block_dev.c	2010-04-20 11:28:32.000000000 +1000
> +++ linux-2.6/fs/block_dev.c	2010-04-20 11:36:46.000000000 +1000
> @@ -406,16 +406,23 @@ static loff_t block_llseek(struct file *
>   
>  int blkdev_fsync(struct file *filp, struct dentry *dentry, int datasync)
>  {
> -	struct block_device *bdev = I_BDEV(filp->f_mapping->host);
> +	struct inode *bd_inode = filp->f_mapping->host;
> +	struct block_device *bdev = I_BDEV(bd_inode);
>  	int error;
>  
> -	error = sync_blockdev(bdev);
> -	if (error)
> -		return error;
> -	
> +	/*
> +	 * There is no need to serialise calls to blkdev_issue_flush with
> +	 * i_mutex and doing so causes performance issues with concurrent
> +	 * O_SYNC writers to a block device.
> +	 */
> +	mutex_unlock(&bd_inode->i_mutex);
> +
>  	error = blkdev_issue_flush(bdev, NULL);
>  	if (error == -EOPNOTSUPP)
>  		error = 0;
> +
> +	mutex_lock(&bd_inode->i_mutex);
> +
>  	return error;
>  }
>  EXPORT_SYMBOL(blkdev_fsync);
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-04-22 19:25 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-15  4:40 [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices Anton Blanchard
2010-04-15  8:47 ` Jan Kara
2010-04-15 10:04 ` Jens Axboe
2010-04-15 10:42 ` Christoph Hellwig
2010-04-15 13:34   ` Jan Kara
2010-04-20  2:26   ` Anton Blanchard
2010-04-20  2:30   ` Anton Blanchard
2010-04-22 19:25     ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox