From: Anton Blanchard <anton@samba.org>
To: Jan Kara <jack@suse.cz>, Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
Jens Axboe <jens.axboe@oracle.com>,
Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org
Subject: [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices
Date: Tue, 20 Apr 2010 12:30:47 +1000 [thread overview]
Message-ID: <20100420023047.GN11751@kryten> (raw)
In-Reply-To: <20100415104224.GA27482@lst.de>
We are seeing a large regression in database performance on recent kernels.
The database opens a block device with O_DIRECT|O_SYNC and a number of threads
write to different regions of the file at the same time.
A simple test case is below. I haven't defined DEVICE since getting it wrong
will destroy your data :) On an 3 disk LVM with a 64k chunk size we see about
17MB/sec and only a few threads in IO wait:
procs -----io---- -system-- -----cpu------
r b bi bo in cs us sy id wa st
0 3 0 16170 656 2259 0 0 86 14 0
0 2 0 16704 695 2408 0 0 92 8 0
0 2 0 17308 744 2653 0 0 86 14 0
0 2 0 17933 759 2777 0 0 89 10 0
Most threads are blocking in vfs_fsync_range, which has:
mutex_lock(&mapping->host->i_mutex);
err = fop->fsync(file, dentry, datasync);
if (!ret)
ret = err;
mutex_unlock(&mapping->host->i_mutex);
commit 148f948ba877f4d3cdef036b1ff6d9f68986706a (vfs: Introduce new helpers for
syncing after writing to O_SYNC file or IS_SYNC inode) offers some explanation
of what is going on:
Use these new helpers for syncing from generic VFS functions. This makes
O_SYNC writes to block devices acquire i_mutex for syncing. If we really
care about this, we can make block_fsync() drop the i_mutex and reacquire
it before it returns.
Thanks Jan for such a good commit message! As well as dropping i_mutex,
Christoph suggests we should remove the call to sync_blockdev():
> sync_blockdev is an overcomplicated alias for filemap_write_and_wait on
> the block device inode, which is exactly what we did just before calling
> into ->fsync
The patch below incorporates both suggestions. With it the testcase improves
from 17MB/s to 68M/sec:
procs -----io---- -system-- -----cpu------
r b bi bo in cs us sy id wa st
0 7 0 65536 1000 3878 0 0 70 30 0
0 34 0 69632 1016 3921 0 1 46 53 0
0 57 0 69632 1000 3921 0 0 55 45 0
0 53 0 69640 754 4111 0 0 81 19 0
Testcase:
#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#define NR_THREADS 64
#define BUFSIZE (64 * 1024)
#define DEVICE "/dev/mapper/XXXXXX"
#define ALIGN(VAL, SIZE) (((VAL)+(SIZE)-1) & ~((SIZE)-1))
static int fd;
static void *doit(void *arg)
{
unsigned long offset = (long)arg;
char *b, *buf;
b = malloc(BUFSIZE + 1024);
buf = (char *)ALIGN((unsigned long)b, 1024);
memset(buf, 0, BUFSIZE);
while (1)
pwrite(fd, buf, BUFSIZE, offset);
}
int main(int argc, char *argv[])
{
int flags = O_RDWR|O_DIRECT;
int i;
unsigned long offset = 0;
if (argc > 1 && !strcmp(argv[1], "O_SYNC"))
flags |= O_SYNC;
fd = open(DEVICE, flags);
if (fd == -1) {
perror("open");
exit(1);
}
for (i = 0; i < NR_THREADS-1; i++) {
pthread_t tid;
pthread_create(&tid, NULL, doit, (void *)offset);
offset += BUFSIZE;
}
doit((void *)offset);
return 0;
}
Signed-off-by: Anton Blanchard <anton@samba.org>
---
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c 2010-04-20 11:28:32.000000000 +1000
+++ linux-2.6/fs/block_dev.c 2010-04-20 11:36:46.000000000 +1000
@@ -406,16 +406,23 @@ static loff_t block_llseek(struct file *
int blkdev_fsync(struct file *filp, struct dentry *dentry, int datasync)
{
- struct block_device *bdev = I_BDEV(filp->f_mapping->host);
+ struct inode *bd_inode = filp->f_mapping->host;
+ struct block_device *bdev = I_BDEV(bd_inode);
int error;
- error = sync_blockdev(bdev);
- if (error)
- return error;
-
+ /*
+ * There is no need to serialise calls to blkdev_issue_flush with
+ * i_mutex and doing so causes performance issues with concurrent
+ * O_SYNC writers to a block device.
+ */
+ mutex_unlock(&bd_inode->i_mutex);
+
error = blkdev_issue_flush(bdev, NULL);
if (error == -EOPNOTSUPP)
error = 0;
+
+ mutex_lock(&bd_inode->i_mutex);
+
return error;
}
EXPORT_SYMBOL(blkdev_fsync);
next prev parent reply other threads:[~2010-04-20 2:35 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-04-15 4:40 [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices Anton Blanchard
2010-04-15 8:47 ` Jan Kara
2010-04-15 10:04 ` Jens Axboe
2010-04-15 10:42 ` Christoph Hellwig
2010-04-15 13:34 ` Jan Kara
2010-04-20 2:26 ` Anton Blanchard
2010-04-20 2:30 ` Anton Blanchard [this message]
2010-04-22 19:25 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100420023047.GN11751@kryten \
--to=anton@samba.org \
--cc=akpm@linux-foundation.org \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=jens.axboe@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.