* Re: Terrible performance of sequential O_DIRECT 4k writes in SAN environment. ~3 times slower then Solars 10 with the same HBA/Storage.
[not found] ` <20140108140307.GA588@infradead.org>
@ 2014-01-14 13:30 ` Sergey Meirovich
2014-01-15 22:07 ` Dave Chinner
0 siblings, 1 reply; 4+ messages in thread
From: Sergey Meirovich @ 2014-01-14 13:30 UTC (permalink / raw)
To: Christoph Hellwig, xfs
Cc: Gluk, Jan Kara, Linux Kernel Mailing List, linux-scsi
Hi Cristoph,
On 8 January 2014 16:03, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Jan 07, 2014 at 08:37:23PM +0200, Sergey Meirovich wrote:
>> Actually my initial report (14.67Mb/sec 3755.41 Requests/sec) was about ext4
>> However I have tried XFS as well. It was a bit slower than ext4 on all
>> occasions.
>
> I wasn't trying to say XFS fixes your problem, but that we could
> implement appending AIO writes in XFS fairly easily.
>
> To verify Jan's theory, can you try to preallocate the file to the full
> size and then run the benchmark by doing a:
>
> # fallocate -l <size> <filename>
>
> and then run it? If that's indeed the issue I'd be happy to implement
> the "real aio" append support for you as well.
>
I've resorted to write simple wrapper around io_submit() and ran it
against preallocated file (exactly to avoid append AIO scenario).
Random data was used to avoid XtremIO online deduplication but results
were still wonderfull for 4k sequential AIO write:
744.77 MB/s 190660.17 Req/sec
Clearly Linux lacks "rial aio" append to be available for any FS.
Seems that you are thinking that it would be relatively easy to
implement it for XFS on Linux? If so - I will really appreciate your
afford.
[root@dca-poc-gtsxdb3 mnt]# dd if=/dev/zero of=4k.data bs=4096 count=524288
524288+0 records in
524288+0 records out
2147483648 bytes (2.1 GB) copied, 5.75357 s, 373 MB/s
[root@dca-poc-gtsxdb3 mnt]# /root/4k
rnd generation (sec.): 195.63
io_submit() accepted 524288 IOs
io_getevents() returned 524288 events
time elapsed (sec.): 2.75
bandwidth (MiB/s): 744.77
IOps: 190660.17
[root@dca-poc-gtsxdb3 mnt]#
========================== io_submit() wrapper =============================
#define _GNU_SOURCE
#include <errno.h>
#include <libaio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>
#define FNAME "4k.data"
#define IOSIZE 4096
#define REQUESTS 524288
/* gcc 4k.c -std=gnu99 -laio -o 4k */
int main(void) {
io_context_t ctx;
int ret;
int flag = O_RDWR | O_DIRECT;
int fd = open(FNAME, flag);
struct timeval start, end;
if (fd == -1) {
printf("open(%s, %d) - failed!\nExiting.\n"
"If file doesn't exist please precreate it "
"with dd if=/dev/zero of=%s bs=%d count=%d\n",
FNAME, flag, FNAME, IOSIZE, REQUESTS);
return errno;
}
memset(&ctx, 0, sizeof(io_context_t));
if (io_setup(REQUESTS, &ctx)) {
printf("io_setup(%d, &ctx) failed\n", REQUESTS);
return -ret;
}
void *mem = NULL;
posix_memalign(&mem, 4096, (size_t) IOSIZE * REQUESTS);
/* memset(mem, 9, IOSIZE); */
int urnd = open("/dev/urandom", O_RDONLY);
void *cur = mem;
gettimeofday(&start, NULL);
for (int i = 0; i < REQUESTS; i++, cur += IOSIZE) {
read(urnd, cur, IOSIZE);
}
gettimeofday(&end, NULL);
close(urnd);
double elapsed = (end.tv_sec - start.tv_sec) +
((end.tv_usec - start.tv_usec)/1000000.0);
printf("rnd generation (sec.):\t%.2f\n", elapsed);
struct iocb *aio = malloc(sizeof(struct iocb) * REQUESTS);
memset(aio, 0, sizeof(struct iocb) * REQUESTS);
struct iocb **lio = malloc(sizeof(void *) * REQUESTS);
memset(lio, 0, sizeof(void *) * REQUESTS);
struct io_event *event = malloc(sizeof(struct io_event) * REQUESTS);
memset(event, 0, sizeof(struct io_event) * REQUESTS);
cur = mem;
for (int i = 0; i < REQUESTS; i++, cur += IOSIZE) {
io_prep_pwrite(&aio[i], fd, cur, IOSIZE, i * IOSIZE);
lio[i] = &aio[i];
}
gettimeofday(&start, NULL);
ret = io_submit(ctx, REQUESTS, lio);
printf("io_submit() accepted %d IOs\n", ret);
fdatasync(fd);
ret = io_getevents(ctx, REQUESTS, REQUESTS, event, NULL);
printf("io_getevents() returned %d events\n", ret);
gettimeofday(&end, NULL);
elapsed = (end.tv_sec - start.tv_sec) +
((end.tv_usec - start.tv_usec)/1000000.0);
printf("time elapsed (sec.):\t%.2f\n", elapsed);
printf("bandwidth (MiB/s):\t%.2f\n",
(double) (((long long) IOSIZE * REQUESTS) / (1024 * 1024))
/ elapsed);
printf("IOps:\t\t\t%.2f\n", (double) REQUESTS
/ elapsed);
if (io_destroy(ctx)) {
perror("io_destroy");
return -1;
}
close(fd);
free(mem);
free(aio);
free(lio);
free(event);
return 0;
}
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Terrible performance of sequential O_DIRECT 4k writes in SAN environment. ~3 times slower then Solars 10 with the same HBA/Storage.
2014-01-14 13:30 ` Terrible performance of sequential O_DIRECT 4k writes in SAN environment. ~3 times slower then Solars 10 with the same HBA/Storage Sergey Meirovich
@ 2014-01-15 22:07 ` Dave Chinner
2014-01-20 13:58 ` Christoph Hellwig
0 siblings, 1 reply; 4+ messages in thread
From: Dave Chinner @ 2014-01-15 22:07 UTC (permalink / raw)
To: Sergey Meirovich
Cc: Jan Kara, linux-scsi, Gluk, Linux Kernel Mailing List, xfs,
Christoph Hellwig
On Tue, Jan 14, 2014 at 03:30:11PM +0200, Sergey Meirovich wrote:
> Hi Cristoph,
>
> On 8 January 2014 16:03, Christoph Hellwig <hch@infradead.org> wrote:
> > On Tue, Jan 07, 2014 at 08:37:23PM +0200, Sergey Meirovich wrote:
> >> Actually my initial report (14.67Mb/sec 3755.41 Requests/sec) was about ext4
> >> However I have tried XFS as well. It was a bit slower than ext4 on all
> >> occasions.
> >
> > I wasn't trying to say XFS fixes your problem, but that we could
> > implement appending AIO writes in XFS fairly easily.
> >
> > To verify Jan's theory, can you try to preallocate the file to the full
> > size and then run the benchmark by doing a:
> >
> > # fallocate -l <size> <filename>
> >
> > and then run it? If that's indeed the issue I'd be happy to implement
> > the "real aio" append support for you as well.
> >
>
> I've resorted to write simple wrapper around io_submit() and ran it
> against preallocated file (exactly to avoid append AIO scenario).
> Random data was used to avoid XtremIO online deduplication but results
> were still wonderfull for 4k sequential AIO write:
>
> 744.77 MB/s 190660.17 Req/sec
>
> Clearly Linux lacks "rial aio" append to be available for any FS.
> Seems that you are thinking that it would be relatively easy to
> implement it for XFS on Linux? If so - I will really appreciate your
> afford.
Yes, I think it can be done relatively simply. We'd have to change
the code in xfs_file_aio_write_checks() to check whether EOF zeroing
was required rather than always taking an exclusive lock (for block
aligned IO at EOF sub-block zeroing isn't required), and then we'd
have to modify the direct IO code to set the is_async flag
appropriately. We'd probably need a new flag to say tell the DIO
code that AIO beyond EOF is OK, but that isn't hard to do....
And for those that are wondering about the stale data exposure problem
documented in the aio code:
/*
* For file extending writes updating i_size before data
* writeouts complete can expose uninitialized blocks. So
* even for AIO, we need to wait for i/o to complete before
* returning in this case.
*/
This is fixed in XFS by removing a single if() check in
xfs_iomap_write_direct(). We already use unwritten extents for DIO
within EOF to avoid races that could expose uninitialised blocks, so
we just need to make that unconditional behaviour. Hence racing IO
on concurrent appending i_size updates will only ever see a hole
(zeros), an unwritten region (zeros) or the written data.
Christoph, are you going to get any time to look at doing this in
the next few days?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Terrible performance of sequential O_DIRECT 4k writes in SAN environment. ~3 times slower then Solars 10 with the same HBA/Storage.
2014-01-15 22:07 ` Dave Chinner
@ 2014-01-20 13:58 ` Christoph Hellwig
2014-01-20 22:18 ` Dave Chinner
0 siblings, 1 reply; 4+ messages in thread
From: Christoph Hellwig @ 2014-01-20 13:58 UTC (permalink / raw)
To: Dave Chinner
Cc: Jan Kara, linux-scsi, Gluk, Linux Kernel Mailing List, xfs,
Christoph Hellwig, Sergey Meirovich
On Thu, Jan 16, 2014 at 09:07:21AM +1100, Dave Chinner wrote:
> Yes, I think it can be done relatively simply. We'd have to change
> the code in xfs_file_aio_write_checks() to check whether EOF zeroing
> was required rather than always taking an exclusive lock (for block
> aligned IO at EOF sub-block zeroing isn't required),
That's not even required for supporting aio appends, just a further
optimization for it.
> and then we'd
> have to modify the direct IO code to set the is_async flag
> appropriately. We'd probably need a new flag to say tell the DIO
> code that AIO beyond EOF is OK, but that isn't hard to do....
Yep, need a flag to allow appending writes and then defer them.
> Christoph, are you going to get any time to look at doing this in
> the next few days?
I'll probably need at least another week before I can get to it. If you
wanna pick it up before than feel free.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Terrible performance of sequential O_DIRECT 4k writes in SAN environment. ~3 times slower then Solars 10 with the same HBA/Storage.
2014-01-20 13:58 ` Christoph Hellwig
@ 2014-01-20 22:18 ` Dave Chinner
0 siblings, 0 replies; 4+ messages in thread
From: Dave Chinner @ 2014-01-20 22:18 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jan Kara, linux-scsi, Gluk, Linux Kernel Mailing List, xfs,
Sergey Meirovich
On Mon, Jan 20, 2014 at 05:58:55AM -0800, Christoph Hellwig wrote:
> On Thu, Jan 16, 2014 at 09:07:21AM +1100, Dave Chinner wrote:
> > Yes, I think it can be done relatively simply. We'd have to change
> > the code in xfs_file_aio_write_checks() to check whether EOF zeroing
> > was required rather than always taking an exclusive lock (for block
> > aligned IO at EOF sub-block zeroing isn't required),
>
> That's not even required for supporting aio appends, just a further
> optimization for it.
Oh, right, I got an off-by-one when reading the code - the EOF
zeroing only occurs when the offset is beyond EOF, not at or beyond
EOF...
> > and then we'd
> > have to modify the direct IO code to set the is_async flag
> > appropriately. We'd probably need a new flag to say tell the DIO
> > code that AIO beyond EOF is OK, but that isn't hard to do....
>
> Yep, need a flag to allow appending writes and then defer them.
>
> > Christoph, are you going to get any time to look at doing this in
> > the next few days?
>
> I'll probably need at least another week before I can get to it. If you
> wanna pick it up before than feel free.
I'm probably not going to get to it before then, either, so check
back in a week?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-01-20 22:18 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CA+QCeVQRrqx=CrxyuAe7k0e0y4Nqo7x_8jtkuD99VM8L9Dxp+g@mail.gmail.com>
[not found] ` <20140106201032.GA13491@quack.suse.cz>
[not found] ` <20140107155830.GA28395@infradead.org>
[not found] ` <CA+QCeVRiwHU+C5utaLQXf_MpjoYMYEF4LKRyDPaqcd=H6n-RRw@mail.gmail.com>
[not found] ` <20140108140307.GA588@infradead.org>
2014-01-14 13:30 ` Terrible performance of sequential O_DIRECT 4k writes in SAN environment. ~3 times slower then Solars 10 with the same HBA/Storage Sergey Meirovich
2014-01-15 22:07 ` Dave Chinner
2014-01-20 13:58 ` Christoph Hellwig
2014-01-20 22:18 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox