public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Corey Hickey <bugfood-ml@fatooh.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: read-modify-write occurring for direct I/O on RAID-5
Date: Sun, 6 Aug 2023 11:54:34 -0700	[thread overview]
Message-ID: <159bf7ff-fe2e-4aad-9866-e7d82e037986@fatooh.org> (raw)
In-Reply-To: <0f21f5eb-803f-c8d1-503a-bb0addeef01f@fatooh.org>

On 2023-08-04 18:44, Corey Hickey wrote:
>> Buffered writes won't guarantee you alignment, either, In fact, it's
>> much more likely to do weird stuff than direct IO. If your
>> filesystem is empty, then buffered writes can look *really good*,
>> but once the filesystem starts being used and has lots of
>> discontiguous free space or the system is busy enough that writeback
>> can't lock contiguous ranges of pages, writeback IO will look a
>> whole lot less pretty and you have little control over what
>> it does....
> 
> I'll keep that in mind. This filesystem doesn't get extensive writes
> except when restoring from backup. That is why I started looking at
> alignment, though--restoring from backup onto a new array with new
> disks was incurring lots of RMW, reads were very delayed, and the
> kernel was warning about hung tasks.

I tested and learned further. The root cause does not seem to be
excessive RMW--the root cause seems to be that the drives in my new
array do not handle the RMW nearly as well as the drives I had used
before.

Under different usage, I had previously noticed reduced performance on
"parallel" reads of the new drives as compared to my older drives,
though I didn't investigate further at the time.

I don't know a great way to test this--there's probably a better way
with fio or something. I wrote a small program to _roughly_ simulate the
non-sequential activity of a RAID-5 RMW. Mostly I just wanted to induce
lots of seeks over small intervals.

I see consistent results across different drives attached via different
cables to different SATA controllers. It's not just that I have one
malfunctioning component.

Differences in performance between runs are negligible, so I'm only
reporting one run of each test.

For 512 KB chunks, the Toshiba performs 11.5% worse.
----------------------------------------------------------------------
$ sudo ./rmw /dev/disk/by-id/ata-WDC_WD60EFRX-68L0BN1_WD-WX11DA71YR1L "$((512 * 1024))" "$((2 * 1024))"
testing path: /dev/disk/by-id/ata-WDC_WD60EFRX-68L0BN1_WD-WX11DA71YR1L  buffer_size: 524288  count: 2048
1073741824 bytes in 34.633402 seconds: 29.6 MiB/sec

$ sudo ./rmw /dev/disk/by-id/ata-TOSHIBA_HDWG21C_2290A04EFPBG "$((512 * 1024))" "$((2 * 1024))"
testing path: /dev/disk/by-id/ata-TOSHIBA_HDWG21C_2290A04EFPBG  buffer_size: 524288  count: 2048
1073741824 bytes in 39.147649 seconds: 26.2 MiB/sec
----------------------------------------------------------------------

For 128 KB chunks, the Toshiba performs 29.4% worse.
----------------------------------------------------------------------
$ sudo ./rmw /dev/disk/by-id/ata-WDC_WD60EFRX-68L0BN1_WD-WX11DA71YR1L "$((128 * 1024))" "$((8 * 1024))"
testing path: /dev/disk/by-id/ata-WDC_WD60EFRX-68L0BN1_WD-WX11DA71YR1L  buffer_size: 131072  count: 8192
1073741824 bytes in 100.036280 seconds: 10.2 MiB/sec

$ sudo ./rmw /dev/disk/by-id/ata-TOSHIBA_HDWG21C_2290A04EFPBG "$((128 * 1024))" "$((8 * 1024))"
testing path: /dev/disk/by-id/ata-TOSHIBA_HDWG21C_2290A04EFPBG  buffer_size: 131072  count: 8192
1073741824 bytes in 142.250680 seconds: 7.2 MiB/sec
----------------------------------------------------------------------

I don't know if the MD behavior tends toward better or worse as
compared to my synthetic testing, but there's definitely a difference
in performance between drives--apparently higher latency on the
Toshiba.

The RAID-5 write-back journal feature seems interesting, but I
hit a reproducible bug early on:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1043078

Making RAID-5 work well under these circumstances doesn't seem
worth it. I'm probably going to use RAID-10 instead.

The test program follows.

-Corey

----------------------------------------------------------------------
#define _GNU_SOURCE

#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>

long parse_positive_long(char *);
int rmw(char *, long, long);
size_t rmw_once(int, char *, long);

int main(int argc, char **argv) {
     char *path;
     long buffer_size, count;

     if (argc != 4) {
         printf("usage: %s path buffer_size count\n", argv[0]);
         printf("WARNING: this overwrites the target file/device\n");
     }

     path = argv[1];
     if (! (buffer_size = parse_positive_long(argv[2]))) {
         exit(1);
     }
     if (buffer_size % 512) {
         printf("buffer size must be a multiple of 512\n");
         exit(1);
     }

     if (! (count = parse_positive_long(argv[3]))) {
         exit(1);
     }

     printf("testing path: %s  buffer_size: %ld  count: %ld\n", path, buffer_size, count);
     if (! rmw(path, buffer_size, count)) {
         exit(1);
     }
     exit(0);
}

/* returns 0 on failure */
long parse_positive_long(char *str) {
     long ret;
     char *endptr;
     ret = strtol(str, &endptr, 0);
     if (str[0] != 0 && endptr[0] == 0) {
         if (ret <= 0) {
             printf("expected positive number instead of: %ld\n", ret);
             return 0;
         }
         return ret;
     } else {
         printf("error parsing number: %s\n", str);
         return 0;
     }
}

int rmw(char *path, long buffer_size, long count) {
     int fd, i;
     size_t bytes_handled, bytes_total;
     char *buffer;
     struct timespec start_time, end_time;
     double elapsed;

     buffer = aligned_alloc(512, buffer_size);
     if (! buffer) {
         printf("error allocating buffer: %s\n", strerror(errno));
     }

     fd = open(path, O_RDWR|O_DIRECT|O_SYNC);
     if (fd == -1) {
         printf("error opening %s: %s\n", path, strerror(errno));
         return 0;
     }

     bytes_total = 0;
     clock_gettime(CLOCK_MONOTONIC, &start_time);
     for (i = 0; i < count; ++i) {
         bytes_handled = rmw_once(fd, buffer, buffer_size);
         if (! bytes_handled) {
             return 0;
         }
         bytes_total += bytes_handled;
         if (bytes_handled != buffer_size) {
             printf("warning: encountered EOF\n");
             break;
         }
     }
     clock_gettime(CLOCK_MONOTONIC, &end_time);

     if (close(fd)) {
         printf("error closing %s: %s\n", path, strerror(errno));
     }

     free(buffer);

     if (! bytes_total) {
         return 0;
     }

     elapsed = (double)(end_time.tv_sec - start_time.tv_sec) +
               (double)(end_time.tv_nsec - start_time.tv_nsec) / 1.0e9;
     if (elapsed == 0.0) {
         printf("no time elapsed???\n");
         return 0;
     }

     printf("%ld bytes in %lf seconds: %.1lf MiB/sec\n", bytes_total, elapsed, (double)bytes_total/elapsed/1024/1024);
     return 1;
}


size_t rmw_once(int fd, char *buffer, long buffer_size) {
     size_t bytes_read, bytes_written;
     ssize_t last_size;
     long i;
     int attempts;

     /* ----- READ ----- */
     bytes_read = 0;
     attempts = 0;
     do {
         ++attempts;
         last_size = read(fd, buffer + bytes_read, buffer_size - bytes_read);
         bytes_read += last_size;
     } while (bytes_read < buffer_size && last_size > 0);

     if (attempts > 1) {
         printf("warning: took %d attempts to read into buffer\n", attempts);
     }

     if (last_size < 0) {
         printf("error reading: %s\n", strerror(errno));
         return 0;
     }

     /* ----- MODIFY ----- */
     for (i = 0; i < bytes_read; ++i) {
         /* do something... doesn't matter what */
         buffer[i] = ~buffer[i];
     }

     /* ----- WRITE ----- */
     if (lseek(fd, -bytes_read, SEEK_CUR) == -1) {
         printf("error seeking: %s\n", strerror(errno));
         return 0;
     }

     bytes_written = 0;
     attempts = 0;
     do {
         attempts += 1;
         last_size = write(fd, buffer, bytes_read - bytes_written);
         bytes_written += last_size;
     } while (bytes_written < bytes_read && last_size > 0);

     if (attempts > 1) {
         printf("warning: took %d attempts to write from buffer\n");
     }

     if (last_size < 0) {
         printf("error writing: %s\n", strerror(errno));
         return 0;
     }

     assert(bytes_read == bytes_written);

     return bytes_read;
}
----------------------------------------------------------------------

      parent reply	other threads:[~2023-08-06 18:54 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-04  5:44 read-modify-write occurring for direct I/O on RAID-5 Corey Hickey
2023-08-04  8:07 ` Dave Chinner
2023-08-04 19:26   ` Corey Hickey
2023-08-04 21:52     ` Dave Chinner
2023-08-05  1:44       ` Corey Hickey
2023-08-05 22:37         ` Dave Chinner
2023-08-06 18:21           ` Corey Hickey
2023-08-06 22:38             ` Dave Chinner
2023-08-06 18:54         ` Corey Hickey [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=159bf7ff-fe2e-4aad-9866-e7d82e037986@fatooh.org \
    --to=bugfood-ml@fatooh.org \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox