From: Chris Mason <chris.mason@fusionio.com>
To: Jeff Moyer <jmoyer@redhat.com>, Matthew Wilcox <willy@linux.intel.com>
Cc: Linux FS Devel <linux-fsdevel@vger.kernel.org>,
Jens Axboe <axboe@kernel.dk>
Subject: Re: [PATCH 1/2] block: Add support for atomic writes
Date: Thu, 7 Nov 2013 08:52:20 -0500 [thread overview]
Message-ID: <20131107135220.3802.91392@localhost.localdomain> (raw)
In-Reply-To: <x49eh6uhhh8.fsf@segfault.boston.devel.redhat.com>
Quoting Jeff Moyer (2013-11-05 12:43:31)
> Chris Mason <chris.mason@fusionio.com> writes:
>
> > This allows filesystems and O_DIRECT to send down a list of bios
> > flagged for atomic completion. If the hardware supports atomic
> > IO, it is given the whole list in a single make_request_fn
> > call.
> >
> > In order to limit corner cases, there are a few restrictions in the
> > current code:
> >
> > * Every bio in the list must be for the same queue
> >
> > * Every bio must be a simple write. No trims or reads may be mixed in
> >
> > A new blk_queue_set_atomic_write() sets the number of atomic segments a
> > given driver can accept.
> >
> > Any number greater than one is allowed, but the driver is expected to
> > do final checks on the bio list to make sure a given list fits inside
> > its atomic capabilities.
>
> Hi, Chris,
>
> This is great stuff. I have a couple of high level questions that I'm
> hoping you can answer, given that you're closer to the hardware than
> most. What constraints can we expect hardware to impose on atomic
> writes in terms of size and, um, contiguousness (is that a word)? How
> do we communicate those constraints to the application? (I'm not
> convinced a sysfs file is adequate.)
>
> For example, looking at NVMe, it appears that devices may guarantee that
> a set of /sequential/ logical blocks may be completed atomically, but I
> don't see a provision for disjoint regions. That spec also
> differentiates between power fail write atomicity and "normal" write
> atomicity.
Unfortunately, it's hard to say. I think the fusionio cards are the
only shipping devices that support this, but I've definitely heard that
others plan to support it as well. mariadb/percona already support the
atomics via fusionio specific ioctls, and turning that into a real
O_ATOMIC is a priority so other hardware can just hop on the train.
This feature in general is pretty natural for the log structured squirrels
they stuff inside flash, so I'd expect everyone to support it. Matthew,
how do you feel about all of this?
With the fusionio drivers, we've recently increased the max atomic size.
It's basically 1MB, disjoint or contig doesn't matter. We're powercut
safe at 1MB.
>
> Basically, I'd like to avoid requiring a trial and error programming
> model to determine what an application can expect to work (like we have
> with O_DIRECT right now).
I'm really interested in ideas on how to provide that. But, with dm,
md, and a healthy assortment of flash vendors, I don't know how...
I've attached my current test program. The basic idea is to fill
buffers (1MB in size) with a random pattern. Each buffer has a
different random pattern.
You let it run for a while and then pull the plug. After the box comes
back up, run the program again and it looks for consistent patterns
filling each 1MB aligned region in the file.
Usage:
gcc -Wall -o atomic-pattern atomic-pattern.c
create a heavily fragmented file (exercise for the user, I need
to make a mode for this)
atomic-pattern file_name init
<wait for init done printf to appear>
<let it run for a while>
<cut power to the box>
<box comes back to life>
atomic-pattern file_name check
In order to reliably find torn blocks without O_ATOMIC, I had to bump
the write size to 1MB and run 24 instances in parallel.
/*
* Copyright 2013 Fusion-io
* GPLv2 or higher license
*/
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <errno.h>
#define FILE_SIZE (300 * 1024 * 1024)
#define O_DIRECT 00040000ULL
#define O_ATOMIC 040000000ULL
void set_block_headers(unsigned char *buf, int buffer_size, unsigned long seq)
{
while (buffer_size > sizeof(seq)) {
memcpy(buf, &seq, sizeof(seq));
buffer_size -= sizeof(seq);
buf += sizeof(seq);
}
}
int check_block_headers(unsigned char *buf, int buffer_size)
{
unsigned long seq = 0;
unsigned long check = 0;
memcpy(&seq, buf, sizeof(seq));
buffer_size -= sizeof(seq);
while (buffer_size > sizeof(seq)) {
memcpy(&check, buf, sizeof(check));
if (check != seq) {
fprintf(stderr, "check failed %lx %lx\n", seq, check);
return -EIO;
}
buffer_size -= sizeof(seq);
buf += sizeof(seq);
}
return 0;
}
int main(int ac, char **av)
{
unsigned char *file_buf;
loff_t pos;
int ret;
int fd;
int write_size = 1024 * 1024;
char *filename = av[1];
int check = 0;
int init = 0;
if (ac < 2) {
fprintf(stderr, "usage: atomic-pattern filename [check | init]\n");
exit(1);
}
if (ac > 2) {
if (!strcmp(av[2], "check")) {
check = 1;
fprintf(stderr, "checking %s\n", filename);
} else if (!strcmp(av[2], "init")) {
init = 1;
fprintf(stderr, "init %s\n", filename);
} else {
fprintf(stderr, "usage: atomic-pattern filename [check | init]\n");
exit(1);
}
}
ret = posix_memalign((void **)&file_buf, 4096, write_size);
if (ret) {
perror("cannot allocate memory\n");
exit(1);
}
fd = open(filename, O_RDWR, 0600);
if (fd < 0) {
perror("open");
exit(1);
}
ret = fcntl (fd, F_SETFL, O_DIRECT | O_ATOMIC);
if (ret) {
perror("fcntl");
exit(1);
}
pos = 0;
if (!init && !check)
goto runit;
while (pos < FILE_SIZE) {
if (check) {
ret = pread(fd, file_buf, write_size, pos);
if (ret != write_size) {
perror("write");
exit(1);
}
ret = check_block_headers(file_buf, write_size);
if (ret) {
fprintf(stderr, "Failed check on buffer %llu\n", (unsigned long long)pos);
exit(1);
}
} else {
set_block_headers(file_buf, write_size, rand());
ret = pwrite(fd, file_buf, write_size, pos);
if (ret != write_size) {
perror("write");
exit(1);
}
}
pos += write_size;
}
if (check)
exit(0);
fsync(fd);
runit:
fprintf(stderr, "File init done, running random writes\n");
while (1) {
pos = rand() % FILE_SIZE;
pos = pos / write_size;
pos = pos * write_size;
if (pos + write_size > FILE_SIZE)
pos = 0;
set_block_headers(file_buf, write_size, rand());
ret = pwrite(fd, file_buf, write_size, pos);
if (ret != write_size) {
perror("write");
exit(1);
}
}
return 0;
}
next prev parent reply other threads:[~2013-11-07 13:52 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-11-01 21:27 [PATCH 0/2] Support for atomic IOs Chris Mason
2013-11-01 21:28 ` [PATCH 1/2] block: Add support for atomic writes Chris Mason
2013-11-01 21:47 ` Shaohua Li
2013-11-05 17:43 ` Jeff Moyer
2013-11-07 13:52 ` Chris Mason [this message]
2013-11-07 15:43 ` Jeff Moyer
2013-11-07 15:55 ` Chris Mason
2013-11-07 16:14 ` Jeff Moyer
2013-11-07 16:52 ` Chris Mason
2013-11-13 23:59 ` Dave Chinner
2013-11-12 15:11 ` Matthew Wilcox
2013-11-13 20:44 ` Chris Mason
2013-11-13 20:53 ` Howard Chu
2013-11-13 21:35 ` Matthew Wilcox
2013-11-01 21:29 ` [PATCH 2/3] fs: Add O_ATOMIC support to direct IO Chris Mason
-- strict thread matches above, loose matches on Subject: below --
2013-11-20 8:23 [PATCH 1/2] block: Add support for atomic writes Kishore Sampathkumar
2013-11-26 6:24 Kishore Sampathkumar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131107135220.3802.91392@localhost.localdomain \
--to=chris.mason@fusionio.com \
--cc=axboe@kernel.dk \
--cc=jmoyer@redhat.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=willy@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).