[PATCH 0/1] Fix false-negative error reporting from fsync/fdatasync

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/1] Fix false-negative error reporting from fsync/fdatasync
@ 2015-09-15  9:46 Junichi Nomura
  2015-09-15  9:49 ` Test program: check if fsync() can detect I/O error (1/2) Junichi Nomura
  2015-09-15  9:54 ` [PATCH 1/1] fs: global sync to not clear error status of individual inodes Junichi Nomura
  0 siblings, 2 replies; 13+ messages in thread
From: Junichi Nomura @ 2015-09-15  9:46 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org
  Cc: akpm@linux-foundation.org, andi@firstfloor.org,
	fengguang.wu@intel.com, tony.luck@intel.com,
	liwanp@linux.vnet.ibm.com, david@fromorbit.com, Tejun Heo,
	Naoya Horiguchi

Applications use fsync/fdatasync to make sure data is written back to
storage. It is expected that those system calls return error if
writeback has failed (e.g. disk/transport failure, memory failure..)

However if admins run a command such as sync or fsfreeze along side,
fsync/fdatasync may return success even if writeback has failed.
That could lead to data corruption.

This patch is a minimal fix for the problem.
-- 
Jun'ichi Nomura, NEC Corporation
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Test program: check if fsync() can detect I/O error (1/2)
  2015-09-15  9:46 [PATCH 0/1] Fix false-negative error reporting from fsync/fdatasync Junichi Nomura
@ 2015-09-15  9:49 ` Junichi Nomura
  2015-09-15  9:52   ` Test program: check if fsync() can detect I/O error (2/2) Junichi Nomura
  2015-09-15  9:54 ` [PATCH 1/1] fs: global sync to not clear error status of individual inodes Junichi Nomura
  1 sibling, 1 reply; 13+ messages in thread
From: Junichi Nomura @ 2015-09-15  9:49 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org
  Cc: akpm@linux-foundation.org, andi@firstfloor.org,
	fengguang.wu@intel.com, tony.luck@intel.com,
	liwanp@linux.vnet.ibm.com, david@fromorbit.com, Tejun Heo,
	Naoya Horiguchi

> However if admins run a command such as sync or fsfreeze along side,
> fsync/fdatasync may return success even if writeback has failed.
> That could lead to data corruption.

For reproducing the problem, compile the attached C program (iogen.c)
and run with 'runtest.sh' script in the next mail:
  # gcc -o iogen iogen.c
  # bash ./runtest.sh

"iogen" does write(), fsync() and checks if on-disk data is same
as application's buffer after successful fsync.
"runtest.sh" injects failure for the file being written by "iogen".
(You need to enable CONFIG_HWPOISON_INJECT=m for the memory error
 injection to work.)

Without the patch, fsync returns success even though data is not on
disk.

  TEST: ext4 / ioerr / sync-command
  (iogen): inject
  (admin): Injecting I/O error
  (admin): Calling sync(2)
  (iogen): remove
  FAIL: corruption!
  DIFF 00000200: de de de de de de de de  | 00 00 00 00 00 00 00 00
  ...

With the patch, fsync detects error correctly.

  TEST: ext4 / ioerr / sync-command
  (iogen): inject
  (admin): Injecting I/O error
  (admin): Calling sync(2)
  INFO: App fsync: Input/output error
  (iogen): remove
  PASS: detected error right
  (iogen): end

-- cut here --
#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

unsigned char *app_buf;
unsigned char *ondisk_data;
char *testfile;
size_t buflen;
int fd;
int rfd;

void dumpdiff(unsigned char *buf1, unsigned char *buf2, int len)
{
	int i, j;
	for(i = 0; i < len; i += 8) {
		if (!memcmp(&buf1[i], &buf2[i], 8))
			continue;
		fprintf(stderr, "DIFF %08x: ", i);
		for(j = 0; j < 8 && i + j < len; j++)
			fprintf(stderr, "%02x ", buf1[i]);
                fprintf(stderr, " | ");
		for(j = 0; j < 8 && i + j < len; j++)
			fprintf(stderr, "%02x ", buf2[i]);
                fprintf(stderr, "\n");
        }
}

void notify_injector(char *str)
{
        if (str)
                fprintf(stderr, "(iogen): %s\n", str);
        write(1, "\n", 2);
        sleep(1);
}

void open_fds(void)
{
	fd = open(testfile, O_RDWR);
	if (fd < 0) {
		perror("????: App open");
		exit(1);
	}
	rfd = open(testfile, O_RDONLY|O_DIRECT); /* for verification */
	if (rfd < 0) {
		perror("????: App open rfd");
		exit(1);
	}
}

void init_fd_status(void)
{
	int r;

	r = fsync(fd); /* flush and clean */
	if (r) {
		perror("????: App fsync0");
		exit(1);
	}
	r = pread(fd, app_buf, buflen, 0); /* stage onto cache */
	if (r != buflen) {
		perror("????: App read1");
		exit(1);
	}
}

void close_fds(void)
{
	int r;

	r = close(rfd);
	if (r)
		perror("????: App close read fd");
	r = close(fd);
	if (r)
		perror("????: App close write fd");
}

void write_data(int cnt)
{
	int r;

	memset(app_buf, cnt, buflen);
	r = pwrite(fd, app_buf, buflen, 0);
	if (r != buflen)
		perror("????: App write1");
}

int sync_data(void)
{
	int r, r2;

	r = fsync(fd);
	if (r)
		perror("INFO: App fsync");
	r2 = fsync(fd);
	if (r2)
		perror("????: App fsync (redo)");

	return r;
}

void read_data_direct(void)
{
	int r;

	r = pread(rfd, ondisk_data, buflen, 0);
	if (r != buflen) {
		perror("????: App direct read");
		r = pread(rfd, ondisk_data, buflen, 0);
		if (r != buflen)
			perror("FAIL: App direct read (retry)");
	}
}

void check_data(int fsync_result)
{
	int r;

	r = memcmp(app_buf, ondisk_data, buflen);
	if (r) {
		/* data is different */
		if (fsync_result == 0) {
			fprintf(stderr, "FAIL: corruption!\n");
			dumpdiff(app_buf, ondisk_data, buflen);
		} else
			fprintf(stderr, "PASS: detected error right\n");
	} else {
		/* data is same */
		if (fsync_result == 0)
			fprintf(stderr, "PASS: no error, data is ok\n");
		else
			fprintf(stderr, "????: sync failed, data is ok\n");
	}
}

void cleanup_data(int cnt)
{
	int r;

	/* write-fsync-read without error injection */
	memset(app_buf, cnt, buflen);
	r = pwrite(fd, app_buf, buflen, 0);
	if (r != buflen)
		perror("BUG : App write (w/o failure)");
	r = fsync(fd);
	if (r)
		perror("BUG : App fsync (w/o failure)");
	r = pread(rfd, ondisk_data, buflen, 0);
	if (r != buflen)
		perror("BUG : App read (w/o failure)");
	r = memcmp(app_buf, ondisk_data, buflen);
	if (r)
		fprintf(stderr, "BUG : memcmp failed\n");
}

/*
 * Do this:
 *   1) write
 *   2) inject failure
 *   3) fsync (should return error)
 *   4) remove failure
 *   5) check on-disk data (using direct read)
 */
void runtest(void)
{
	int fsync_result;

	notify_injector("start");
	open_fds();
	init_fd_status();

	write_data(0xde);
	notify_injector("inject");
	fsync_result = sync_data();
	notify_injector("remove");

	/* re-read and compare */
	read_data_direct();
	check_data(fsync_result);

	cleanup_data(0);
	close_fds();
	notify_injector("end");
}

int main(int argc, char **argv)
{
	testfile = argv[1];
	buflen = atoi(argv[2]);

	app_buf = malloc(buflen);
	if (!app_buf)
		exit(1);
	if (posix_memalign((void **) &ondisk_data, 4096, buflen))
		exit(1);

	runtest();

	free(app_buf);
	free(ondisk_data);

	return 0;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Test program: check if fsync() can detect I/O error (2/2)
  2015-09-15  9:49 ` Test program: check if fsync() can detect I/O error (1/2) Junichi Nomura
@ 2015-09-15  9:52   ` Junichi Nomura
  0 siblings, 0 replies; 13+ messages in thread
From: Junichi Nomura @ 2015-09-15  9:52 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org
  Cc: akpm@linux-foundation.org, andi@firstfloor.org,
	fengguang.wu@intel.com, tony.luck@intel.com,
	liwanp@linux.vnet.ibm.com, david@fromorbit.com, Tejun Heo,
	Naoya Horiguchi

On 09/15/15 17:39, Jun'ichi Nomura wrote:
>> However if admins run a command such as sync or fsfreeze along side,
>> fsync/fdatasync may return success even if writeback has failed.
>> That could lead to data corruption.
>
> For reproducing the problem, compile the attached C program (iogen.c)
> and run with 'runtest.sh' script in the next mail:
>   # gcc -o iogen iogen.c
>   # bash ./runtest.sh

-- cut here --
#!/bin/bash

# preparation for hwpoison injection
export KERNEL_SRC=/lib/modules/$(uname -r)/build
[ -d "$KERNEL_SRC" ] || exit 1 # no kernel source given
make vm -C $KERNEL_SRC/tools || exit 1 # tools/vm failed to build
pagetypes=$KERNEL_SRC/tools/vm/page-types
[ -x $pagetypes ] || exit 1
modprobe hwpoison-inject

# -------------------------------------------------------------------
fstype=ext4

# file name of loopback image
loopfile=test.img
imgsize=16M
lodev=/dev/loop0

# filesystem to use
mkfs=mkfs.$fstype

# device-mapper map name
testmap=testmap

# file name to store device-mapper table data
mapok=testmap.ok
maperr=testmap.err

# mount point and file name used for testing
testdir=/mnt/test
testfile=$testdir/x

# test file size
filesize=16384

# -------------------------------------------------------------------
# Set up
#

endtest() {
	sleep 3
	umount $testdir
	dmsetup remove $testmap
	losetup -d $lodev
	exit
}

# Create loopback device for testing
dd if=/dev/zero of=$loopfile bs=$imgsize count=1
losetup $lodev $loopfile || endtest
if [ ! -b $lodev ]; then
	endtest
fi


# Layer DM device for error injection
echo "0 $(blockdev --getsz $lodev) linear $lodev 0" | dmsetup create $testmap
dmsetup table $testmap > $mapok || endtest
if [ ! -b /dev/mapper/$testmap ]; then
	endtest
fi

# Mount and create target file
mkdir -p $testdir
$mkfs /dev/mapper/$testmap
mount /dev/mapper/$testmap $testdir || endtest
dd if=/dev/zero of=$testfile bs=$filesize count=1 oflag=direct || endtest

# Find physical location of the target file
find_location() {
	# pick up physical block number of file offset 0
	filefrag -v $1 | \
		awk '$1 == "0" {print $3} $1 == "0:" {print $4}' | \
		sed 's/\.//g'
}
filefrag -v $testfile
block=$(find_location $testfile)
if [ -z "$block" ]; then
	endtest
fi
blocksize=$(stat -c %s -f $testfile)
secsize=512
sector=$((block * blocksize / secsize + 1))

# Create error mapping: inject error at $sector
next=$((sector + 1))
total=$(blockdev --getsz $lodev)
remainder=$((total - next))
cat <<EOF > $maperr
0 $sector linear $lodev 0
$sector 1 error
$next $remainder linear $lodev $next
EOF

map_replace() {
	cat $1 | dmsetup load $testmap
	dmsetup suspend --nolockfs $testmap
	dmsetup resume $testmap
}

inject_memory_error() {
	local pfn=0x$($pagetypes -f $testfile -Nl | grep ^1$'\t' | cut -f2)
	[ "$pfn" = 0x ] && return 1 # target pfn not found
	$pagetypes -a $pfn -X -N
}

# -------------------------------------------------------------------
# Test
#

msg() {
	echo $* > /dev/kmsg
	echo $*
}

injector_ioerr_nop() {
	# start
	read x
	msg "TEST: $fstype / ioerr / (no admin action)"

	# inject
	read x
	msg "(admin): Injecting I/O error"
	map_replace $maperr
	msg "(admin): Do nothing"

	# remove
	read x
	map_replace $mapok

	# end
	read x
	umount /dev/mapper/$testmap || endtest
	mount /dev/mapper/$testmap $testdir || endtest
}

injector_ioerr_synccmd() {
	# start
	read x
	msg "TEST: $fstype / ioerr / sync-command"

	# inject
	read x
	msg "(admin): Injecting I/O error"
	map_replace $maperr
	msg "(admin): Calling sync(2)"
	sync

	# remove
	read x
	map_replace $mapok

	# end
	read x
	umount /dev/mapper/$testmap || endtest
	mount /dev/mapper/$testmap $testdir || endtest
}

injector_hwpoison_synccmd() {
	# start
	read x
	msg "TEST: $fstype / memory-error / sync-command"

	# inject
	read x
	msg "(admin): Injecting memory error"
	inject_memory_error
	msg "(admin): Calling sync(2)"
	sync

	# remove
	read x

	# end
	read x
	umount /dev/mapper/$testmap || endtest
	mount /dev/mapper/$testmap $testdir || endtest
}

msg '============'
./iogen $testfile $filesize | injector_ioerr_nop
msg '============'
./iogen $testfile $filesize | injector_ioerr_synccmd
msg '============'
./iogen $testfile $filesize | injector_hwpoison_synccmd

# -------------------------------------------------------------------
# Clean up
#
endtest

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/1] fs: global sync to not clear error status of individual inodes
  2015-09-15  9:46 [PATCH 0/1] Fix false-negative error reporting from fsync/fdatasync Junichi Nomura
  2015-09-15  9:49 ` Test program: check if fsync() can detect I/O error (1/2) Junichi Nomura
@ 2015-09-15  9:54 ` Junichi Nomura
  2015-09-15 14:37   ` Andi Kleen
  2015-09-15 15:20   ` [PATCH 1/1] fs: global sync to not clear error status of individual inodes Tejun Heo
  1 sibling, 2 replies; 13+ messages in thread
From: Junichi Nomura @ 2015-09-15  9:54 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org
  Cc: akpm@linux-foundation.org, andi@firstfloor.org,
	fengguang.wu@intel.com, tony.luck@intel.com,
	liwanp@linux.vnet.ibm.com, david@fromorbit.com, Tejun Heo,
	Naoya Horiguchi

filemap_fdatawait() is a function to wait for on-going writeback
to complete but also consume and clear error status of the mapping
set during writeback.
The latter functionality is critical for applications to detect
writeback error with system calls like fsync(2)/fdatasync(2).

However filemap_fdatawait() is also used by sync(2) or FIFREEZE
ioctl, which don't check error status of individual mappings.

As a result, fsync() may not be able to detect writeback error
if events happen in the following order:

   Application                    System admin
   ----------------------------------------------------------
   write data on page cache
                                  Run sync command
                                  writeback completes with error
                                  filemap_fdatawait() clears error
   fsync returns success
   (but the data is not on disk)

This patch adds filemap_fdatawait_keep_errors() for call sites where
writeback error is not handled so that they don't clear error status.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 fs/fs-writeback.c  |  8 +++++++-
 fs/sync.c          |  2 +-
 include/linux/fs.h |  1 +
 mm/filemap.c       | 35 ++++++++++++++++++++++++++++++++---
 4 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 587ac08..df52aad 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2121,7 +2121,13 @@ static void wait_sb_inodes(struct super_block *sb)
 		iput(old_inode);
 		old_inode = inode;
 
-		filemap_fdatawait(mapping);
+		/*
+		 * Wait for on-going writeback to complete
+		 * but not consume error status on this mapping.
+		 * Otherwise application may fail to catch writeback error
+		 * using fsync(2).
+		 */
+		filemap_fdatawait_keep_errors(mapping);
 
 		cond_resched();
 
diff --git a/fs/sync.c b/fs/sync.c
index fbc98ee..e2b7a77 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -86,7 +86,7 @@ static void fdatawrite_one_bdev(struct block_device *bdev, void *arg)
 
 static void fdatawait_one_bdev(struct block_device *bdev, void *arg)
 {
-	filemap_fdatawait(bdev->bd_inode->i_mapping);
+	filemap_fdatawait_keep_errors(bdev->bd_inode->i_mapping);
 }
 
 /*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 72d8a84..9355f37 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2422,6 +2422,7 @@ extern int write_inode_now(struct inode *, int);
 extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
 extern int filemap_fdatawait(struct address_space *);
+extern void filemap_fdatawait_keep_errors(struct address_space *);
 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
 				   loff_t lend);
 extern int filemap_write_and_wait(struct address_space *mapping);
diff --git a/mm/filemap.c b/mm/filemap.c
index 72940fb..059050a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -340,14 +340,14 @@ EXPORT_SYMBOL(filemap_flush);
  * Walk the list of under-writeback pages of the given address space
  * in the given range and wait for all of them.
  */
-int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
-			    loff_t end_byte)
+static int __filemap_fdatawait_range(struct address_space *mapping,
+				     loff_t start_byte, loff_t end_byte)
 {
 	pgoff_t index = start_byte >> PAGE_CACHE_SHIFT;
 	pgoff_t end = end_byte >> PAGE_CACHE_SHIFT;
 	struct pagevec pvec;
 	int nr_pages;
-	int ret2, ret = 0;
+	int ret = 0;
 
 	if (end_byte < start_byte)
 		goto out;
@@ -374,6 +374,15 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
 		cond_resched();
 	}
 out:
+	return ret;
+}
+
+int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
+			    loff_t end_byte)
+{
+	int ret, ret2;
+
+	ret = __filemap_fdatawait_range(mapping, start_byte, end_byte);
 	ret2 = filemap_check_errors(mapping);
 	if (!ret)
 		ret = ret2;
@@ -382,6 +391,26 @@ out:
 }
 EXPORT_SYMBOL(filemap_fdatawait_range);
 
+/*
+ * As filemap_check_errors() consumes and clears error status of mapping,
+ * filemap_fdatawait() should be used only when the caller is responsible
+ * for handling the error.
+ *
+ * Use filemap_fdatawait_keep_errors() if callers just want to wait for
+ * witeback and don't handle errors themselves.
+ * Expected call sites are system-wide / filesystem-wide data flushers:
+ * e.g. sync(2), fsfreeze(8)
+ */
+void filemap_fdatawait_keep_errors(struct address_space *mapping)
+{
+	loff_t i_size = i_size_read(mapping->host);
+
+	if (i_size == 0)
+		return;
+
+	__filemap_fdatawait_range(mapping, 0, i_size - 1);
+}
+
 /**
  * filemap_fdatawait - wait for all under-writeback pages to complete
  * @mapping: address space structure to wait for
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] fs: global sync to not clear error status of individual inodes
  2015-09-15  9:54 ` [PATCH 1/1] fs: global sync to not clear error status of individual inodes Junichi Nomura
@ 2015-09-15 14:37   ` Andi Kleen
  2015-09-15 22:02     ` Andrew Morton
  2015-09-15 15:20   ` [PATCH 1/1] fs: global sync to not clear error status of individual inodes Tejun Heo
  1 sibling, 1 reply; 13+ messages in thread
From: Andi Kleen @ 2015-09-15 14:37 UTC (permalink / raw)
  To: Junichi Nomura
  Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, akpm@linux-foundation.org,
	andi@firstfloor.org, fengguang.wu@intel.com, tony.luck@intel.com,
	liwanp@linux.vnet.ibm.com, david@fromorbit.com, Tejun Heo,
	Naoya Horiguchi

> This patch adds filemap_fdatawait_keep_errors() for call sites where
> writeback error is not handled so that they don't clear error status.

Patch looks good to me. 

Acked-by: Andi Kleen <ak@linux.intel.com>

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] fs: global sync to not clear error status of individual inodes
  2015-09-15 14:37   ` Andi Kleen
@ 2015-09-15 22:02     ` Andrew Morton
  2015-09-16  0:45       ` Junichi Nomura
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2015-09-15 22:02 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Junichi Nomura, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	fengguang.wu@intel.com, tony.luck@intel.com,
	liwanp@linux.vnet.ibm.com, david@fromorbit.com, Tejun Heo,
	Naoya Horiguchi

On Tue, 15 Sep 2015 16:37:24 +0200 Andi Kleen <andi@firstfloor.org> wrote:

> > This patch adds filemap_fdatawait_keep_errors() for call sites where
> > writeback error is not handled so that they don't clear error status.
> 
> Patch looks good to me. 
> 

Me too.

It would be nice to capture the test case(s) somewhere permanent. 
Possibly in tools/testing/selftests, but selftests is more for peculiar
linux-specific things.  LTP or xfstests would be a better place.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] fs: global sync to not clear error status of individual inodes
  2015-09-15 22:02     ` Andrew Morton
@ 2015-09-16  0:45       ` Junichi Nomura
  2015-09-16 10:45         ` xfstests: test data-writeback error detection with fsync Junichi Nomura
  0 siblings, 1 reply; 13+ messages in thread
From: Junichi Nomura @ 2015-09-16  0:45 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen
  Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, fengguang.wu@intel.com, tony.luck@intel.com,
	david@fromorbit.com, Tejun Heo, Naoya Horiguchi

On 09/16/15 07:02, Andrew Morton wrote:
> It would be nice to capture the test case(s) somewhere permanent. 
> Possibly in tools/testing/selftests, but selftests is more for peculiar
> linux-specific things.  LTP or xfstests would be a better place.

I'll check xfstests if I can adapt the test case for its framework.

-- 
Jun'ichi Nomura, NEC Corporation

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* xfstests: test data-writeback error detection with fsync
  2015-09-16  0:45       ` Junichi Nomura
@ 2015-09-16 10:45         ` Junichi Nomura
  0 siblings, 0 replies; 13+ messages in thread
From: Junichi Nomura @ 2015-09-16 10:45 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org
  Cc: Andrew Morton, Andi Kleen, fengguang.wu@intel.com,
	tony.luck@intel.com, david@fromorbit.com, Tejun Heo,
	Naoya Horiguchi

On 09/16/15 07:02, Andrew Morton wrote:
> It would be nice to capture the test case(s) somewhere permanent. 
> Possibly in tools/testing/selftests, but selftests is more for
> peculiar
> linux-specific things.  LTP or xfstests would be a better place.

This is a xfstests version of my test case.
(Device failure portion only. Memory failure will need additional code.)

I used '9999' in this proposal temporarily but if I should other number,
I'll fix that.

---
 common/dm_error       |   96 ++++++++++++++++++++++++++++++++++++++++++
 common/rc             |   16 +++++++
 tests/shared/9999     |  113 ++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/shared/9999.out |   18 +++++++
 tests/shared/group    |    1 
 5 files changed, 244 insertions(+)

diff --git a/common/dm_error b/common/dm_error
new file mode 100644
index 0000000..f6c926f
--- /dev/null
+++ b/common/dm_error
@@ -0,0 +1,96 @@
+##/bin/bash
+#
+# Copyright (c) 2015 NEC Corporation.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#
+#
+# common functions for setting up and tearing down a dm error device
+
+# device-mapper map name
+DM_ERR_MAPNAME=xfstests-dm-error
+
+# temporary file names for storing device-mapper table data
+DM_ERR_NORMAL_MAP=$RESULT_DIR/$DM_ERR_MAPNAME.ok
+DM_ERR_ERROR_MAP=$RESULT_DIR/$DM_ERR_MAPNAME.err
+
+_init_dm_error() {
+	# Layer DM device for error injection
+	echo "0 $(blockdev --getsz $SCRATCH_DEV) linear $SCRATCH_DEV 0" | \
+		$DMSETUP_PROG create $DM_ERR_MAPNAME || \
+		_fatal "failed to create dm linear device"
+	$DMSETUP_PROG table $DM_ERR_MAPNAME > $DM_ERR_NORMAL_MAP
+}
+
+_prepare_dm_error_table_for_file() {
+	local file=$1
+	local offset=$2
+	local len=$3
+
+	# Find physical location of the target file
+	find_location() {
+		# pick up physical block number of file offset 0
+		$FILEFRAG_PROG -v $1 | \
+			awk '$1 == "0" {print $3} $1 == "0:" {print $4}' | \
+			sed 's/\.//g'
+	}
+	local block=$(find_location $file)
+	if [ -z "$block" ]; then
+		_fatal "failed to find physical block for $file"
+	fi
+	local blocksize=$(stat -c %s -f $file)
+	local secsize=512
+	local sector=$((block * blocksize / secsize + offset))
+
+	# Create error mapping: inject error at $sector
+	local next=$((sector + len))
+	local total=$(blockdev --getsz $SCRATCH_DEV)
+	local remainder=$((total - next))
+
+	# Generate error mapping
+	echo "0 $sector linear $SCRATCH_DEV 0" > $DM_ERR_ERROR_MAP
+	echo "$sector $len error" >> $DM_ERR_ERROR_MAP
+	echo "$next $remainder linear $SCRATCH_DEV $next" >> $DM_ERR_ERROR_MAP
+}
+
+_load_dm_error_table() {
+	cat $DM_ERR_ERROR_MAP | $DMSETUP_PROG load $DM_ERR_MAPNAME || \
+		_fatal "failed to load dm error table"
+	$DMSETUP_PROG suspend --nolockfs $DM_ERR_MAPNAME || \
+		_fatal "failed to suspend dm device"
+	$DMSETUP_PROG resume $DM_ERR_MAPNAME || \
+		_fatal "failed to suspend dm device"
+}
+_unload_dm_error_table() {
+	cat $DM_ERR_NORMAL_MAP | $DMSETUP_PROG load $DM_ERR_MAPNAME || \
+		_fatal "failed to re-load normal dm table"
+	$DMSETUP_PROG suspend --nolockfs $DM_ERR_MAPNAME || \
+		_fatal "failed to suspend dm device"
+	$DMSETUP_PROG resume $DM_ERR_MAPNAME || \
+		_fatal "failed to suspend dm device"
+}
+
+_mount_dm_error() {
+	mount -t $FSTYP $MOUNT_OPTIONS /dev/mapper/$DM_ERR_MAPNAME $SCRATCH_MNT
+}
+
+_unmount_dm_error() {
+	$UMOUNT_PROG $SCRATCH_MNT
+}
+
+_cleanup_dm_error() {
+	_unmount_dm_error
+	$DMSETUP_PROG remove $DM_ERR_MAPNAME
+	rm -f $DM_ERR_NORMAL_MAP $DM_ERR_ERROR_MAP
+}
diff --git a/common/rc b/common/rc
index 70d2fa8..a4478f6 100644
--- a/common/rc
+++ b/common/rc
@@ -1337,6 +1337,22 @@ _require_sane_bdev_flush()
 	fi
 }
 
+# this test requires the device mapper error target
+#
+_require_dm_error()
+{
+	_require_block_device $SCRATCH_DEV
+	_require_command "$DMSETUP_PROG" dmsetup
+	# Use filefrag to find location to inject failure
+	_require_command "$FILEFRAG_PROG" filefrag
+
+	modprobe dm-mod >/dev/null 2>&1
+	$DMSETUP_PROG targets | grep error >/dev/null 2>&1
+	if [ $? -ne 0 ]; then
+		_notrun "This test requires dm error support"
+	fi
+}
+
 # this test requires the device mapper flakey target
 #
 _require_dm_flakey()
diff --git a/tests/shared/9999 b/tests/shared/9999
new file mode 100755
index 0000000..9e66f77
--- /dev/null
+++ b/tests/shared/9999
@@ -0,0 +1,113 @@
+#!/bin/bash
+# FS QA Test No. 9999
+#
+# Overwrite blocks on buffer, inject sector error using device-mapper,
+# run sync, and then fsync the file.
+# Verify if fsync could detect the error.
+#
+#-----------------------------------------------------------------------
+# Copyright (C) 2015 NEC Corporation. All Rights Reserved.
+# Author: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#-----------------------------------------------------------------------
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+status=1        # failure is the default!
+
+_cleanup() {
+	_cleanup_dm_error
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dm_error
+
+# real QA test starts here
+_supported_fs ext4 ext3 ext2 xfs
+_supported_os Linux
+_need_to_be_root
+_require_scratch
+_require_dm_error
+
+rm -f $seqres.full
+
+_scratch_mkfs >> $seqres.full 2>&1
+
+# test file name, size and filling patterns
+testfile=$SCRATCH_MNT/x
+filesize=16384
+pat1=0x11
+pat2=0xaa
+
+_init_dm_error
+_mount_dm_error
+
+echo "Create testfile"
+$XFS_IO_PROG -f \
+	-c "pwrite -S $pat1 0 $filesize" -c "fsync" \
+	$testfile | _filter_xfs_io
+$XFS_IO_PROG -f \
+	-c "pwrite -S $pat2 0 $filesize" -c "fsync" \
+	$testfile.expected | _filter_xfs_io
+_prepare_dm_error_table_for_file $testfile 0 1
+
+echo "Buffered write on the file"
+$XFS_IO_PROG -c "pwrite -S $pat2 0 $filesize" $testfile | _filter_xfs_io
+
+echo "Inject device error"
+_load_dm_error_table
+
+# Running 'sync' while written data is on buffer. This should start
+# writeback and wait for completion.  Beause of the injected failure,
+# the file is marked with AS_EIO.
+echo "Execute sync command"
+sync
+
+# fsync() should get error return.
+echo "Do fsync on the file (should fail)"
+$XFS_IO_PROG -c "fsync" $testfile | _filter_xfs_io
+
+echo "Remove injected device error"
+_unload_dm_error_table
+_unmount_dm_error
+_mount_dm_error
+
+cmp $testfile $testfile.expected >> $seqres.full 2>&1
+if [ $? -ne 0 ]; then
+	echo "Data was not written to disk"
+fi
+echo "Expected contents of the file if error was not injected:" >> $seqres.full
+od -t x1 $testfile.expected >> $seqres.full
+echo "Actual contents of the file:" >> $seqres.full
+od -t x1 $testfile >> $seqres.full
+
+echo "Retry write and fsync"
+$XFS_IO_PROG -f \
+	-c "pwrite -S $pat2 0 $filesize" -c "fsync" \
+	$testfile | _filter_xfs_io
+
+cmp $testfile $testfile.expected
+echo "Contents of the file after retry:" >> $seqres.full
+od -t x1 $testfile >> $seqres.full
+
+status=0
+exit
diff --git a/tests/shared/9999.out b/tests/shared/9999.out
new file mode 100644
index 0000000..236e913
--- /dev/null
+++ b/tests/shared/9999.out
@@ -0,0 +1,18 @@
+QA output created by 9999
+Create testfile
+wrote 16384/16384 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 16384/16384 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Buffered write on the file
+wrote 16384/16384 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Inject device error
+Execute sync command
+Do fsync on the file (should fail)
+fsync: Input/output error
+Remove injected device error
+Data was not written to disk
+Retry write and fsync
+wrote 16384/16384 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
diff --git a/tests/shared/group b/tests/shared/group
index 00d42c8..f196b71 100644
--- a/tests/shared/group
+++ b/tests/shared/group
@@ -11,3 +11,4 @@
 272 auto enospc rw
 289 auto quick
 298 auto trim
+9999 auto quick data
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] fs: global sync to not clear error status of individual inodes
  2015-09-15  9:54 ` [PATCH 1/1] fs: global sync to not clear error status of individual inodes Junichi Nomura
  2015-09-15 14:37   ` Andi Kleen
@ 2015-09-15 15:20   ` Tejun Heo
  2015-09-15 16:52     ` Andi Kleen
  2015-09-16  0:59     ` Junichi Nomura
  1 sibling, 2 replies; 13+ messages in thread
From: Tejun Heo @ 2015-09-15 15:20 UTC (permalink / raw)
  To: Junichi Nomura
  Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, akpm@linux-foundation.org,
	andi@firstfloor.org, fengguang.wu@intel.com, tony.luck@intel.com,
	liwanp@linux.vnet.ibm.com, david@fromorbit.com, Naoya Horiguchi

Hello, Junichi.

On Tue, Sep 15, 2015 at 09:54:13AM +0000, Junichi Nomura wrote:
> filemap_fdatawait() is a function to wait for on-going writeback
> to complete but also consume and clear error status of the mapping
> set during writeback.
> The latter functionality is critical for applications to detect
> writeback error with system calls like fsync(2)/fdatasync(2).
> 
> However filemap_fdatawait() is also used by sync(2) or FIFREEZE
> ioctl, which don't check error status of individual mappings.
> 
> As a result, fsync() may not be able to detect writeback error
> if events happen in the following order:
> 
>    Application                    System admin
>    ----------------------------------------------------------
>    write data on page cache
>                                   Run sync command
>                                   writeback completes with error
>                                   filemap_fdatawait() clears error
>    fsync returns success
>    (but the data is not on disk)
> 
> This patch adds filemap_fdatawait_keep_errors() for call sites where
> writeback error is not handled so that they don't clear error status.

Is this an actual problem?  Write errors usually indicate that the
underlying device is completely hosed and the kernel tends to make a
lot of noise throughout the different layers and it often pretty
quickly leads to failures of metadata IOs which which results in
damage-control actions like RO remounts, so in most cases the
specifics of failure handling don't end up mattering all that much.

That said, no reason to not improve upon it.

> @@ -2121,7 +2121,13 @@ static void wait_sb_inodes(struct super_block *sb)
>  		iput(old_inode);
>  		old_inode = inode;
>  
> -		filemap_fdatawait(mapping);
> +		/*
> +		 * Wait for on-going writeback to complete
> +		 * but not consume error status on this mapping.
                       ^don't

> +		 * Otherwise application may fail to catch writeback error

                   mapping; otherwise,

> +		 * using fsync(2).
> +		 */

Can you please re-flow the comment so that it's filling up to, say, 72
or 76 or whatever column?

> +		filemap_fdatawait_keep_errors(mapping);
>  
>  		cond_resched();
>  
> diff --git a/fs/sync.c b/fs/sync.c
> index fbc98ee..e2b7a77 100644
> --- a/fs/sync.c
> +++ b/fs/sync.c
> @@ -86,7 +86,7 @@ static void fdatawrite_one_bdev(struct block_device *bdev, void *arg)
>  
>  static void fdatawait_one_bdev(struct block_device *bdev, void *arg)
>  {
> -	filemap_fdatawait(bdev->bd_inode->i_mapping);
> +	filemap_fdatawait_keep_errors(bdev->bd_inode->i_mapping);

Maybe it'd be better to describe what's going on in detail in the
function comment of filemat_fdatawait_keep_errors() and refer to that
from its callers?

> @@ -382,6 +391,26 @@ out:
>  }
>  EXPORT_SYMBOL(filemap_fdatawait_range);
>  
> +/*
> + * As filemap_check_errors() consumes and clears error status of mapping,
> + * filemap_fdatawait() should be used only when the caller is responsible
> + * for handling the error.

Please make this a proper function comment.

> + *
> + * Use filemap_fdatawait_keep_errors() if callers just want to wait for
> + * witeback and don't handle errors themselves.
      writeback

> + * Expected call sites are system-wide / filesystem-wide data flushers:
> + * e.g. sync(2), fsfreeze(8)

Ditto, please either break up paragraphs or reflow to column
consistently.

> + */
> +void filemap_fdatawait_keep_errors(struct address_space *mapping)
> +{
> +	loff_t i_size = i_size_read(mapping->host);
> +
> +	if (i_size == 0)
> +		return;
> +
> +	__filemap_fdatawait_range(mapping, 0, i_size - 1);
> +}

Generally looks good to me.  We may end up deferring writeback error
from way earlier to later fsync callers but given that sync(2) doesn't
even have a return value, it looks like that's the best we can do.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] fs: global sync to not clear error status of individual inodes
  2015-09-15 15:20   ` [PATCH 1/1] fs: global sync to not clear error status of individual inodes Tejun Heo
@ 2015-09-15 16:52     ` Andi Kleen
  2015-09-16  0:59     ` Junichi Nomura
  1 sibling, 0 replies; 13+ messages in thread
From: Andi Kleen @ 2015-09-15 16:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Junichi Nomura, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	akpm@linux-foundation.org, andi@firstfloor.org,
	fengguang.wu@intel.com, tony.luck@intel.com,
	liwanp@linux.vnet.ibm.com, david@fromorbit.com, Naoya Horiguchi

> Is this an actual problem?  Write errors usually indicate that the

There are transaction systems that need to track errors at the level
of the individual IO.  So yes we should guarantee that if a particular
sync failed an error is always returned.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] fs: global sync to not clear error status of individual inodes
  2015-09-15 15:20   ` [PATCH 1/1] fs: global sync to not clear error status of individual inodes Tejun Heo
  2015-09-15 16:52     ` Andi Kleen
@ 2015-09-16  0:59     ` Junichi Nomura
  2015-09-16  8:39       ` [PATCH v2] " Junichi Nomura
  1 sibling, 1 reply; 13+ messages in thread
From: Junichi Nomura @ 2015-09-16  0:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, akpm@linux-foundation.org,
	andi@firstfloor.org, fengguang.wu@intel.com, tony.luck@intel.com,
	david@fromorbit.com, Naoya Horiguchi

On 09/16/15 00:20, Tejun Heo wrote:
>> @@ -2121,7 +2121,13 @@ static void wait_sb_inodes(struct super_block *sb)
>>  		iput(old_inode);
>>  		old_inode = inode;
>>  
>> -		filemap_fdatawait(mapping);
>> +		/*
>> +		 * Wait for on-going writeback to complete
>> +		 * but not consume error status on this mapping.
>                        ^don't
> 
>> +		 * Otherwise application may fail to catch writeback error
> 
>                    mapping; otherwise,
> 
>> +		 * using fsync(2).
>> +		 */
> 
> Can you please re-flow the comment so that it's filling up to, say, 72
> or 76 or whatever column?

I'll fix them.

>> -	filemap_fdatawait(bdev->bd_inode->i_mapping);
>> +	filemap_fdatawait_keep_errors(bdev->bd_inode->i_mapping);
> 
> Maybe it'd be better to describe what's going on in detail in the
> function comment of filemat_fdatawait_keep_errors() and refer to that
> from its callers?

Thanks, that seems better.
I'll also extend function comments of filemap_fdatawait so that the
difference becomes clear.

-- 
Jun'ichi Nomura, NEC Corporation
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2] fs: global sync to not clear error status of individual inodes
  2015-09-16  0:59     ` Junichi Nomura
@ 2015-09-16  8:39       ` Junichi Nomura
  2015-09-16 17:47         ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Junichi Nomura @ 2015-09-16  8:39 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org
  Cc: Tejun Heo, akpm@linux-foundation.org, andi@firstfloor.org,
	fengguang.wu@intel.com, tony.luck@intel.com, david@fromorbit.com,
	Naoya Horiguchi

filemap_fdatawait() is a function to wait for on-going writeback
to complete but also consume and clear error status of the mapping
set during writeback.
The latter functionality is critical for applications to detect
writeback error with system calls like fsync(2)/fdatasync(2).

However filemap_fdatawait() is also used by sync(2) or FIFREEZE
ioctl, which don't check error status of individual mappings.

As a result, fsync() may not be able to detect writeback error
if events happen in the following order:

   Application                    System admin
   ----------------------------------------------------------
   write data on page cache
                                  Run sync command
                                  writeback completes with error
                                  filemap_fdatawait() clears error
   fsync returns success
   (but the data is not on disk)

This patch adds filemap_fdatawait_keep_errors() for call sites where
writeback error is not handled so that they don't clear error status.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
---
v2:
  - Fixed comments based on Tejun's suggestions. No code changes.

v1 and test cases:
  - https://lkml.org/lkml/2015/9/15/335

---
 fs/fs-writeback.c  |  7 +++++-
 fs/sync.c          |  7 +++++-
 include/linux/fs.h |  1 +
 mm/filemap.c       | 67 +++++++++++++++++++++++++++++++++++++++++++-----------
 4 files changed, 67 insertions(+), 15 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 587ac08..f252a00 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2121,7 +2121,12 @@ static void wait_sb_inodes(struct super_block *sb)
 		iput(old_inode);
 		old_inode = inode;
 
-		filemap_fdatawait(mapping);
+		/*
+		 * We keep the error status of individual mapping so that
+		 * applications can catch the writeback error using fsync(2).
+		 * See filemap_fdatawait_keep_errors() for details.
+		 */
+		filemap_fdatawait_keep_errors(mapping);
 
 		cond_resched();
 
diff --git a/fs/sync.c b/fs/sync.c
index fbc98ee..4ec430a 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -86,7 +86,12 @@ static void fdatawrite_one_bdev(struct block_device *bdev, void *arg)
 
 static void fdatawait_one_bdev(struct block_device *bdev, void *arg)
 {
-	filemap_fdatawait(bdev->bd_inode->i_mapping);
+	/*
+	 * We keep the error status of individual mapping so that
+	 * applications can catch the writeback error using fsync(2).
+	 * See filemap_fdatawait_keep_errors() for details.
+	 */
+	filemap_fdatawait_keep_errors(bdev->bd_inode->i_mapping);
 }
 
 /*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 72d8a84..9355f37 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2422,6 +2422,7 @@ extern int write_inode_now(struct inode *, int);
 extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
 extern int filemap_fdatawait(struct address_space *);
+extern void filemap_fdatawait_keep_errors(struct address_space *);
 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
 				   loff_t lend);
 extern int filemap_write_and_wait(struct address_space *mapping);
diff --git a/mm/filemap.c b/mm/filemap.c
index 72940fb..40e0af9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -331,23 +331,14 @@ int filemap_flush(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_flush);
 
-/**
- * filemap_fdatawait_range - wait for writeback to complete
- * @mapping:		address space structure to wait for
- * @start_byte:		offset in bytes where the range starts
- * @end_byte:		offset in bytes where the range ends (inclusive)
- *
- * Walk the list of under-writeback pages of the given address space
- * in the given range and wait for all of them.
- */
-int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
-			    loff_t end_byte)
+static int __filemap_fdatawait_range(struct address_space *mapping,
+				     loff_t start_byte, loff_t end_byte)
 {
 	pgoff_t index = start_byte >> PAGE_CACHE_SHIFT;
 	pgoff_t end = end_byte >> PAGE_CACHE_SHIFT;
 	struct pagevec pvec;
 	int nr_pages;
-	int ret2, ret = 0;
+	int ret = 0;
 
 	if (end_byte < start_byte)
 		goto out;
@@ -374,6 +365,29 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
 		cond_resched();
 	}
 out:
+	return ret;
+}
+
+/**
+ * filemap_fdatawait_range - wait for writeback to complete
+ * @mapping:		address space structure to wait for
+ * @start_byte:		offset in bytes where the range starts
+ * @end_byte:		offset in bytes where the range ends (inclusive)
+ *
+ * Walk the list of under-writeback pages of the given address space
+ * in the given range and wait for all of them.  Check error status of
+ * the address space and return it.
+ *
+ * Since the error status of the address space is cleared by this function,
+ * callers are responsible for checking the return value and handling and/or
+ * reporting the error.
+ */
+int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
+			    loff_t end_byte)
+{
+	int ret, ret2;
+
+	ret = __filemap_fdatawait_range(mapping, start_byte, end_byte);
 	ret2 = filemap_check_errors(mapping);
 	if (!ret)
 		ret = ret2;
@@ -383,11 +397,38 @@ out:
 EXPORT_SYMBOL(filemap_fdatawait_range);
 
 /**
+ * filemap_fdatawait_keep_errors - wait for writeback without clearing errors
+ * @mapping: address space structure to wait for
+ *
+ * Walk the list of under-writeback pages of the given address space
+ * and wait for all of them.  Unlike filemap_fdatawait(), this function
+ * does not clear error status of the address space.
+ *
+ * Use this function if callers don't handle errors themselves.  Expected
+ * call sites are system-wide / filesystem-wide data flushers: e.g. sync(2),
+ * fsfreeze(8)
+ */
+void filemap_fdatawait_keep_errors(struct address_space *mapping)
+{
+	loff_t i_size = i_size_read(mapping->host);
+
+	if (i_size == 0)
+		return;
+
+	__filemap_fdatawait_range(mapping, 0, i_size - 1);
+}
+
+/**
  * filemap_fdatawait - wait for all under-writeback pages to complete
  * @mapping: address space structure to wait for
  *
  * Walk the list of under-writeback pages of the given address space
- * and wait for all of them.
+ * and wait for all of them.  Check error status of the address space
+ * and return it.
+ *
+ * Since the error status of the address space is cleared by this function,
+ * callers are responsible for checking the return value and handling and/or
+ * reporting the error.
  */
 int filemap_fdatawait(struct address_space *mapping)
 {
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] fs: global sync to not clear error status of individual inodes
  2015-09-16  8:39       ` [PATCH v2] " Junichi Nomura
@ 2015-09-16 17:47         ` Tejun Heo
  0 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2015-09-16 17:47 UTC (permalink / raw)
  To: Junichi Nomura
  Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, akpm@linux-foundation.org,
	andi@firstfloor.org, fengguang.wu@intel.com, tony.luck@intel.com,
	david@fromorbit.com, Naoya Horiguchi

On Wed, Sep 16, 2015 at 08:39:09AM +0000, Junichi Nomura wrote:
> filemap_fdatawait() is a function to wait for on-going writeback
> to complete but also consume and clear error status of the mapping
> set during writeback.
> The latter functionality is critical for applications to detect
> writeback error with system calls like fsync(2)/fdatasync(2).
> 
> However filemap_fdatawait() is also used by sync(2) or FIFREEZE
> ioctl, which don't check error status of individual mappings.
> 
> As a result, fsync() may not be able to detect writeback error
> if events happen in the following order:
> 
>    Application                    System admin
>    ----------------------------------------------------------
>    write data on page cache
>                                   Run sync command
>                                   writeback completes with error
>                                   filemap_fdatawait() clears error
>    fsync returns success
>    (but the data is not on disk)
> 
> This patch adds filemap_fdatawait_keep_errors() for call sites where
> writeback error is not handled so that they don't clear error status.
> 
> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
> Acked-by: Andi Kleen <ak@linux.intel.com>

Reviewed-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-09-16 17:47 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-15  9:46 [PATCH 0/1] Fix false-negative error reporting from fsync/fdatasync Junichi Nomura
2015-09-15  9:49 ` Test program: check if fsync() can detect I/O error (1/2) Junichi Nomura
2015-09-15  9:52   ` Test program: check if fsync() can detect I/O error (2/2) Junichi Nomura
2015-09-15  9:54 ` [PATCH 1/1] fs: global sync to not clear error status of individual inodes Junichi Nomura
2015-09-15 14:37   ` Andi Kleen
2015-09-15 22:02     ` Andrew Morton
2015-09-16  0:45       ` Junichi Nomura
2015-09-16 10:45         ` xfstests: test data-writeback error detection with fsync Junichi Nomura
2015-09-15 15:20   ` [PATCH 1/1] fs: global sync to not clear error status of individual inodes Tejun Heo
2015-09-15 16:52     ` Andi Kleen
2015-09-16  0:59     ` Junichi Nomura
2015-09-16  8:39       ` [PATCH v2] " Junichi Nomura
2015-09-16 17:47         ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).