From mboxrd@z Thu Jan 1 00:00:00 1970 From: James Dingwall Subject: ext[234] data corruption (Linux 3.8, 3.9 / Xen) Date: Thu, 26 Sep 2013 08:22:40 +0100 Message-ID: <5243E0C0.2090304@zynstra.com> References: <524314B3.3090000@zynstra.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit To: Return-path: Received: from mail-am1lp0011.outbound.protection.outlook.com ([213.199.154.11]:31472 "EHLO emea01-am1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1755851Ab3IZHWu (ORCPT ); Thu, 26 Sep 2013 03:22:50 -0400 In-Reply-To: <524314B3.3090000@zynstra.com> Sender: linux-ext4-owner@vger.kernel.org List-ID: > Hi, > > We have observed a data corruption bug in a database created by the > postmap command (BDB file) under the following conditions: > > Xen domU guest kernel 3.8, 3.9 (3.5, 3.10, 3.11 don't show the > behaviour 3.6 and 3.7 are unknown) > dom0 Xen 4.2.1 / kernel 3.8 or Xen 4.3.0 / kernel 3.11 > The guest has a passed through block device (phy:/ or file:/) > The filesytem on the passed through device is ext2/3/4 with a 1k block > size > > By examining a strace of the postmap command we produced a short piece > of code (at the bottom) which demonstrates the problem. If this is > executed in a loop such as: > > #!/bin/bash > for i in $(seq 1 5) ; do > mount /dev/xvde1 /mnt > pushd /mnt> /dev/null > echo "checksums after mount" > md5sum testcase.bin > [ "${i}" = "1" ] && ./a.out > echo "checksums before umount" > md5sum testcase.bin > popd> /dev/null > umount /mnt > done > > > The output is > > checksums after mount > md5sum: testcase.bin: No such file or directory > checksums before umount > 719f20c98b69457ce0247d6bf4474cf9 testcase.bin# the correct checksum > for the file > checksums after mount > a90804e64bcc1c0c98dd2cb23d0e4c10 testcase.bin > checksums before umount > a90804e64bcc1c0c98dd2cb23d0e4c10 testcase.bin > checksums after mount > 14bb035eca1ec516ce3865700536fc0c testcase.bin > checksums before umount > 14bb035eca1ec516ce3865700536fc0c testcase.bin > checksums after mount > 124d3d3ea8e421925825ff94a815630b testcase.bin > checksums before umount > 124d3d3ea8e421925825ff94a815630b testcase.bin > checksums after mount > 7c05f36ffdd6b8217a27c0bd4d9cb531 testcase.bin > checksums before umount > 7c05f36ffdd6b8217a27c0bd4d9cb531 testcase.bin > > If we dd out the block device and then loop mount the resulting file > we do not see this problem suggesting that communication between xen > block back/front is ok and that it is only when the mount takes place > that there is a problem. The default libdb behaviour seems to be to > create a database with a block size matching that of the filesystem, > if we override this and set it at 4k we do not see this issue. This > is also observed by changing the bs value in our test program. Once > bs is > 3072 we no longer observe the problem. Also we can avoid the > issue in our test program by filling in hole while __testcase.bin is > being generated. A similar test on xfs with a 1k block size did not > demonstrate this problem. If make a cp of the file before the umount > then the copied version is and remains correct. > > Our searching does not seem to have revealed any similar reports or an > explicitly identified fix that was introduced for 3.10. Our concern > therefore is that this is an unrecognised failure that has been > inadvertently fixed and could equally inadvertently be reintroduced by > some other change. If this problem sounds familiar or there are > suggestions on how to narrow this down further we would greatly > appreciate the advice. > > Thanks, > James > > > > #include > #include > #include > #include > #include > > extern > int main(int argc, char *argv[]) > { > struct stat *sbuf; > char *buf, *zero, *null; > int fd5, fd6, fd7; > int i; > int bs = 1024; /* lte 3072 = corruption */ > > > buf = malloc(3*bs); > zero = malloc(3*bs); > null = malloc(bs); > memset(zero, 0, 3*bs); > sbuf = malloc(sizeof(struct stat)); > memset(sbuf, 0, sizeof(struct stat)); > > for(i = 0; i < 3*bs; i++) { > buf[i] = i & 0x000f; > } > > fd5 = open("__testcase.bin", O_RDWR|O_CREAT|O_EXCL, 0644); > //fcntl(fd5, F_GETFD); > //fcntl(fd5, F_SETFD, FD_CLOEXEC); > //stat("__testcase.bin", sbuf); > fstat(fd5, sbuf); > /* this only writes the first and last blocks */ > lseek(fd5, 0*bs, SEEK_SET); > write(fd5, zero, bs); > //lseek(fd5, 1*bs, SEEK_SET); /* filling in this hole is a fix! */ > //write(fd5, zero, bs); > lseek(fd5, 2*bs, SEEK_SET); > write(fd5, zero, bs); > fdatasync(fd5); > rename("__testcase.bin", "testcase.bin"); > > //stat("testcase.bin", sbuf); > fd6 = open("testcase.bin", O_RDWR|O_CREAT, 0); > //fcntl(fd6, F_GETFD); > //fcntl(fd6, F_SETFD, FD_CLOEXEC); > //fstat(fd6, sbuf); > pread(fd6, null, bs, 0); > //fstat(fd6, sbuf); > //fcntl(fd6, F_GETFD); > //fcntl(fd6, F_SETFD, FD_CLOEXEC); > //fcntl(fd6, F_GETFD); > //fcntl(fd6, F_SETFD, FD_CLOEXEC); > fd7 = open("testcase.bin", O_RDWR); > flock(fd7, LOCK_EX); > umask(022); > pread(fd6, null, bs, 1*bs); > pread(fd6, null, bs, 2*bs); > pwrite(fd6, buf, bs, 0*bs); > pwrite(fd6, buf, bs, 1*bs); > pwrite(fd6, buf, bs, 2*bs); > fdatasync(fd6); > fdatasync(fd6); > close(fd5); > close(fd6); > > fd5 = open("testcase.bin", O_RDWR, 0); > //fcntl(fd5, F_GETFD); > //fcntl(fd5, F_SETFD, FD_CLOEXEC); > fdatasync(fd5); > close(fd5); > > close(fd7); > > free(buf); > free(sbuf); > free(zero); > free(null); > }