* ext[234] data corruption (Linux 3.8, 3.9 / Xen)
[not found] <524314B3.3090000@zynstra.com>
@ 2013-09-26 7:22 ` James Dingwall
2013-09-26 19:14 ` Jan Kara
0 siblings, 1 reply; 3+ messages in thread
From: James Dingwall @ 2013-09-26 7:22 UTC (permalink / raw)
To: linux-ext4
> Hi,
>
> We have observed a data corruption bug in a database created by the
> postmap command (BDB file) under the following conditions:
>
> Xen domU guest kernel 3.8, 3.9 (3.5, 3.10, 3.11 don't show the
> behaviour 3.6 and 3.7 are unknown)
> dom0 Xen 4.2.1 / kernel 3.8 or Xen 4.3.0 / kernel 3.11
> The guest has a passed through block device (phy:/ or file:/)
> The filesytem on the passed through device is ext2/3/4 with a 1k block
> size
>
> By examining a strace of the postmap command we produced a short piece
> of code (at the bottom) which demonstrates the problem. If this is
> executed in a loop such as:
>
> #!/bin/bash
> for i in $(seq 1 5) ; do
> mount /dev/xvde1 /mnt
> pushd /mnt> /dev/null
> echo "checksums after mount"
> md5sum testcase.bin
> [ "${i}" = "1" ] && ./a.out
> echo "checksums before umount"
> md5sum testcase.bin
> popd> /dev/null
> umount /mnt
> done
>
>
> The output is
>
> checksums after mount
> md5sum: testcase.bin: No such file or directory
> checksums before umount
> 719f20c98b69457ce0247d6bf4474cf9 testcase.bin# the correct checksum
> for the file
> checksums after mount
> a90804e64bcc1c0c98dd2cb23d0e4c10 testcase.bin
> checksums before umount
> a90804e64bcc1c0c98dd2cb23d0e4c10 testcase.bin
> checksums after mount
> 14bb035eca1ec516ce3865700536fc0c testcase.bin
> checksums before umount
> 14bb035eca1ec516ce3865700536fc0c testcase.bin
> checksums after mount
> 124d3d3ea8e421925825ff94a815630b testcase.bin
> checksums before umount
> 124d3d3ea8e421925825ff94a815630b testcase.bin
> checksums after mount
> 7c05f36ffdd6b8217a27c0bd4d9cb531 testcase.bin
> checksums before umount
> 7c05f36ffdd6b8217a27c0bd4d9cb531 testcase.bin
>
> If we dd out the block device and then loop mount the resulting file
> we do not see this problem suggesting that communication between xen
> block back/front is ok and that it is only when the mount takes place
> that there is a problem. The default libdb behaviour seems to be to
> create a database with a block size matching that of the filesystem,
> if we override this and set it at 4k we do not see this issue. This
> is also observed by changing the bs value in our test program. Once
> bs is > 3072 we no longer observe the problem. Also we can avoid the
> issue in our test program by filling in hole while __testcase.bin is
> being generated. A similar test on xfs with a 1k block size did not
> demonstrate this problem. If make a cp of the file before the umount
> then the copied version is and remains correct.
>
> Our searching does not seem to have revealed any similar reports or an
> explicitly identified fix that was introduced for 3.10. Our concern
> therefore is that this is an unrecognised failure that has been
> inadvertently fixed and could equally inadvertently be reintroduced by
> some other change. If this problem sounds familiar or there are
> suggestions on how to narrow this down further we would greatly
> appreciate the advice.
>
> Thanks,
> James
>
>
>
> #include <string.h>
> #include <stdio.h>
> #include <fcntl.h>
> #include <stdlib.h>
> #include <sys/stat.h>
>
> extern
> int main(int argc, char *argv[])
> {
> struct stat *sbuf;
> char *buf, *zero, *null;
> int fd5, fd6, fd7;
> int i;
> int bs = 1024; /* lte 3072 = corruption */
>
>
> buf = malloc(3*bs);
> zero = malloc(3*bs);
> null = malloc(bs);
> memset(zero, 0, 3*bs);
> sbuf = malloc(sizeof(struct stat));
> memset(sbuf, 0, sizeof(struct stat));
>
> for(i = 0; i < 3*bs; i++) {
> buf[i] = i & 0x000f;
> }
>
> fd5 = open("__testcase.bin", O_RDWR|O_CREAT|O_EXCL, 0644);
> //fcntl(fd5, F_GETFD);
> //fcntl(fd5, F_SETFD, FD_CLOEXEC);
> //stat("__testcase.bin", sbuf);
> fstat(fd5, sbuf);
> /* this only writes the first and last blocks */
> lseek(fd5, 0*bs, SEEK_SET);
> write(fd5, zero, bs);
> //lseek(fd5, 1*bs, SEEK_SET); /* filling in this hole is a fix! */
> //write(fd5, zero, bs);
> lseek(fd5, 2*bs, SEEK_SET);
> write(fd5, zero, bs);
> fdatasync(fd5);
> rename("__testcase.bin", "testcase.bin");
>
> //stat("testcase.bin", sbuf);
> fd6 = open("testcase.bin", O_RDWR|O_CREAT, 0);
> //fcntl(fd6, F_GETFD);
> //fcntl(fd6, F_SETFD, FD_CLOEXEC);
> //fstat(fd6, sbuf);
> pread(fd6, null, bs, 0);
> //fstat(fd6, sbuf);
> //fcntl(fd6, F_GETFD);
> //fcntl(fd6, F_SETFD, FD_CLOEXEC);
> //fcntl(fd6, F_GETFD);
> //fcntl(fd6, F_SETFD, FD_CLOEXEC);
> fd7 = open("testcase.bin", O_RDWR);
> flock(fd7, LOCK_EX);
> umask(022);
> pread(fd6, null, bs, 1*bs);
> pread(fd6, null, bs, 2*bs);
> pwrite(fd6, buf, bs, 0*bs);
> pwrite(fd6, buf, bs, 1*bs);
> pwrite(fd6, buf, bs, 2*bs);
> fdatasync(fd6);
> fdatasync(fd6);
> close(fd5);
> close(fd6);
>
> fd5 = open("testcase.bin", O_RDWR, 0);
> //fcntl(fd5, F_GETFD);
> //fcntl(fd5, F_SETFD, FD_CLOEXEC);
> fdatasync(fd5);
> close(fd5);
>
> close(fd7);
>
> free(buf);
> free(sbuf);
> free(zero);
> free(null);
> }
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: ext[234] data corruption (Linux 3.8, 3.9 / Xen)
2013-09-26 7:22 ` ext[234] data corruption (Linux 3.8, 3.9 / Xen) James Dingwall
@ 2013-09-26 19:14 ` Jan Kara
2013-10-02 16:51 ` James Dingwall
0 siblings, 1 reply; 3+ messages in thread
From: Jan Kara @ 2013-09-26 19:14 UTC (permalink / raw)
To: James Dingwall; +Cc: linux-ext4
Hello,
On Thu 26-09-13 08:22:40, James Dingwall wrote:
> >Hi,
> >
> >We have observed a data corruption bug in a database created by
> >the postmap command (BDB file) under the following conditions:
> >
> >Xen domU guest kernel 3.8, 3.9 (3.5, 3.10, 3.11 don't show the
> >behaviour 3.6 and 3.7 are unknown)
> >dom0 Xen 4.2.1 / kernel 3.8 or Xen 4.3.0 / kernel 3.11
> >The guest has a passed through block device (phy:/ or file:/)
> >The filesytem on the passed through device is ext2/3/4 with a 1k
> >block size
Thanks for report! So have you really tried with all three filesystems?
And don't you have EXT4_USE_FOR_EXT23 set by any chance? There were some
changes to ext4 writeback path and extent status tree. So for ext4 I could
understand the problem got introduced and fixed. But ext2/3 didn't see any
significant changes for a long time...
> >By examining a strace of the postmap command we produced a short
> >piece of code (at the bottom) which demonstrates the problem. If
> >this is executed in a loop such as:
> >
> >#!/bin/bash
> >for i in $(seq 1 5) ; do
> > mount /dev/xvde1 /mnt
> > pushd /mnt> /dev/null
> > echo "checksums after mount"
> > md5sum testcase.bin
> > [ "${i}" = "1" ] && ./a.out
> > echo "checksums before umount"
> > md5sum testcase.bin
> > popd> /dev/null
> > umount /mnt
> >done
I'll see if I can reproduce this to investigate.
> >The output is
> >
> >checksums after mount
> >md5sum: testcase.bin: No such file or directory
> >checksums before umount
> >719f20c98b69457ce0247d6bf4474cf9 testcase.bin# the correct
> >checksum for the file
> >checksums after mount
> >a90804e64bcc1c0c98dd2cb23d0e4c10 testcase.bin
> >checksums before umount
> >a90804e64bcc1c0c98dd2cb23d0e4c10 testcase.bin
> >checksums after mount
> >14bb035eca1ec516ce3865700536fc0c testcase.bin
> >checksums before umount
> >14bb035eca1ec516ce3865700536fc0c testcase.bin
> >checksums after mount
> >124d3d3ea8e421925825ff94a815630b testcase.bin
> >checksums before umount
> >124d3d3ea8e421925825ff94a815630b testcase.bin
> >checksums after mount
> >7c05f36ffdd6b8217a27c0bd4d9cb531 testcase.bin
> >checksums before umount
> >7c05f36ffdd6b8217a27c0bd4d9cb531 testcase.bin
> >
> >If we dd out the block device and then loop mount the resulting
> >file we do not see this problem suggesting that communication
> >between xen block back/front is ok and that it is only when the
> >mount takes place that there is a problem. The default libdb
> >behaviour seems to be to create a database with a block size
> >matching that of the filesystem, if we override this and set it at
> >4k we do not see this issue. This is also observed by changing
> >the bs value in our test program. Once bs is > 3072 we no longer
> >observe the problem. Also we can avoid the issue in our test
> >program by filling in hole while __testcase.bin is being
> >generated. A similar test on xfs with a 1k block size did not
> >demonstrate this problem. If make a cp of the file before the
> >umount then the copied version is and remains correct.
> >
> >Our searching does not seem to have revealed any similar reports
> >or an explicitly identified fix that was introduced for 3.10. Our
> >concern therefore is that this is an unrecognised failure that has
> >been inadvertently fixed and could equally inadvertently be
> >reintroduced by some other change. If this problem sounds
> >familiar or there are suggestions on how to narrow this down
> >further we would greatly appreciate the advice.
Well, you can always use 'git bisect' to find the commit that fixed this.
Honza
> >#include <string.h>
> >#include <stdio.h>
> >#include <fcntl.h>
> >#include <stdlib.h>
> >#include <sys/stat.h>
> >
> >extern
> >int main(int argc, char *argv[])
> >{
> > struct stat *sbuf;
> > char *buf, *zero, *null;
> > int fd5, fd6, fd7;
> > int i;
> > int bs = 1024; /* lte 3072 = corruption */
> >
> >
> > buf = malloc(3*bs);
> > zero = malloc(3*bs);
> > null = malloc(bs);
> > memset(zero, 0, 3*bs);
> > sbuf = malloc(sizeof(struct stat));
> > memset(sbuf, 0, sizeof(struct stat));
> >
> > for(i = 0; i < 3*bs; i++) {
> > buf[i] = i & 0x000f;
> > }
> >
> > fd5 = open("__testcase.bin", O_RDWR|O_CREAT|O_EXCL, 0644);
> > //fcntl(fd5, F_GETFD);
> > //fcntl(fd5, F_SETFD, FD_CLOEXEC);
> > //stat("__testcase.bin", sbuf);
> > fstat(fd5, sbuf);
> > /* this only writes the first and last blocks */
> > lseek(fd5, 0*bs, SEEK_SET);
> > write(fd5, zero, bs);
> > //lseek(fd5, 1*bs, SEEK_SET); /* filling in this hole is a fix! */
> > //write(fd5, zero, bs);
> > lseek(fd5, 2*bs, SEEK_SET);
> > write(fd5, zero, bs);
> > fdatasync(fd5);
> > rename("__testcase.bin", "testcase.bin");
> >
> > //stat("testcase.bin", sbuf);
> > fd6 = open("testcase.bin", O_RDWR|O_CREAT, 0);
> > //fcntl(fd6, F_GETFD);
> > //fcntl(fd6, F_SETFD, FD_CLOEXEC);
> > //fstat(fd6, sbuf);
> > pread(fd6, null, bs, 0);
> > //fstat(fd6, sbuf);
> > //fcntl(fd6, F_GETFD);
> > //fcntl(fd6, F_SETFD, FD_CLOEXEC);
> > //fcntl(fd6, F_GETFD);
> > //fcntl(fd6, F_SETFD, FD_CLOEXEC);
> > fd7 = open("testcase.bin", O_RDWR);
> > flock(fd7, LOCK_EX);
> > umask(022);
> > pread(fd6, null, bs, 1*bs);
> > pread(fd6, null, bs, 2*bs);
> > pwrite(fd6, buf, bs, 0*bs);
> > pwrite(fd6, buf, bs, 1*bs);
> > pwrite(fd6, buf, bs, 2*bs);
> > fdatasync(fd6);
> > fdatasync(fd6);
> > close(fd5);
> > close(fd6);
> >
> > fd5 = open("testcase.bin", O_RDWR, 0);
> > //fcntl(fd5, F_GETFD);
> > //fcntl(fd5, F_SETFD, FD_CLOEXEC);
> > fdatasync(fd5);
> > close(fd5);
> >
> > close(fd7);
> >
> > free(buf);
> > free(sbuf);
> > free(zero);
> > free(null);
> >}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: ext[234] data corruption (Linux 3.8, 3.9 / Xen)
2013-09-26 19:14 ` Jan Kara
@ 2013-10-02 16:51 ` James Dingwall
0 siblings, 0 replies; 3+ messages in thread
From: James Dingwall @ 2013-10-02 16:51 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-ext4
Jan Kara wrote:
> Hello,
>
> On Thu 26-09-13 08:22:40, James Dingwall wrote:
>>> Hi,
>>>
>>> We have observed a data corruption bug in a database created by
>>> the postmap command (BDB file) under the following conditions:
>>>
>>> Xen domU guest kernel 3.8, 3.9 (3.5, 3.10, 3.11 don't show the
>>> behaviour 3.6 and 3.7 are unknown)
>>> dom0 Xen 4.2.1 / kernel 3.8 or Xen 4.3.0 / kernel 3.11
>>> The guest has a passed through block device (phy:/ or file:/)
>>> The filesytem on the passed through device is ext2/3/4 with a 1k
>>> block size
> Thanks for report! So have you really tried with all three filesystems?
> And don't you have EXT4_USE_FOR_EXT23 set by any chance? There were some
> changes to ext4 writeback path and extent status tree. So for ext4 I could
> understand the problem got introduced and fixed. But ext2/3 didn't see any
> significant changes for a long time...
EXT4_USE_FOR_EXT23 doesn't seem to be in use and it isn't listed in
/proc/config.gz, perhaps a parent option that would make it available
isn't set?
# zgrep EXT[234] /proc/config.gz
CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
# CONFIG_EXT2_FS_XIP is not set
CONFIG_EXT3_FS=y
# CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_EXT4_FS=m
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
# CONFIG_EXT4_DEBUG is not set
>>> By examining a strace of the postmap command we produced a short
>>> piece of code (at the bottom) which demonstrates the problem. If
>>> this is executed in a loop such as:
>>>
>>> #!/bin/bash
>>> for i in $(seq 1 5) ; do
>>> mount /dev/xvde1 /mnt
>>> pushd /mnt> /dev/null
>>> echo "checksums after mount"
>>> md5sum testcase.bin
>>> [ "${i}" = "1" ] && ./a.out
>>> echo "checksums before umount"
>>> md5sum testcase.bin
>>> popd> /dev/null
>>> umount /mnt
>>> done
> I'll see if I can reproduce this to investigate.
>
> <snip output of script>
> Well, you can always use 'git bisect' to find the commit that fixed this.
In the end it seemed that the problem was fixed in a 3.10 stable release
(I had originally tested with 3.10.10 which it may have been helpful to
mention:) as it does happen with a kernel built from tag v3.10.
Bisecting gives commit 7b2b160da7661bb2ade3f924b1bd3e3084e53341 (in
xen-blkfront.c) which solves the observed issue although the commit
message seems to indicate it is a resolution for a problem of a
different nature. My knowledge of kernel internals isn't enough to
understand why this is the fix. Since the commit applies cleanly to the
Ubuntu 3.8 LTS tree (where we found the original problem) we'll do some
testing and then open it as an Ubuntu bug if it looks good.
<snip test program>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2013-10-02 16:52 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <524314B3.3090000@zynstra.com>
2013-09-26 7:22 ` ext[234] data corruption (Linux 3.8, 3.9 / Xen) James Dingwall
2013-09-26 19:14 ` Jan Kara
2013-10-02 16:51 ` James Dingwall
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.