[linux-lvm] Contents of read-only LVM snapshot change

linux-lvm.redhat.com archive mirror
 help / color / mirror / Atom feed

* [linux-lvm] Contents of read-only LVM snapshot change
@ 2006-11-11 13:05 Roger Lucas
  2006-11-13 15:01 ` Roger Lucas
  0 siblings, 1 reply; 5+ messages in thread
From: Roger Lucas @ 2006-11-11 13:05 UTC (permalink / raw)
  To: linux-lvm

Hi,

I am seeing the (very) unexpected behaviour where the contents of a read-only snapshot change.

I am running a Debian-based system, mostly Sarge, but with an updated Kernel and LVM as below:

dromedary:~# lvm version
  LVM version:     2.02.07 (2006-07-17)
  Library version: 1.02.08 (2006-07-17)
  Driver version:  4.5.0
dromedary:~# uname -a
Linux dromedary 2.6.16.20.rwl2 #1 Wed Jul 26 12:52:43 BST 2006 i686 GNU/Linux
dromedary:~#

I am taking a snapshot of a LV (/dev/store/backupimage) on which is stacked dmcrypt (/dev/mapper/cryptdisk) and then reiser3.

dromedary:~# mount
<snip>
/dev/mapper/cryptdisk on /mnt/cryptdisk type reiserfs (rw,noatime,acl)
dromedary:~#

I make a whole loads of writes to the Reiser3 mounted partition then I do the following:

	mount -o remount,ro /mnt/cryptodisk
	sync
	dmsetup suspend /dev/mapper/cryptodisk 
	dmsetup resume /dev/mapper/cryptodisk
	lvchange -p r /dev/store/backupimage
	lvcreate -L1G -p r -s -n snapdisk /dev/store/backupimage

If I understand this correctly, this should remount the disk read-only so it cannot change any more, sync the Reiser disk contents
to the device below, flush any changes through the dmcrypt device, make the LV itself read-only and finally create a read-only
snapshot.

At this point, everything should be locked down hard....

BUT:

I am seeing changes on the LV snapshot within a 64KB block at offset 313CA0000.

After running the above commands, I then do:

dromedary# dd if=/dev/store/backupimage bs=65536 skip=201674 count=1 | openssl sha1
1+0 records in
1+0 records out
65536 bytes transferred in 0.074199 seconds (883248 bytes/sec)
cc6f4e68d6f28513f22efcb4012af8165d0d9e2f

dromedary# dd if=/dev/store/snapdisk bs=65536 skip=201674 count=1 | openssl sha1
1+0 records in
1+0 records out
65536 bytes transferred in 0.017019 seconds (3850756 bytes/sec)
2cb7aa5a75e823027bd2151951091e27f9b65e17

dromedary# run_a_script_that_copies_the_disk_image_elsewhere.sh

((We now copy the 20GB disk image elsewhere.  This is a lot of disk activity so I expect that it will flush any existing cached disk
data as there will be >20GB of new disk read traffic))

dromedary# dd if=/dev/store/backupimage bs=65536 skip=201674 count=1 | openssl sha1
1+0 records in
1+0 records out
65536 bytes transferred in 0.015872 seconds (4129033 bytes/sec)
cc6f4e68d6f28513f22efcb4012af8165d0d9e2f

dromedary# dd if=/dev/store/snapdisk bs=65536 skip=201674 count=1 | openssl sha1
1+0 records in
1+0 records out
65536 bytes transferred in 0.005131 seconds (12772596 bytes/sec)
cc6f4e68d6f28513f22efcb4012af8165d0d9e2f

?HUH?

I don't mind if there is a slight difference between the disk and its snapshot images.  I don't know how it could be, given that I
make the backupimage read-only *before* I take the snapshot, but I will let that pass for the moment...

What I do *not* understand is how the contents of a read-only snapshot can change.  The first time I check the snapshot SHA1
checksum, I get 2cb7..... but when I re-run the after lots of data copies elsewhere, I get cc6f.... which matches the image
underneath it.

Can anyone offer any help on this?

At the moment, I have it consistently repeatable at this block offset on one machine. So I can run some more tests.  I haven't
rebooted the machine to "try to make it go away" as I want to get to the bottom of the problem.  I haven't seen it on a different
machine (yet) but this is a very small amount of data to see changing and it seems time dependent, so I am not surprised by that.

Thanks,

Roger

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [linux-lvm] Contents of read-only LVM snapshot change
  2006-11-11 13:05 [linux-lvm] Contents of read-only LVM snapshot change Roger Lucas
@ 2006-11-13 15:01 ` Roger Lucas
  2006-11-14 11:53   ` [linux-lvm] Contents of read-only LVM snapshot change - still have the problem with latest LVM! Roger Lucas
  0 siblings, 1 reply; 5+ messages in thread
From: Roger Lucas @ 2006-11-13 15:01 UTC (permalink / raw)
  To: 'LVM general discussion and development'

More information...

> -----Original Message-----
> From: linux-lvm-bounces@redhat.com [mailto:linux-lvm-bounces@redhat.com] On Behalf Of Roger Lucas
> Sent: 11 November 2006 13:06
> To: linux-lvm@redhat.com
> Subject: [linux-lvm] Contents of read-only LVM snapshot change
> 
> Hi,
> 
> I am seeing the (very) unexpected behaviour where the contents of a read-only snapshot change.
> 
> I am running a Debian-based system, mostly Sarge, but with an updated Kernel and LVM as below:
> 
> dromedary:~# lvm version
>   LVM version:     2.02.07 (2006-07-17)
>   Library version: 1.02.08 (2006-07-17)
>   Driver version:  4.5.0
> dromedary:~# uname -a
> Linux dromedary 2.6.16.20.rwl2 #1 Wed Jul 26 12:52:43 BST 2006 i686 GNU/Linux
> dromedary:~#
> 

I have tried updating to the 2.02.14 LVM and the 1.02.12 device mapper (compiled from source from the FTP site).  The first couple
of tests have run correctly so it looks like this may have resolved the issue, but I am continuing to run tests to confirm this.

Can anyone shed some light on what may be going on?  I have checked the changelogs for the LVM/DM code and there doesn't seem to be
any change that is a strong candidate as the fix of the problem.

Thanks,

Roger

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [linux-lvm] Contents of read-only LVM snapshot change - still have the problem with latest LVM!
  2006-11-13 15:01 ` Roger Lucas
@ 2006-11-14 11:53   ` Roger Lucas
  2006-11-14 22:11     ` [linux-lvm] Contents of read-only LVM snapshot change when both LV and SNAP are Read-Only Roger Lucas
  0 siblings, 1 reply; 5+ messages in thread
From: Roger Lucas @ 2006-11-14 11:53 UTC (permalink / raw)
  To: 'LVM general discussion and development'

> 
> I have tried updating to the 2.02.14 LVM and the 1.02.12 device mapper (compiled from source from
> the FTP site).  The first couple
> of tests have run correctly so it looks like this may have resolved the issue, but I am continuing
> to run tests to confirm this.
> 

It's still not working reliably.  The first few attempts worked but after running more tests, the problem still occurs. 

My test reads the read-only snapshot multiple times.  All the data is consistent across the 20GB snapshot except for a single byte
at offset 0x0313CAC6C0.  Sometimes it reads 0x48 and other times it reads 0x08.  I have run the tests multiple times over many days
and, although the tests sometimes show the correct behaviour, about half the time I get this problem.

I cannot understand why repeated reads of a read-only snapshot would return data that changes.

I am running the 2.6.16.20 kernel and have upgraded to the latest LVM 2.02.14 and DMapper library 1.02.12 but the problem still
exists.

Can anyone offer any help as to what may be going on or suggest how I can investigate the problem further?  I don't mind patching
the LVM source code to obtain more debug information if this helps.

Thanks in advance,

Roger

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [linux-lvm] Contents of read-only LVM snapshot change when both LV and SNAP are Read-Only
  2006-11-14 11:53   ` [linux-lvm] Contents of read-only LVM snapshot change - still have the problem with latest LVM! Roger Lucas
@ 2006-11-14 22:11     ` Roger Lucas
  2006-11-17 14:34       ` [linux-lvm] Contents of read-only LVM snapshot change when bothLV and SNAP are Read-Only - SOLVED Roger Lucas
  0 siblings, 1 reply; 5+ messages in thread
From: Roger Lucas @ 2006-11-14 22:11 UTC (permalink / raw)
  To: 'LVM general discussion and development'

Hi,

I've sent a few mails on this subject but I have (finally) narrowed a consistent test sequence that fails.

To re-summarise, this is on a Debian Sarge machine with updated 2.6.16.20 kernel and the latest LVM/DM libraries and tools.

dromedary:~# lvm version
  LVM version:     2.02.14 (2006-11-10)
  Library version: 1.02.12 (2006-10-13)
  Driver version:  4.5.0
dromedary:~# uname -a
Linux dromedary 2.6.16.20.rwl2 #1 Wed Jul 26 12:52:43 BST 2006 i686 GNU/Linux
dromedary:~#

I have a MD RAID-1 disk array with LVM on top of it.  I was backing up the LV "backupimage" when I noticed the problem.   Although
the LV is read-only and reads the same data consistently, if I take a snapshot of the stable LV, the contents of the LV change.
What I am seeing change is a single byte at offset 0x0313CAC6C7 in the snapshot LV and it is changing between values 0x08 and 0x48.
I detected this when I was verifying a copy of the snapshot against an SHA1 checksum of its contents on LVM.  It was at this point
that I found the contents of the snapshot LV to be changing.

My command sequence that reliably repeats this is:

dromedary: dromedary:~# lvs
  LV          VG          Attr   LSize  Origin      Snap%  Move Log Copy%
  backupimage clientstore ori-ao 50.00G
  root        clientstore -wi-ao  5.00G
  userdisk    clientstore -wi-ao 50.00G
dromedary:~# dd if=/dev/clientstore/backupimage bs=1 skip=13216958151 count=1 | hd
1+0 records in
1+0 records out
1 bytes transferred in 0.000097 seconds (10307 bytes/sec)
00000000  08                                                |.|
00000001
dromedary:~# dd if=/dev/clientstore/backupimage bs=1 skip=13216958151 count=1 | hd
1+0 records in
1+0 records out
1 bytes transferred in 0.000094 seconds (10628 bytes/sec)
00000000  08                                                |.|
00000001
dromedary:~# dd if=/dev/clientstore/backupimage bs=1 skip=13216958151 count=1 | hd
1+0 records in
1+0 records out
1 bytes transferred in 0.000095 seconds (10523 bytes/sec)
00000000  08                                                |.|
00000001
dromedary:~#

At this point, the LV in question is read-only and its contents are stable...

dromedary:~# lvcreate -L10G -p r -s -n snapdisk /dev/clientstore/backupimage
  Logical volume "snapdisk" created
dromedary:~# lvs
  LV          VG          Attr   LSize  Origin      Snap%  Move Log Copy%
  backupimage clientstore ori-ao 50.00G
  root        clientstore -wi-ao  5.00G
  snapdisk    clientstore sri-a- 10.00G backupimage   0.00
  userdisk    clientstore -wi-ao 50.00G
dromedary:~#

We now add the snapshot, again it is read-only.

dromedary:~# dd if=/dev/clientstore/backupimage bs=1 skip=13216958151 count=1 | hd
1+0 records in
1+0 records out
1 bytes transferred in 0.000090 seconds (11106 bytes/sec)
00000000  08                                                |.|
00000001
dromedary:~# dd if=/dev/clientstore/backupimage bs=1 skip=13216958151 count=1 | hd
1+0 records in
1+0 records out
1 bytes transferred in 0.000096 seconds (10416 bytes/sec)
00000000  08                                                |.|
00000001
dromedary:~# dd if=/dev/clientstore/backupimage bs=1 skip=13216958151 count=1 | hd
1+0 records in
1+0 records out
1 bytes transferred in 0.000092 seconds (10871 bytes/sec)
00000000  08                                                |.|
00000001
dromedary:~#

** It looks like the main disk is still stable...

dromedary:~# dd if=/dev/clientstore/snapdisk bs=1 skip=13216958151 count=1 | hd
00000000  48                                                |H|
00000001
1+0 records in
1+0 records out
1 bytes transferred in 0.013386 seconds (75 bytes/sec)
dromedary:~# dd if=/dev/clientstore/snapdisk bs=1 skip=13216958151 count=1 | hd
00000000  08                                                |.|
00000001
1+0 records in
1+0 records out
1 bytes transferred in 0.013048 seconds (77 bytes/sec)
dromedary:~# dd if=/dev/clientstore/snapdisk bs=1 skip=13216958151 count=1 | hd
00000000  48                                                |H|
00000001
1+0 records in
1+0 records out
1 bytes transferred in 0.012758 seconds (78 bytes/sec)
dromedary:~# dd if=/dev/clientstore/snapdisk bs=1 skip=13216958151 count=1 | hd
00000000  08                                                |.|
00000001
1+0 records in
1+0 records out
1 bytes transferred in 0.001883 seconds (531 bytes/sec)
dromedary:~# dd if=/dev/clientstore/snapdisk bs=1 skip=13216958151 count=1 | hd
00000000  48                                                |H|
00000001
1+0 records in
1+0 records out
1 bytes transferred in 0.001794 seconds (557 bytes/sec)
dromedary:~# dd if=/dev/clientstore/snapdisk bs=1 skip=13216958151 count=1 | hd
00000000  08                                                |.|
00000001
1+0 records in
1+0 records out
1 bytes transferred in 0.001800 seconds (556 bytes/sec)
dromedary:~#

** The snapshot's values is toggling between two different values!!!!

dromedary:~# dd if=/dev/clientstore/backupimage bs=1 skip=13216958151 count=1 | hd
1+0 records in
1+0 records out
1 bytes transferred in 0.000093 seconds (10739 bytes/sec)
00000000  08                                                |.|
00000001
dromedary:~# dd if=/dev/clientstore/backupimage bs=1 skip=13216958151 count=1 | hd
1+0 records in
1+0 records out
1 bytes transferred in 0.000088 seconds (11352 bytes/sec)
00000000  08                                                |.|
00000001
dromedary:~# dd if=/dev/clientstore/backupimage bs=1 skip=13216958151 count=1 | hd
1+0 records in
1+0 records out
1 bytes transferred in 0.000087 seconds (11482 bytes/sec)
00000000  08                                                |.|
00000001
dromedary:~#

** And it looks like the main disk is still stable...

Based on information found on the Internet, I put together a simple tool that reads back the contents of the COW structure to see if
the snapshot thinks anything has been written to it.

dromedary:~# ~/lvcowmap --hdr /dev/mapper/clientstore-snapdisk-cow
# Header Info
#  Magic:      0x70416e53
#  Valid:      0x00000001
#  Version:    0x00000001
#  Chunk Size: 16 sectors (8192 bytes)
# Reading exceptions...
SEEKING: 0x0000000000002000 EXCEPTION INFO: OLD=0x0000000000000000  NEW=0x0000000000000000 (END)
LV Name = /dev/mapper/clientstore-snapdisk-cow
ChunkSize = 16 sectors
SECTORSIZE = 512 bytes
TrueOffset       CowOffset       Num_of_ContigSectors    (all values in sectors)
CoW List :
0                0               0
dromedary:~#

So the snapshot thinks that it has no changes, the original LV's data is unchanging, but the snapshot has data that is changing.

This data error is extremely consistent.  It has existed for several days now with in excess of 100 tests run against it and has
stayed in exactly the same bit across reboots.  I don't see how it can be a hardware fault such as a RAM error given the
repeatability of it and its repeatability.  It doesn't make sense that it is a disk error since we are running Linux' software MD in
RAID-1 and should be protected against that kind of thing.  I have tried removing and re-creating the snapshot but the bit error
still happens.

We are talking about a single bit on a 10GB image.

I am now getting very stuck.  Does anyone have any ideas?

For what it is worth, the motherboard is a Gigabyte 8I865GVMF-775 with a Celeron 2.8GHz processor and two Seagate 160GB SATA-150
(ST3160812AS) hard disks attached to the Intel 82810EB (ICH5) disk controller on the motherboard.

Thanks,

Roger

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [linux-lvm] Contents of read-only LVM snapshot change when bothLV and SNAP are Read-Only - SOLVED
  2006-11-14 22:11     ` [linux-lvm] Contents of read-only LVM snapshot change when both LV and SNAP are Read-Only Roger Lucas
@ 2006-11-17 14:34       ` Roger Lucas
  0 siblings, 0 replies; 5+ messages in thread
From: Roger Lucas @ 2006-11-17 14:34 UTC (permalink / raw)
  To: 'LVM general discussion and development'

Hi All,

The problem was nothing to do with LVM but was a RAID error.  I moved the two disks in the RAID-1 array between three different
machines - two were identical and one was a totally different motherboard.  The problem occurred on all of the machines so I tried
removing one of the disks and brought the array up in degraded mode.  When I re-ran the test, I consistently received the first data
sequence.  I powered off the machine and swapped the disks.  I then powered the machine up so that it was running off the other
disk, again in degraded mode.  When I re-ran the test, I consistently received the other data sequence.

What it looks like is happening is that, at some point in time, a write to the RAID array failed and wrote slightly different data
to one of the disks or one of the disks generated a bit error.  Since I was running in RAID-1 and the disks were not failing with a
hardware error, the MD driver seemed happy to return data from whichever disk completed the read first.  Since the disks were
identical, it was pretty random which one would get the data back first, hence the apparent inconsistency.

I don't know why it seemed that the main disk was stable but the snapshot wasn't, but I suspect that this is again down to the race
condition or some caching somewhere.

Thank you to everyone in the LVM team who has helped investigate this for me - I apologise for wasting your time with a false goose
chase.

BR,

Roger

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2006-11-17 14:33 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-11 13:05 [linux-lvm] Contents of read-only LVM snapshot change Roger Lucas
2006-11-13 15:01 ` Roger Lucas
2006-11-14 11:53   ` [linux-lvm] Contents of read-only LVM snapshot change - still have the problem with latest LVM! Roger Lucas
2006-11-14 22:11     ` [linux-lvm] Contents of read-only LVM snapshot change when both LV and SNAP are Read-Only Roger Lucas
2006-11-17 14:34       ` [linux-lvm] Contents of read-only LVM snapshot change when bothLV and SNAP are Read-Only - SOLVED Roger Lucas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).