Random data corruption in VM, possibly caused by rbd

All of lore.kernel.org
 help / color / mirror / Atom feed

* Random data corruption in VM, possibly caused by rbd
@ 2012-06-07 18:04 Guido Winkelmann
  2012-06-07 18:18 ` Stefan Priebe
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Guido Winkelmann @ 2012-06-07 18:04 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2119 bytes --]

Hi,

I'm using Ceph with RBD to provide network-transparent disk images for KVM-
based virtual servers. The last two days, I've been hunting some weird elusive 
bug where data in the virtual machines would be corrupted in weird ways. It 
usually manifests in files having some random data - usually zeroes - at the 
start before the actual contents that should be in there start.

To track this down, I wrote a simple io tester. It does the following:

- Create 1 Megabyte of random data
- Calculate the SHA256 hash of that data
- Write the data to a file on the harddisk, in a given directory, using the 
hash as the filename
- Repeat until the disk is full
- Delete the last file (because it is very likely to be incompletely written)
- Read and delete all the files just written while checking that their sha256 
sums are equal to their filenames

When running this io tester in a VM that uses a qcow2 file on a local harddisk 
for its virtual disk, no errors are found. When the same VM is running using 
rbd, the io tester finds on average about one corruption every 200 Megabytes, 
reproducably.

(As in an interesting aside, the io tester also prints how long it took to 
read or write 100 MB, and it turns out reading the data back in again is about 
three times slower than writing them in the first place...)

Ceph is version 0.47.2. Qemu KVM is 1.0, compiled with the spec file from 
http://pkgs.fedoraproject.org/gitweb/?p=qemu.git;a=summary
(And compiled after ceph 0.47.2 was installed on that machine, so it would use 
the correct headers...)
Both the Ceph cluster and the KVM host machines are running on Fedora 16, with 
a fairly recent 3.3.x kernel.
The ceph cluster uses btrf for the osd's data dirs. The journal is on a tmpfs. 
(This is not a production setup - luckily.)
The virtual machine is using ext4 as its filesystem.
There were no obvious other problems with either the ceph cluster or the KVM 
host machines.

I have attached a copy of the ceph.conf in use, in case it might be helpful.

This is a huge problem, and any help in tracking it down would be much 
appreciated.

Regards,

	Guido

[-- Attachment #2: ceph.conf --]
[-- Type: application/octet-stream, Size: 1358 bytes --]

; global
[global]
	; enable secure authentication
	; auth supported = cephx
        max open files = 131072
        log file = /var/log/ceph/$name.log
        ; log_to_syslog = true        ; uncomment this line to log to syslog
        pid file = /var/run/ceph/$name.pid

; monitors
[mon]
        mon data = /mondata/$name

[mon.alpha]
	host = storage1
	mon addr = 10.6.224.129:6789

[mon.beta]
	host = storage2
	mon addr = 10.6.224.130:6789

[mon.gamma]
	host = storage3
	mon addr = 10.6.224.131:6789

; mds
[mds]
	; where the mds keeps it's secret encryption keys
	keyring = /mdsdata/keyring.$name

[mds.alpha]
	host = storage1

[mds.beta]
	host = storage2

[mds.gamma]
	host = storage3

; osd
[osd]
	osd data = /osddata/$name

	osd journal = /journaldata/$name/journal
	osd journal size = 1000 ; journal size, in megabytes

        ; If you want to run the journal on a tmpfs, disable DirectIO
        journal dio = false

        osd recovery max active = 5

	btrfs devs = /dev/sda5 /dev/sdb5

	keyring = /osddata/$name/keyring

[osd.0]
	host = storage1
	cluster addr = 10.6.224.193
	public addr = 10.6.224.129

[osd.1]
	host = storage2
	cluster addr = 10.6.224.194
	public addr = 10.6.224.130

[osd.2]
	host = storage3
	cluster addr = 10.6.224.195
	public addr = 10.6.224.131

[client]   ; userspace client
;      debug ms = 1
;      debug client = 10

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-07 18:04 Random data corruption in VM, possibly caused by rbd Guido Winkelmann
@ 2012-06-07 18:18 ` Stefan Priebe
  2012-06-07 18:37   ` Guido Winkelmann
  2012-06-07 18:40 ` Oliver Francke
  2012-06-07 19:48 ` Josh Durgin
  2 siblings, 1 reply; 30+ messages in thread
From: Stefan Priebe @ 2012-06-07 18:18 UTC (permalink / raw)
  To: Guido Winkelmann; +Cc: ceph-devel@vger.kernel.org

I think the test script would help a lot so others can test too.

Am 07.06.2012 um 20:04 schrieb Guido Winkelmann <guido-ceph@thisisnotatest.de>:

> Hi,
> 
> I'm using Ceph with RBD to provide network-transparent disk images for KVM-
> based virtual servers. The last two days, I've been hunting some weird elusive 
> bug where data in the virtual machines would be corrupted in weird ways. It 
> usually manifests in files having some random data - usually zeroes - at the 
> start before the actual contents that should be in there start.
> 
> To track this down, I wrote a simple io tester. It does the following:
> 
> - Create 1 Megabyte of random data
> - Calculate the SHA256 hash of that data
> - Write the data to a file on the harddisk, in a given directory, using the 
> hash as the filename
> - Repeat until the disk is full
> - Delete the last file (because it is very likely to be incompletely written)
> - Read and delete all the files just written while checking that their sha256 
> sums are equal to their filenames
> 
> When running this io tester in a VM that uses a qcow2 file on a local harddisk 
> for its virtual disk, no errors are found. When the same VM is running using 
> rbd, the io tester finds on average about one corruption every 200 Megabytes, 
> reproducably.
> 
> (As in an interesting aside, the io tester also prints how long it took to 
> read or write 100 MB, and it turns out reading the data back in again is about 
> three times slower than writing them in the first place...)
> 
> Ceph is version 0.47.2. Qemu KVM is 1.0, compiled with the spec file from 
> http://pkgs.fedoraproject.org/gitweb/?p=qemu.git;a=summary
> (And compiled after ceph 0.47.2 was installed on that machine, so it would use 
> the correct headers...)
> Both the Ceph cluster and the KVM host machines are running on Fedora 16, with 
> a fairly recent 3.3.x kernel.
> The ceph cluster uses btrf for the osd's data dirs. The journal is on a tmpfs. 
> (This is not a production setup - luckily.)
> The virtual machine is using ext4 as its filesystem.
> There were no obvious other problems with either the ceph cluster or the KVM 
> host machines.
> 
> I have attached a copy of the ceph.conf in use, in case it might be helpful.
> 
> This is a huge problem, and any help in tracking it down would be much 
> appreciated.
> 
> Regards,
> 
>    Guido
> <ceph.conf>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-07 18:18 ` Stefan Priebe
@ 2012-06-07 18:37   ` Guido Winkelmann
  2012-06-07 19:54     ` Andrey Korolyov
  2012-06-07 21:53     ` Marcus Sorensen
  0 siblings, 2 replies; 30+ messages in thread
From: Guido Winkelmann @ 2012-06-07 18:37 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 356 bytes --]

Am Donnerstag, 7. Juni 2012, 20:18:52 schrieb Stefan Priebe:
> I think the test script would help a lot so others can test too.

Okay, I've attached the program. It's barely 2 KB. You need Boost 1.45+, CMake 
2.6+ and Crypto++ to compile it.

Warning: This will fill up your harddisk completely, which is not a good idea 
on in-production machines.

	Guido

[-- Attachment #2: iotester.tgz --]
[-- Type: application/x-compressed-tar, Size: 1864 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-07 18:04 Random data corruption in VM, possibly caused by rbd Guido Winkelmann
  2012-06-07 18:18 ` Stefan Priebe
@ 2012-06-07 18:40 ` Oliver Francke
  2012-06-07 19:48 ` Josh Durgin
  2 siblings, 0 replies; 30+ messages in thread
From: Oliver Francke @ 2012-06-07 18:40 UTC (permalink / raw)
  To: Guido Winkelmann; +Cc: ceph-devel@vger.kernel.org

Hi Guido,

unfortunately this sounds very familiar to me. We have been on a long road with similar "weird" errors.
Our setup is something like "start a couple of VM's ( qemu-*), let them create a 1G-file each and randomly seek and write 4MB blocks filled with md5sums of the block as payload, to be verifiable after completely written.
Furthermore create some 10000 files every-now-and-then and try to remove them after the verify-run.
This produced the same things than you are experiencing - zero'ed blocks -  with the main difference, that my tests are now clean with 0.47-2 and friends. After a couple of hundreds of runs.
Our setup is with XFS as OSD-data partition, as we had too many errors with btrfs in the past.
My assumption now would be, that there are some relations to your filesystem…?!

Would be cool if you are able to change your setup to XFS. At least that would be a starting-point for further investigations.

Regards,

Oliver.

Am 07.06.2012 um 20:04 schrieb Guido Winkelmann:

> Hi,
> 
> I'm using Ceph with RBD to provide network-transparent disk images for KVM-
> based virtual servers. The last two days, I've been hunting some weird elusive 
> bug where data in the virtual machines would be corrupted in weird ways. It 
> usually manifests in files having some random data - usually zeroes - at the 
> start before the actual contents that should be in there start.
> 
> To track this down, I wrote a simple io tester. It does the following:
> 
> - Create 1 Megabyte of random data
> - Calculate the SHA256 hash of that data
> - Write the data to a file on the harddisk, in a given directory, using the 
> hash as the filename
> - Repeat until the disk is full
> - Delete the last file (because it is very likely to be incompletely written)
> - Read and delete all the files just written while checking that their sha256 
> sums are equal to their filenames
> 
> When running this io tester in a VM that uses a qcow2 file on a local harddisk 
> for its virtual disk, no errors are found. When the same VM is running using 
> rbd, the io tester finds on average about one corruption every 200 Megabytes, 
> reproducably.
> 
> (As in an interesting aside, the io tester also prints how long it took to 
> read or write 100 MB, and it turns out reading the data back in again is about 
> three times slower than writing them in the first place...)
> 
> Ceph is version 0.47.2. Qemu KVM is 1.0, compiled with the spec file from 
> http://pkgs.fedoraproject.org/gitweb/?p=qemu.git;a=summary
> (And compiled after ceph 0.47.2 was installed on that machine, so it would use 
> the correct headers...)
> Both the Ceph cluster and the KVM host machines are running on Fedora 16, with 
> a fairly recent 3.3.x kernel.
> The ceph cluster uses btrf for the osd's data dirs. The journal is on a tmpfs. 
> (This is not a production setup - luckily.)
> The virtual machine is using ext4 as its filesystem.
> There were no obvious other problems with either the ceph cluster or the KVM 
> host machines.
> 
> I have attached a copy of the ceph.conf in use, in case it might be helpful.
> 
> This is a huge problem, and any help in tracking it down would be much 
> appreciated.
> 
> Regards,
> 
> 	Guido<ceph.conf>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-07 18:04 Random data corruption in VM, possibly caused by rbd Guido Winkelmann
  2012-06-07 18:18 ` Stefan Priebe
  2012-06-07 18:40 ` Oliver Francke
@ 2012-06-07 19:48 ` Josh Durgin
  2012-06-07 21:36   ` Guido Winkelmann
  2012-06-08 12:55   ` Guido Winkelmann
  2 siblings, 2 replies; 30+ messages in thread
From: Josh Durgin @ 2012-06-07 19:48 UTC (permalink / raw)
  To: Guido Winkelmann; +Cc: ceph-devel@vger.kernel.org

On 06/07/2012 11:04 AM, Guido Winkelmann wrote:
> Hi,
>
> I'm using Ceph with RBD to provide network-transparent disk images for KVM-
> based virtual servers. The last two days, I've been hunting some weird elusive
> bug where data in the virtual machines would be corrupted in weird ways. It
> usually manifests in files having some random data - usually zeroes - at the
> start before the actual contents that should be in there start.

I definitely want to figure out what's going on with this.
A few questions:

Are you using rbd caching? If so, what settings?

In either case, does the corruption still occur if you
switch caching on/off? There are different I/O paths here,
and this might tell us if the problem is on the client side.

Another thing to try is turning off sparse reads on the osd by setting
filestore fiemap threshold = 0

> To track this down, I wrote a simple io tester. It does the following:
>
> - Create 1 Megabyte of random data
> - Calculate the SHA256 hash of that data
> - Write the data to a file on the harddisk, in a given directory, using the
> hash as the filename
> - Repeat until the disk is full
> - Delete the last file (because it is very likely to be incompletely written)
> - Read and delete all the files just written while checking that their sha256
> sums are equal to their filenames
>
> When running this io tester in a VM that uses a qcow2 file on a local harddisk
> for its virtual disk, no errors are found. When the same VM is running using
> rbd, the io tester finds on average about one corruption every 200 Megabytes,
> reproducably.
>
> (As in an interesting aside, the io tester also prints how long it took to
> read or write 100 MB, and it turns out reading the data back in again is about
> three times slower than writing them in the first place...)
>
> Ceph is version 0.47.2. Qemu KVM is 1.0, compiled with the spec file from
> http://pkgs.fedoraproject.org/gitweb/?p=qemu.git;a=summary
> (And compiled after ceph 0.47.2 was installed on that machine, so it would use
> the correct headers...)
> Both the Ceph cluster and the KVM host machines are running on Fedora 16, with
> a fairly recent 3.3.x kernel.

Those versions should all work.

> The ceph cluster uses btrf for the osd's data dirs. The journal is on a tmpfs.
> (This is not a production setup - luckily.)
> The virtual machine is using ext4 as its filesystem.
> There were no obvious other problems with either the ceph cluster or the KVM
> host machines.

Were there any nodes with osds restarted during the test runs? I wonder
if it's a problem with losing the tmpfs journal.

As Oliver suggested, switching the osd data dir filesystem might help
too.

> I have attached a copy of the ceph.conf in use, in case it might be helpful.
>
> This is a huge problem, and any help in tracking it down would be much
> appreciated.

Agreed, and I'm happy to help.

Josh

> Regards,
>
> 	Guido

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-07 18:37   ` Guido Winkelmann
@ 2012-06-07 19:54     ` Andrey Korolyov
  2012-06-07 21:03       ` Guido Winkelmann
  2012-06-07 21:53     ` Marcus Sorensen
  1 sibling, 1 reply; 30+ messages in thread
From: Andrey Korolyov @ 2012-06-07 19:54 UTC (permalink / raw)
  To: Guido Winkelmann; +Cc: ceph-devel

Hmm, can`t reproduce that(phew!). Qemu-1.1-release, 0.47.2, guest/host
mainly debian wheezy. Only one main difference with my setup from
yours is a underlying fs - I`m tired of btrfs unpredictable load
issues and moved back to xfs.

BTW you calculate sha1 in test suite, not sha256 as you mentioned above.

On Thu, Jun 7, 2012 at 10:37 PM, Guido Winkelmann
<guido-ceph@thisisnotatest.de> wrote:
> Am Donnerstag, 7. Juni 2012, 20:18:52 schrieb Stefan Priebe:
>> I think the test script would help a lot so others can test too.
>
> Okay, I've attached the program. It's barely 2 KB. You need Boost 1.45+, CMake
> 2.6+ and Crypto++ to compile it.
>
> Warning: This will fill up your harddisk completely, which is not a good idea
> on in-production machines.
>
>        Guido
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-07 19:54     ` Andrey Korolyov
@ 2012-06-07 21:03       ` Guido Winkelmann
  0 siblings, 0 replies; 30+ messages in thread
From: Guido Winkelmann @ 2012-06-07 21:03 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: ceph-devel

On Thursday 07 June 2012 23:54:04 Andrey Korolyov wrote:
> Hmm, can`t reproduce that(phew!). Qemu-1.1-release, 0.47.2, guest/host
> mainly debian wheezy. Only one main difference with my setup from
> yours is a underlying fs - I`m tired of btrfs unpredictable load
> issues and moved back to xfs.

I guess I'll try that tomorrow as well. Is there a guide somewhere for 
switching from btrfs to xfs?
 
> BTW you calculate sha1 in test suite, not sha256 as you mentioned above.

True. I must have mixed that up somehow. Not that it matters - for what I'm 
doing with it here, I might as well be using CRC32.

	Guido

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-07 19:48 ` Josh Durgin
@ 2012-06-07 21:36   ` Guido Winkelmann
  2012-06-07 22:13     ` Tommi Virtanen
  2012-06-08 12:55   ` Guido Winkelmann
  1 sibling, 1 reply; 30+ messages in thread
From: Guido Winkelmann @ 2012-06-07 21:36 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel@vger.kernel.org

On Thursday 07 June 2012 12:48:05 Josh Durgin wrote:
> On 06/07/2012 11:04 AM, Guido Winkelmann wrote:
> > Hi,
> > 
> > I'm using Ceph with RBD to provide network-transparent disk images for
> > KVM-
> > based virtual servers. The last two days, I've been hunting some weird
> > elusive bug where data in the virtual machines would be corrupted in
> > weird ways. It usually manifests in files having some random data -
> > usually zeroes - at the start before the actual contents that should be
> > in there start.
> 
> I definitely want to figure out what's going on with this.
> A few questions:
> 
> Are you using rbd caching? If so, what settings?

I'm not using rbd caching, and I wasn't planning on even trying before I have 
a much better understanding of how it affects VM migration.

> In either case, does the corruption still occur if you
> switch caching on/off? There are different I/O paths here,
> and this might tell us if the problem is on the client side.
> 
> Another thing to try is turning off sparse reads on the osd by setting
> filestore fiemap threshold = 0

Okay, I will try these things tomorrow.

[...]
> > The ceph cluster uses btrf for the osd's data dirs. The journal is on a
> > tmpfs. (This is not a production setup - luckily.)
> > The virtual machine is using ext4 as its filesystem.
> > There were no obvious other problems with either the ceph cluster or the
> > KVM host machines.
> 
> Were there any nodes with osds restarted during the test runs? I wonder
> if it's a problem with losing the tmpfs journal.

No, from the point when the rbd volume was created, all nodes were online all 
the time. No nodes were added or removed.

> As Oliver suggested, switching the osd data dir filesystem might help
> too.

Again, I'll try that tomorrow. BTW, I could use some advice on how to go about 
that. Right I would stop one osd process (not the whole machine), reformat and 
remount its btrfs devices as XFS, delete the journal, restart the osd, wait 
until the cluster is healthy again, repeat for all the osds in the cluster. Is 
that sufficient?

Oh, one other thing I just thought of:
The rbd volume in question was created as a copy, using the rbd cp command, 
from a template volume. I cannot recall seeing any corruption while using the 
original volume (which was created using rbd import). Maybe the bug only bites 
volumes that have been created as copies of other volumes? I'll have to do 
more tests along those lines as well...

Regards,
	Guido

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-07 18:37   ` Guido Winkelmann
  2012-06-07 19:54     ` Andrey Korolyov
@ 2012-06-07 21:53     ` Marcus Sorensen
  2012-06-07 22:12       ` Guido Winkelmann
  1 sibling, 1 reply; 30+ messages in thread
From: Marcus Sorensen @ 2012-06-07 21:53 UTC (permalink / raw)
  To: Guido Winkelmann; +Cc: Stefan Priebe, ceph-devel@vger.kernel.org

Maybe I did something wrong with your iotester, but I had to mkdir
./iotest to get it to run. I straced and found that it died on 'no
such file'.

On Thu, Jun 7, 2012 at 12:37 PM, Guido Winkelmann
<guido-ceph@thisisnotatest.de> wrote:
> Am Donnerstag, 7. Juni 2012, 20:18:52 schrieb Stefan Priebe:
>> I think the test script would help a lot so others can test too.
>
> Okay, I've attached the program. It's barely 2 KB. You need Boost 1.45+, CMake
> 2.6+ and Crypto++ to compile it.
>
> Warning: This will fill up your harddisk completely, which is not a good idea
> on in-production machines.
>
>        Guido
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-07 21:53     ` Marcus Sorensen
@ 2012-06-07 22:12       ` Guido Winkelmann
  0 siblings, 0 replies; 30+ messages in thread
From: Guido Winkelmann @ 2012-06-07 22:12 UTC (permalink / raw)
  To: Marcus Sorensen; +Cc: Stefan Priebe, ceph-devel@vger.kernel.org

On Thursday 07 June 2012 15:53:18 Marcus Sorensen wrote:
> Maybe I did something wrong with your iotester, but I had to mkdir
> ./iotest to get it to run. I straced and found that it died on 'no
> such file'.

It's a bit quick and dirty... You are supposed to pass the directory where it 
is to put its test files on the commandline, like

./iotester /mnt/filesystem_to_test

	Guido

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-07 21:36   ` Guido Winkelmann
@ 2012-06-07 22:13     ` Tommi Virtanen
  0 siblings, 0 replies; 30+ messages in thread
From: Tommi Virtanen @ 2012-06-07 22:13 UTC (permalink / raw)
  To: Guido Winkelmann; +Cc: Josh Durgin, ceph-devel@vger.kernel.org

On Thu, Jun 7, 2012 at 2:36 PM, Guido Winkelmann
<guido-ceph@thisisnotatest.de> wrote:
> Again, I'll try that tomorrow. BTW, I could use some advice on how to go about
> that. Right I would stop one osd process (not the whole machine), reformat and
> remount its btrfs devices as XFS, delete the journal, restart the osd, wait
> until the cluster is healthy again, repeat for all the osds in the cluster. Is
> that sufficient?

Before restarting the osd, you need to do a ceph-osd --mkfs.
Otherwise, yeah that looks good.

> The rbd volume in question was created as a copy, using the rbd cp command,
> from a template volume. I cannot recall seeing any corruption while using the
> original volume (which was created using rbd import). Maybe the bug only bites
> volumes that have been created as copies of other volumes? I'll have to do
> more tests along those lines as well...

Hmm. There should be no difference between the end results of rbd cp
and rbd import.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-07 19:48 ` Josh Durgin
  2012-06-07 21:36   ` Guido Winkelmann
@ 2012-06-08 12:55   ` Guido Winkelmann
  2012-06-08 13:08     ` Guido Winkelmann
  2012-06-08 13:36     ` Oliver Francke
  1 sibling, 2 replies; 30+ messages in thread
From: Guido Winkelmann @ 2012-06-08 12:55 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

Am Donnerstag, 7. Juni 2012, 12:48:05 schrieben Sie:
> On 06/07/2012 11:04 AM, Guido Winkelmann wrote:
> > Hi,
> > 
> > I'm using Ceph with RBD to provide network-transparent disk images for
> > KVM-
> > based virtual servers. The last two days, I've been hunting some weird
> > elusive bug where data in the virtual machines would be corrupted in
> > weird ways. It usually manifests in files having some random data -
> > usually zeroes - at the start before the actual contents that should be
> > in there start.
> 
> I definitely want to figure out what's going on with this.
> A few questions:
> 
> Are you using rbd caching? If so, what settings?
> 
> In either case, does the corruption still occur if you
> switch caching on/off? There are different I/O paths here,
> and this might tell us if the problem is on the client side.

Okay, I've tried enabling rbd caching now, and so far, the problem appears to 
be gone.

I am using libvirt for starting and managing the virtual machines, and what I 
did was change the <source> element for the virtual disk from

<source protocol='rbd' name='rbd/name_of_image'>

to

<source protocol='rbd' name='rbd/name_of_image:rbd_cache=true'>

and then restart the VM.
(I found that in one of your mails on this list; there does not appear to be 
any proper documentation on this...)

The iotester does not find any corruptions with these settings.

The VM ist still horribly broken, but that's probably lingering filesystem 
damage from yesterday. I'll try with a fresh image next.

I did not change anything else in the setup. In particular, the OSDs still use 
btrfs. One of the OSD has been restarted, though. I will run another test with 
a VM without rbd caching, to make sure it wasn't by random chance restarting 
that one osd that made the real difference.

Enabling btrfs did not appear to make any difference wrt performance, but 
that's probably because my tests mostly create sustained sequential IO, for 
which caches are generally not very helpful.

Enabling rbd caching is not a solution I particularly like, for two reasons:

1. In my setup, migrating VMs from one host to another is a normal part of 
operation, and I still don't know ho to prevent data corruption (in the form 
of silently lost writes) when combining rbd caching and migration.

2. I'm not really looking into speeding up single VM, I'm really more 
interested in just how many VMs I can run before performance starts degrading 
for everyone, and I don't think rbd caching will help with that.

Regards,
	Guido

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-08 12:55   ` Guido Winkelmann
@ 2012-06-08 13:08     ` Guido Winkelmann
  2012-06-08 13:36     ` Oliver Francke
  1 sibling, 0 replies; 30+ messages in thread
From: Guido Winkelmann @ 2012-06-08 13:08 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

Am Freitag, 8. Juni 2012, 14:55:44 schrieb Guido Winkelmann:
> I did not change anything else in the setup. In particular, the OSDs still
> use btrfs. One of the OSD has been restarted, though. I will run another
> test with a VM without rbd caching, to make sure it wasn't by random chance
> restarting that one osd that made the real difference.

Did that now, the problem is still reproducable:

Start a VM normally -> Lots and lots of data corruption (and it seems to be 
getting worse...)
Start a VM with rbd caching -> No data corruption

	Guido

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-08 12:55   ` Guido Winkelmann
  2012-06-08 13:08     ` Guido Winkelmann
@ 2012-06-08 13:36     ` Oliver Francke
  2012-06-08 13:55       ` Sage Weil
  1 sibling, 1 reply; 30+ messages in thread
From: Oliver Francke @ 2012-06-08 13:36 UTC (permalink / raw)
  To: Guido Winkelmann; +Cc: Josh Durgin, ceph-devel

Hi Guido,

yeah, there is something weird going on. I just started to establish 
some test-VM's. Freshly imported from running *.qcow2 images.
Kernel panic with INIT, seg-faults and other "funny" stuff.

Just added the rbd_cache=true in my config, voila. All is 
fast-n-up-n-running...
All my testing was done with cache enabled... Since our errors all came 
from rbd_writeback from former ceph-versions...

Josh? Sage? Help?!

Oliver.

On 06/08/2012 02:55 PM, Guido Winkelmann wrote:
> Am Donnerstag, 7. Juni 2012, 12:48:05 schrieben Sie:
>> On 06/07/2012 11:04 AM, Guido Winkelmann wrote:
>>> Hi,
>>>
>>> I'm using Ceph with RBD to provide network-transparent disk images for
>>> KVM-
>>> based virtual servers. The last two days, I've been hunting some weird
>>> elusive bug where data in the virtual machines would be corrupted in
>>> weird ways. It usually manifests in files having some random data -
>>> usually zeroes - at the start before the actual contents that should be
>>> in there start.
>> I definitely want to figure out what's going on with this.
>> A few questions:
>>
>> Are you using rbd caching? If so, what settings?
>>
>> In either case, does the corruption still occur if you
>> switch caching on/off? There are different I/O paths here,
>> and this might tell us if the problem is on the client side.
> Okay, I've tried enabling rbd caching now, and so far, the problem appears to
> be gone.
>
> I am using libvirt for starting and managing the virtual machines, and what I
> did was change the<source>  element for the virtual disk from
>
> <source protocol='rbd' name='rbd/name_of_image'>
>
> to
>
> <source protocol='rbd' name='rbd/name_of_image:rbd_cache=true'>
>
> and then restart the VM.
> (I found that in one of your mails on this list; there does not appear to be
> any proper documentation on this...)
>
> The iotester does not find any corruptions with these settings.
>
> The VM ist still horribly broken, but that's probably lingering filesystem
> damage from yesterday. I'll try with a fresh image next.
>
> I did not change anything else in the setup. In particular, the OSDs still use
> btrfs. One of the OSD has been restarted, though. I will run another test with
> a VM without rbd caching, to make sure it wasn't by random chance restarting
> that one osd that made the real difference.
>
> Enabling btrfs did not appear to make any difference wrt performance, but
> that's probably because my tests mostly create sustained sequential IO, for
> which caches are generally not very helpful.
>
> Enabling rbd caching is not a solution I particularly like, for two reasons:
>
> 1. In my setup, migrating VMs from one host to another is a normal part of
> operation, and I still don't know ho to prevent data corruption (in the form
> of silently lost writes) when combining rbd caching and migration.
>
> 2. I'm not really looking into speeding up single VM, I'm really more
> interested in just how many VMs I can run before performance starts degrading
> for everyone, and I don't think rbd caching will help with that.
>
> Regards,
> 	Guido
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 

Oliver Francke

filoo GmbH
Moltkestraße 25a
33330 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-08 13:36     ` Oliver Francke
@ 2012-06-08 13:55       ` Sage Weil
  2012-06-08 14:50         ` Josh Durgin
  2012-06-11 15:50         ` Guido Winkelmann
  0 siblings, 2 replies; 30+ messages in thread
From: Sage Weil @ 2012-06-08 13:55 UTC (permalink / raw)
  To: Oliver Francke; +Cc: Guido Winkelmann, Josh Durgin, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4530 bytes --]

On Fri, 8 Jun 2012, Oliver Francke wrote:
> Hi Guido,
> 
> yeah, there is something weird going on. I just started to establish some
> test-VM's. Freshly imported from running *.qcow2 images.
> Kernel panic with INIT, seg-faults and other "funny" stuff.
> 
> Just added the rbd_cache=true in my config, voila. All is
> fast-n-up-n-running...
> All my testing was done with cache enabled... Since our errors all came from
> rbd_writeback from former ceph-versions...

Are you guys able to reproduce the corruption with 'debug osd = 20' and 
'debug ms = 1'?  Ideally we'd like to:

 - reproduce from a fresh vm, with osd logs
 - identify the bad file
 - map that file to a block offset (see 
   http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)
 - use that to identify the badness in the log

I suspect the cache is just masking the problem because it submits fewer 
IOs...

sage


> 
> Josh? Sage? Help?!
> 
> Oliver.
> 
> On 06/08/2012 02:55 PM, Guido Winkelmann wrote:
> > Am Donnerstag, 7. Juni 2012, 12:48:05 schrieben Sie:
> > > On 06/07/2012 11:04 AM, Guido Winkelmann wrote:
> > > > Hi,
> > > > 
> > > > I'm using Ceph with RBD to provide network-transparent disk images for
> > > > KVM-
> > > > based virtual servers. The last two days, I've been hunting some weird
> > > > elusive bug where data in the virtual machines would be corrupted in
> > > > weird ways. It usually manifests in files having some random data -
> > > > usually zeroes - at the start before the actual contents that should be
> > > > in there start.
> > > I definitely want to figure out what's going on with this.
> > > A few questions:
> > > 
> > > Are you using rbd caching? If so, what settings?
> > > 
> > > In either case, does the corruption still occur if you
> > > switch caching on/off? There are different I/O paths here,
> > > and this might tell us if the problem is on the client side.
> > Okay, I've tried enabling rbd caching now, and so far, the problem appears
> > to
> > be gone.
> > 
> > I am using libvirt for starting and managing the virtual machines, and what
> > I
> > did was change the<source>  element for the virtual disk from
> > 
> > <source protocol='rbd' name='rbd/name_of_image'>
> > 
> > to
> > 
> > <source protocol='rbd' name='rbd/name_of_image:rbd_cache=true'>
> > 
> > and then restart the VM.
> > (I found that in one of your mails on this list; there does not appear to be
> > any proper documentation on this...)
> > 
> > The iotester does not find any corruptions with these settings.
> > 
> > The VM ist still horribly broken, but that's probably lingering filesystem
> > damage from yesterday. I'll try with a fresh image next.
> > 
> > I did not change anything else in the setup. In particular, the OSDs still
> > use
> > btrfs. One of the OSD has been restarted, though. I will run another test
> > with
> > a VM without rbd caching, to make sure it wasn't by random chance restarting
> > that one osd that made the real difference.
> > 
> > Enabling btrfs did not appear to make any difference wrt performance, but
> > that's probably because my tests mostly create sustained sequential IO, for
> > which caches are generally not very helpful.
> > 
> > Enabling rbd caching is not a solution I particularly like, for two reasons:
> > 
> > 1. In my setup, migrating VMs from one host to another is a normal part of
> > operation, and I still don't know ho to prevent data corruption (in the form
> > of silently lost writes) when combining rbd caching and migration.
> > 
> > 2. I'm not really looking into speeding up single VM, I'm really more
> > interested in just how many VMs I can run before performance starts
> > degrading
> > for everyone, and I don't think rbd caching will help with that.
> > 
> > Regards,
> > 	Guido
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> -- 
> 
> Oliver Francke
> 
> filoo GmbH
> Moltkestraße 25a
> 33330 Gütersloh
> HRB4355 AG Gütersloh
> 
> Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz
> 
> Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-08 13:55       ` Sage Weil
@ 2012-06-08 14:50         ` Josh Durgin
  2012-06-08 15:39           ` Oliver Francke
  2012-06-08 17:15           ` Guido Winkelmann
  2012-06-11 15:50         ` Guido Winkelmann
  1 sibling, 2 replies; 30+ messages in thread
From: Josh Durgin @ 2012-06-08 14:50 UTC (permalink / raw)
  To: Sage Weil; +Cc: Oliver Francke, Guido Winkelmann, ceph-devel

On 06/08/2012 06:55 AM, Sage Weil wrote:
> On Fri, 8 Jun 2012, Oliver Francke wrote:
>> Hi Guido,
>>
>> yeah, there is something weird going on. I just started to establish some
>> test-VM's. Freshly imported from running *.qcow2 images.
>> Kernel panic with INIT, seg-faults and other "funny" stuff.
>>
>> Just added the rbd_cache=true in my config, voila. All is
>> fast-n-up-n-running...
>> All my testing was done with cache enabled... Since our errors all came from
>> rbd_writeback from former ceph-versions...
>
> Are you guys able to reproduce the corruption with 'debug osd = 20' and
> 'debug ms = 1'?  Ideally we'd like to:
>
>   - reproduce from a fresh vm, with osd logs
>   - identify the bad file
>   - map that file to a block offset (see
>     http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)
>   - use that to identify the badness in the log
>
> I suspect the cache is just masking the problem because it submits fewer
> IOs...

The cache also doesn't do sparse reads. Is it still reproducible with
a fresh vm when you set filestore_fiemap_threshold = 0 for the osds,
and run without rbd caching?

Josh

> sage
>
>
>>
>> Josh? Sage? Help?!
>>
>> Oliver.
>>
>> On 06/08/2012 02:55 PM, Guido Winkelmann wrote:
>>> Am Donnerstag, 7. Juni 2012, 12:48:05 schrieben Sie:
>>>> On 06/07/2012 11:04 AM, Guido Winkelmann wrote:
>>>>> Hi,
>>>>>
>>>>> I'm using Ceph with RBD to provide network-transparent disk images for
>>>>> KVM-
>>>>> based virtual servers. The last two days, I've been hunting some weird
>>>>> elusive bug where data in the virtual machines would be corrupted in
>>>>> weird ways. It usually manifests in files having some random data -
>>>>> usually zeroes - at the start before the actual contents that should be
>>>>> in there start.
>>>> I definitely want to figure out what's going on with this.
>>>> A few questions:
>>>>
>>>> Are you using rbd caching? If so, what settings?
>>>>
>>>> In either case, does the corruption still occur if you
>>>> switch caching on/off? There are different I/O paths here,
>>>> and this might tell us if the problem is on the client side.
>>> Okay, I've tried enabling rbd caching now, and so far, the problem appears
>>> to
>>> be gone.
>>>
>>> I am using libvirt for starting and managing the virtual machines, and what
>>> I
>>> did was change the<source>   element for the virtual disk from
>>>
>>> <source protocol='rbd' name='rbd/name_of_image'>
>>>
>>> to
>>>
>>> <source protocol='rbd' name='rbd/name_of_image:rbd_cache=true'>
>>>
>>> and then restart the VM.
>>> (I found that in one of your mails on this list; there does not appear to be
>>> any proper documentation on this...)
>>>
>>> The iotester does not find any corruptions with these settings.
>>>
>>> The VM ist still horribly broken, but that's probably lingering filesystem
>>> damage from yesterday. I'll try with a fresh image next.
>>>
>>> I did not change anything else in the setup. In particular, the OSDs still
>>> use
>>> btrfs. One of the OSD has been restarted, though. I will run another test
>>> with
>>> a VM without rbd caching, to make sure it wasn't by random chance restarting
>>> that one osd that made the real difference.
>>>
>>> Enabling btrfs did not appear to make any difference wrt performance, but
>>> that's probably because my tests mostly create sustained sequential IO, for
>>> which caches are generally not very helpful.
>>>
>>> Enabling rbd caching is not a solution I particularly like, for two reasons:
>>>
>>> 1. In my setup, migrating VMs from one host to another is a normal part of
>>> operation, and I still don't know ho to prevent data corruption (in the form
>>> of silently lost writes) when combining rbd caching and migration.
>>>
>>> 2. I'm not really looking into speeding up single VM, I'm really more
>>> interested in just how many VMs I can run before performance starts
>>> degrading
>>> for everyone, and I don't think rbd caching will help with that.
>>>
>>> Regards,
>>> 	Guido
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> --
>>
>> Oliver Francke
>>
>> filoo GmbH
>> Moltkestraße 25a
>> 33330 Gütersloh
>> HRB4355 AG Gütersloh
>>
>> Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz
>>
>> Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-08 14:50         ` Josh Durgin
@ 2012-06-08 15:39           ` Oliver Francke
  2012-06-08 17:15           ` Guido Winkelmann
  1 sibling, 0 replies; 30+ messages in thread
From: Oliver Francke @ 2012-06-08 15:39 UTC (permalink / raw)
  To: Josh Durgin; +Cc: Sage Weil, Guido Winkelmann, ceph-devel

Well then,

quite busy, too with some other stuff, but...


On 06/08/2012 04:50 PM, Josh Durgin wrote:
> On 06/08/2012 06:55 AM, Sage Weil wrote:
>> On Fri, 8 Jun 2012, Oliver Francke wrote:
>>> Hi Guido,
>>>
>>> yeah, there is something weird going on. I just started to establish 
>>> some
>>> test-VM's. Freshly imported from running *.qcow2 images.
>>> Kernel panic with INIT, seg-faults and other "funny" stuff.
>>>
>>> Just added the rbd_cache=true in my config, voila. All is
>>> fast-n-up-n-running...
>>> All my testing was done with cache enabled... Since our errors all 
>>> came from
>>> rbd_writeback from former ceph-versions...
>>
>> Are you guys able to reproduce the corruption with 'debug osd = 20' and
>> 'debug ms = 1'?  Ideally we'd like to:
>>
>>   - reproduce from a fresh vm, with osd logs
>>   - identify the bad file
>>   - map that file to a block offset (see
>>     http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)
>>   - use that to identify the badness in the log

a logfile with debugging is available at our local store...

>>
>> I suspect the cache is just masking the problem because it submits fewer
>> IOs...
>
> The cache also doesn't do sparse reads. Is it still reproducible with
> a fresh vm when you set filestore_fiemap_threshold = 0 for the osds,
> and run without rbd caching?

restarted OSDs with this setting, but without rbd_cache I still get 
errors. *sigh*


Oliver.

>
> Josh
>
>> sage
>>
>>
>>>
>>> Josh? Sage? Help?!
>>>
>>> Oliver.
>>>
>>> On 06/08/2012 02:55 PM, Guido Winkelmann wrote:
>>>> Am Donnerstag, 7. Juni 2012, 12:48:05 schrieben Sie:
>>>>> On 06/07/2012 11:04 AM, Guido Winkelmann wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm using Ceph with RBD to provide network-transparent disk 
>>>>>> images for
>>>>>> KVM-
>>>>>> based virtual servers. The last two days, I've been hunting some 
>>>>>> weird
>>>>>> elusive bug where data in the virtual machines would be corrupted in
>>>>>> weird ways. It usually manifests in files having some random data -
>>>>>> usually zeroes - at the start before the actual contents that 
>>>>>> should be
>>>>>> in there start.
>>>>> I definitely want to figure out what's going on with this.
>>>>> A few questions:
>>>>>
>>>>> Are you using rbd caching? If so, what settings?
>>>>>
>>>>> In either case, does the corruption still occur if you
>>>>> switch caching on/off? There are different I/O paths here,
>>>>> and this might tell us if the problem is on the client side.
>>>> Okay, I've tried enabling rbd caching now, and so far, the problem 
>>>> appears
>>>> to
>>>> be gone.
>>>>
>>>> I am using libvirt for starting and managing the virtual machines, 
>>>> and what
>>>> I
>>>> did was change the<source>   element for the virtual disk from
>>>>
>>>> <source protocol='rbd' name='rbd/name_of_image'>
>>>>
>>>> to
>>>>
>>>> <source protocol='rbd' name='rbd/name_of_image:rbd_cache=true'>
>>>>
>>>> and then restart the VM.
>>>> (I found that in one of your mails on this list; there does not 
>>>> appear to be
>>>> any proper documentation on this...)
>>>>
>>>> The iotester does not find any corruptions with these settings.
>>>>
>>>> The VM ist still horribly broken, but that's probably lingering 
>>>> filesystem
>>>> damage from yesterday. I'll try with a fresh image next.
>>>>
>>>> I did not change anything else in the setup. In particular, the 
>>>> OSDs still
>>>> use
>>>> btrfs. One of the OSD has been restarted, though. I will run 
>>>> another test
>>>> with
>>>> a VM without rbd caching, to make sure it wasn't by random chance 
>>>> restarting
>>>> that one osd that made the real difference.
>>>>
>>>> Enabling btrfs did not appear to make any difference wrt 
>>>> performance, but
>>>> that's probably because my tests mostly create sustained sequential 
>>>> IO, for
>>>> which caches are generally not very helpful.
>>>>
>>>> Enabling rbd caching is not a solution I particularly like, for two 
>>>> reasons:
>>>>
>>>> 1. In my setup, migrating VMs from one host to another is a normal 
>>>> part of
>>>> operation, and I still don't know ho to prevent data corruption (in 
>>>> the form
>>>> of silently lost writes) when combining rbd caching and migration.
>>>>
>>>> 2. I'm not really looking into speeding up single VM, I'm really more
>>>> interested in just how many VMs I can run before performance starts
>>>> degrading
>>>> for everyone, and I don't think rbd caching will help with that.
>>>>
>>>> Regards,
>>>>     Guido
>>>>
>>>> -- 
>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>> ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>> -- 
>>>
>>> Oliver Francke
>>>
>>> filoo GmbH
>>> Moltkestraße 25a
>>> 33330 Gütersloh
>>> HRB4355 AG Gütersloh
>>>
>>> Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz
>>>
>>> Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh
>>>
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe 
>>> ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>


-- 

Oliver Francke

filoo GmbH
Moltkestraße 25a
33330 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-08 14:50         ` Josh Durgin
  2012-06-08 15:39           ` Oliver Francke
@ 2012-06-08 17:15           ` Guido Winkelmann
  2012-06-10  3:04             ` Sage Weil
  1 sibling, 1 reply; 30+ messages in thread
From: Guido Winkelmann @ 2012-06-08 17:15 UTC (permalink / raw)
  To: Josh Durgin; +Cc: Sage Weil, Oliver Francke, ceph-devel

Am Freitag, 8. Juni 2012, 07:50:36 schrieb Josh Durgin:
> On 06/08/2012 06:55 AM, Sage Weil wrote:
> > On Fri, 8 Jun 2012, Oliver Francke wrote:
> >> Hi Guido,
> >> 
> >> yeah, there is something weird going on. I just started to establish some
> >> test-VM's. Freshly imported from running *.qcow2 images.
> >> Kernel panic with INIT, seg-faults and other "funny" stuff.
> >> 
> >> Just added the rbd_cache=true in my config, voila. All is
> >> fast-n-up-n-running...
> >> All my testing was done with cache enabled... Since our errors all came
> >> from rbd_writeback from former ceph-versions...
> > 
> > Are you guys able to reproduce the corruption with 'debug osd = 20' and
> > 
> > 'debug ms = 1'?  Ideally we'd like to:
> >   - reproduce from a fresh vm, with osd logs
> >   - identify the bad file
> >   - map that file to a block offset (see
> >   
> >     http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)
> >   
> >   - use that to identify the badness in the log
> > 
> > I suspect the cache is just masking the problem because it submits fewer
> > IOs...
> 
> The cache also doesn't do sparse reads. Is it still reproducible with
> a fresh vm when you set filestore_fiemap_threshold = 0 for the osds,
> and run without rbd caching?

I have set filestore_fiemap_threshold = 0 on all osds and restarted them. The 
problem is still there, and so bad I cannot even run this fiemap utility that 
Sage posted. I guess I should have tried booting the VM from a livecd 
instead...

	Guido

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-08 17:15           ` Guido Winkelmann
@ 2012-06-10  3:04             ` Sage Weil
  2012-06-10  3:07               ` Sage Weil
  2012-06-11 14:15               ` Guido Winkelmann
  0 siblings, 2 replies; 30+ messages in thread
From: Sage Weil @ 2012-06-10  3:04 UTC (permalink / raw)
  To: Guido Winkelmann; +Cc: Josh Durgin, Sage Weil, Oliver Francke, ceph-devel

On Fri, 8 Jun 2012, Guido Winkelmann wrote:
> Am Freitag, 8. Juni 2012, 07:50:36 schrieb Josh Durgin:
> > On 06/08/2012 06:55 AM, Sage Weil wrote:
> > > On Fri, 8 Jun 2012, Oliver Francke wrote:
> > >> Hi Guido,
> > >> 
> > >> yeah, there is something weird going on. I just started to establish some
> > >> test-VM's. Freshly imported from running *.qcow2 images.
> > >> Kernel panic with INIT, seg-faults and other "funny" stuff.
> > >> 
> > >> Just added the rbd_cache=true in my config, voila. All is
> > >> fast-n-up-n-running...
> > >> All my testing was done with cache enabled... Since our errors all came
> > >> from rbd_writeback from former ceph-versions...
> > > 
> > > Are you guys able to reproduce the corruption with 'debug osd = 20' and
> > > 
> > > 'debug ms = 1'?  Ideally we'd like to:
> > >   - reproduce from a fresh vm, with osd logs
> > >   - identify the bad file
> > >   - map that file to a block offset (see
> > >   
> > >     http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)
> > >   
> > >   - use that to identify the badness in the log
> > > 
> > > I suspect the cache is just masking the problem because it submits fewer
> > > IOs...
> > 
> > The cache also doesn't do sparse reads. Is it still reproducible with
> > a fresh vm when you set filestore_fiemap_threshold = 0 for the osds,
> > and run without rbd caching?
> 
> I have set filestore_fiemap_threshold = 0 on all osds and restarted them. The 
> problem is still there, and so bad I cannot even run this fiemap utility that 
> Sage posted. I guess I should have tried booting the VM from a livecd 
> instead...

Whoops,

	filestore fiemap threshold = 0

doesn't turn it off, but

	filestore fiemap = false

will.  (Or filestore fiemap threshold = 1000000000 would too.)  Can you 
try again with the above?

Thanks!
sage

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-10  3:04             ` Sage Weil
@ 2012-06-10  3:07               ` Sage Weil
  2012-06-11 14:15               ` Guido Winkelmann
  1 sibling, 0 replies; 30+ messages in thread
From: Sage Weil @ 2012-06-10  3:07 UTC (permalink / raw)
  To: Sage Weil; +Cc: Guido Winkelmann, Josh Durgin, Oliver Francke, ceph-devel

On Sat, 9 Jun 2012, Sage Weil wrote:
> On Fri, 8 Jun 2012, Guido Winkelmann wrote:
> > Am Freitag, 8. Juni 2012, 07:50:36 schrieb Josh Durgin:
> > > On 06/08/2012 06:55 AM, Sage Weil wrote:
> > > > On Fri, 8 Jun 2012, Oliver Francke wrote:
> > > >> Hi Guido,
> > > >> 
> > > >> yeah, there is something weird going on. I just started to establish some
> > > >> test-VM's. Freshly imported from running *.qcow2 images.
> > > >> Kernel panic with INIT, seg-faults and other "funny" stuff.
> > > >> 
> > > >> Just added the rbd_cache=true in my config, voila. All is
> > > >> fast-n-up-n-running...
> > > >> All my testing was done with cache enabled... Since our errors all came
> > > >> from rbd_writeback from former ceph-versions...
> > > > 
> > > > Are you guys able to reproduce the corruption with 'debug osd = 20' and
> > > > 
> > > > 'debug ms = 1'?  Ideally we'd like to:
> > > >   - reproduce from a fresh vm, with osd logs
> > > >   - identify the bad file
> > > >   - map that file to a block offset (see
> > > >   
> > > >     http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)
> > > >   
> > > >   - use that to identify the badness in the log
> > > > 
> > > > I suspect the cache is just masking the problem because it submits fewer
> > > > IOs...
> > > 
> > > The cache also doesn't do sparse reads. Is it still reproducible with
> > > a fresh vm when you set filestore_fiemap_threshold = 0 for the osds,
> > > and run without rbd caching?
> > 
> > I have set filestore_fiemap_threshold = 0 on all osds and restarted them. The 
> > problem is still there, and so bad I cannot even run this fiemap utility that 
> > Sage posted. I guess I should have tried booting the VM from a livecd 
> > instead...
> 
> Whoops,
> 
> 	filestore fiemap threshold = 0
> 
> doesn't turn it off, but
> 
> 	filestore fiemap = false
> 
> will.  (Or filestore fiemap threshold = 1000000000 would too.)  Can you 
> try again with the above?

BTW, I opened http://tracker.newdream.net/issues/2535 with (I think) all 
the relevant info so far.

sage

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-10  3:04             ` Sage Weil
  2012-06-10  3:07               ` Sage Weil
@ 2012-06-11 14:15               ` Guido Winkelmann
  1 sibling, 0 replies; 30+ messages in thread
From: Guido Winkelmann @ 2012-06-11 14:15 UTC (permalink / raw)
  To: Sage Weil; +Cc: Josh Durgin, Oliver Francke, ceph-devel

Am Samstag, 9. Juni 2012, 20:04:20 schrieb Sage Weil:
> On Fri, 8 Jun 2012, Guido Winkelmann wrote:
> > Am Freitag, 8. Juni 2012, 07:50:36 schrieb Josh Durgin:
> > > On 06/08/2012 06:55 AM, Sage Weil wrote:
> > > > On Fri, 8 Jun 2012, Oliver Francke wrote:
> > > >> Hi Guido,
> > > >> 
> > > >> yeah, there is something weird going on. I just started to establish
> > > >> some
> > > >> test-VM's. Freshly imported from running *.qcow2 images.
> > > >> Kernel panic with INIT, seg-faults and other "funny" stuff.
> > > >> 
> > > >> Just added the rbd_cache=true in my config, voila. All is
> > > >> fast-n-up-n-running...
> > > >> All my testing was done with cache enabled... Since our errors all
> > > >> came
> > > >> from rbd_writeback from former ceph-versions...
> > > > 
> > > > Are you guys able to reproduce the corruption with 'debug osd = 20'
> > > > and
> > > > 
> > > > 'debug ms = 1'?  Ideally we'd like to:
> > > >   - reproduce from a fresh vm, with osd logs
> > > >   - identify the bad file
> > > >   - map that file to a block offset (see
> > > >   
> > > >     http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)
> > > >   
> > > >   - use that to identify the badness in the log
> > > > 
> > > > I suspect the cache is just masking the problem because it submits
> > > > fewer
> > > > IOs...
> > > 
> > > The cache also doesn't do sparse reads. Is it still reproducible with
> > > a fresh vm when you set filestore_fiemap_threshold = 0 for the osds,
> > > and run without rbd caching?
> > 
> > I have set filestore_fiemap_threshold = 0 on all osds and restarted them.
> > The problem is still there, and so bad I cannot even run this fiemap
> > utility that Sage posted. I guess I should have tried booting the VM from
> > a livecd instead...
> 
> Whoops,
> 
> 	filestore fiemap threshold = 0
> 
> doesn't turn it off, but
> 
> 	filestore fiemap = false

Okay, I changed "filestore fiemap threshold = 0" to "filestore fiemap = false" 
under [osd]. So far, the problem does not seem to resurface.

	Guido


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-08 13:55       ` Sage Weil
  2012-06-08 14:50         ` Josh Durgin
@ 2012-06-11 15:50         ` Guido Winkelmann
  2012-06-11 16:30           ` Sage Weil
  1 sibling, 1 reply; 30+ messages in thread
From: Guido Winkelmann @ 2012-06-11 15:50 UTC (permalink / raw)
  To: Sage Weil; +Cc: Oliver Francke, Josh Durgin, ceph-devel

Am Freitag, 8. Juni 2012, 06:55:19 schrieb Sage Weil:
> On Fri, 8 Jun 2012, Oliver Francke wrote:
> > Hi Guido,
> > 
> > yeah, there is something weird going on. I just started to establish some
> > test-VM's. Freshly imported from running *.qcow2 images.
> > Kernel panic with INIT, seg-faults and other "funny" stuff.
> > 
> > Just added the rbd_cache=true in my config, voila. All is
> > fast-n-up-n-running...
> > All my testing was done with cache enabled... Since our errors all came
> > from rbd_writeback from former ceph-versions...
> 
> Are you guys able to reproduce the corruption with 'debug osd = 20' and
> 'debug ms = 1'?  Ideally we'd like to:
> 
>  - reproduce from a fresh vm, with osd logs
>  - identify the bad file
>  - map that file to a block offset (see
>    http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)
>  - use that to identify the badness in the log
> 
> I suspect the cache is just masking the problem because it submits fewer
> IOs...

Okay, I added 'debug osd = 20' and 'debug ms = 1' under [global] and 
'filestore fiemap = false' under [osd] and started a new VM. That worked 
nicely, and the iotester found no corruptions. Then I removed 'filestore 
fiemap = false' from the config, restarted all osds and ran the iotester 
again. Output is as follows:

testserver-rbd11 iotester # date ; ./iotester /var/iotest ; date
Mon Jun 11 17:34:44 CEST 2012
Wrote 100 MiB of data in 1943 milliseconds
Wrote 100 MiB of data in 1858 milliseconds
Wrote 100 MiB of data in 2213 milliseconds
Wrote 100 MiB of data in 3441 milliseconds
Wrote 100 MiB of data in 2705 milliseconds
Wrote 100 MiB of data in 1778 milliseconds
Wrote 100 MiB of data in 1974 milliseconds
Wrote 100 MiB of data in 2780 milliseconds
Wrote 100 MiB of data in 1961 milliseconds
Wrote 100 MiB of data in 2366 milliseconds
Wrote 100 MiB of data in 1886 milliseconds
Wrote 100 MiB of data in 3589 milliseconds
Wrote 100 MiB of data in 1973 milliseconds
Wrote 100 MiB of data in 2506 milliseconds
Wrote 100 MiB of data in 1937 milliseconds
Wrote 100 MiB of data in 3404 milliseconds
Wrote 100 MiB of data in 1990 milliseconds
Wrote 100 MiB of data in 3713 milliseconds
Read 100 MiB of data in 4856 milliseconds
Digest wrong for file "/var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa"
Mon Jun 11 17:35:34 CEST 2012
testserver-rbd11 iotester # ~/fiemap 
/var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa
File /var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa has 1 extents:
#       Logical          Physical         Length           Flags
0:      0000000000000000 00000000a8200000 0000000000100000 000

I looked into the file in question, and it started with zero-bytes from the 
start until position 0xbff, even though it was supposed to all random data.

I have included timestamps in the hopes they might make it easier to find the 
related entries in the logs.

So what do I do now? The logs are very large and complex, and I don't 
understand most of what's in there. I don't even know which OSD served that 
particular block/object.

Regards,

	Guido


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-11 15:50         ` Guido Winkelmann
@ 2012-06-11 16:30           ` Sage Weil
  2012-06-11 17:07             ` Guido Winkelmann
  2012-06-12 12:31             ` Guido Winkelmann
  0 siblings, 2 replies; 30+ messages in thread
From: Sage Weil @ 2012-06-11 16:30 UTC (permalink / raw)
  To: Guido Winkelmann; +Cc: Sage Weil, Oliver Francke, Josh Durgin, ceph-devel

On Mon, 11 Jun 2012, Guido Winkelmann wrote:
> Am Freitag, 8. Juni 2012, 06:55:19 schrieb Sage Weil:
> > On Fri, 8 Jun 2012, Oliver Francke wrote:
> > > Hi Guido,
> > > 
> > > yeah, there is something weird going on. I just started to establish some
> > > test-VM's. Freshly imported from running *.qcow2 images.
> > > Kernel panic with INIT, seg-faults and other "funny" stuff.
> > > 
> > > Just added the rbd_cache=true in my config, voila. All is
> > > fast-n-up-n-running...
> > > All my testing was done with cache enabled... Since our errors all came
> > > from rbd_writeback from former ceph-versions...
> > 
> > Are you guys able to reproduce the corruption with 'debug osd = 20' and
> > 'debug ms = 1'?  Ideally we'd like to:
> > 
> >  - reproduce from a fresh vm, with osd logs
> >  - identify the bad file
> >  - map that file to a block offset (see
> >    http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)
> >  - use that to identify the badness in the log
> > 
> > I suspect the cache is just masking the problem because it submits fewer
> > IOs...
> 
> Okay, I added 'debug osd = 20' and 'debug ms = 1' under [global] and 
> 'filestore fiemap = false' under [osd] and started a new VM. That worked 
> nicely, and the iotester found no corruptions. Then I removed 'filestore 
> fiemap = false' from the config, restarted all osds and ran the iotester 
> again. Output is as follows:
> 
> testserver-rbd11 iotester # date ; ./iotester /var/iotest ; date
> Mon Jun 11 17:34:44 CEST 2012
> Wrote 100 MiB of data in 1943 milliseconds
> Wrote 100 MiB of data in 1858 milliseconds
> Wrote 100 MiB of data in 2213 milliseconds
> Wrote 100 MiB of data in 3441 milliseconds
> Wrote 100 MiB of data in 2705 milliseconds
> Wrote 100 MiB of data in 1778 milliseconds
> Wrote 100 MiB of data in 1974 milliseconds
> Wrote 100 MiB of data in 2780 milliseconds
> Wrote 100 MiB of data in 1961 milliseconds
> Wrote 100 MiB of data in 2366 milliseconds
> Wrote 100 MiB of data in 1886 milliseconds
> Wrote 100 MiB of data in 3589 milliseconds
> Wrote 100 MiB of data in 1973 milliseconds
> Wrote 100 MiB of data in 2506 milliseconds
> Wrote 100 MiB of data in 1937 milliseconds
> Wrote 100 MiB of data in 3404 milliseconds
> Wrote 100 MiB of data in 1990 milliseconds
> Wrote 100 MiB of data in 3713 milliseconds
> Read 100 MiB of data in 4856 milliseconds
> Digest wrong for file "/var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa"
> Mon Jun 11 17:35:34 CEST 2012
> testserver-rbd11 iotester # ~/fiemap 
> /var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa
> File /var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa has 1 extents:
> #       Logical          Physical         Length           Flags
> 0:      0000000000000000 00000000a8200000 0000000000100000 000
> 
> I looked into the file in question, and it started with zero-bytes from the 
> start until position 0xbff, even though it was supposed to all random data.
> 
> I have included timestamps in the hopes they might make it easier to find the 
> related entries in the logs.
> 
> So what do I do now? The logs are very large and complex, and I don't 
> understand most of what's in there. I don't even know which OSD served that 
> particular block/object.

If you can reproduce it with 'debug filestore = 20' too, that will be 
better, as it will tell us what the FIEMAP ioctl is returning.  Also, if 
you can attach/post the contents of the object itself (rados -p rbd get 
rb.0.1.0000000002a0 /tmp/foo) we can make sure the object has the right 
data (and the sparse-read operation that librbd is doing is the culprit).

As for the log:

First, map the offset to an rbd block.  For example, taking the 'Physical' 
value of 00000000a8200000 from above:

$ printf "%012x\n" $((0x00000000a8200000 / (4096*1024) ))
0000000002a0

Then figure out what the object name prefix is:

$ rbd info <imagename> | grep prefix
        block_name_prefix: rb.0.1

Then add the block number, 0000000002a0 to that, e.g. rb.0.1.0000000002a0.

Then map that back to an osd with

$ ceph osd map rbd rb.0.1.0000000002a0
osdmap e19 pool 'rbd' (2) object 'rb.0.1.0000000002a0' -> pg 2.a2e06f65 
(2.5) -> up [0,2] acting [0,2]

You'll see the osd ids listed in brackets after 'active'.  We want the 
first one, 0 in my example.  The log from that OSD is what we need.

Thanks!
sage

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-11 16:30           ` Sage Weil
@ 2012-06-11 17:07             ` Guido Winkelmann
  2012-06-11 17:12               ` Sage Weil
  2012-06-11 17:29               ` Josh Durgin
  2012-06-12 12:31             ` Guido Winkelmann
  1 sibling, 2 replies; 30+ messages in thread
From: Guido Winkelmann @ 2012-06-11 17:07 UTC (permalink / raw)
  To: Sage Weil; +Cc: Oliver Francke, Josh Durgin, ceph-devel

Am Montag, 11. Juni 2012, 09:30:42 schrieb Sage Weil:
> On Mon, 11 Jun 2012, Guido Winkelmann wrote:
> > Am Freitag, 8. Juni 2012, 06:55:19 schrieb Sage Weil:
> > > On Fri, 8 Jun 2012, Oliver Francke wrote:

> > > Are you guys able to reproduce the corruption with 'debug osd = 20' and
> > > 
> > > 'debug ms = 1'?  Ideally we'd like to:
> > >  - reproduce from a fresh vm, with osd logs
> > >  - identify the bad file
> > >  - map that file to a block offset (see
> > >  
> > >    http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)
> > >  
> > >  - use that to identify the badness in the log
> > > 
> > > I suspect the cache is just masking the problem because it submits fewer
> > > IOs...
> > 
> > Okay, I added 'debug osd = 20' and 'debug ms = 1' under [global] and
> > 'filestore fiemap = false' under [osd] and started a new VM. That worked
> > nicely, and the iotester found no corruptions. Then I removed 'filestore
> > fiemap = false' from the config, restarted all osds and ran the iotester
> > again. Output is as follows:
> > 
> > testserver-rbd11 iotester # date ; ./iotester /var/iotest ; date
> > Mon Jun 11 17:34:44 CEST 2012
> > Wrote 100 MiB of data in 1943 milliseconds
> > Wrote 100 MiB of data in 1858 milliseconds
> > Wrote 100 MiB of data in 2213 milliseconds
> > Wrote 100 MiB of data in 3441 milliseconds
> > Wrote 100 MiB of data in 2705 milliseconds
> > Wrote 100 MiB of data in 1778 milliseconds
> > Wrote 100 MiB of data in 1974 milliseconds
> > Wrote 100 MiB of data in 2780 milliseconds
> > Wrote 100 MiB of data in 1961 milliseconds
> > Wrote 100 MiB of data in 2366 milliseconds
> > Wrote 100 MiB of data in 1886 milliseconds
> > Wrote 100 MiB of data in 3589 milliseconds
> > Wrote 100 MiB of data in 1973 milliseconds
> > Wrote 100 MiB of data in 2506 milliseconds
> > Wrote 100 MiB of data in 1937 milliseconds
> > Wrote 100 MiB of data in 3404 milliseconds
> > Wrote 100 MiB of data in 1990 milliseconds
> > Wrote 100 MiB of data in 3713 milliseconds
> > Read 100 MiB of data in 4856 milliseconds
> > Digest wrong for file
> > "/var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa" Mon Jun 11
> > 17:35:34 CEST 2012
> > testserver-rbd11 iotester # ~/fiemap
> > /var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa
> > File /var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa has 1 extents:
> > #       Logical          Physical         Length           Flags
> > 0:      0000000000000000 00000000a8200000 0000000000100000 000
> > 
> > I looked into the file in question, and it started with zero-bytes from
> > the
> > start until position 0xbff, even though it was supposed to all random
> > data.
> > 
> > I have included timestamps in the hopes they might make it easier to find
> > the related entries in the logs.
> > 
> > So what do I do now? The logs are very large and complex, and I don't
> > understand most of what's in there. I don't even know which OSD served
> > that
> > particular block/object.
> 
> If you can reproduce it with 'debug filestore = 20' too, that will be
> better, as it will tell us what the FIEMAP ioctl is returning.  Also, if
> you can attach/post the contents of the object itself (rados -p rbd get
> rb.0.1.0000000002a0 /tmp/foo) we can make sure the object has the right
> data (and the sparse-read operation that librbd is doing is the culprit).

Um. Maybe... That's the problem with using random data, I can't just look at 
it and recognize it. I guess tomorrow I'll slap something together to see if I 
can find any 1 Meg-range of data in there that matches the expect checksum.

> 
> As for the log:
> 
> First, map the offset to an rbd block.  For example, taking the 'Physical'
> value of 00000000a8200000 from above:
> 
> $ printf "%012x\n" $((0x00000000a8200000 / (4096*1024) ))
> 0000000002a0
> 
> Then figure out what the object name prefix is:
> 
> $ rbd info <imagename> | grep prefix
>         block_name_prefix: rb.0.1
> 
> Then add the block number, 0000000002a0 to that, e.g. rb.0.1.0000000002a0.
> 
> Then map that back to an osd with
> 
> $ ceph osd map rbd rb.0.1.0000000002a0
> osdmap e19 pool 'rbd' (2) object 'rb.0.1.0000000002a0' -> pg 2.a2e06f65
> (2.5) -> up [0,2] acting [0,2]
> 
> You'll see the osd ids listed in brackets after 'active'.  We want the
> first one, 0 in my example.  The log from that OSD is what we need.

I'm getting

osdmap e89 pool 'rbd' (2) object 'rb.0.13.0000000002a0' -> pg 2.aca5eccb 
(2.4b) -> up [1,2] acting [1,2]

from that command, so I guess it's osd.1 then.
Do you have somewhere I can upload the log? It is 1.1 GiB in size. Bzip2 gets 
it down to 53 MiB, but that's still too large to be sent to a mailing list...

	Guido

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-11 17:07             ` Guido Winkelmann
@ 2012-06-11 17:12               ` Sage Weil
  2012-06-11 17:29               ` Josh Durgin
  1 sibling, 0 replies; 30+ messages in thread
From: Sage Weil @ 2012-06-11 17:12 UTC (permalink / raw)
  To: Guido Winkelmann; +Cc: Oliver Francke, Josh Durgin, ceph-devel

On Mon, 11 Jun 2012, Guido Winkelmann wrote:
> Am Montag, 11. Juni 2012, 09:30:42 schrieb Sage Weil:
> > On Mon, 11 Jun 2012, Guido Winkelmann wrote:
> > > Am Freitag, 8. Juni 2012, 06:55:19 schrieb Sage Weil:
> > > > On Fri, 8 Jun 2012, Oliver Francke wrote:
> 
> > > > Are you guys able to reproduce the corruption with 'debug osd = 20' and
> > > > 
> > > > 'debug ms = 1'?  Ideally we'd like to:
> > > >  - reproduce from a fresh vm, with osd logs
> > > >  - identify the bad file
> > > >  - map that file to a block offset (see
> > > >  
> > > >    http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)
> > > >  
> > > >  - use that to identify the badness in the log
> > > > 
> > > > I suspect the cache is just masking the problem because it submits fewer
> > > > IOs...
> > > 
> > > Okay, I added 'debug osd = 20' and 'debug ms = 1' under [global] and
> > > 'filestore fiemap = false' under [osd] and started a new VM. That worked
> > > nicely, and the iotester found no corruptions. Then I removed 'filestore
> > > fiemap = false' from the config, restarted all osds and ran the iotester
> > > again. Output is as follows:
> > > 
> > > testserver-rbd11 iotester # date ; ./iotester /var/iotest ; date
> > > Mon Jun 11 17:34:44 CEST 2012
> > > Wrote 100 MiB of data in 1943 milliseconds
> > > Wrote 100 MiB of data in 1858 milliseconds
> > > Wrote 100 MiB of data in 2213 milliseconds
> > > Wrote 100 MiB of data in 3441 milliseconds
> > > Wrote 100 MiB of data in 2705 milliseconds
> > > Wrote 100 MiB of data in 1778 milliseconds
> > > Wrote 100 MiB of data in 1974 milliseconds
> > > Wrote 100 MiB of data in 2780 milliseconds
> > > Wrote 100 MiB of data in 1961 milliseconds
> > > Wrote 100 MiB of data in 2366 milliseconds
> > > Wrote 100 MiB of data in 1886 milliseconds
> > > Wrote 100 MiB of data in 3589 milliseconds
> > > Wrote 100 MiB of data in 1973 milliseconds
> > > Wrote 100 MiB of data in 2506 milliseconds
> > > Wrote 100 MiB of data in 1937 milliseconds
> > > Wrote 100 MiB of data in 3404 milliseconds
> > > Wrote 100 MiB of data in 1990 milliseconds
> > > Wrote 100 MiB of data in 3713 milliseconds
> > > Read 100 MiB of data in 4856 milliseconds
> > > Digest wrong for file
> > > "/var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa" Mon Jun 11
> > > 17:35:34 CEST 2012
> > > testserver-rbd11 iotester # ~/fiemap
> > > /var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa
> > > File /var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa has 1 extents:
> > > #       Logical          Physical         Length           Flags
> > > 0:      0000000000000000 00000000a8200000 0000000000100000 000
> > > 
> > > I looked into the file in question, and it started with zero-bytes from
> > > the
> > > start until position 0xbff, even though it was supposed to all random
> > > data.
> > > 
> > > I have included timestamps in the hopes they might make it easier to find
> > > the related entries in the logs.
> > > 
> > > So what do I do now? The logs are very large and complex, and I don't
> > > understand most of what's in there. I don't even know which OSD served
> > > that
> > > particular block/object.
> > 
> > If you can reproduce it with 'debug filestore = 20' too, that will be
> > better, as it will tell us what the FIEMAP ioctl is returning.  Also, if
> > you can attach/post the contents of the object itself (rados -p rbd get
> > rb.0.1.0000000002a0 /tmp/foo) we can make sure the object has the right
> > data (and the sparse-read operation that librbd is doing is the culprit).
> 
> Um. Maybe... That's the problem with using random data, I can't just look at 
> it and recognize it. I guess tomorrow I'll slap something together to see if I 
> can find any 1 Meg-range of data in there that matches the expect checksum.

The process below will identify the object in question..

> > 
> > As for the log:
> > 
> > First, map the offset to an rbd block.  For example, taking the 'Physical'
> > value of 00000000a8200000 from above:
> > 
> > $ printf "%012x\n" $((0x00000000a8200000 / (4096*1024) ))
> > 0000000002a0
> > 
> > Then figure out what the object name prefix is:
> > 
> > $ rbd info <imagename> | grep prefix
> >         block_name_prefix: rb.0.1
> > 
> > Then add the block number, 0000000002a0 to that, e.g. rb.0.1.0000000002a0.
> > 
> > Then map that back to an osd with
> > 
> > $ ceph osd map rbd rb.0.1.0000000002a0
> > osdmap e19 pool 'rbd' (2) object 'rb.0.1.0000000002a0' -> pg 2.a2e06f65
> > (2.5) -> up [0,2] acting [0,2]
> > 
> > You'll see the osd ids listed in brackets after 'active'.  We want the
> > first one, 0 in my example.  The log from that OSD is what we need.
> 
> I'm getting
> 
> osdmap e89 pool 'rbd' (2) object 'rb.0.13.0000000002a0' -> pg 2.aca5eccb 
> (2.4b) -> up [1,2] acting [1,2]
> 
> from that command, so I guess it's osd.1 then.
> Do you have somewhere I can upload the log? It is 1.1 GiB in size. Bzip2 
> gets it down to 53 MiB, but that's still too large to be sent to a 
> mailing list...

Yeah, but it'll be more useful if its generated with 'debug filestore = 
20'... :)

sage

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-11 17:07             ` Guido Winkelmann
  2012-06-11 17:12               ` Sage Weil
@ 2012-06-11 17:29               ` Josh Durgin
  1 sibling, 0 replies; 30+ messages in thread
From: Josh Durgin @ 2012-06-11 17:29 UTC (permalink / raw)
  To: Guido Winkelmann; +Cc: Sage Weil, Oliver Francke, ceph-devel

On 06/11/2012 10:07 AM, Guido Winkelmann wrote:
> Am Montag, 11. Juni 2012, 09:30:42 schrieb Sage Weil:
>> On Mon, 11 Jun 2012, Guido Winkelmann wrote:
>>> Am Freitag, 8. Juni 2012, 06:55:19 schrieb Sage Weil:
>>>> On Fri, 8 Jun 2012, Oliver Francke wrote:
>
>>>> Are you guys able to reproduce the corruption with 'debug osd = 20' and
>>>>
>>>> 'debug ms = 1'?  Ideally we'd like to:
>>>>   - reproduce from a fresh vm, with osd logs
>>>>   - identify the bad file
>>>>   - map that file to a block offset (see
>>>>
>>>>     http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)
>>>>
>>>>   - use that to identify the badness in the log
>>>>
>>>> I suspect the cache is just masking the problem because it submits fewer
>>>> IOs...
>>>
>>> Okay, I added 'debug osd = 20' and 'debug ms = 1' under [global] and
>>> 'filestore fiemap = false' under [osd] and started a new VM. That worked
>>> nicely, and the iotester found no corruptions. Then I removed 'filestore
>>> fiemap = false' from the config, restarted all osds and ran the iotester
>>> again. Output is as follows:
>>>
>>> testserver-rbd11 iotester # date ; ./iotester /var/iotest ; date
>>> Mon Jun 11 17:34:44 CEST 2012
>>> Wrote 100 MiB of data in 1943 milliseconds
>>> Wrote 100 MiB of data in 1858 milliseconds
>>> Wrote 100 MiB of data in 2213 milliseconds
>>> Wrote 100 MiB of data in 3441 milliseconds
>>> Wrote 100 MiB of data in 2705 milliseconds
>>> Wrote 100 MiB of data in 1778 milliseconds
>>> Wrote 100 MiB of data in 1974 milliseconds
>>> Wrote 100 MiB of data in 2780 milliseconds
>>> Wrote 100 MiB of data in 1961 milliseconds
>>> Wrote 100 MiB of data in 2366 milliseconds
>>> Wrote 100 MiB of data in 1886 milliseconds
>>> Wrote 100 MiB of data in 3589 milliseconds
>>> Wrote 100 MiB of data in 1973 milliseconds
>>> Wrote 100 MiB of data in 2506 milliseconds
>>> Wrote 100 MiB of data in 1937 milliseconds
>>> Wrote 100 MiB of data in 3404 milliseconds
>>> Wrote 100 MiB of data in 1990 milliseconds
>>> Wrote 100 MiB of data in 3713 milliseconds
>>> Read 100 MiB of data in 4856 milliseconds
>>> Digest wrong for file
>>> "/var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa" Mon Jun 11
>>> 17:35:34 CEST 2012
>>> testserver-rbd11 iotester # ~/fiemap
>>> /var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa
>>> File /var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa has 1 extents:
>>> #       Logical          Physical         Length           Flags
>>> 0:      0000000000000000 00000000a8200000 0000000000100000 000
>>>
>>> I looked into the file in question, and it started with zero-bytes from
>>> the
>>> start until position 0xbff, even though it was supposed to all random
>>> data.
>>>
>>> I have included timestamps in the hopes they might make it easier to find
>>> the related entries in the logs.
>>>
>>> So what do I do now? The logs are very large and complex, and I don't
>>> understand most of what's in there. I don't even know which OSD served
>>> that
>>> particular block/object.
>>
>> If you can reproduce it with 'debug filestore = 20' too, that will be
>> better, as it will tell us what the FIEMAP ioctl is returning.  Also, if
>> you can attach/post the contents of the object itself (rados -p rbd get
>> rb.0.1.0000000002a0 /tmp/foo) we can make sure the object has the right
>> data (and the sparse-read operation that librbd is doing is the culprit).
>
> Um. Maybe... That's the problem with using random data, I can't just look at
> it and recognize it. I guess tomorrow I'll slap something together to see if I
> can find any 1 Meg-range of data in there that matches the expect checksum.
>
>>
>> As for the log:
>>
>> First, map the offset to an rbd block.  For example, taking the 'Physical'
>> value of 00000000a8200000 from above:
>>
>> $ printf "%012x\n" $((0x00000000a8200000 / (4096*1024) ))
>> 0000000002a0
>>
>> Then figure out what the object name prefix is:
>>
>> $ rbd info<imagename>  | grep prefix
>>          block_name_prefix: rb.0.1
>>
>> Then add the block number, 0000000002a0 to that, e.g. rb.0.1.0000000002a0.
>>
>> Then map that back to an osd with
>>
>> $ ceph osd map rbd rb.0.1.0000000002a0
>> osdmap e19 pool 'rbd' (2) object 'rb.0.1.0000000002a0' ->  pg 2.a2e06f65
>> (2.5) ->  up [0,2] acting [0,2]
>>
>> You'll see the osd ids listed in brackets after 'active'.  We want the
>> first one, 0 in my example.  The log from that OSD is what we need.
>
> I'm getting
>
> osdmap e89 pool 'rbd' (2) object 'rb.0.13.0000000002a0' ->  pg 2.aca5eccb
> (2.4b) ->  up [1,2] acting [1,2]
>
> from that command, so I guess it's osd.1 then.
> Do you have somewhere I can upload the log? It is 1.1 GiB in size. Bzip2 gets
> it down to 53 MiB, but that's still too large to be sent to a mailing list...

You can attach it to the tracker: http://tracker.newdream.net/issues/2535

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-11 16:30           ` Sage Weil
  2012-06-11 17:07             ` Guido Winkelmann
@ 2012-06-12 12:31             ` Guido Winkelmann
  2012-06-15 12:14               ` Stefan Majer
  1 sibling, 1 reply; 30+ messages in thread
From: Guido Winkelmann @ 2012-06-12 12:31 UTC (permalink / raw)
  To: Sage Weil; +Cc: Oliver Francke, Josh Durgin, ceph-devel

Am Montag, 11. Juni 2012, 09:30:42 schrieb Sage Weil:
> If you can reproduce it with 'debug filestore = 20' too, that will be
> better, as it will tell us what the FIEMAP ioctl is returning.

I ran another testrun with 'debug filestore = 20'.

> Also, if
> you can attach/post the contents of the object itself (rados -p rbd get
> rb.0.1.0000000002a0 /tmp/foo) we can make sure the object has the right
> data (and the sparse-read operation that librbd is doing is the culprit).

I tried that, with the block name that the steps further below gave me:

rados -p rbd get rb.0.13.00000000045a block

When I looked into the block, it looked like a bunch of temp files from the 
portage system with padding in between, although it should be random data... I 
think I got the wrong block after all...

Here's what I did:
Run the iotester again:
testserver-rbd11 iotester # date ; ./iotester /var/iotest ; date                                                                                                                                                                 
Tue Jun 12 13:51:58 CEST 2012                                                                                                               
Wrote 100 MiB of data in 2004 milliseconds                                                                                                  
[snip lots of irrelevant lines]
Wrote 100 MiB of data in 2537 milliseconds
Read 100 MiB of data in 3794 milliseconds
Read 100 MiB of data in 10150 milliseconds
Digest wrong for file "/var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e"
Tue Jun 12 13:55:00 CEST 2012

Run the fiemap tool on that file:

testserver-rbd11 ~ # ./fiemap 
/var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e
File /var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e has 1 extents:
#       Logical          Physical         Length           Flags
0:      0000000000000000 0000000116900000 0000000000100000 0001

> As for the log:
> 
> First, map the offset to an rbd block.  For example, taking the 'Physical'
> value of 00000000a8200000 from above:
> 
> $ printf "%012x\n" $((0x00000000a8200000 / (4096*1024) ))
> 0000000002a0

That gave me

 $ printf "%012x\n" $((0x0000000116900000 / (4096*1024) ))
00000000045a
 
> Then figure out what the object name prefix is:
> 
> $ rbd info <imagename> | grep prefix
>         block_name_prefix: rb.0.1

Result: block_name_prefix: rb.0.13

> Then add the block number, 0000000002a0 to that, e.g. rb.0.1.0000000002a0.

Result: rb.0.13.00000000045a

> Then map that back to an osd with
> 
> $ ceph osd map rbd rb.0.1.0000000002a0
> osdmap e19 pool 'rbd' (2) object 'rb.0.1.0000000002a0' -> pg 2.a2e06f65
> (2.5) -> up [0,2] acting [0,2]

That gives me
[root@storage1 ~]# ceph osd map rbd rb.0.13.00000000045a 2> /dev/null 
osdmap e101 pool 'rbd' (2) object 'rb.0.13.00000000045a' -> pg 2.80b039fb 
(2.7b) -> up [2,1] acting [2,1]
 
> You'll see the osd ids listed in brackets after 'active'.  We want the
> first one, 0 in my example.  The log from that OSD is what we need.

Okay, i'm attaching the compressed log for osd.2 and the compressed block to 
the issue report in the redmine.

	Guido

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-12 12:31             ` Guido Winkelmann
@ 2012-06-15 12:14               ` Stefan Majer
  2012-06-15 15:38                 ` Josh Durgin
  0 siblings, 1 reply; 30+ messages in thread
From: Stefan Majer @ 2012-06-15 12:14 UTC (permalink / raw)
  To: Guido Winkelmann; +Cc: Sage Weil, Oliver Francke, Josh Durgin, ceph-devel

Hi,

We had today a catastrophic fs corruption in one of our virtual
machines, after fsck ~100MB was inside lost+found :-(
So is think we hit the same bug (ceph-0.45.2, sparse rbd images)

Is there any progress on this topic, or any hint how to help on this
would be helpful.

Greetings
Stefan Majer

On Tue, Jun 12, 2012 at 2:31 PM, Guido Winkelmann
<guido-ceph@thisisnotatest.de> wrote:
> Am Montag, 11. Juni 2012, 09:30:42 schrieb Sage Weil:
>> If you can reproduce it with 'debug filestore = 20' too, that will be
>> better, as it will tell us what the FIEMAP ioctl is returning.
>
> I ran another testrun with 'debug filestore = 20'.
>
>> Also, if
>> you can attach/post the contents of the object itself (rados -p rbd get
>> rb.0.1.0000000002a0 /tmp/foo) we can make sure the object has the right
>> data (and the sparse-read operation that librbd is doing is the culprit).
>
> I tried that, with the block name that the steps further below gave me:
>
> rados -p rbd get rb.0.13.00000000045a block
>
> When I looked into the block, it looked like a bunch of temp files from the
> portage system with padding in between, although it should be random data... I
> think I got the wrong block after all...
>
> Here's what I did:
> Run the iotester again:
> testserver-rbd11 iotester # date ; ./iotester /var/iotest ; date
> Tue Jun 12 13:51:58 CEST 2012
> Wrote 100 MiB of data in 2004 milliseconds
> [snip lots of irrelevant lines]
> Wrote 100 MiB of data in 2537 milliseconds
> Read 100 MiB of data in 3794 milliseconds
> Read 100 MiB of data in 10150 milliseconds
> Digest wrong for file "/var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e"
> Tue Jun 12 13:55:00 CEST 2012
>
> Run the fiemap tool on that file:
>
> testserver-rbd11 ~ # ./fiemap
> /var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e
> File /var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e has 1 extents:
> #       Logical          Physical         Length           Flags
> 0:      0000000000000000 0000000116900000 0000000000100000 0001
>
>> As for the log:
>>
>> First, map the offset to an rbd block.  For example, taking the 'Physical'
>> value of 00000000a8200000 from above:
>>
>> $ printf "%012x\n" $((0x00000000a8200000 / (4096*1024) ))
>> 0000000002a0
>
> That gave me
>
>  $ printf "%012x\n" $((0x0000000116900000 / (4096*1024) ))
> 00000000045a
>
>> Then figure out what the object name prefix is:
>>
>> $ rbd info <imagename> | grep prefix
>>         block_name_prefix: rb.0.1
>
> Result: block_name_prefix: rb.0.13
>
>> Then add the block number, 0000000002a0 to that, e.g. rb.0.1.0000000002a0.
>
> Result: rb.0.13.00000000045a
>
>> Then map that back to an osd with
>>
>> $ ceph osd map rbd rb.0.1.0000000002a0
>> osdmap e19 pool 'rbd' (2) object 'rb.0.1.0000000002a0' -> pg 2.a2e06f65
>> (2.5) -> up [0,2] acting [0,2]
>
> That gives me
> [root@storage1 ~]# ceph osd map rbd rb.0.13.00000000045a 2> /dev/null
> osdmap e101 pool 'rbd' (2) object 'rb.0.13.00000000045a' -> pg 2.80b039fb
> (2.7b) -> up [2,1] acting [2,1]
>
>> You'll see the osd ids listed in brackets after 'active'.  We want the
>> first one, 0 in my example.  The log from that OSD is what we need.
>
> Okay, i'm attaching the compressed log for osd.2 and the compressed block to
> the issue report in the redmine.
>
>        Guido
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Stefan Majer
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-15 12:14               ` Stefan Majer
@ 2012-06-15 15:38                 ` Josh Durgin
  2012-06-15 18:50                   ` Josh Durgin
  0 siblings, 1 reply; 30+ messages in thread
From: Josh Durgin @ 2012-06-15 15:38 UTC (permalink / raw)
  To: Stefan Majer; +Cc: Guido Winkelmann, Sage Weil, Oliver Francke, ceph-devel

Short version: you should set 'filestore fiemap = false' for your osds.

I was able to reproduce the crash with all the debugging I needed 
yesterday via test_librbd_fsx, and the problem looks like a bug in 
fiemap. Even though we call fsync before each fiemap call, we were
getting different results (one bad result, which resulted in the
corruption, and the correct result later, with no writes to the file
in between).

This was on XFS kernel 3.3.1, so I'll be sending a report to the xfs
list with the log when I get to the office. I don't know which
other versions are affected yet.

In the meantime, you should turn fiemap usage off on the osd by setting
'filestore fiemap = false' in your ceph.conf [osd] section. I think
we should make that the default in 0.48 as well.

Josh

On 06/15/2012 05:14 AM, Stefan Majer wrote:
> Hi,
>
> We had today a catastrophic fs corruption in one of our virtual
> machines, after fsck ~100MB was inside lost+found :-(
> So is think we hit the same bug (ceph-0.45.2, sparse rbd images)
>
> Is there any progress on this topic, or any hint how to help on this
> would be helpful.
>
> Greetings
> Stefan Majer
>
> On Tue, Jun 12, 2012 at 2:31 PM, Guido Winkelmann
> <guido-ceph@thisisnotatest.de>  wrote:
>> Am Montag, 11. Juni 2012, 09:30:42 schrieb Sage Weil:
>>> If you can reproduce it with 'debug filestore = 20' too, that will be
>>> better, as it will tell us what the FIEMAP ioctl is returning.
>>
>> I ran another testrun with 'debug filestore = 20'.
>>
>>> Also, if
>>> you can attach/post the contents of the object itself (rados -p rbd get
>>> rb.0.1.0000000002a0 /tmp/foo) we can make sure the object has the right
>>> data (and the sparse-read operation that librbd is doing is the culprit).
>>
>> I tried that, with the block name that the steps further below gave me:
>>
>> rados -p rbd get rb.0.13.00000000045a block
>>
>> When I looked into the block, it looked like a bunch of temp files from the
>> portage system with padding in between, although it should be random data... I
>> think I got the wrong block after all...
>>
>> Here's what I did:
>> Run the iotester again:
>> testserver-rbd11 iotester # date ; ./iotester /var/iotest ; date
>> Tue Jun 12 13:51:58 CEST 2012
>> Wrote 100 MiB of data in 2004 milliseconds
>> [snip lots of irrelevant lines]
>> Wrote 100 MiB of data in 2537 milliseconds
>> Read 100 MiB of data in 3794 milliseconds
>> Read 100 MiB of data in 10150 milliseconds
>> Digest wrong for file "/var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e"
>> Tue Jun 12 13:55:00 CEST 2012
>>
>> Run the fiemap tool on that file:
>>
>> testserver-rbd11 ~ # ./fiemap
>> /var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e
>> File /var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e has 1 extents:
>> #       Logical          Physical         Length           Flags
>> 0:      0000000000000000 0000000116900000 0000000000100000 0001
>>
>>> As for the log:
>>>
>>> First, map the offset to an rbd block.  For example, taking the 'Physical'
>>> value of 00000000a8200000 from above:
>>>
>>> $ printf "%012x\n" $((0x00000000a8200000 / (4096*1024) ))
>>> 0000000002a0
>>
>> That gave me
>>
>>   $ printf "%012x\n" $((0x0000000116900000 / (4096*1024) ))
>> 00000000045a
>>
>>> Then figure out what the object name prefix is:
>>>
>>> $ rbd info<imagename>  | grep prefix
>>>          block_name_prefix: rb.0.1
>>
>> Result: block_name_prefix: rb.0.13
>>
>>> Then add the block number, 0000000002a0 to that, e.g. rb.0.1.0000000002a0.
>>
>> Result: rb.0.13.00000000045a
>>
>>> Then map that back to an osd with
>>>
>>> $ ceph osd map rbd rb.0.1.0000000002a0
>>> osdmap e19 pool 'rbd' (2) object 'rb.0.1.0000000002a0' ->  pg 2.a2e06f65
>>> (2.5) ->  up [0,2] acting [0,2]
>>
>> That gives me
>> [root@storage1 ~]# ceph osd map rbd rb.0.13.00000000045a 2>  /dev/null
>> osdmap e101 pool 'rbd' (2) object 'rb.0.13.00000000045a' ->  pg 2.80b039fb
>> (2.7b) ->  up [2,1] acting [2,1]
>>
>>> You'll see the osd ids listed in brackets after 'active'.  We want the
>>> first one, 0 in my example.  The log from that OSD is what we need.
>>
>> Okay, i'm attaching the compressed log for osd.2 and the compressed block to
>> the issue report in the redmine.
>>
>>         Guido



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Random data corruption in VM, possibly caused by rbd
  2012-06-15 15:38                 ` Josh Durgin
@ 2012-06-15 18:50                   ` Josh Durgin
  0 siblings, 0 replies; 30+ messages in thread
From: Josh Durgin @ 2012-06-15 18:50 UTC (permalink / raw)
  To: Stefan Majer; +Cc: Guido Winkelmann, Sage Weil, Oliver Francke, ceph-devel

Since Guido was seeing this problem on btrfs as well, I'm going to try
tracking down more precisely where it was introduced.

Josh

On 06/15/2012 08:38 AM, Josh Durgin wrote:
> Short version: you should set 'filestore fiemap = false' for your osds.
>
> I was able to reproduce the crash with all the debugging I needed
> yesterday via test_librbd_fsx, and the problem looks like a bug in
> fiemap. Even though we call fsync before each fiemap call, we were
> getting different results (one bad result, which resulted in the
> corruption, and the correct result later, with no writes to the file
> in between).
>
> This was on XFS kernel 3.3.1, so I'll be sending a report to the xfs
> list with the log when I get to the office. I don't know which
> other versions are affected yet.
>
> In the meantime, you should turn fiemap usage off on the osd by setting
> 'filestore fiemap = false' in your ceph.conf [osd] section. I think
> we should make that the default in 0.48 as well.
>
> Josh
>
> On 06/15/2012 05:14 AM, Stefan Majer wrote:
>> Hi,
>>
>> We had today a catastrophic fs corruption in one of our virtual
>> machines, after fsck ~100MB was inside lost+found :-(
>> So is think we hit the same bug (ceph-0.45.2, sparse rbd images)
>>
>> Is there any progress on this topic, or any hint how to help on this
>> would be helpful.
>>
>> Greetings
>> Stefan Majer
>>
>> On Tue, Jun 12, 2012 at 2:31 PM, Guido Winkelmann
>> <guido-ceph@thisisnotatest.de> wrote:
>>> Am Montag, 11. Juni 2012, 09:30:42 schrieb Sage Weil:
>>>> If you can reproduce it with 'debug filestore = 20' too, that will be
>>>> better, as it will tell us what the FIEMAP ioctl is returning.
>>>
>>> I ran another testrun with 'debug filestore = 20'.
>>>
>>>> Also, if
>>>> you can attach/post the contents of the object itself (rados -p rbd get
>>>> rb.0.1.0000000002a0 /tmp/foo) we can make sure the object has the right
>>>> data (and the sparse-read operation that librbd is doing is the
>>>> culprit).
>>>
>>> I tried that, with the block name that the steps further below gave me:
>>>
>>> rados -p rbd get rb.0.13.00000000045a block
>>>
>>> When I looked into the block, it looked like a bunch of temp files
>>> from the
>>> portage system with padding in between, although it should be random
>>> data... I
>>> think I got the wrong block after all...
>>>
>>> Here's what I did:
>>> Run the iotester again:
>>> testserver-rbd11 iotester # date ; ./iotester /var/iotest ; date
>>> Tue Jun 12 13:51:58 CEST 2012
>>> Wrote 100 MiB of data in 2004 milliseconds
>>> [snip lots of irrelevant lines]
>>> Wrote 100 MiB of data in 2537 milliseconds
>>> Read 100 MiB of data in 3794 milliseconds
>>> Read 100 MiB of data in 10150 milliseconds
>>> Digest wrong for file
>>> "/var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e"
>>> Tue Jun 12 13:55:00 CEST 2012
>>>
>>> Run the fiemap tool on that file:
>>>
>>> testserver-rbd11 ~ # ./fiemap
>>> /var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e
>>> File /var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e has 1 extents:
>>> # Logical Physical Length Flags
>>> 0: 0000000000000000 0000000116900000 0000000000100000 0001
>>>
>>>> As for the log:
>>>>
>>>> First, map the offset to an rbd block. For example, taking the
>>>> 'Physical'
>>>> value of 00000000a8200000 from above:
>>>>
>>>> $ printf "%012x\n" $((0x00000000a8200000 / (4096*1024) ))
>>>> 0000000002a0
>>>
>>> That gave me
>>>
>>> $ printf "%012x\n" $((0x0000000116900000 / (4096*1024) ))
>>> 00000000045a
>>>
>>>> Then figure out what the object name prefix is:
>>>>
>>>> $ rbd info<imagename> | grep prefix
>>>> block_name_prefix: rb.0.1
>>>
>>> Result: block_name_prefix: rb.0.13
>>>
>>>> Then add the block number, 0000000002a0 to that, e.g.
>>>> rb.0.1.0000000002a0.
>>>
>>> Result: rb.0.13.00000000045a
>>>
>>>> Then map that back to an osd with
>>>>
>>>> $ ceph osd map rbd rb.0.1.0000000002a0
>>>> osdmap e19 pool 'rbd' (2) object 'rb.0.1.0000000002a0' -> pg 2.a2e06f65
>>>> (2.5) -> up [0,2] acting [0,2]
>>>
>>> That gives me
>>> [root@storage1 ~]# ceph osd map rbd rb.0.13.00000000045a 2> /dev/null
>>> osdmap e101 pool 'rbd' (2) object 'rb.0.13.00000000045a' -> pg
>>> 2.80b039fb
>>> (2.7b) -> up [2,1] acting [2,1]
>>>
>>>> You'll see the osd ids listed in brackets after 'active'. We want the
>>>> first one, 0 in my example. The log from that OSD is what we need.
>>>
>>> Okay, i'm attaching the compressed log for osd.2 and the compressed
>>> block to
>>> the issue report in the redmine.
>>>
>>> Guido
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2012-06-15 18:50 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-07 18:04 Random data corruption in VM, possibly caused by rbd Guido Winkelmann
2012-06-07 18:18 ` Stefan Priebe
2012-06-07 18:37   ` Guido Winkelmann
2012-06-07 19:54     ` Andrey Korolyov
2012-06-07 21:03       ` Guido Winkelmann
2012-06-07 21:53     ` Marcus Sorensen
2012-06-07 22:12       ` Guido Winkelmann
2012-06-07 18:40 ` Oliver Francke
2012-06-07 19:48 ` Josh Durgin
2012-06-07 21:36   ` Guido Winkelmann
2012-06-07 22:13     ` Tommi Virtanen
2012-06-08 12:55   ` Guido Winkelmann
2012-06-08 13:08     ` Guido Winkelmann
2012-06-08 13:36     ` Oliver Francke
2012-06-08 13:55       ` Sage Weil
2012-06-08 14:50         ` Josh Durgin
2012-06-08 15:39           ` Oliver Francke
2012-06-08 17:15           ` Guido Winkelmann
2012-06-10  3:04             ` Sage Weil
2012-06-10  3:07               ` Sage Weil
2012-06-11 14:15               ` Guido Winkelmann
2012-06-11 15:50         ` Guido Winkelmann
2012-06-11 16:30           ` Sage Weil
2012-06-11 17:07             ` Guido Winkelmann
2012-06-11 17:12               ` Sage Weil
2012-06-11 17:29               ` Josh Durgin
2012-06-12 12:31             ` Guido Winkelmann
2012-06-15 12:14               ` Stefan Majer
2012-06-15 15:38                 ` Josh Durgin
2012-06-15 18:50                   ` Josh Durgin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.