From mboxrd@z Thu Jan 1 00:00:00 1970 From: Josh Durgin Subject: Re: Random data corruption in VM, possibly caused by rbd Date: Fri, 15 Jun 2012 11:50:01 -0700 Message-ID: <4FDB83D9.6010209@inktank.com> References: <21601270.dfB0BsVfyn@pc10> <25761303.PmWYZprmzc@pc10> <1815615.qFFlAnfNF8@pc10> <4FDB56F5.3060407@inktank.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-pb0-f46.google.com ([209.85.160.46]:48143 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757231Ab2FOSuG (ORCPT ); Fri, 15 Jun 2012 14:50:06 -0400 Received: by pbbrp8 with SMTP id rp8so5636937pbb.19 for ; Fri, 15 Jun 2012 11:50:05 -0700 (PDT) In-Reply-To: <4FDB56F5.3060407@inktank.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Stefan Majer Cc: Guido Winkelmann , Sage Weil , Oliver Francke , ceph-devel@vger.kernel.org Since Guido was seeing this problem on btrfs as well, I'm going to try tracking down more precisely where it was introduced. Josh On 06/15/2012 08:38 AM, Josh Durgin wrote: > Short version: you should set 'filestore fiemap = false' for your osds. > > I was able to reproduce the crash with all the debugging I needed > yesterday via test_librbd_fsx, and the problem looks like a bug in > fiemap. Even though we call fsync before each fiemap call, we were > getting different results (one bad result, which resulted in the > corruption, and the correct result later, with no writes to the file > in between). > > This was on XFS kernel 3.3.1, so I'll be sending a report to the xfs > list with the log when I get to the office. I don't know which > other versions are affected yet. > > In the meantime, you should turn fiemap usage off on the osd by setting > 'filestore fiemap = false' in your ceph.conf [osd] section. I think > we should make that the default in 0.48 as well. > > Josh > > On 06/15/2012 05:14 AM, Stefan Majer wrote: >> Hi, >> >> We had today a catastrophic fs corruption in one of our virtual >> machines, after fsck ~100MB was inside lost+found :-( >> So is think we hit the same bug (ceph-0.45.2, sparse rbd images) >> >> Is there any progress on this topic, or any hint how to help on this >> would be helpful. >> >> Greetings >> Stefan Majer >> >> On Tue, Jun 12, 2012 at 2:31 PM, Guido Winkelmann >> wrote: >>> Am Montag, 11. Juni 2012, 09:30:42 schrieb Sage Weil: >>>> If you can reproduce it with 'debug filestore = 20' too, that will be >>>> better, as it will tell us what the FIEMAP ioctl is returning. >>> >>> I ran another testrun with 'debug filestore = 20'. >>> >>>> Also, if >>>> you can attach/post the contents of the object itself (rados -p rbd get >>>> rb.0.1.0000000002a0 /tmp/foo) we can make sure the object has the right >>>> data (and the sparse-read operation that librbd is doing is the >>>> culprit). >>> >>> I tried that, with the block name that the steps further below gave me: >>> >>> rados -p rbd get rb.0.13.00000000045a block >>> >>> When I looked into the block, it looked like a bunch of temp files >>> from the >>> portage system with padding in between, although it should be random >>> data... I >>> think I got the wrong block after all... >>> >>> Here's what I did: >>> Run the iotester again: >>> testserver-rbd11 iotester # date ; ./iotester /var/iotest ; date >>> Tue Jun 12 13:51:58 CEST 2012 >>> Wrote 100 MiB of data in 2004 milliseconds >>> [snip lots of irrelevant lines] >>> Wrote 100 MiB of data in 2537 milliseconds >>> Read 100 MiB of data in 3794 milliseconds >>> Read 100 MiB of data in 10150 milliseconds >>> Digest wrong for file >>> "/var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e" >>> Tue Jun 12 13:55:00 CEST 2012 >>> >>> Run the fiemap tool on that file: >>> >>> testserver-rbd11 ~ # ./fiemap >>> /var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e >>> File /var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e has 1 extents: >>> # Logical Physical Length Flags >>> 0: 0000000000000000 0000000116900000 0000000000100000 0001 >>> >>>> As for the log: >>>> >>>> First, map the offset to an rbd block. For example, taking the >>>> 'Physical' >>>> value of 00000000a8200000 from above: >>>> >>>> $ printf "%012x\n" $((0x00000000a8200000 / (4096*1024) )) >>>> 0000000002a0 >>> >>> That gave me >>> >>> $ printf "%012x\n" $((0x0000000116900000 / (4096*1024) )) >>> 00000000045a >>> >>>> Then figure out what the object name prefix is: >>>> >>>> $ rbd info | grep prefix >>>> block_name_prefix: rb.0.1 >>> >>> Result: block_name_prefix: rb.0.13 >>> >>>> Then add the block number, 0000000002a0 to that, e.g. >>>> rb.0.1.0000000002a0. >>> >>> Result: rb.0.13.00000000045a >>> >>>> Then map that back to an osd with >>>> >>>> $ ceph osd map rbd rb.0.1.0000000002a0 >>>> osdmap e19 pool 'rbd' (2) object 'rb.0.1.0000000002a0' -> pg 2.a2e06f65 >>>> (2.5) -> up [0,2] acting [0,2] >>> >>> That gives me >>> [root@storage1 ~]# ceph osd map rbd rb.0.13.00000000045a 2> /dev/null >>> osdmap e101 pool 'rbd' (2) object 'rb.0.13.00000000045a' -> pg >>> 2.80b039fb >>> (2.7b) -> up [2,1] acting [2,1] >>> >>>> You'll see the osd ids listed in brackets after 'active'. We want the >>>> first one, 0 in my example. The log from that OSD is what we need. >>> >>> Okay, i'm attaching the compressed log for osd.2 and the compressed >>> block to >>> the issue report in the redmine. >>> >>> Guido > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html