From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx3.redhat.com (mx3.redhat.com [172.16.48.32]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id m1TJU5AL022504 for ; Fri, 29 Feb 2008 14:30:05 -0500 Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by mx3.redhat.com (8.13.8/8.13.8) with ESMTP id m1TJTZu8011153 for ; Fri, 29 Feb 2008 14:29:35 -0500 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e32.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id m1TJSr5W002568 for ; Fri, 29 Feb 2008 14:28:53 -0500 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.7) with ESMTP id m1TJTSlx216820 for ; Fri, 29 Feb 2008 12:29:28 -0700 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id m1TJTRt3031903 for ; Fri, 29 Feb 2008 12:29:27 -0700 Received: from malahal.beaverton.ibm.com (malahal.beaverton.ibm.com [9.47.17.130]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id m1TJTRUs031498 for ; Fri, 29 Feb 2008 12:29:27 -0700 Date: Fri, 29 Feb 2008 11:29:20 -0800 From: malahal@us.ibm.com Subject: Re: [linux-lvm] Page cache corruption when creating a snapshot Message-ID: <20080229192920.GB18264@us.ibm.com> References: <200802291732.m1THWfD7013248@outgoing.mit.edu> <20080229183148.GI1788@agk.fab.redhat.com> <1204312265.5850.7.camel@error-messages.mit.edu> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <1204312265.5850.7.camel@error-messages.mit.edu> Reply-To: LVM general discussion and development List-Id: LVM general discussion and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , List-Id: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-lvm@redhat.com Greg Hudson [ghudson@MIT.EDU] wrote: > On Fri, 2008-02-29 at 18:31 +0000, Alasdair G Kergon wrote: > > On Fri, Feb 29, 2008 at 12:32:41PM -0500, ghudson@MIT.EDU wrote: > > > The reproduction recipe looks like: > > > rm -rf /tmp/test > > > mkdir /tmp/test > > > # Put around 60MB of files into /tmp/test > > > find /tmp/test -type f | xargs md5sum > /tmp/sum.pre > > > lvcreate --size 2G --snapshot /dev/dink/gutsy-i386-sbuild --name testsnapshot > > > find /tmp/test -type f | xargs md5sum > /tmp/sum.post > > > > Can you do that twice? > > find /tmp/test -type f | xargs md5sum > /tmp/sum.post2 > > and check the two post files are the same? > > In three reproductions of the page cache corruption, sum.post2 was > always the same as sum.post. > > In my experiences with this problem in general, the page cache > corruption is not particularly transient; once it happens, the file > continues to appear modified (with the same incorrect contents) for the > indefinite future, until the machine is rebooted. > > > And add some syncs/blockdev --flushbufs at different places > > in the script to see if you can make it go away. > > Nope, that never made it go away. I'm not sure in what situations > flushing write buffers would have any effect. If I had a way to throw > away the read-only page cache and force a file reload from disk, I would > expect that to eliminate the visible effect of the corruption; at the > moment the only reliable way I know how to do that is to reboot. Not an expert on O_DIRECT, but it is supposed to read from the disk without creating page cache. I don't really know what it does if page cache exists! The "dd" command has O_DIRECT support and see if you notice any change with the corrupted file when you do "dd" with and without O_DIRECT. --Malahal.