All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nick Piggin <nickpiggin@yahoo.com.au>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>, "Martin J. Bligh" <mbligh@mbligh.org>,
	linux-ext4@vger.kernel.org, Ying Han <yinghan@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	"linux-kernel" <linux-kernel@vger.kernel.org>,
	"linux-mm" <linux-mm@kvack.org>,
	guichaz@gmail.com, Alex Khesin <alexk@google.com>,
	Mike Waychison <mikew@google.com>,
	Rohit Seth <rohitseth@google.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.
Date: Wed, 25 Mar 2009 02:35:01 +1100	[thread overview]
Message-ID: <200903250235.02816.nickpiggin@yahoo.com.au> (raw)
In-Reply-To: <20090324033204.64f3da9d.akpm@linux-foundation.org>

On Tuesday 24 March 2009 21:32:04 Andrew Morton wrote:
> On Tue, 24 Mar 2009 18:44:21 +1100 Nick Piggin <nickpiggin@yahoo.com.au> 
wrote:
> > On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> > > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
> > > > Holding mapping->private_lock over the __set_page_dirty should
> > > > fix it, although I guess you'd want to release it before calling
> > > > __mark_inode_dirty so as not to put inode_lock under there. I
> > > > have a patch for this if it sounds reasonable.
> > >
> > >   Yes, that seems to be a bug - the function actually looked suspitious
> > > to me yesterday but I somehow convinced myself that it's fine. Probably
> > > because fsx-linux is single-threaded.
> >
> > After a whole lot of chasing my own tail in the VM and buffer layers,
> > I think it is a problem in ext2 (and I haven't been able to reproduce
> > with ext3 yet, which might lend weight to that, although as we have
> > seen, it is very timing dependent).
> >
> > That would be slightly unfortunate because we still have Jan's ext3
> > problem, and also another reported problem of corruption on ext3 (on
> > brd driver).
> >
> > Anyway, when I have reproduced the problem with the test case, the
> > "lost" writes are all reported to be holes. Unfortunately, that doesn't
> > point straight to the filesystem, because ext2 allocates blocks in this
> > case at writeout time, so if dirty bits are getting lost, then it would
> > be normal to see holes.
> >
> > I then put in a whole lot of extra infrastructure to track metadata about
> > each struct page (when it was last written out, when it last had the
> > number of writable ptes reach 0, when the dirty bits were last cleared
> > etc). And none of the normal asertions were triggering: eg. when any page
> > is removed from pagecache (except truncates), it has always had all its
> > buffers written out *after* all ptes were made readonly or unmapped. Lots
> > of other tests and crap like that.
> >
> > So I tried what I should have done to start with and did an e2fsck after
> > seeing corruption. Yes, it comes up with errors.
>
> Do you recall what the errors were?

OK, after running several tests in parallel and having 3 of them
blow up, I unmounted the fs (so error-case files are still intact).

# e2fsck -fn /dev/ram0
e2fsck 1.41.3 (12-Oct-2008)
Pass 1: Checking inodes, blocks, and sizes
Inode 16, i_blocks is 131594, should be 131566.  Fix? no

Inode 18, i_blocks is 131588, should be 131576.  Fix? no

Inode 21, i_blocks is 131594, should be 131552.  Fix? no

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -(628209--628220) -628231 -628233 -(638751--638755) 
-638765 -(646271--646295) -(646301--646304) -647609 -(651501--651505) -651509 
-(651719--651726) -(651732--651733) -(665666--665670)
Fix? no


/dev/ram0: ********** WARNING: Filesystem still has errors **********

/dev/ram0: 21/229376 files (4.8% non-contiguous), 407105/3670016 blocks

ino 16, 18, 21 of course are the files with errors.


inode 18 is the simplest case with just one hole, so let's look at that:

#hexdump file9
0000000 ffff ffff ffff ffff ffff ffff ffff ffff
*
3c8c000 0000 0000 0000 0000 0000 0000 0000 0000
*
3c8d400 ffff ffff ffff ffff ffff ffff ffff ffff
*
4000000


Let's take a look at our hole then:

#./bmap file9  // bmap is modified to print hex offsets
[... lots of stuff ...]
3c82000-3c82c00: 26fd0400-26fd1000 (1000)
3c83000-3c83c00: 26fd3400-26fd4000 (1000)
3c84000-3c84c00: 26fc9c00-26fca800 (1000)
3c85000-3c85c00: 26fcc400-26fcd000 (1000)
3c86000-3c86c00: 26fcf400-26fd0000 (1000)
3c87000-3c87c00: 26fd2400-26fd3000 (1000)
3c88000-3c88c00: 26fd5400-26fd6000 (1000)
3c89000-3c8bc00: 26fd7400-26fda000 (3000)
3c8c000-3c8c000: 0-0 (400)
3c8c400-3c8c400: 0-0 (400)
3c8c800-3c8c800: 0-0 (400)
3c8cc00-3c8cc00: 0-0 (400)
3c8d000-3c8d000: 0-0 (400)
3c8d400-3c8dc00: 26fcb800-26fcc000 (c00)
3c8e000-3c8ec00: 26fce400-26fcf000 (1000)
3c8f000-3c8fc00: 26fd1400-26fd2000 (1000)
3c90000-3c99c00: 27924400-2792e000 (a000)
3c9a000-3c9ac00: 2792f000-2792fc00 (1000)
3c9b000-3c9bc00: 27931000-27931c00 (1000)
3c9c000-3c9cc00: 27933000-27933c00 (1000)
3c9d000-3c9dc00: 27935000-27935c00 (1000)
3c9e000-3c9ec00: 27938000-27938c00 (1000)
3c9f000-3c9fc00: 2793a000-2793ac00 (1000)
[... lots more stuff ...]

3.5G filesystem image bzip2s down to 500K if anybody wants it I
can send it privately.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Nick Piggin <nickpiggin@yahoo.com.au>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>, "Martin J. Bligh" <mbligh@mbligh.org>,
	linux-ext4@vger.kernel.org, Ying Han <yinghan@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	"linux-kernel" <linux-kernel@vger.kernel.org>,
	"linux-mm" <linux-mm@kvack.org>,
	guichaz@gmail.com, Alex Khesin <alexk@google.com>,
	Mike Waychison <mikew@google.com>,
	Rohit Seth <rohitseth@google.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.
Date: Wed, 25 Mar 2009 02:35:01 +1100	[thread overview]
Message-ID: <200903250235.02816.nickpiggin@yahoo.com.au> (raw)
In-Reply-To: <20090324033204.64f3da9d.akpm@linux-foundation.org>

On Tuesday 24 March 2009 21:32:04 Andrew Morton wrote:
> On Tue, 24 Mar 2009 18:44:21 +1100 Nick Piggin <nickpiggin@yahoo.com.au> 
wrote:
> > On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> > > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
> > > > Holding mapping->private_lock over the __set_page_dirty should
> > > > fix it, although I guess you'd want to release it before calling
> > > > __mark_inode_dirty so as not to put inode_lock under there. I
> > > > have a patch for this if it sounds reasonable.
> > >
> > >   Yes, that seems to be a bug - the function actually looked suspitious
> > > to me yesterday but I somehow convinced myself that it's fine. Probably
> > > because fsx-linux is single-threaded.
> >
> > After a whole lot of chasing my own tail in the VM and buffer layers,
> > I think it is a problem in ext2 (and I haven't been able to reproduce
> > with ext3 yet, which might lend weight to that, although as we have
> > seen, it is very timing dependent).
> >
> > That would be slightly unfortunate because we still have Jan's ext3
> > problem, and also another reported problem of corruption on ext3 (on
> > brd driver).
> >
> > Anyway, when I have reproduced the problem with the test case, the
> > "lost" writes are all reported to be holes. Unfortunately, that doesn't
> > point straight to the filesystem, because ext2 allocates blocks in this
> > case at writeout time, so if dirty bits are getting lost, then it would
> > be normal to see holes.
> >
> > I then put in a whole lot of extra infrastructure to track metadata about
> > each struct page (when it was last written out, when it last had the
> > number of writable ptes reach 0, when the dirty bits were last cleared
> > etc). And none of the normal asertions were triggering: eg. when any page
> > is removed from pagecache (except truncates), it has always had all its
> > buffers written out *after* all ptes were made readonly or unmapped. Lots
> > of other tests and crap like that.
> >
> > So I tried what I should have done to start with and did an e2fsck after
> > seeing corruption. Yes, it comes up with errors.
>
> Do you recall what the errors were?

OK, after running several tests in parallel and having 3 of them
blow up, I unmounted the fs (so error-case files are still intact).

# e2fsck -fn /dev/ram0
e2fsck 1.41.3 (12-Oct-2008)
Pass 1: Checking inodes, blocks, and sizes
Inode 16, i_blocks is 131594, should be 131566.  Fix? no

Inode 18, i_blocks is 131588, should be 131576.  Fix? no

Inode 21, i_blocks is 131594, should be 131552.  Fix? no

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -(628209--628220) -628231 -628233 -(638751--638755) 
-638765 -(646271--646295) -(646301--646304) -647609 -(651501--651505) -651509 
-(651719--651726) -(651732--651733) -(665666--665670)
Fix? no


/dev/ram0: ********** WARNING: Filesystem still has errors **********

/dev/ram0: 21/229376 files (4.8% non-contiguous), 407105/3670016 blocks

ino 16, 18, 21 of course are the files with errors.


inode 18 is the simplest case with just one hole, so let's look at that:

#hexdump file9
0000000 ffff ffff ffff ffff ffff ffff ffff ffff
*
3c8c000 0000 0000 0000 0000 0000 0000 0000 0000
*
3c8d400 ffff ffff ffff ffff ffff ffff ffff ffff
*
4000000


Let's take a look at our hole then:

#./bmap file9  // bmap is modified to print hex offsets
[... lots of stuff ...]
3c82000-3c82c00: 26fd0400-26fd1000 (1000)
3c83000-3c83c00: 26fd3400-26fd4000 (1000)
3c84000-3c84c00: 26fc9c00-26fca800 (1000)
3c85000-3c85c00: 26fcc400-26fcd000 (1000)
3c86000-3c86c00: 26fcf400-26fd0000 (1000)
3c87000-3c87c00: 26fd2400-26fd3000 (1000)
3c88000-3c88c00: 26fd5400-26fd6000 (1000)
3c89000-3c8bc00: 26fd7400-26fda000 (3000)
3c8c000-3c8c000: 0-0 (400)
3c8c400-3c8c400: 0-0 (400)
3c8c800-3c8c800: 0-0 (400)
3c8cc00-3c8cc00: 0-0 (400)
3c8d000-3c8d000: 0-0 (400)
3c8d400-3c8dc00: 26fcb800-26fcc000 (c00)
3c8e000-3c8ec00: 26fce400-26fcf000 (1000)
3c8f000-3c8fc00: 26fd1400-26fd2000 (1000)
3c90000-3c99c00: 27924400-2792e000 (a000)
3c9a000-3c9ac00: 2792f000-2792fc00 (1000)
3c9b000-3c9bc00: 27931000-27931c00 (1000)
3c9c000-3c9cc00: 27933000-27933c00 (1000)
3c9d000-3c9dc00: 27935000-27935c00 (1000)
3c9e000-3c9ec00: 27938000-27938c00 (1000)
3c9f000-3c9fc00: 2793a000-2793ac00 (1000)
[... lots more stuff ...]

3.5G filesystem image bzip2s down to 500K if anybody wants it I
can send it privately.


  reply	other threads:[~2009-03-24 15:35 UTC|newest]

Thread overview: 121+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-03-18 19:44 ftruncate-mmap: pages are lost after writing to mmaped file Ying Han
2009-03-18 19:44 ` Ying Han
2009-03-18 22:11 ` Andrew Morton
2009-03-18 22:11   ` Andrew Morton
2009-03-18 22:40   ` Linus Torvalds
2009-03-18 22:40     ` Linus Torvalds
2009-03-18 23:18     ` Ying Han
2009-03-18 23:18       ` Ying Han
2009-03-18 23:36       ` Linus Torvalds
2009-03-18 23:36         ` Linus Torvalds
2009-03-18 23:54         ` Ying Han
2009-03-18 23:54           ` Ying Han
2009-03-19 15:48           ` Nick Piggin
2009-03-19 15:48             ` Nick Piggin
2009-03-19 16:16             ` Peter Zijlstra
2009-03-19 16:16               ` Peter Zijlstra
2009-03-19 16:36               ` Nick Piggin
2009-03-19 16:36                 ` Nick Piggin
2009-03-19 16:20             ` Linus Torvalds
2009-03-19 16:20               ` Linus Torvalds
2009-03-19 16:34               ` Nick Piggin
2009-03-19 16:34                 ` Nick Piggin
2009-03-19 16:51                 ` Linus Torvalds
2009-03-19 16:51                   ` Linus Torvalds
2009-03-19 17:03                   ` Jan Kara
2009-03-19 17:03                     ` Jan Kara
2009-03-19 17:06                     ` Jan Kara
2009-03-19 17:06                       ` Jan Kara
2009-03-19 20:05                     ` Linus Torvalds
2009-03-19 20:05                       ` Linus Torvalds
2009-03-19 20:21                   ` Linus Torvalds
2009-03-19 20:21                     ` Linus Torvalds
2009-03-19 21:17                     ` Ying Han
2009-03-19 21:17                       ` Ying Han
2009-03-19 22:16                     ` Jan Kara
2009-03-19 22:16                       ` Jan Kara
2009-03-19 16:46             ` Jan Kara
2009-03-19 16:46               ` Jan Kara
2009-03-24  7:44               ` Nick Piggin
2009-03-24  7:44                 ` Nick Piggin
2009-03-24 10:27                 ` Nick Piggin
2009-03-24 10:27                   ` Nick Piggin
2009-03-24 10:32                 ` Andrew Morton
2009-03-24 10:32                   ` Andrew Morton
2009-03-24 15:35                   ` Nick Piggin [this message]
2009-03-24 15:35                     ` Nick Piggin
2009-03-26 18:29                     ` Jan Kara
2009-03-26 18:29                       ` Jan Kara
2009-03-26  0:03                   ` Ying Han
2009-03-26  0:03                     ` Ying Han
2009-03-24 12:39                 ` Jan Kara
2009-03-24 12:39                   ` Jan Kara
2009-03-24 12:55                   ` Jan Kara
2009-03-24 12:55                     ` Jan Kara
2009-03-24 13:26                     ` Jan Kara
2009-03-24 13:26                       ` Jan Kara
2009-03-24 14:01                       ` Chris Mason
2009-03-24 14:01                         ` Chris Mason
2009-03-24 14:07                         ` Jan Kara
2009-03-24 14:07                           ` Jan Kara
2009-03-26  8:18                           ` Aneesh Kumar K.V
2009-03-26  8:18                             ` Aneesh Kumar K.V
2009-03-24 14:30                       ` Nick Piggin
2009-03-24 14:30                         ` Nick Piggin
2009-03-24 14:47                         ` Jan Kara
2009-03-24 14:47                           ` Jan Kara
2009-03-24 14:56                           ` Peter Zijlstra
2009-03-24 14:56                             ` Peter Zijlstra
2009-03-24 15:29                             ` Jan Kara
2009-03-24 15:29                               ` Jan Kara
2009-03-24 20:14                               ` OGAWA Hirofumi
2009-03-24 20:14                                 ` OGAWA Hirofumi
2009-03-26  8:47                               ` Aneesh Kumar K.V
2009-03-26  8:47                                 ` Aneesh Kumar K.V
2009-03-26 11:37                                 ` Jan Kara
2009-03-26 11:37                                   ` Jan Kara
2009-03-26 23:02                                 ` Linus Torvalds
2009-03-26 23:02                                   ` Linus Torvalds
2009-03-24 15:03                           ` Nick Piggin
2009-03-24 15:03                             ` Nick Piggin
2009-03-24 15:48                             ` Jan Kara
2009-03-24 15:48                               ` Jan Kara
2009-03-24 17:35                               ` Jan Kara
2009-03-24 17:35                                 ` Jan Kara
2009-03-24 17:35                                 ` Jan Kara
2009-04-01 22:36                                 ` Ying Han
2009-04-01 22:36                                   ` Ying Han
2009-04-02 10:11                                   ` Jan Kara
2009-04-02 10:11                                     ` Jan Kara
2009-04-02 11:24                                   ` Nick Piggin
2009-04-02 11:24                                     ` Nick Piggin
2009-04-02 11:34                                     ` Jan Kara
2009-04-02 11:34                                       ` Jan Kara
2009-04-02 15:51                                       ` Nick Piggin
2009-04-02 15:51                                         ` Nick Piggin
2009-04-02 17:44                                         ` Ying Han
2009-04-02 17:44                                           ` Ying Han
2009-04-02 22:52                                           ` Ying Han
2009-04-02 22:52                                             ` Ying Han
2009-04-02 23:39                                             ` Jan Kara
2009-04-02 23:39                                               ` Jan Kara
2009-04-03  0:25                                               ` Ying Han
2009-04-03  0:25                                                 ` Ying Han
2009-04-03  1:29                                               ` Ying Han
2009-04-03  1:29                                                 ` Ying Han
2009-04-03  9:41                                                 ` Jan Kara
2009-04-03  9:41                                                   ` Jan Kara
2009-04-03 21:34                                                   ` Ying Han
2009-04-03 21:34                                                     ` Ying Han
2009-04-03  0:13                                     ` Ying Han
2009-04-03  0:13                                       ` Ying Han
2009-03-27 20:35                 ` Ying Han
2009-03-27 20:35                   ` Ying Han
2009-03-20  0:34     ` Ying Han
2009-03-20  0:34       ` Ying Han
2009-03-20  0:49       ` Linus Torvalds
2009-03-20  0:49         ` Linus Torvalds
2009-03-20  7:00         ` Ying Han
2009-03-20  7:00           ` Ying Han
2009-03-25 23:15     ` Ying Han
2009-03-25 23:15       ` Ying Han

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200903250235.02816.nickpiggin@yahoo.com.au \
    --to=nickpiggin@yahoo.com.au \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=alexk@google.com \
    --cc=guichaz@gmail.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mbligh@mbligh.org \
    --cc=mikew@google.com \
    --cc=rohitseth@google.com \
    --cc=torvalds@linux-foundation.org \
    --cc=yinghan@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.