* fsync on ext[34] working only by an accident @ 2009-09-08 13:26 Jan Kara 2009-09-10 6:46 ` Aneesh Kumar K.V 0 siblings, 1 reply; 16+ messages in thread From: Jan Kara @ 2009-09-08 13:26 UTC (permalink / raw) To: linux-ext4 Hi, When looking at how ext3/4 handles fsync, I've realized I don't understand how writing out inode on fsync can work. The problem is that ext3/4 mostly calls ext?_mark_inode_dirty() which actually does *not* dirty the inode. It just copies the in-memory inode content to disk buffer. So in particular the inode looks clean to VFS and our check in ext?_sync_file() shouldn't trigger. The only obvious case when we call mark_inode_dirty() is from write_end functions when we update i_size but that's clearly not enough. Now I did some research why things seem to be actually working. The trick is that when allocating block, we call vfs_dq_alloc_block() which calls mark_inode_dirty(). But that's all what's keeping our fsync / writeout logic from breaking! There are even some cases when the logic actually is broken (I've tested it and it really does not work) - for example when you create an empty file, the inode won't get written when you fsync it. So what we should IMHO do is to convert all ext?_mark_inode_dirty() calls to simple mark_inode_dirty() (or even maybe introduce and use mark_inode_dirty_datasync() where appropriate). It will cost us some more CPU and stack space but if we optimize ext3_dirty_inode() for the case where handle is already started, it shouldn't be too bad. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: fsync on ext[34] working only by an accident 2009-09-08 13:26 fsync on ext[34] working only by an accident Jan Kara @ 2009-09-10 6:46 ` Aneesh Kumar K.V 2009-09-10 8:50 ` Jan Kara 0 siblings, 1 reply; 16+ messages in thread From: Aneesh Kumar K.V @ 2009-09-10 6:46 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4 On Tue, Sep 08, 2009 at 03:26:01PM +0200, Jan Kara wrote: > Hi, > > When looking at how ext3/4 handles fsync, I've realized I don't > understand how writing out inode on fsync can work. The problem is that > ext3/4 mostly calls ext?_mark_inode_dirty() which actually does *not* dirty > the inode. It just copies the in-memory inode content to disk buffer. > So in particular the inode looks clean to VFS and our check in > ext?_sync_file() shouldn't trigger. > The only obvious case when we call mark_inode_dirty() is from write_end > functions when we update i_size but that's clearly not enough. Now I did > some research why things seem to be actually working. The trick is that > when allocating block, we call vfs_dq_alloc_block() which calls > mark_inode_dirty(). But that's all what's keeping our fsync / writeout > logic from breaking! ext4_handle_dirty_metadata should do mark_inode_dirty right ? __ext4_handle_dirty_metadata -> mark_buffer_dirty ->__set_page_dirty -> __mark_inode_dirty -> list_move(&inode->i_list, &sb->s_dirty); > There are even some cases when the logic actually is broken (I've tested > it and it really does not work) - for example when you create an empty > file, the inode won't get written when you fsync it. > So what we should IMHO do is to convert all ext?_mark_inode_dirty() > calls to simple mark_inode_dirty() (or even maybe introduce and use > mark_inode_dirty_datasync() where appropriate). It will cost us some more > CPU and stack space but if we optimize ext3_dirty_inode() for the case > where handle is already started, it shouldn't be too bad. -aneesh ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: fsync on ext[34] working only by an accident 2009-09-10 6:46 ` Aneesh Kumar K.V @ 2009-09-10 8:50 ` Jan Kara 2009-09-10 9:04 ` Aneesh Kumar K.V 0 siblings, 1 reply; 16+ messages in thread From: Jan Kara @ 2009-09-10 8:50 UTC (permalink / raw) To: Aneesh Kumar K.V; +Cc: Jan Kara, linux-ext4 Hi, On Thu 10-09-09 12:16:05, Aneesh Kumar K.V wrote: > On Tue, Sep 08, 2009 at 03:26:01PM +0200, Jan Kara wrote: > > When looking at how ext3/4 handles fsync, I've realized I don't > > understand how writing out inode on fsync can work. The problem is that > > ext3/4 mostly calls ext?_mark_inode_dirty() which actually does *not* dirty > > the inode. It just copies the in-memory inode content to disk buffer. > > So in particular the inode looks clean to VFS and our check in > > ext?_sync_file() shouldn't trigger. > > The only obvious case when we call mark_inode_dirty() is from write_end > > functions when we update i_size but that's clearly not enough. Now I did > > some research why things seem to be actually working. The trick is that > > when allocating block, we call vfs_dq_alloc_block() which calls > > mark_inode_dirty(). But that's all what's keeping our fsync / writeout > > logic from breaking! > > ext4_handle_dirty_metadata should do mark_inode_dirty right ? > __ext4_handle_dirty_metadata -> mark_buffer_dirty ->__set_page_dirty > -> __mark_inode_dirty -> list_move(&inode->i_list, &sb->s_dirty); ext4_handle_dirty_metadata() marks the buffer dirty only when we do not have a journal (BTW, the inode that gets dirtied in the nojournal case is the block-device one, not the one whose metadata we mark as dirty, so it won't work there either - but Google guys are working on this I think). With a journal the function just calls jbd2_journal_dirty_metadata which does nothing with the inode. > > There are even some cases when the logic actually is broken (I've tested > > it and it really does not work) - for example when you create an empty > > file, the inode won't get written when you fsync it. > > So what we should IMHO do is to convert all ext?_mark_inode_dirty() > > calls to simple mark_inode_dirty() (or even maybe introduce and use > > mark_inode_dirty_datasync() where appropriate). It will cost us some more > > CPU and stack space but if we optimize ext3_dirty_inode() for the case > > where handle is already started, it shouldn't be too bad. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: fsync on ext[34] working only by an accident 2009-09-10 8:50 ` Jan Kara @ 2009-09-10 9:04 ` Aneesh Kumar K.V 2009-09-10 9:15 ` Jan Kara 2009-09-10 9:15 ` Aneesh Kumar K.V 0 siblings, 2 replies; 16+ messages in thread From: Aneesh Kumar K.V @ 2009-09-10 9:04 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4 On Thu, Sep 10, 2009 at 10:50:56AM +0200, Jan Kara wrote: > Hi, > > On Thu 10-09-09 12:16:05, Aneesh Kumar K.V wrote: > > On Tue, Sep 08, 2009 at 03:26:01PM +0200, Jan Kara wrote: > > > When looking at how ext3/4 handles fsync, I've realized I don't > > > understand how writing out inode on fsync can work. The problem is that > > > ext3/4 mostly calls ext?_mark_inode_dirty() which actually does *not* dirty > > > the inode. It just copies the in-memory inode content to disk buffer. > > > So in particular the inode looks clean to VFS and our check in > > > ext?_sync_file() shouldn't trigger. > > > The only obvious case when we call mark_inode_dirty() is from write_end > > > functions when we update i_size but that's clearly not enough. Now I did > > > some research why things seem to be actually working. The trick is that > > > when allocating block, we call vfs_dq_alloc_block() which calls > > > mark_inode_dirty(). But that's all what's keeping our fsync / writeout > > > logic from breaking! > > > > ext4_handle_dirty_metadata should do mark_inode_dirty right ? > > __ext4_handle_dirty_metadata -> mark_buffer_dirty ->__set_page_dirty > > -> __mark_inode_dirty -> list_move(&inode->i_list, &sb->s_dirty); > ext4_handle_dirty_metadata() marks the buffer dirty only when we do not > have a journal (BTW, the inode that gets dirtied in the nojournal case > is the block-device one, not the one whose metadata we mark as dirty, so > it won't work there either - but Google guys are working on this I think). > With a journal the function just calls jbd2_journal_dirty_metadata which > does nothing with the inode. When we don't have a journal handle we do that as a part of journal commit right ? __jbd2_journal_temp_unlink_buffer -> mark_buffer_dirty I guess fsync only requires the meta data update to be in journal ? -aneesh ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: fsync on ext[34] working only by an accident 2009-09-10 9:04 ` Aneesh Kumar K.V @ 2009-09-10 9:15 ` Jan Kara 2009-09-10 9:15 ` Aneesh Kumar K.V 1 sibling, 0 replies; 16+ messages in thread From: Jan Kara @ 2009-09-10 9:15 UTC (permalink / raw) To: Aneesh Kumar K.V; +Cc: Jan Kara, linux-ext4 On Thu 10-09-09 14:34:49, Aneesh Kumar K.V wrote: > On Thu, Sep 10, 2009 at 10:50:56AM +0200, Jan Kara wrote: > > Hi, > > > > On Thu 10-09-09 12:16:05, Aneesh Kumar K.V wrote: > > > On Tue, Sep 08, 2009 at 03:26:01PM +0200, Jan Kara wrote: > > > > When looking at how ext3/4 handles fsync, I've realized I don't > > > > understand how writing out inode on fsync can work. The problem is that > > > > ext3/4 mostly calls ext?_mark_inode_dirty() which actually does *not* dirty > > > > the inode. It just copies the in-memory inode content to disk buffer. > > > > So in particular the inode looks clean to VFS and our check in > > > > ext?_sync_file() shouldn't trigger. > > > > The only obvious case when we call mark_inode_dirty() is from write_end > > > > functions when we update i_size but that's clearly not enough. Now I did > > > > some research why things seem to be actually working. The trick is that > > > > when allocating block, we call vfs_dq_alloc_block() which calls > > > > mark_inode_dirty(). But that's all what's keeping our fsync / writeout > > > > logic from breaking! > > > > > > ext4_handle_dirty_metadata should do mark_inode_dirty right ? > > > __ext4_handle_dirty_metadata -> mark_buffer_dirty ->__set_page_dirty > > > -> __mark_inode_dirty -> list_move(&inode->i_list, &sb->s_dirty); > > ext4_handle_dirty_metadata() marks the buffer dirty only when we do not > > have a journal (BTW, the inode that gets dirtied in the nojournal case > > is the block-device one, not the one whose metadata we mark as dirty, so > > it won't work there either - but Google guys are working on this I think). > > With a journal the function just calls jbd2_journal_dirty_metadata which > > does nothing with the inode. > > When we don't have a journal handle we do that as a part of journal commit ^^^^^ you meant probably "do" > right ? __jbd2_journal_temp_unlink_buffer -> mark_buffer_dirty Yes, that happens. But as I said above: a) mark_buffer_dirty() dirties blockdevice inode and thus not the one we check in ext4_sync_file(). b) this happens only after we commit the transaction and only if the buffer is not part of the next transaction. Believe me, I've actually checked with blktrace, that the code does not work in some cases ;). > I guess fsync only requires the meta data update to be in journal ? Yes, to be more precise, it requires transaction with the metadata update to be committed. And the problem is we force a transaction commit in ext4_sync_file() only if the inode has a dirty bit set - which is accidentally set only by quota code. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: fsync on ext[34] working only by an accident 2009-09-10 9:04 ` Aneesh Kumar K.V 2009-09-10 9:15 ` Jan Kara @ 2009-09-10 9:15 ` Aneesh Kumar K.V 2009-09-10 10:52 ` Jan Kara 1 sibling, 1 reply; 16+ messages in thread From: Aneesh Kumar K.V @ 2009-09-10 9:15 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4 On Thu, Sep 10, 2009 at 02:34:49PM +0530, Aneesh Kumar K.V wrote: > On Thu, Sep 10, 2009 at 10:50:56AM +0200, Jan Kara wrote: > > Hi, > > > > On Thu 10-09-09 12:16:05, Aneesh Kumar K.V wrote: > > > On Tue, Sep 08, 2009 at 03:26:01PM +0200, Jan Kara wrote: > > > > When looking at how ext3/4 handles fsync, I've realized I don't > > > > understand how writing out inode on fsync can work. The problem is that > > > > ext3/4 mostly calls ext?_mark_inode_dirty() which actually does *not* dirty > > > > the inode. It just copies the in-memory inode content to disk buffer. > > > > So in particular the inode looks clean to VFS and our check in > > > > ext?_sync_file() shouldn't trigger. > > > > The only obvious case when we call mark_inode_dirty() is from write_end > > > > functions when we update i_size but that's clearly not enough. Now I did > > > > some research why things seem to be actually working. The trick is that > > > > when allocating block, we call vfs_dq_alloc_block() which calls > > > > mark_inode_dirty(). But that's all what's keeping our fsync / writeout > > > > logic from breaking! > > > > > > ext4_handle_dirty_metadata should do mark_inode_dirty right ? > > > __ext4_handle_dirty_metadata -> mark_buffer_dirty ->__set_page_dirty > > > -> __mark_inode_dirty -> list_move(&inode->i_list, &sb->s_dirty); > > ext4_handle_dirty_metadata() marks the buffer dirty only when we do not > > have a journal (BTW, the inode that gets dirtied in the nojournal case > > is the block-device one, not the one whose metadata we mark as dirty, so > > it won't work there either - but Google guys are working on this I think). > > With a journal the function just calls jbd2_journal_dirty_metadata which > > does nothing with the inode. > > When we don't have a journal handle we do that as a part of journal commit > right ? __jbd2_journal_temp_unlink_buffer -> mark_buffer_dirty > > I guess fsync only requires the meta data update to be in journal ? > Adding the file inode to the sb->s_dirty is done through block_write_end ? Why do you mention above that it is not "clearly not enough" ? -aneesh ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: fsync on ext[34] working only by an accident 2009-09-10 9:15 ` Aneesh Kumar K.V @ 2009-09-10 10:52 ` Jan Kara 2009-09-10 11:04 ` Aneesh Kumar K.V 0 siblings, 1 reply; 16+ messages in thread From: Jan Kara @ 2009-09-10 10:52 UTC (permalink / raw) To: Aneesh Kumar K.V; +Cc: Jan Kara, linux-ext4 On Thu 10-09-09 14:45:51, Aneesh Kumar K.V wrote: > On Thu, Sep 10, 2009 at 02:34:49PM +0530, Aneesh Kumar K.V wrote: > > On Thu, Sep 10, 2009 at 10:50:56AM +0200, Jan Kara wrote: > > > Hi, > > > > > > On Thu 10-09-09 12:16:05, Aneesh Kumar K.V wrote: > > > > On Tue, Sep 08, 2009 at 03:26:01PM +0200, Jan Kara wrote: > > > > > When looking at how ext3/4 handles fsync, I've realized I don't > > > > > understand how writing out inode on fsync can work. The problem is that > > > > > ext3/4 mostly calls ext?_mark_inode_dirty() which actually does *not* dirty > > > > > the inode. It just copies the in-memory inode content to disk buffer. > > > > > So in particular the inode looks clean to VFS and our check in > > > > > ext?_sync_file() shouldn't trigger. > > > > > The only obvious case when we call mark_inode_dirty() is from write_end > > > > > functions when we update i_size but that's clearly not enough. Now I did > > > > > some research why things seem to be actually working. The trick is that > > > > > when allocating block, we call vfs_dq_alloc_block() which calls > > > > > mark_inode_dirty(). But that's all what's keeping our fsync / writeout > > > > > logic from breaking! > > > > > > > > ext4_handle_dirty_metadata should do mark_inode_dirty right ? > > > > __ext4_handle_dirty_metadata -> mark_buffer_dirty ->__set_page_dirty > > > > -> __mark_inode_dirty -> list_move(&inode->i_list, &sb->s_dirty); > > > ext4_handle_dirty_metadata() marks the buffer dirty only when we do not > > > have a journal (BTW, the inode that gets dirtied in the nojournal case > > > is the block-device one, not the one whose metadata we mark as dirty, so > > > it won't work there either - but Google guys are working on this I think). > > > With a journal the function just calls jbd2_journal_dirty_metadata which > > > does nothing with the inode. > > > > When we don't have a journal handle we do that as a part of journal commit > > right ? __jbd2_journal_temp_unlink_buffer -> mark_buffer_dirty > > > > I guess fsync only requires the meta data update to be in journal ? > > > > Adding the file inode to the sb->s_dirty is done through block_write_end ? > Why do you mention above that it is not "clearly not enough" ? Where? I don't see block_write_end() marking the inode dirty anywhere... It calls __block_commit_write() and that dirties only buffers, not the inode. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: fsync on ext[34] working only by an accident 2009-09-10 10:52 ` Jan Kara @ 2009-09-10 11:04 ` Aneesh Kumar K.V 2009-09-10 12:32 ` Jan Kara 2009-09-10 13:10 ` Theodore Tso 0 siblings, 2 replies; 16+ messages in thread From: Aneesh Kumar K.V @ 2009-09-10 11:04 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4 On Thu, Sep 10, 2009 at 12:52:16PM +0200, Jan Kara wrote: > On Thu 10-09-09 14:45:51, Aneesh Kumar K.V wrote: > > On Thu, Sep 10, 2009 at 02:34:49PM +0530, Aneesh Kumar K.V wrote: > > > On Thu, Sep 10, 2009 at 10:50:56AM +0200, Jan Kara wrote: > > > > Hi, > > > > > > > > On Thu 10-09-09 12:16:05, Aneesh Kumar K.V wrote: > > > > > On Tue, Sep 08, 2009 at 03:26:01PM +0200, Jan Kara wrote: > > > > > > When looking at how ext3/4 handles fsync, I've realized I don't > > > > > > understand how writing out inode on fsync can work. The problem is that > > > > > > ext3/4 mostly calls ext?_mark_inode_dirty() which actually does *not* dirty > > > > > > the inode. It just copies the in-memory inode content to disk buffer. > > > > > > So in particular the inode looks clean to VFS and our check in > > > > > > ext?_sync_file() shouldn't trigger. > > > > > > The only obvious case when we call mark_inode_dirty() is from write_end > > > > > > functions when we update i_size but that's clearly not enough. Now I did > > > > > > some research why things seem to be actually working. The trick is that > > > > > > when allocating block, we call vfs_dq_alloc_block() which calls > > > > > > mark_inode_dirty(). But that's all what's keeping our fsync / writeout > > > > > > logic from breaking! > > > > > > > > > > ext4_handle_dirty_metadata should do mark_inode_dirty right ? > > > > > __ext4_handle_dirty_metadata -> mark_buffer_dirty ->__set_page_dirty > > > > > -> __mark_inode_dirty -> list_move(&inode->i_list, &sb->s_dirty); > > > > ext4_handle_dirty_metadata() marks the buffer dirty only when we do not > > > > have a journal (BTW, the inode that gets dirtied in the nojournal case > > > > is the block-device one, not the one whose metadata we mark as dirty, so > > > > it won't work there either - but Google guys are working on this I think). > > > > With a journal the function just calls jbd2_journal_dirty_metadata which > > > > does nothing with the inode. > > > > > > When we don't have a journal handle we do that as a part of journal commit > > > right ? __jbd2_journal_temp_unlink_buffer -> mark_buffer_dirty > > > > > > I guess fsync only requires the meta data update to be in journal ? > > > > > > > Adding the file inode to the sb->s_dirty is done through block_write_end ? > > Why do you mention above that it is not "clearly not enough" ? > Where? I don't see block_write_end() marking the inode dirty anywhere... > It calls __block_commit_write() and that dirties only buffers, not the > inode. mark_buffer_dirty -> __set_page_dirty -> __mark_inode_dirty -aneesh ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: fsync on ext[34] working only by an accident 2009-09-10 11:04 ` Aneesh Kumar K.V @ 2009-09-10 12:32 ` Jan Kara 2009-09-10 13:10 ` Theodore Tso 1 sibling, 0 replies; 16+ messages in thread From: Jan Kara @ 2009-09-10 12:32 UTC (permalink / raw) To: Aneesh Kumar K.V; +Cc: Jan Kara, linux-ext4 On Thu 10-09-09 16:34:55, Aneesh Kumar K.V wrote: > On Thu, Sep 10, 2009 at 12:52:16PM +0200, Jan Kara wrote: > > On Thu 10-09-09 14:45:51, Aneesh Kumar K.V wrote: > > > On Thu, Sep 10, 2009 at 02:34:49PM +0530, Aneesh Kumar K.V wrote: > > > > On Thu, Sep 10, 2009 at 10:50:56AM +0200, Jan Kara wrote: > > > > > Hi, > > > > > > > > > > On Thu 10-09-09 12:16:05, Aneesh Kumar K.V wrote: > > > > > > On Tue, Sep 08, 2009 at 03:26:01PM +0200, Jan Kara wrote: > > > > > > > When looking at how ext3/4 handles fsync, I've realized I don't > > > > > > > understand how writing out inode on fsync can work. The problem is that > > > > > > > ext3/4 mostly calls ext?_mark_inode_dirty() which actually does *not* dirty > > > > > > > the inode. It just copies the in-memory inode content to disk buffer. > > > > > > > So in particular the inode looks clean to VFS and our check in > > > > > > > ext?_sync_file() shouldn't trigger. > > > > > > > The only obvious case when we call mark_inode_dirty() is from write_end > > > > > > > functions when we update i_size but that's clearly not enough. Now I did > > > > > > > some research why things seem to be actually working. The trick is that > > > > > > > when allocating block, we call vfs_dq_alloc_block() which calls > > > > > > > mark_inode_dirty(). But that's all what's keeping our fsync / writeout > > > > > > > logic from breaking! > > > > > > > > > > > > ext4_handle_dirty_metadata should do mark_inode_dirty right ? > > > > > > __ext4_handle_dirty_metadata -> mark_buffer_dirty ->__set_page_dirty > > > > > > -> __mark_inode_dirty -> list_move(&inode->i_list, &sb->s_dirty); > > > > > ext4_handle_dirty_metadata() marks the buffer dirty only when we do not > > > > > have a journal (BTW, the inode that gets dirtied in the nojournal case > > > > > is the block-device one, not the one whose metadata we mark as dirty, so > > > > > it won't work there either - but Google guys are working on this I think). > > > > > With a journal the function just calls jbd2_journal_dirty_metadata which > > > > > does nothing with the inode. > > > > > > > > When we don't have a journal handle we do that as a part of journal commit > > > > right ? __jbd2_journal_temp_unlink_buffer -> mark_buffer_dirty > > > > > > > > I guess fsync only requires the meta data update to be in journal ? > > > > > > > > > > Adding the file inode to the sb->s_dirty is done through block_write_end ? > > > Why do you mention above that it is not "clearly not enough" ? > > Where? I don't see block_write_end() marking the inode dirty anywhere... > > It calls __block_commit_write() and that dirties only buffers, not the > > inode. > mark_buffer_dirty -> __set_page_dirty -> __mark_inode_dirty OK, but that sets only I_DIRTY_PAGES not I_DIRTY_DATA or I_DIRTY. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: fsync on ext[34] working only by an accident 2009-09-10 11:04 ` Aneesh Kumar K.V 2009-09-10 12:32 ` Jan Kara @ 2009-09-10 13:10 ` Theodore Tso 2009-09-10 14:06 ` Jan Kara 2009-09-10 16:25 ` Aneesh Kumar K.V 1 sibling, 2 replies; 16+ messages in thread From: Theodore Tso @ 2009-09-10 13:10 UTC (permalink / raw) To: Aneesh Kumar K.V; +Cc: Jan Kara, linux-ext4 On Thu, Sep 10, 2009 at 04:34:55PM +0530, Aneesh Kumar K.V wrote: > mark_buffer_dirty -> __set_page_dirty -> __mark_inode_dirty We need to be careful here. First of all, mark_buffer_dirty() on the code paths you are talking about is being passed a metadata buffer head. As such, has Jan has pointed out, the bh is part of the buffer cache, so the page->mapping of associated with bh->b_page is the inode of the block device --- *not* the ext4 inode. Secondly, __set_page_dirty calls __mark_inode_dirty passing in I_DIRTY_PAGES --- which should be a hint. What Jan is talking about is where we set the inode flags I_DIRTY_SYNC and I_DIRTY_DATASYNC: * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on * fdatasync(). i_atime is the usual cause. * I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of * these changes separately from I_DIRTY_SYNC so that we * don't have to write inode on fdatasync() when only * mtime has changed in it. This is important because ext4_sync_file() (which is called by fsync() and fdatasync()) uses this logic to determine whether or not to call sync_inode(), which is what will force a commit when wbc.sync_mode is set to WB_SYNC_ALL. In fact, I think the problem is worse than Jan is pointing out, because it's not enough that vfs_fq_alloc_space() is calling mark_inode_dirty(), since that only sets I_DIRTY_SYNC. When we touch i_size or i_block[], we need to make sure that I_DIRTY_DATASYNC is set, so that fdatasync() will force a commit. I think the right thing to do is to create an _ext[34]_mark_inode_dirty() which takes an extra argument, which controls whether or not we set I_DIRTY_SYNC or I_DIRTY_DATASYNC. In fact, most of the time, we want to be setting I_DIRTY_DATASYNC, so we should probably have ext[34]_mark_inode_dirty() call _ext[34]_mark_inode_dirty() with I_DIRTY_DATASYNC, and then create a ext[34]_mark_inode_nodatasync() version passes in I_DIRTY_SYNC. This will cause pdflush to call ext4_write_inode() more frequently, but pdflush calls write_inode with wait=0, and ext4_write_inode() is a no-op in that case. BTW, while I was looking into this, I noted that the comments ahead of ext[34]_mark_inode_dirty are out of date; they date back to a time when when prune_icache actually was responsible for cleaning dirty inodes; these days, that honor is owned by fs-writeback.c and pdflush.) Also, the second half of the comments in ext4_write_inode(), where they reference mark_inode_dirty() are also painfully out of date, and somewhat misleading as a result. Does this make sense to every one? - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: fsync on ext[34] working only by an accident 2009-09-10 13:10 ` Theodore Tso @ 2009-09-10 14:06 ` Jan Kara 2009-09-10 16:52 ` Theodore Tso 2009-09-10 20:14 ` Mingming 2009-09-10 16:25 ` Aneesh Kumar K.V 1 sibling, 2 replies; 16+ messages in thread From: Jan Kara @ 2009-09-10 14:06 UTC (permalink / raw) To: Theodore Tso; +Cc: Aneesh Kumar K.V, Jan Kara, linux-ext4 On Thu 10-09-09 09:10:07, Theodore Tso wrote: > On Thu, Sep 10, 2009 at 04:34:55PM +0530, Aneesh Kumar K.V wrote: > > mark_buffer_dirty -> __set_page_dirty -> __mark_inode_dirty > > We need to be careful here. First of all, mark_buffer_dirty() on the > code paths you are talking about is being passed a metadata buffer > head. As such, has Jan has pointed out, the bh is part of the buffer > cache, so the page->mapping of associated with bh->b_page is the inode > of the block device --- *not* the ext4 inode. > > Secondly, __set_page_dirty calls __mark_inode_dirty passing in > I_DIRTY_PAGES --- which should be a hint. What Jan is talking about > is where we set the inode flags I_DIRTY_SYNC and I_DIRTY_DATASYNC: > > * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on > * fdatasync(). i_atime is the usual cause. > * I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of > * these changes separately from I_DIRTY_SYNC so that we > * don't have to write inode on fdatasync() when only > * mtime has changed in it. > > This is important because ext4_sync_file() (which is called by fsync() > and fdatasync()) uses this logic to determine whether or not to call > sync_inode(), which is what will force a commit when wbc.sync_mode is > set to WB_SYNC_ALL. Yes, this is exactly what I was trying to point out. > In fact, I think the problem is worse than Jan is pointing out, > because it's not enough that vfs_fq_alloc_space() is calling > mark_inode_dirty(), since that only sets I_DIRTY_SYNC. When we touch > i_size or i_block[], we need to make sure that I_DIRTY_DATASYNC is > set, so that fdatasync() will force a commit. Actually no. mark_inode_dirty() dirties inode with I_DIRTY == (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) so it happens to work. The fact that quota *could* dirty the inode with I_DIRTY_SYNC only is right but that's a separate issue. > I think the right thing to do is to create an > _ext[34]_mark_inode_dirty() which takes an extra argument, which > controls whether or not we set I_DIRTY_SYNC or I_DIRTY_DATASYNC. In > fact, most of the time, we want to be setting I_DIRTY_DATASYNC, so we > should probably have ext[34]_mark_inode_dirty() call > _ext[34]_mark_inode_dirty() with I_DIRTY_DATASYNC, and then create a > ext[34]_mark_inode_nodatasync() version passes in I_DIRTY_SYNC. > > This will cause pdflush to call ext4_write_inode() more frequently, > but pdflush calls write_inode with wait=0, and ext4_write_inode() is a > no-op in that case. Thinking about it, it won't work so easily. The problem is that when pdflush decides to write the inode, it unconditionally clears dirty flags. We could redirty the inode in write_inode() but that's IMHO too ugly to bear it. We could use ext[34] internal inode dirty flags to track when transaction commit is needed on fsync and fdatasync... That would provide the integrity guarantees. The performance problem with this is that because these flags won't get cleared on transaction commit, we'd force transaction commit unnecessarily when it already happened before fsync. So either we can force commit only if inode buffer is part of the running transaction (but that's slightly hacky as in fact we want to force a commit whetever some metadata is part of the running transaction) or we can let inode track TIDs (transaction ID) which must be committed to return from fsync / fdatasync. The last possibility looks the best to me so I'd go for it. I can write a patch next week... > BTW, while I was looking into this, I noted that the comments ahead of > ext[34]_mark_inode_dirty are out of date; they date back to a time > when when prune_icache actually was responsible for cleaning dirty > inodes; these days, that honor is owned by fs-writeback.c and > pdflush.) Also, the second half of the comments in > ext4_write_inode(), where they reference mark_inode_dirty() are also > painfully out of date, and somewhat misleading as a result. Yeah, I also couldn't make sence from some of those comments probably because I lack enough history context ;). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: fsync on ext[34] working only by an accident 2009-09-10 14:06 ` Jan Kara @ 2009-09-10 16:52 ` Theodore Tso 2009-09-14 16:00 ` Jan Kara 2009-09-10 20:14 ` Mingming 1 sibling, 1 reply; 16+ messages in thread From: Theodore Tso @ 2009-09-10 16:52 UTC (permalink / raw) To: Jan Kara; +Cc: Aneesh Kumar K.V, linux-ext4 On Thu, Sep 10, 2009 at 04:06:36PM +0200, Jan Kara wrote: > > In fact, I think the problem is worse than Jan is pointing out, > > because it's not enough that vfs_fq_alloc_space() is calling > > mark_inode_dirty(), since that only sets I_DIRTY_SYNC. When we touch > > i_size or i_block[], we need to make sure that I_DIRTY_DATASYNC is > > set, so that fdatasync() will force a commit. > Actually no. mark_inode_dirty() dirties inode with I_DIRTY == > (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) so it happens to work. > The fact that quota *could* dirty the inode with I_DIRTY_SYNC only > is right but that's a separate issue. Oops, you're right. I mixed up I_DIRTY and I_DIRTY_SYNC. Hmm, what's actually a bit surprising is that we don't have a mark_inode_dirty_datasync() which only sets I_DIRTY_DATASYNC. And shouldn't quota be dirtying the inode using I_DIRTY_DATASYNC only? After, all the reason why we need to do this is because it's messing with i_size, right? > Thinking about it, it won't work so easily. The problem is that when > pdflush decides to write the inode, it unconditionally clears dirty flags. > We could redirty the inode in write_inode() but that's IMHO too ugly to > bear it. Hmm, yes. I wonder if this is something we should change, and make it be the responsibility of the filesystem's write_inode method function to clear I_DIRTY_SYNC and I_DIRTY_DATASYNC flags. That would allow the file system to refuse to clean the inode for whatever reason the file system saw fit. That would require sweeping through all of the file systems, but it might be useful for more file systems other than ext3/ext4. The problem is it's a little late, given that 2.6.31 has already been released, to try to get consensus on that way of doing things. Tracking the TID is probably the best short-term way of handling this problem, although it means bloating the in-core inode by another four bytes, which is a bit of a shame. - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: fsync on ext[34] working only by an accident 2009-09-10 16:52 ` Theodore Tso @ 2009-09-14 16:00 ` Jan Kara 0 siblings, 0 replies; 16+ messages in thread From: Jan Kara @ 2009-09-14 16:00 UTC (permalink / raw) To: Theodore Tso; +Cc: Jan Kara, Aneesh Kumar K.V, linux-ext4 On Thu 10-09-09 12:52:01, Theodore Tso wrote: > On Thu, Sep 10, 2009 at 04:06:36PM +0200, Jan Kara wrote: > > > In fact, I think the problem is worse than Jan is pointing out, > > > because it's not enough that vfs_fq_alloc_space() is calling > > > mark_inode_dirty(), since that only sets I_DIRTY_SYNC. When we touch > > > i_size or i_block[], we need to make sure that I_DIRTY_DATASYNC is > > > set, so that fdatasync() will force a commit. > > Actually no. mark_inode_dirty() dirties inode with I_DIRTY == > > (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) so it happens to work. > > The fact that quota *could* dirty the inode with I_DIRTY_SYNC only > > is right but that's a separate issue. > > Oops, you're right. I mixed up I_DIRTY and I_DIRTY_SYNC. Hmm, what's > actually a bit surprising is that we don't have a > mark_inode_dirty_datasync() which only sets I_DIRTY_DATASYNC. Yeah. But that's easy to fix ;). > And shouldn't quota be dirtying the inode using I_DIRTY_DATASYNC only? > After, all the reason why we need to do this is because it's messing > with i_size, right? Quota does not touch i_size. It touches only i_blocks - that's why it dirties the inode. Since i_blocks are not really needed to get to the data, I think I_DIRTY_SYNC for quota is more appropriate. > > Thinking about it, it won't work so easily. The problem is that when > > pdflush decides to write the inode, it unconditionally clears dirty flags. > > We could redirty the inode in write_inode() but that's IMHO too ugly to > > bear it. > > Hmm, yes. I wonder if this is something we should change, and make it > be the responsibility of the filesystem's write_inode method function > to clear I_DIRTY_SYNC and I_DIRTY_DATASYNC flags. That would allow > the file system to refuse to clean the inode for whatever reason the > file system saw fit. That would require sweeping through all of the > file systems, but it might be useful for more file systems other than > ext3/ext4. Possibly yes. But OTOH from the point of view of pdflush, the inode actually is clean. It could be even evicted from memory when needed before the transaction commits - which invalidates my approach to solving the fsync() trouble. Nasty. Also just not clearing dirty bits in write_inode() would probably rather confuse current writeback code because all inodes would stay dirty until a transaction commit with no way of getting rid of them. So pdflush would needlessly burn CPU cycles scanning them and trying to write them. Also it's not completely clear how we would clear the dirty bits after the transaction commits since during commit, it could become part of a freshly started transaction. Sigh, it's rather convoluted. Probably we have to somehow pin the inode in memory until the transaction commits to fix the "remember TID" approach. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: fsync on ext[34] working only by an accident 2009-09-10 14:06 ` Jan Kara 2009-09-10 16:52 ` Theodore Tso @ 2009-09-10 20:14 ` Mingming 2009-09-14 15:25 ` Jan Kara 1 sibling, 1 reply; 16+ messages in thread From: Mingming @ 2009-09-10 20:14 UTC (permalink / raw) To: Jan Kara; +Cc: Theodore Tso, Aneesh Kumar K.V, linux-ext4 On Thu, 2009-09-10 at 16:06 +0200, Jan Kara wrote: > On Thu 10-09-09 09:10:07, Theodore Tso wrote: > > On Thu, Sep 10, 2009 at 04:34:55PM +0530, Aneesh Kumar K.V wrote: > > > mark_buffer_dirty -> __set_page_dirty -> __mark_inode_dirty > > > > We need to be careful here. First of all, mark_buffer_dirty() on the > > code paths you are talking about is being passed a metadata buffer > > head. As such, has Jan has pointed out, the bh is part of the buffer > > cache, so the page->mapping of associated with bh->b_page is the inode > > of the block device --- *not* the ext4 inode. > > > > Secondly, __set_page_dirty calls __mark_inode_dirty passing in > > I_DIRTY_PAGES --- which should be a hint. What Jan is talking about > > is where we set the inode flags I_DIRTY_SYNC and I_DIRTY_DATASYNC: > > > > * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on > > * fdatasync(). i_atime is the usual cause. > > * I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of > > * these changes separately from I_DIRTY_SYNC so that we > > * don't have to write inode on fdatasync() when only > > * mtime has changed in it. > > > > This is important because ext4_sync_file() (which is called by fsync() > > and fdatasync()) uses this logic to determine whether or not to call > > sync_inode(), which is what will force a commit when wbc.sync_mode is > > set to WB_SYNC_ALL. > Yes, this is exactly what I was trying to point out. > > > In fact, I think the problem is worse than Jan is pointing out, > > because it's not enough that vfs_fq_alloc_space() is calling > > mark_inode_dirty(), since that only sets I_DIRTY_SYNC. When we touch > > i_size or i_block[], we need to make sure that I_DIRTY_DATASYNC is > > set, so that fdatasync() will force a commit. > Actually no. mark_inode_dirty() dirties inode with I_DIRTY == > (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) so it happens to work. > The fact that quota *could* dirty the inode with I_DIRTY_SYNC only > is right but that's a separate issue. > > > I think the right thing to do is to create an > > _ext[34]_mark_inode_dirty() which takes an extra argument, which > > controls whether or not we set I_DIRTY_SYNC or I_DIRTY_DATASYNC. In > > fact, most of the time, we want to be setting I_DIRTY_DATASYNC, so we > > should probably have ext[34]_mark_inode_dirty() call > > _ext[34]_mark_inode_dirty() with I_DIRTY_DATASYNC, and then create a > > ext[34]_mark_inode_nodatasync() version passes in I_DIRTY_SYNC. > > > > This will cause pdflush to call ext4_write_inode() more frequently, > > but pdflush calls write_inode with wait=0, and ext4_write_inode() is a > > no-op in that case. > Thinking about it, it won't work so easily. The problem is that when > pdflush decides to write the inode, it unconditionally clears dirty flags. > We could redirty the inode in write_inode() but that's IMHO too ugly to > bear it. I am a little confused here, so pdflush could found the dirty inodes (due to quota) but it doesn't force journal comminit and write the inode to disk? Mingming ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: fsync on ext[34] working only by an accident 2009-09-10 20:14 ` Mingming @ 2009-09-14 15:25 ` Jan Kara 0 siblings, 0 replies; 16+ messages in thread From: Jan Kara @ 2009-09-14 15:25 UTC (permalink / raw) To: Mingming; +Cc: Jan Kara, Theodore Tso, Aneesh Kumar K.V, linux-ext4 On Thu 10-09-09 13:14:46, Mingming wrote: > On Thu, 2009-09-10 at 16:06 +0200, Jan Kara wrote: > > On Thu 10-09-09 09:10:07, Theodore Tso wrote: > > > On Thu, Sep 10, 2009 at 04:34:55PM +0530, Aneesh Kumar K.V wrote: > > > > mark_buffer_dirty -> __set_page_dirty -> __mark_inode_dirty > > > > > > We need to be careful here. First of all, mark_buffer_dirty() on the > > > code paths you are talking about is being passed a metadata buffer > > > head. As such, has Jan has pointed out, the bh is part of the buffer > > > cache, so the page->mapping of associated with bh->b_page is the inode > > > of the block device --- *not* the ext4 inode. > > > > > > Secondly, __set_page_dirty calls __mark_inode_dirty passing in > > > I_DIRTY_PAGES --- which should be a hint. What Jan is talking about > > > is where we set the inode flags I_DIRTY_SYNC and I_DIRTY_DATASYNC: > > > > > > * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on > > > * fdatasync(). i_atime is the usual cause. > > > * I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of > > > * these changes separately from I_DIRTY_SYNC so that we > > > * don't have to write inode on fdatasync() when only > > > * mtime has changed in it. > > > > > > This is important because ext4_sync_file() (which is called by fsync() > > > and fdatasync()) uses this logic to determine whether or not to call > > > sync_inode(), which is what will force a commit when wbc.sync_mode is > > > set to WB_SYNC_ALL. > > Yes, this is exactly what I was trying to point out. > > > > > In fact, I think the problem is worse than Jan is pointing out, > > > because it's not enough that vfs_fq_alloc_space() is calling > > > mark_inode_dirty(), since that only sets I_DIRTY_SYNC. When we touch > > > i_size or i_block[], we need to make sure that I_DIRTY_DATASYNC is > > > set, so that fdatasync() will force a commit. > > Actually no. mark_inode_dirty() dirties inode with I_DIRTY == > > (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) so it happens to work. > > The fact that quota *could* dirty the inode with I_DIRTY_SYNC only > > is right but that's a separate issue. > > > > > I think the right thing to do is to create an > > > _ext[34]_mark_inode_dirty() which takes an extra argument, which > > > controls whether or not we set I_DIRTY_SYNC or I_DIRTY_DATASYNC. In > > > fact, most of the time, we want to be setting I_DIRTY_DATASYNC, so we > > > should probably have ext[34]_mark_inode_dirty() call > > > _ext[34]_mark_inode_dirty() with I_DIRTY_DATASYNC, and then create a > > > ext[34]_mark_inode_nodatasync() version passes in I_DIRTY_SYNC. > > > > > > This will cause pdflush to call ext4_write_inode() more frequently, > > > but pdflush calls write_inode with wait=0, and ext4_write_inode() is a > > > no-op in that case. > > Thinking about it, it won't work so easily. The problem is that when > > pdflush decides to write the inode, it unconditionally clears dirty flags. > > We could redirty the inode in write_inode() but that's IMHO too ugly to > > bear it. > > I am a little confused here, so pdflush could found the dirty inodes > (due to quota) but it doesn't force journal comminit and write the inode > to disk? pdflush never forces a journal commit. It just does a periodic writeout so there's no need to force a journal commit. It just calls write_inode(), which ext3/4 implement as NOP because the inode is already part of the running transaction when it is dirty. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: fsync on ext[34] working only by an accident 2009-09-10 13:10 ` Theodore Tso 2009-09-10 14:06 ` Jan Kara @ 2009-09-10 16:25 ` Aneesh Kumar K.V 1 sibling, 0 replies; 16+ messages in thread From: Aneesh Kumar K.V @ 2009-09-10 16:25 UTC (permalink / raw) To: Theodore Tso; +Cc: Jan Kara, linux-ext4 On Thu, Sep 10, 2009 at 09:10:07AM -0400, Theodore Tso wrote: > On Thu, Sep 10, 2009 at 04:34:55PM +0530, Aneesh Kumar K.V wrote: > > mark_buffer_dirty -> __set_page_dirty -> __mark_inode_dirty > > We need to be careful here. First of all, mark_buffer_dirty() on the > code paths you are talking about is being passed a metadata buffer > head. As such, has Jan has pointed out, the bh is part of the buffer > cache, so the page->mapping of associated with bh->b_page is the inode > of the block device --- *not* the ext4 inode. > > Secondly, __set_page_dirty calls __mark_inode_dirty passing in > I_DIRTY_PAGES --- which should be a hint. What Jan is talking about > is where we set the inode flags I_DIRTY_SYNC and I_DIRTY_DATASYNC: > > * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on > * fdatasync(). i_atime is the usual cause. > * I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of > * these changes separately from I_DIRTY_SYNC so that we > * don't have to write inode on fdatasync() when only > * mtime has changed in it. > > This is important because ext4_sync_file() (which is called by fsync() > and fdatasync()) uses this logic to determine whether or not to call > sync_inode(), which is what will force a commit when wbc.sync_mode is > set to WB_SYNC_ALL. > > In fact, I think the problem is worse than Jan is pointing out, > because it's not enough that vfs_fq_alloc_space() is calling > mark_inode_dirty(), since that only sets I_DIRTY_SYNC. When we touch > i_size or i_block[], we need to make sure that I_DIRTY_DATASYNC is > set, so that fdatasync() will force a commit. > That explained it pretty nicely. Thanks -aneesh ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2009-09-14 16:00 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-09-08 13:26 fsync on ext[34] working only by an accident Jan Kara 2009-09-10 6:46 ` Aneesh Kumar K.V 2009-09-10 8:50 ` Jan Kara 2009-09-10 9:04 ` Aneesh Kumar K.V 2009-09-10 9:15 ` Jan Kara 2009-09-10 9:15 ` Aneesh Kumar K.V 2009-09-10 10:52 ` Jan Kara 2009-09-10 11:04 ` Aneesh Kumar K.V 2009-09-10 12:32 ` Jan Kara 2009-09-10 13:10 ` Theodore Tso 2009-09-10 14:06 ` Jan Kara 2009-09-10 16:52 ` Theodore Tso 2009-09-14 16:00 ` Jan Kara 2009-09-10 20:14 ` Mingming 2009-09-14 15:25 ` Jan Kara 2009-09-10 16:25 ` Aneesh Kumar K.V
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).