From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ted Ts'o Subject: Re: [PATCH 3/3] ext4: fix block swap procedure on migration V2 Date: Sat, 29 Oct 2011 08:54:51 -0400 Message-ID: <20111029125451.GC19536@thunk.org> References: <1316266379-18737-1-git-send-email-dmonakhov@openvz.org> <1316266379-18737-3-git-send-email-dmonakhov@openvz.org> <4CCE9F3D-0343-4FF2-A969-291335071D69@dilger.ca> <87wrd4zdjn.fsf@dmbot.sw.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Andreas Dilger , linux-ext4@vger.kernel.org, aneesh.kumar@linux.vnet.ibm.com To: Dmitry Monakhov Return-path: Received: from li9-11.members.linode.com ([67.18.176.11]:40653 "EHLO test.thunk.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933761Ab1J2SMm (ORCPT ); Sat, 29 Oct 2011 14:12:42 -0400 Content-Disposition: inline In-Reply-To: <87wrd4zdjn.fsf@dmbot.sw.ru> Sender: linux-ext4-owner@vger.kernel.org List-ID: On Mon, Sep 19, 2011 at 08:45:32PM +0400, Dmitry Monakhov wrote: > > It would potentially be better to just leave the inode off the orphan > > list, and in the _extremely_ rare case that there is a crash during > > inode migration the inode is leaked until e2fsck is run. The migrate > > will happen at most once for any filesystem, so the loss of a single > > inode per crash will not be a serious issue IMHO. > > But we still need to tell e2fsck that it is not just an average inode, > and it should be cleaned up without touching it's data blocks. Otherwise > fsck will complain about blocks are referenced by multiple inodes, which > is really scary message. So IMHO we still need persistent flag. The simplest way to solve this is to keep the n_links count on the inode to be zero. That will prevent e2fsck from considering the inode as being in use, so it will simply not consider the blocks owned by the temporary inode. What this would mean is that we will leak the temporary inode as well as the newly allocated index blocks until the next e2fsck, but that's not a disaster, and it is a very rare case. If we wanted to optimize things further, we could also add support to __ext4_get_inode_loc(), ext4_mark_inode_dirty(), and ext4_write_inode() so that if the inode number is some magic number, such as inode number 1 (which as the bad block inode we never touch directly), that it is to be considered an in-memory inode only and so we don't even bother writing it to do disk or journaling to it. That way the migration code can use an in-memory inode which is allocated by ext4_ext_migrate(), and whose only existence is a pointer in that kernel stack frame to an in-memory inode, which has no existence on disk. With this optimized approach, the only thing we will leak is one, maybe two extent tree index blocks *if* we happen to be migrating an inode during the time of a system crash. We could even avoid that by adding support for blocks which are allocated in memory, but not (yet) pushed out to disk, which we may need for some of the write path improvements. But if we don't get to that right away, I think that's fine.... - Ted