From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: with ECARTIS (v1.0.0; list xfs); Thu, 23 Oct 2008 23:08:21 -0700 (PDT) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.168.29]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with ESMTP id m9O68BfJ011108 for ; Thu, 23 Oct 2008 23:08:13 -0700 Received: from ipmail01.adl6.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 81D19538780 for ; Thu, 23 Oct 2008 23:08:10 -0700 (PDT) Received: from ipmail01.adl6.internode.on.net (ipmail01.adl6.internode.on.net [203.16.214.146]) by cuda.sgi.com with ESMTP id G39SoEeukCGtC8Ly for ; Thu, 23 Oct 2008 23:08:10 -0700 (PDT) Date: Fri, 24 Oct 2008 17:08:06 +1100 From: Dave Chinner Subject: Re: [PATCH, RFC] Re: atime not written to disk Message-ID: <20081024060806.GJ18495@disturbed> References: <48FD74CC.907@sgi.com> <48FD7B69.3090600@wm.jp.nec.com> <20081022081710.GL18495@disturbed> <20081023042053.GY18495@disturbed> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com List-Id: xfs To: Niv Sardi Cc: Utako Kusaka , Timothy Shimmin , xfs On Fri, Oct 24, 2008 at 04:37:59PM +1100, Niv Sardi wrote: > Dave Chinner writes: > > > On Thu, Oct 23, 2008 at 01:53:23PM +1100, Niv Sardi wrote: > >> Dave Chinner writes: > >> [...] > >> > As I mentioned on IRC, Tim, the following patch fixes the above test > >> > case. It will make XFS behave like other filesystems w.r.t. atime, > >> > instead of defaulting to relatime-like behaviour. This will have > >> > performance impact unless ppl now add the relatime mount option. > >> ^^^^^^^^^^^^^^^^^^^ > >> > >> I don't really like it, and don't think there is a real justification to > >> do it. > > > > You probably missed the context of that patch (which I mentioned on > > IRC to Tim). > > Thank you for the refresh. > > > That was that it is part of a larger patch set that tracks dirty state > > in the XFS inode radix tree. In that case, the ->dirty_inode callout > > is used to set a tag in the per-ag inode radix tree. I want to > > do it this way so there is a single place that the dirty tag > > is set, whether it is due to an unlogged change (XFS or VFS) or > > a transaction commit.... > > So, to refrase, you have a patchset, that will use that callback better, > and then we will not need the patch you actually sent ? why not wait for > that patchset, this is not exactly a critical bug… No, that patch is part of the patchset. In fact - it's the first patch in the patchset of 4 patches. The following patches are: xfs-iradix-set-dirty-tag xfs-iradix-use-dirty-tag-for-clustering xfs-iradix-use-dirty-tag-for-sync i.e. using the radix tree for finding sequential dirty inode clusters for writeback.... It's not complete yet, because I'm trying to decide if it is better to avoid inode writeback from ->write_inode altogether and only do optimised ascending order inode writeback from from xfs_sync_inodes() and/or AIL tail pushing.... > >> Why not only do: > […] > > This doesn't avoid the performance problem - it simply shifts it to > > the reclaim code. Think about it - you traverse a million inodes > > stat()ing them all (the nightly backup). Instead of having the now > > dirty inodes written back as you go along, you wait until memory > > reclaim turfs them out to mark them dirty and then write them back. > > This means memory reclaim goes *much slower* because reclaiming the > > memory now has to wait for I/O to complete. > > yep, I just read the code, do we have a way to know that we are > unmounting ? maybe the good way to do it is to only write atime on > unmount ? the call trace is: if (!(sb->s_flags & MS_ACTIVE)) { /* we are unmounting */ } Even so, we don't want unmount to take hours - especially in the situation that unmounts that would trigger this problem often occur in planned maintenance windows that aren't hours long..... > I'm concerned about the overhead, with your patch, that we'll be > writting things even if nothing has changed, not sure about how many We'll won't write an inode if it hasn't changed. The VFS calling mark_inode_dirty_sync() is not "nothing". > times ->dirty_inodes is called and such. I don't like the idea to slow > things down for everyone or force people to use realtime I didn't suggest we force ppl to use relatime - I suggested that we default to it behind the covers and then pay attention to all dirty events coming form the VFS. This should result in practically no increase in the number of atime-only inode writes.... Cheers, Dave. -- Dave Chinner david@fromorbit.com