atime not written to disk

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* atime not written to disk
@ 2008-10-21  6:21 Timothy Shimmin
  2008-10-21  6:49 ` Utako Kusaka
  0 siblings, 1 reply; 8+ messages in thread
From: Timothy Shimmin @ 2008-10-21  6:21 UTC (permalink / raw)
  To: xfs-oss

Hi,

Before I investigate further ;-),
it appears that in XFS (seen in recent xfs-dev tree and on older issp release
on default mkfs/mount options),
that the atime is not being written out to disk in xfs,
at least, in the simple scenario below.

emu:/home/tes # echo bill >/mnt/test/bill
emu:/home/tes # ls -l /mnt/test/bill
-rw-r--r-- 1 root root 5 2008-10-21 16:03 /mnt/test/bill
emu:/home/tes # ls -lu /mnt/test/bill
-rw-r--r-- 1 root root 5 2008-10-21 16:03 /mnt/test/bill

... wait a bit to change the atime...

emu:/home/tes # cat /mnt/test/bill
bill
emu:/home/tes # ls -lu /mnt/test/bill
-rw-r--r-- 1 root root 5 2008-10-21 16:11 /mnt/test/bill

emu:/home/tes # cd /
emu:/ # umount /mnt/test
emu:/ # mount /mnt/test
emu:/mnt/test # ls -lu /mnt/test/bill
-rw-r--r-- 1 root root 5 2008-10-21 16:03 /mnt/test/bill

I believe that the atime is handled by the vfs and in xfs_iflush_int
we sync up with the linux inode.
Perhaps i_update_core needs to be set so that xfs_iflush_int
will proceed and call xfs_synchronize_atime() and
somehow that is not happening??

I haven't looked further yet, but I thought I'd ask on the list
in case others, such as Christoph and Dave have thoughts on this.

Thanks,
Tim.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: atime not written to disk
  2008-10-21  6:21 atime not written to disk Timothy Shimmin
@ 2008-10-21  6:49 ` Utako Kusaka
  2008-10-22  8:17   ` [PATCH, RFC] " Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Utako Kusaka @ 2008-10-21  6:49 UTC (permalink / raw)
  To: Timothy Shimmin; +Cc: xfs

Hi,

Your problem seems the same as mine.
http://oss.sgi.com/archives/xfs/2007-10/msg00168.html

Utako.

Timothy Shimmin wrote:
> Hi,
> 
> Before I investigate further ;-),
> it appears that in XFS (seen in recent xfs-dev tree and on older issp release
> on default mkfs/mount options),
> that the atime is not being written out to disk in xfs,
> at least, in the simple scenario below.
> 
> emu:/home/tes # echo bill >/mnt/test/bill
> emu:/home/tes # ls -l /mnt/test/bill
> -rw-r--r-- 1 root root 5 2008-10-21 16:03 /mnt/test/bill
> emu:/home/tes # ls -lu /mnt/test/bill
> -rw-r--r-- 1 root root 5 2008-10-21 16:03 /mnt/test/bill
> 
> ... wait a bit to change the atime...
> 
> emu:/home/tes # cat /mnt/test/bill
> bill
> emu:/home/tes # ls -lu /mnt/test/bill
> -rw-r--r-- 1 root root 5 2008-10-21 16:11 /mnt/test/bill
> 
> emu:/home/tes # cd /
> emu:/ # umount /mnt/test
> emu:/ # mount /mnt/test
> emu:/mnt/test # ls -lu /mnt/test/bill
> -rw-r--r-- 1 root root 5 2008-10-21 16:03 /mnt/test/bill
> 
> I believe that the atime is handled by the vfs and in xfs_iflush_int
> we sync up with the linux inode.
> Perhaps i_update_core needs to be set so that xfs_iflush_int
> will proceed and call xfs_synchronize_atime() and
> somehow that is not happening??
> 
> I haven't looked further yet, but I thought I'd ask on the list
> in case others, such as Christoph and Dave have thoughts on this.
> 
> Thanks,
> Tim.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH, RFC] Re: atime not written to disk
  2008-10-21  6:49 ` Utako Kusaka
@ 2008-10-22  8:17   ` Dave Chinner
  2008-10-23  2:52     ` Niv Sardi
  2008-10-23  2:53     ` Niv Sardi
  0 siblings, 2 replies; 8+ messages in thread
From: Dave Chinner @ 2008-10-22  8:17 UTC (permalink / raw)
  To: Utako Kusaka; +Cc: Timothy Shimmin, xfs

On Tue, Oct 21, 2008 at 03:49:13PM +0900, Utako Kusaka wrote:
> Hi,
>
> Your problem seems the same as mine.
> http://oss.sgi.com/archives/xfs/2007-10/msg00168.html
>
> Utako.
>
> Timothy Shimmin wrote:
>> Hi,
>>
>> Before I investigate further ;-),
>> it appears that in XFS (seen in recent xfs-dev tree and on older issp release
>> on default mkfs/mount options),
>> that the atime is not being written out to disk in xfs,
>> at least, in the simple scenario below.
>>
>> emu:/home/tes # echo bill >/mnt/test/bill
>> emu:/home/tes # ls -l /mnt/test/bill
>> -rw-r--r-- 1 root root 5 2008-10-21 16:03 /mnt/test/bill
>> emu:/home/tes # ls -lu /mnt/test/bill
>> -rw-r--r-- 1 root root 5 2008-10-21 16:03 /mnt/test/bill
>>
>> ... wait a bit to change the atime...
>>
>> emu:/home/tes # cat /mnt/test/bill
>> bill
>> emu:/home/tes # ls -lu /mnt/test/bill
>> -rw-r--r-- 1 root root 5 2008-10-21 16:11 /mnt/test/bill
>>
>> emu:/home/tes # cd /
>> emu:/ # umount /mnt/test
>> emu:/ # mount /mnt/test
>> emu:/mnt/test # ls -lu /mnt/test/bill
>> -rw-r--r-- 1 root root 5 2008-10-21 16:03 /mnt/test/bill

As I mentioned on IRC, Tim, the following patch fixes the above test
case. It will make XFS behave like other filesystems w.r.t. atime,
instead of defaulting to relatime-like behaviour. This will have
performance impact unless ppl now add the relatime mount option.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


XFS: Implement ->dirty_inode callout

Hook up ->dirty_inode so that when the VFS dirties an inode we
can mark the XFS inode as "dirty with unlogged changes". This
allows events such as touch_atime() to propagate the dirty state
right through to XFS so it gets written back to disk.

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/linux-2.6/xfs_aops.c  |    1 -
 fs/xfs/linux-2.6/xfs_iops.c  |   14 +++-----------
 fs/xfs/linux-2.6/xfs_super.c |   23 +++++++++++++++++++++++
 fs/xfs/xfs_inode_item.c      |   14 +++++++++-----
 4 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 8fbc97d..6f4ebd0 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -189,7 +189,6 @@ xfs_setfilesize(
 
 	if (ip->i_d.di_size < isize) {
 		ip->i_d.di_size = isize;
-		ip->i_update_core = 1;
 		ip->i_update_size = 1;
 		xfs_mark_inode_dirty_sync(ip);
 	}
diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
index 37bb101..b7deff9 100644
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -117,19 +117,11 @@ xfs_ichgtime(
 	}
 
 	/*
-	 * We update the i_update_core field _after_ changing
-	 * the timestamps in order to coordinate properly with
-	 * xfs_iflush() so that we don't lose timestamp updates.
-	 * This keeps us from having to hold the inode lock
-	 * while doing this.  We use the SYNCHRONIZE macro to
-	 * ensure that the compiler does not reorder the update
-	 * of i_update_core above the timestamp updates above.
+	 * Update complete - now make sure everyone knows that the inode
+	 * is dirty.
 	 */
-	if (sync_it) {
-		SYNCHRONIZE();
-		ip->i_update_core = 1;
+	if (sync_it)
 		xfs_mark_inode_dirty_sync(ip);
-	}
 }
 
 /*
diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 3ae8051..a5570b5 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -928,6 +928,28 @@ xfs_fs_inode_init_once(
 }
 
 /*
+ * Dirty the XFS inode when mark_inode_dirty_sync() is called so that
+ * we catch unlogged VFS level updates to the inode. Care must be taken
+ * here - the transaction code calls mark_inode_dirty_sync() to mark the
+ * VFS inode dirty in a transaction and clears the i_update_core field;
+ * it must clear the field after calling mark_inode_dirty_sync() to
+ * correctly indicate that the dirty state has been propagated into the
+ * inode log item.
+ *
+ * We need the barrier() to maintain correct ordering between unlogged
+ * updates and the transaction commit code that clears the i_update_core
+ * field. This requires all updates to be completed before marking the
+ * inode dirty.
+ */
+STATIC void
+xfs_fs_dirty_inode(
+	struct inode    *inode)
+{
+	barrier();
+	XFS_I(inode)->i_update_core = 1;
+}
+
+/*
  * Attempt to flush the inode, this will actually fail
  * if the inode is pinned, but we dirty the inode again
  * at the point when it is unpinned after a log write,
@@ -1712,6 +1734,7 @@ xfs_fs_get_sb(
 static struct super_operations xfs_super_operations = {
 	.alloc_inode		= xfs_fs_alloc_inode,
 	.destroy_inode		= xfs_fs_destroy_inode,
+	.dirty_inode		= xfs_fs_dirty_inode,
 	.write_inode		= xfs_fs_write_inode,
 	.clear_inode		= xfs_fs_clear_inode,
 	.put_super		= xfs_fs_put_super,
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index aa9bf05..89d480c 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -232,6 +232,15 @@ xfs_inode_item_format(
 	nvecs	     = 1;
 
 	/*
+	 * Make sure the linux inode is dirty. We do this before
+	 * clearing i_update_core as the VFS will call back into
+	 * XFS here and set i_update_core, so we need to dirty the
+	 * inode first so that the ordering of i_update_core and
+	 * unlogged modifications still works as described below.
+	 */
+	xfs_mark_inode_dirty_sync(ip);
+
+	/*
 	 * Clear i_update_core if the timestamps (or any other
 	 * non-transactional modification) need flushing/logging
 	 * and we're about to log them with the rest of the core.
@@ -275,11 +284,6 @@ xfs_inode_item_format(
 	 */
 	xfs_synchronize_atime(ip);
 
-	/*
-	 * make sure the linux inode is dirty
-	 */
-	xfs_mark_inode_dirty_sync(ip);
-
 	vecp->i_addr = (xfs_caddr_t)&ip->i_d;
 	vecp->i_len  = sizeof(xfs_dinode_core_t);
 	XLOG_VEC_SET_TYPE(vecp, XLOG_REG_TYPE_ICORE);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH, RFC] Re: atime not written to disk
  2008-10-22  8:17   ` [PATCH, RFC] " Dave Chinner
@ 2008-10-23  2:52     ` Niv Sardi
  2008-10-23  2:53     ` Niv Sardi
  1 sibling, 0 replies; 8+ messages in thread
From: Niv Sardi @ 2008-10-23  2:52 UTC (permalink / raw)
  To: Utako Kusaka; +Cc: Timothy Shimmin, xfs

Dave Chinner <david@fromorbit.com> writes:
[...]
> As I mentioned on IRC, Tim, the following patch fixes the above test
> case. It will make XFS behave like other filesystems w.r.t. atime,
> instead of defaulting to relatime-like behaviour. This will have
> performance impact unless ppl now add the relatime mount option.
 ^^^^^^^^^^^^^^^^^^^

I don't really like it, and don't think there is a real justification to
do it. Why not only do:

-- 
Niv Sardi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH, RFC] Re: atime not written to disk
  2008-10-22  8:17   ` [PATCH, RFC] " Dave Chinner
  2008-10-23  2:52     ` Niv Sardi
@ 2008-10-23  2:53     ` Niv Sardi
  2008-10-23  4:20       ` Dave Chinner
  1 sibling, 1 reply; 8+ messages in thread
From: Niv Sardi @ 2008-10-23  2:53 UTC (permalink / raw)
  To: Utako Kusaka; +Cc: Timothy Shimmin, xfs

[-- Attachment #1: Type: text/plain, Size: 443 bytes --]

Dave Chinner <david@fromorbit.com> writes:
[...]
> As I mentioned on IRC, Tim, the following patch fixes the above test
> case. It will make XFS behave like other filesystems w.r.t. atime,
> instead of defaulting to relatime-like behaviour. This will have
> performance impact unless ppl now add the relatime mount option.
 ^^^^^^^^^^^^^^^^^^^

I don't really like it, and don't think there is a real justification to
do it. Why not only do:


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Mark-core-as-dirty-when-atime-updated-needed-for-XFS.patch --]
[-- Type: text/x-diff, Size: 1443 bytes --]

>From 96be907a11f2cad0d6c8696737bb144b1275ce5a Mon Sep 17 00:00:00 2001
From: Niv Sardi <xaiki@debian.org>
Date: Thu, 23 Oct 2008 13:41:09 +1100
Subject: [PATCH] Mark core as dirty when atime updated needed for XFS inode

Currently we're not writting atime updates back to disk on unmount
if it's the only change that happened. One solution would be to implement
->dirty_inode() but that might be overkill and has a perf hit, instead
just tag as needed for update in reclaim()

Signed-off-by: Niv Sardi <xaiki@debian.org>
---
 fs/xfs/xfs_vnodeops.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index e257f65..828d398 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -2840,6 +2840,7 @@ int
 xfs_reclaim(
 	xfs_inode_t	*ip)
 {
+	struct inode	*inode = VFS_I(ip);
 
 	xfs_itrace_entry(ip);
 
@@ -2856,10 +2857,13 @@ xfs_reclaim(
 	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);
 
 	/*
-	 * Make sure the atime in the XFS inode is correct before freeing the
-	 * Linux inode.
+	 * Mark dirty and update atime only if different from the linux inode one.
 	 */
-	xfs_synchronize_atime(ip);
+	if (! timespec_equal(&inode->i_atime,
+			     (struct timespec *) &ip->i_d.di_atime)) {
+		xfs_synchronize_atime(ip);
+		ip->i_update_core = 1;
+	}
 
 	/*
 	 * If we have nothing to flush with this inode then complete the
-- 
1.5.6.2


[-- Attachment #3: Type: text/plain, Size: 15 bytes --]


-- 
Niv Sardi

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH, RFC] Re: atime not written to disk
  2008-10-23  2:53     ` Niv Sardi
@ 2008-10-23  4:20       ` Dave Chinner
  2008-10-24  5:37         ` Niv Sardi
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2008-10-23  4:20 UTC (permalink / raw)
  To: Niv Sardi; +Cc: Utako Kusaka, Timothy Shimmin, xfs

On Thu, Oct 23, 2008 at 01:53:23PM +1100, Niv Sardi wrote:
> Dave Chinner <david@fromorbit.com> writes:
> [...]
> > As I mentioned on IRC, Tim, the following patch fixes the above test
> > case. It will make XFS behave like other filesystems w.r.t. atime,
> > instead of defaulting to relatime-like behaviour. This will have
> > performance impact unless ppl now add the relatime mount option.
>  ^^^^^^^^^^^^^^^^^^^
> 
> I don't really like it, and don't think there is a real justification to
> do it.

You probably missed the context of that patch (which I mentioned on
IRC to Tim).

That was that it is part of a larger patch set that tracks dirty state
in the XFS inode radix tree. In that case, the ->dirty_inode callout
is used to set a tag in the per-ag inode radix tree. I want to
do it this way so there is a single place that the dirty tag
is set, whether it is due to an unlogged change (XFS or VFS) or
a transaction commit....

Also, if you read the referenced thread, Christoph made the comment
that the VFS was not letting us know that the inode had been dirtied and
that we really needed to know about that.  The patch I sent isn't
intended to do this, not as a fix for atime updates...

> Why not only do:
>
> From 96be907a11f2cad0d6c8696737bb144b1275ce5a Mon Sep 17 00:00:00 2001
> From: Niv Sardi <xaiki@debian.org>
> Date: Thu, 23 Oct 2008 13:41:09 +1100
> Subject: [PATCH] Mark core as dirty when atime updated needed for XFS inode
> 
> Currently we're not writting atime updates back to disk on unmount
> if it's the only change that happened. One solution would be to implement
> ->dirty_inode() but that might be overkill and has a perf hit, instead
> just tag as needed for update in reclaim()
> 
> Signed-off-by: Niv Sardi <xaiki@debian.org>
> ---
>  fs/xfs/xfs_vnodeops.c |   10 +++++++---
>  1 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
> index e257f65..828d398 100644
> --- a/fs/xfs/xfs_vnodeops.c
> +++ b/fs/xfs/xfs_vnodeops.c
> @@ -2840,6 +2840,7 @@ int
>  xfs_reclaim(
>  	xfs_inode_t	*ip)
>  {
> +	struct inode	*inode = VFS_I(ip);
>  
>  	xfs_itrace_entry(ip);
>  
> @@ -2856,10 +2857,13 @@ xfs_reclaim(
>  	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);
>  
>  	/*
> -	 * Make sure the atime in the XFS inode is correct before freeing the
> -	 * Linux inode.
> +	 * Mark dirty and update atime only if different from the linux inode one.
>  	 */
> -	xfs_synchronize_atime(ip);
> +	if (! timespec_equal(&inode->i_atime,
> +			     (struct timespec *) &ip->i_d.di_atime)) {
> +		xfs_synchronize_atime(ip);
> +		ip->i_update_core = 1;
> +	}

This doesn't avoid the performance problem - it simply shifts it to
the reclaim code.  Think about it - you traverse a million inodes
stat()ing them all (the nightly backup). Instead of having the now
dirty inodes written back as you go along, you wait until memory
reclaim turfs them out to mark them dirty and then write them back.
This means memory reclaim goes *much slower* because reclaiming the
memory now has to wait for I/O to complete.

The speed of inode reclaim is already a problem when we only have
clean inodes.  If we now have to write back all these inodes, we'll
get at best a couple of thousand inodes cleaned a second (with
current writeback algorithms) and it could be as little as 50
inodes/s when software RAID5 on SATA disks is involved. That means
reclaim will be very slow, as will unmount. This could make low
memory situations effectively unrecoverable when the inode caches
dominate memory usage (e.g. your build servers after the nightly
backup has run).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH, RFC] Re: atime not written to disk
  2008-10-23  4:20       ` Dave Chinner
@ 2008-10-24  5:37         ` Niv Sardi
  2008-10-24  6:08           ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Niv Sardi @ 2008-10-24  5:37 UTC (permalink / raw)
  To: Utako Kusaka; +Cc: Timothy Shimmin, xfs

Dave Chinner <david@fromorbit.com> writes:

> On Thu, Oct 23, 2008 at 01:53:23PM +1100, Niv Sardi wrote:
>> Dave Chinner <david@fromorbit.com> writes:
>> [...]
>> > As I mentioned on IRC, Tim, the following patch fixes the above test
>> > case. It will make XFS behave like other filesystems w.r.t. atime,
>> > instead of defaulting to relatime-like behaviour. This will have
>> > performance impact unless ppl now add the relatime mount option.
>>  ^^^^^^^^^^^^^^^^^^^
>> 
>> I don't really like it, and don't think there is a real justification to
>> do it.
>
> You probably missed the context of that patch (which I mentioned on
> IRC to Tim).

Thank you for the refresh.

> That was that it is part of a larger patch set that tracks dirty state
> in the XFS inode radix tree. In that case, the ->dirty_inode callout
> is used to set a tag in the per-ag inode radix tree. I want to
> do it this way so there is a single place that the dirty tag
> is set, whether it is due to an unlogged change (XFS or VFS) or
> a transaction commit....

So, to refrase, you have a patchset, that will use that callback better,
and then we will not need the patch you actually sent ? why not wait for
that patchset, this is not exactly a critical bug…

> Also, if you read the referenced thread, Christoph made the comment
> that the VFS was not letting us know that the inode had been dirtied and
> that we really needed to know about that.  The patch I sent isn't
> intended to do this, not as a fix for atime updates...

>> Why not only do:
[…]
> This doesn't avoid the performance problem - it simply shifts it to
> the reclaim code.  Think about it - you traverse a million inodes
> stat()ing them all (the nightly backup). Instead of having the now
> dirty inodes written back as you go along, you wait until memory
> reclaim turfs them out to mark them dirty and then write them back.
> This means memory reclaim goes *much slower* because reclaiming the
> memory now has to wait for I/O to complete.

yep, I just read the code, do we have a way to know that we are
unmounting ? maybe the good way to do it is to only write atime on
unmount ? the call trace is:

 [xfs]xfs_fs_clear_inode+0xd4
 clear_inode+0x7d
 dispose_list+0x5b
 invalidate_inodes+0xe0
 generic_shutdown_super+0x3a
 kill_block_super+0x15
 deactivate_super+0x62
 mntput_no_expire+0xe9
 sys_umount+0x2d1

So it doesn't look obvious to me, but sure there is a way to know if
we're in umount right ?

I'm concerned about the overhead, with your patch, that we'll be
writting things even if nothing has changed, not sure about how many
times ->dirty_inodes is called and such. I don't like the idea to slow
things down for everyone or force people to use realtime

Cheers,
-- 
Niv Sardi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH, RFC] Re: atime not written to disk
  2008-10-24  5:37         ` Niv Sardi
@ 2008-10-24  6:08           ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2008-10-24  6:08 UTC (permalink / raw)
  To: Niv Sardi; +Cc: Utako Kusaka, Timothy Shimmin, xfs

On Fri, Oct 24, 2008 at 04:37:59PM +1100, Niv Sardi wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Thu, Oct 23, 2008 at 01:53:23PM +1100, Niv Sardi wrote:
> >> Dave Chinner <david@fromorbit.com> writes:
> >> [...]
> >> > As I mentioned on IRC, Tim, the following patch fixes the above test
> >> > case. It will make XFS behave like other filesystems w.r.t. atime,
> >> > instead of defaulting to relatime-like behaviour. This will have
> >> > performance impact unless ppl now add the relatime mount option.
> >>  ^^^^^^^^^^^^^^^^^^^
> >> 
> >> I don't really like it, and don't think there is a real justification to
> >> do it.
> >
> > You probably missed the context of that patch (which I mentioned on
> > IRC to Tim).
> 
> Thank you for the refresh.
> 
> > That was that it is part of a larger patch set that tracks dirty state
> > in the XFS inode radix tree. In that case, the ->dirty_inode callout
> > is used to set a tag in the per-ag inode radix tree. I want to
> > do it this way so there is a single place that the dirty tag
> > is set, whether it is due to an unlogged change (XFS or VFS) or
> > a transaction commit....
> 
> So, to refrase, you have a patchset, that will use that callback better,
> and then we will not need the patch you actually sent ? why not wait for
> that patchset, this is not exactly a critical bug…

No, that patch is part of the patchset. In fact - it's the first
patch in the patchset of 4 patches. The following patches are:

xfs-iradix-set-dirty-tag
xfs-iradix-use-dirty-tag-for-clustering
xfs-iradix-use-dirty-tag-for-sync

i.e. using the radix tree for finding sequential dirty inode
clusters for writeback....

It's not complete yet, because I'm trying to decide if it is better
to avoid inode writeback from ->write_inode altogether and only do
optimised ascending order inode writeback from from
xfs_sync_inodes() and/or AIL tail pushing....

> >> Why not only do:
> […]
> > This doesn't avoid the performance problem - it simply shifts it to
> > the reclaim code.  Think about it - you traverse a million inodes
> > stat()ing them all (the nightly backup). Instead of having the now
> > dirty inodes written back as you go along, you wait until memory
> > reclaim turfs them out to mark them dirty and then write them back.
> > This means memory reclaim goes *much slower* because reclaiming the
> > memory now has to wait for I/O to complete.
> 
> yep, I just read the code, do we have a way to know that we are
> unmounting ? maybe the good way to do it is to only write atime on
> unmount ? the call trace is:

	if (!(sb->s_flags & MS_ACTIVE)) {
		/* we are unmounting */
	}

Even so, we don't want unmount to take hours - especially in the
situation that unmounts that would trigger this problem often occur
in planned maintenance windows that aren't hours long.....

> I'm concerned about the overhead, with your patch, that we'll be
> writting things even if nothing has changed, not sure about how many

We'll won't write an inode if it hasn't changed. The VFS calling
mark_inode_dirty_sync() is not "nothing".

> times ->dirty_inodes is called and such. I don't like the idea to slow
> things down for everyone or force people to use realtime

I didn't suggest we force ppl to use relatime - I suggested that we
default to it behind the covers and then pay attention to all dirty
events coming form the VFS. This should result in practically no
increase in the number of atime-only inode writes....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-10-24  6:08 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-21  6:21 atime not written to disk Timothy Shimmin
2008-10-21  6:49 ` Utako Kusaka
2008-10-22  8:17   ` [PATCH, RFC] " Dave Chinner
2008-10-23  2:52     ` Niv Sardi
2008-10-23  2:53     ` Niv Sardi
2008-10-23  4:20       ` Dave Chinner
2008-10-24  5:37         ` Niv Sardi
2008-10-24  6:08           ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox