[PATCH, RFC] Delayed logging of file sizes

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH, RFC] Delayed logging of file sizes
@ 2007-11-23  7:04 Lachlan McIlroy
  2007-11-25 22:59 ` David Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Lachlan McIlroy @ 2007-11-23  7:04 UTC (permalink / raw)
  To: xfs-dev, xfs-oss

[-- Attachment #1: Type: text/plain, Size: 2179 bytes --]

Here's a patch for an idea to address an issue we have with log
replay.  The problem stems from the fact that not all changes to
inodes result in transactions.  Some changes (such as file size
updates, timestamp updates) would generate too much log traffic
if we logged a transaction for every event.  So we update the inode
and flag it as dirty (ie set i_update_core/i_update_size).  If the
inode gets logged in a later transaction then the update gets
rolled into that transaction and the flag is cleared.  If the inode
gets flushed to disk before another transaction then the on-disk
version of the inode is newer than what is in the log.

On log replay we risk overwriting a newer inode on disk with an
older version of the inode from the log.  We try to prevent this with
the i_flushiter counter in the inode (the on-disk inode will have a
greater flushiter than the inode in the log) but this does not work
for newly created inode cluster buffers.  When log replay encounters
a newly allocated inode cluster buffer in the log it cannot determine
if the on-disk version of the cluster is older, newer or even valid.
Since we cannot determine if the on-disk inode cluster has been
initialised since it was logged we cannot read it to check the
i_flushiter values.  If we write out the log record anyway we risk
overwriting newer inode data, if we don't write out the log record
we risk exposing an uninitialised inode cluster.

The easy solution is to log everything so that log replay doesn't need
to check if the on-disk version is newer - it can just replay the log.
But logging everything would cause too much log traffic so this patch
is a compromise and it logs a transaction before we flush an inode to
disk only if it has changes that have not yet been logged.  With this
scheme we don't clear the i_update_core flag when flushing an inode -
we only do that when logging a transaction.  The flag is used to
determine if a transaction needs to be done.  It needs to be this way
because the log pushing code may need to flush an inode and we cannot
create a transaction at that point.

Anyway it's an idea and needs a little polishing so comments are welcome.

Lachlan

[-- Attachment #2: logsize.diff --]
[-- Type: text/x-patch, Size: 3454 bytes --]

--- fs/xfs/xfs_inode.c_1.487	2007-11-23 14:18:58.000000000 +1100
+++ fs/xfs/xfs_inode.c	2007-11-23 14:20:55.000000000 +1100
@@ -3331,21 +3331,6 @@ xfs_iflush_int(
 	dip = (xfs_dinode_t *)xfs_buf_offset(bp, ip->i_boffset);

 	/*
-	 * Clear i_update_core before copying out the data.
-	 * This is for coordination with our timestamp updates
-	 * that don't hold the inode lock. They will always
-	 * update the timestamps BEFORE setting i_update_core,
-	 * so if we clear i_update_core after they set it we
-	 * are guaranteed to see their updates to the timestamps.
-	 * I believe that this depends on strongly ordered memory
-	 * semantics, but we have that.  We use the SYNCHRONIZE
-	 * macro to make sure that the compiler does not reorder
-	 * the i_update_core access below the data copy below.
-	 */
-	ip->i_update_core = 0;
-	SYNCHRONIZE();
-
-	/*
 	 * Make sure to get the latest atime from the Linux inode.
 	 */
 	xfs_synchronize_atime(ip);
--- fs/xfs/xfs_vfsops.c_1.548	2007-11-23 14:26:52.000000000 +1100
+++ fs/xfs/xfs_vfsops.c	2007-11-23 14:18:03.000000000 +1100
@@ -1162,6 +1162,12 @@ xfs_sync_inodes(
 			if (mount_locked)
 				IPOINTER_INSERT(ip, mp);

+			if (ip->i_update_core) {
+				xfs_iunlock(ip, XFS_ILOCK_SHARED);
+				error = xfs_log_inode(ip, flags & SYNC_WAIT);
+				xfs_ilock(ip, XFS_ILOCK_SHARED);
+			}
+
 			if (flags & SYNC_WAIT) {
 				xfs_iflock(ip);
 				error = xfs_iflush(ip, XFS_IFLUSH_SYNC);
--- fs/xfs/xfs_vnodeops.c_1.726	2007-11-23 14:26:53.000000000 +1100
+++ fs/xfs/xfs_vnodeops.c	2007-11-23 14:18:03.000000000 +1100
@@ -3527,6 +3527,39 @@ xfs_rwunlock(

 int
+xfs_log_inode(
+	xfs_inode_t	*ip,
+	int		sync)
+{
+	xfs_trans_t	*tp;
+	int		error;
+
+	xfs_itrace_entry(ip);
+
+	if (ip->i_update_core == 0)
+		return 0;
+
+	tp = xfs_trans_alloc(ip->i_mount, XFS_TRANS_FSYNC_TS);
+	if ((error = xfs_trans_reserve(tp, 0,
+			XFS_FSYNC_TS_LOG_RES(ip->i_mount), 0, 0, 0))) {
+		xfs_trans_cancel(tp, 0);
+		return error;
+	}
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_synchronize_atime(ip);
+	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
+	xfs_trans_ihold(tp, ip);
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+	if (sync)
+		xfs_trans_set_sync(tp);
+	error = xfs_trans_commit(tp, 0);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	return error;
+}
+
+
+int
 xfs_inode_flush(
 	xfs_inode_t	*ip,
 	int		flags)
@@ -3579,6 +3612,10 @@ xfs_inode_flush(
 	if (flags & FLUSH_INODE) {
 		int	flush_flags;

+		error = xfs_log_inode(ip, flags & FLUSH_SYNC);
+		if (error)
+			return error;
+
 		if (flags & FLUSH_SYNC) {
 			xfs_ilock(ip, XFS_ILOCK_SHARED);
 			xfs_iflock(ip);
@@ -3751,6 +3788,15 @@ xfs_finish_reclaim(
 		if (ip->i_update_core ||
 		    ((ip->i_itemp != NULL) &&
 		     (ip->i_itemp->ili_format.ilf_fields != 0))) {
+
+			if (ip->i_update_core) {
+				xfs_ifunlock(ip);
+				xfs_iunlock(ip, XFS_ILOCK_EXCL);
+				error = xfs_log_inode(ip, sync_mode & XFS_IFLUSH_SYNC);
+				xfs_ilock(ip, XFS_ILOCK_EXCL);
+				xfs_iflock(ip);
+			}
+
 			error = xfs_iflush(ip, sync_mode);
 			/*
 			 * If we hit an error, typically because of filesystem
--- fs/xfs/xfs_vnodeops.h_1.4	2007-10-04 14:53:13.000000000 +1000
+++ fs/xfs/xfs_vnodeops.h	2007-10-04 18:11:51.000000000 +1000
@@ -80,4 +80,6 @@ int xfs_flushinval_pages(struct xfs_inod
 int xfs_flush_pages(struct xfs_inode *ip, xfs_off_t first,
 		xfs_off_t last, uint64_t flags, int fiopt);

+int xfs_log_inode(struct xfs_inode *ip, int sync);
+
 #endif /* _XFS_VNODEOPS_H */

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC] Delayed logging of file sizes
  2007-11-23  7:04 [PATCH, RFC] Delayed logging of file sizes Lachlan McIlroy
@ 2007-11-25 22:59 ` David Chinner
  2007-11-26  0:19   ` Lachlan McIlroy
  0 siblings, 1 reply; 14+ messages in thread
From: David Chinner @ 2007-11-25 22:59 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs-dev, xfs-oss

On Fri, Nov 23, 2007 at 06:04:39PM +1100, Lachlan McIlroy wrote:
> The easy solution is to log everything so that log replay doesn't need
> to check if the on-disk version is newer - it can just replay the log.
> But logging everything would cause too much log traffic so this patch
> is a compromise and it logs a transaction before we flush an inode to
> disk only if it has changes that have not yet been logged.

The problem with this is that the inode will be marked dirty during the
transaction, so we'll never be able to clean an inode if we issue a
transaction during inode writeback.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC] Delayed logging of file sizes
  2007-11-25 22:59 ` David Chinner
@ 2007-11-26  0:19   ` Lachlan McIlroy
  2007-11-26  1:10     ` David Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Lachlan McIlroy @ 2007-11-26  0:19 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs-dev, xfs-oss

David Chinner wrote:
> On Fri, Nov 23, 2007 at 06:04:39PM +1100, Lachlan McIlroy wrote:
>> The easy solution is to log everything so that log replay doesn't need
>> to check if the on-disk version is newer - it can just replay the log.
>> But logging everything would cause too much log traffic so this patch
>> is a compromise and it logs a transaction before we flush an inode to
>> disk only if it has changes that have not yet been logged.
> 
> The problem with this is that the inode will be marked dirty during the
> transaction, so we'll never be able to clean an inode if we issue a
> transaction during inode writeback.
> 

Ah, yeah, good point.  I wrote this patch back before that "dirty inode
on transaction" patch went in.  For this transaction though the changes
to the inode have already been made (ie when we set i_update_core and
called mark_inode_dirty_sync()) so there is no need to dirty it in this
transaction.  I'll keep digging.  Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC] Delayed logging of file sizes
  2007-11-26  0:19   ` Lachlan McIlroy
@ 2007-11-26  1:10     ` David Chinner
  2007-11-26  1:29       ` Lachlan McIlroy
  0 siblings, 1 reply; 14+ messages in thread
From: David Chinner @ 2007-11-26  1:10 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: David Chinner, xfs-dev, xfs-oss

On Mon, Nov 26, 2007 at 11:19:57AM +1100, Lachlan McIlroy wrote:
> David Chinner wrote:
> >On Fri, Nov 23, 2007 at 06:04:39PM +1100, Lachlan McIlroy wrote:
> >>The easy solution is to log everything so that log replay doesn't need
> >>to check if the on-disk version is newer - it can just replay the log.
> >>But logging everything would cause too much log traffic so this patch
> >>is a compromise and it logs a transaction before we flush an inode to
> >>disk only if it has changes that have not yet been logged.
> >
> >The problem with this is that the inode will be marked dirty during the
> >transaction, so we'll never be able to clean an inode if we issue a
> >transaction during inode writeback.
> 
> Ah, yeah, good point.  I wrote this patch back before that "dirty inode
> on transaction" patch went in.

Wouldn't have made aany difference - the inode woul dbe marked dirty
at transaction completion...

> For this transaction though the changes
> to the inode have already been made (ie when we set i_update_core and
> called mark_inode_dirty_sync()) so there is no need to dirty it in this
> transaction.  I'll keep digging.  Thanks.

I wouldn't worry too much about this problem right now - I'm working
on moving the dirty state into the inode radix trees so i_update_core
might even go away completely soon....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC] Delayed logging of file sizes
  2007-11-26  1:10     ` David Chinner
@ 2007-11-26  1:29       ` Lachlan McIlroy
  2007-11-26  2:15         ` David Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Lachlan McIlroy @ 2007-11-26  1:29 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs-dev, xfs-oss

David Chinner wrote:
> On Mon, Nov 26, 2007 at 11:19:57AM +1100, Lachlan McIlroy wrote:
>> David Chinner wrote:
>>> On Fri, Nov 23, 2007 at 06:04:39PM +1100, Lachlan McIlroy wrote:
>>>> The easy solution is to log everything so that log replay doesn't need
>>>> to check if the on-disk version is newer - it can just replay the log.
>>>> But logging everything would cause too much log traffic so this patch
>>>> is a compromise and it logs a transaction before we flush an inode to
>>>> disk only if it has changes that have not yet been logged.
>>> The problem with this is that the inode will be marked dirty during the
>>> transaction, so we'll never be able to clean an inode if we issue a
>>> transaction during inode writeback.
>> Ah, yeah, good point.  I wrote this patch back before that "dirty inode
>> on transaction" patch went in.
> 
> Wouldn't have made aany difference - the inode woul dbe marked dirty
> at transaction completion...
> 
>> For this transaction though the changes
>> to the inode have already been made (ie when we set i_update_core and
>> called mark_inode_dirty_sync()) so there is no need to dirty it in this
>> transaction.  I'll keep digging.  Thanks.
> 
> I wouldn't worry too much about this problem right now - I'm working
> on moving the dirty state into the inode radix trees so i_update_core
> might even go away completely soon....
> 

Which problem?  Just the bit about dirtying the inode or will your changes
allow us to log all inode changes?

What's the motivation for moving the dirty state?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC] Delayed logging of file sizes
  2007-11-26  1:29       ` Lachlan McIlroy
@ 2007-11-26  2:15         ` David Chinner
  2007-11-26  3:16           ` Lachlan McIlroy
  0 siblings, 1 reply; 14+ messages in thread
From: David Chinner @ 2007-11-26  2:15 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: David Chinner, xfs-dev, xfs-oss

On Mon, Nov 26, 2007 at 12:29:36PM +1100, Lachlan McIlroy wrote:
> David Chinner wrote:
> >On Mon, Nov 26, 2007 at 11:19:57AM +1100, Lachlan McIlroy wrote:
> >>David Chinner wrote:
> >>>On Fri, Nov 23, 2007 at 06:04:39PM +1100, Lachlan McIlroy wrote:
> >>>>The easy solution is to log everything so that log replay doesn't need
> >>>>to check if the on-disk version is newer - it can just replay the log.
> >>>>But logging everything would cause too much log traffic so this patch
> >>>>is a compromise and it logs a transaction before we flush an inode to
> >>>>disk only if it has changes that have not yet been logged.
> >>>The problem with this is that the inode will be marked dirty during the
> >>>transaction, so we'll never be able to clean an inode if we issue a
> >>>transaction during inode writeback.
> >>Ah, yeah, good point.  I wrote this patch back before that "dirty inode
> >>on transaction" patch went in.
> >
> >Wouldn't have made aany difference - the inode woul dbe marked dirty
> >at transaction completion...
> >
> >>For this transaction though the changes
> >>to the inode have already been made (ie when we set i_update_core and
> >>called mark_inode_dirty_sync()) so there is no need to dirty it in this
> >>transaction.  I'll keep digging.  Thanks.
> >
> >I wouldn't worry too much about this problem right now - I'm working
> >on moving the dirty state into the inode radix trees so i_update_core
> >might even go away completely soon....
> >
> 
> Which problem?  Just the bit about dirtying the inode or will your changes
> allow us to log all inode changes?

Trying to change XFS to logging all updates.

> What's the motivation for moving the dirty state?

Better inode writeback clustering. i.e. it's easy to find all the dirty
inodes and then we can write them in larger contiguous chunks. The first
"hack" at this I did tracked only inodes in the AIL. Sequential create
of small files improved by about 20% with better clustering during
tail pushing operations. I'm trying to make it track all dirty inodes
at this point (via ->dirty_inode). This may mean that i_update_core
is not needed to track whether an inode needs writeback or not.

Not to mention all that horrible IPOINTER crap can get removed from 
xfs_sync_inodes() because finding dirty inodes is now a lockless radix
tree traverse based on a dirty tag lookup.

That also means the global mount inodes list can be replaced by a lockless radix
tree traverse, so we can lose another 2 pointers in the xfs_inode_t and lock
operations out of the inode get and reclaim paths.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC] Delayed logging of file sizes
  2007-11-26  2:15         ` David Chinner
@ 2007-11-26  3:16           ` Lachlan McIlroy
  2007-11-26  5:03             ` David Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Lachlan McIlroy @ 2007-11-26  3:16 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs-dev, xfs-oss

David Chinner wrote:
> On Mon, Nov 26, 2007 at 12:29:36PM +1100, Lachlan McIlroy wrote:
>> David Chinner wrote:
>>> On Mon, Nov 26, 2007 at 11:19:57AM +1100, Lachlan McIlroy wrote:
>>>> David Chinner wrote:
>>>>> On Fri, Nov 23, 2007 at 06:04:39PM +1100, Lachlan McIlroy wrote:
>>>>>> The easy solution is to log everything so that log replay doesn't need
>>>>>> to check if the on-disk version is newer - it can just replay the log.
>>>>>> But logging everything would cause too much log traffic so this patch
>>>>>> is a compromise and it logs a transaction before we flush an inode to
>>>>>> disk only if it has changes that have not yet been logged.
>>>>> The problem with this is that the inode will be marked dirty during the
>>>>> transaction, so we'll never be able to clean an inode if we issue a
>>>>> transaction during inode writeback.
>>>> Ah, yeah, good point.  I wrote this patch back before that "dirty inode
>>>> on transaction" patch went in.
>>> Wouldn't have made aany difference - the inode woul dbe marked dirty
>>> at transaction completion...
>>>
>>>> For this transaction though the changes
>>>> to the inode have already been made (ie when we set i_update_core and
>>>> called mark_inode_dirty_sync()) so there is no need to dirty it in this
>>>> transaction.  I'll keep digging.  Thanks.
>>> I wouldn't worry too much about this problem right now - I'm working
>>> on moving the dirty state into the inode radix trees so i_update_core
>>> might even go away completely soon....
>>>
>> Which problem?  Just the bit about dirtying the inode or will your changes
>> allow us to log all inode changes?
> 
> Trying to change XFS to logging all updates.

That would be great.  But what about the increase in log traffic that has
deterred us from doing this in the past?

> 
>> What's the motivation for moving the dirty state?
> 
> Better inode writeback clustering. i.e. it's easy to find all the dirty
> inodes and then we can write them in larger contiguous chunks. The first
> "hack" at this I did tracked only inodes in the AIL. Sequential create
> of small files improved by about 20% with better clustering during
> tail pushing operations. I'm trying to make it track all dirty inodes
> at this point (via ->dirty_inode). This may mean that i_update_core
> is not needed to track whether an inode needs writeback or not.

Okay, I'm interested to see what you come up with.

> 
> Not to mention all that horrible IPOINTER crap can get removed from 
> xfs_sync_inodes() because finding dirty inodes is now a lockless radix
> tree traverse based on a dirty tag lookup.

Oh good, that macro hackery is ugly.

> 
> That also means the global mount inodes list can be replaced by a lockless radix
> tree traverse, so we can lose another 2 pointers in the xfs_inode_t and lock
> operations out of the inode get and reclaim paths.
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC] Delayed logging of file sizes
  2007-11-26  3:16           ` Lachlan McIlroy
@ 2007-11-26  5:03             ` David Chinner
  2007-11-27  3:30               ` Lachlan McIlroy
  0 siblings, 1 reply; 14+ messages in thread
From: David Chinner @ 2007-11-26  5:03 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: David Chinner, xfs-dev, xfs-oss

On Mon, Nov 26, 2007 at 02:16:34PM +1100, Lachlan McIlroy wrote:
> David Chinner wrote:
> >On Mon, Nov 26, 2007 at 12:29:36PM +1100, Lachlan McIlroy wrote:
> >>David Chinner wrote:
> >>>On Mon, Nov 26, 2007 at 11:19:57AM +1100, Lachlan McIlroy wrote:
> >>>>David Chinner wrote:
> >>>>>On Fri, Nov 23, 2007 at 06:04:39PM +1100, Lachlan McIlroy wrote:
> >>>>>>The easy solution is to log everything so that log replay doesn't need
> >>>>>>to check if the on-disk version is newer - it can just replay the log.
> >>>>>>But logging everything would cause too much log traffic so this patch
> >>>>>>is a compromise and it logs a transaction before we flush an inode to
> >>>>>>disk only if it has changes that have not yet been logged.
> >>>>>The problem with this is that the inode will be marked dirty during the
> >>>>>transaction, so we'll never be able to clean an inode if we issue a
> >>>>>transaction during inode writeback.
> >>>>Ah, yeah, good point.  I wrote this patch back before that "dirty inode
> >>>>on transaction" patch went in.
> >>>Wouldn't have made aany difference - the inode woul dbe marked dirty
> >>>at transaction completion...
> >>>
> >>>>For this transaction though the changes
> >>>>to the inode have already been made (ie when we set i_update_core and
> >>>>called mark_inode_dirty_sync()) so there is no need to dirty it in this
> >>>>transaction.  I'll keep digging.  Thanks.
> >>>I wouldn't worry too much about this problem right now - I'm working
> >>>on moving the dirty state into the inode radix trees so i_update_core
> >>>might even go away completely soon....
> >>>
> >>Which problem?  Just the bit about dirtying the inode or will your changes
> >>allow us to log all inode changes?
> >
> >Trying to change XFS to logging all updates.
> 
> That would be great.  But what about the increase in log traffic that has
> deterred us from doing this in the past?

Sorry, i wasn't particularly clear. What I mean was that i_update_core
might disappear completely with the changes I'm making.

Basically, we have three different methods of marking the inode dirty
at the moment - on  the linux inode (mark_inode_dirty[_sync]()), the
i_update_core = 1 for unlogged changes and logged changes are tracked via the
inode log item in the AIL.

One top of that, we have three different methods of flushing them - one
from the generic code for inodes dirtied by mark_inode_dirty(), one from
xfssyncd for inodes that are only dirtied by setting i_update_core = 1
and the other from the xfsaild when log tail pushing.

Ideally we should only have a single method for pushing out inodes. The first
step to that is tracking the dirty state in a single tree (the inode radix
trees). That means we have to hook ->dirty_inode() to catch all dirtying via
mark_inode_dirty[_sync]() and mark the inodes dirty in the radix tree. Then we
need to use xfs_mark_inode_dirty_sync() everywhere that we dirty the inode.

Once we have all the dirty state in the radix trees we can now get rid of
i_update_core and i_update_size - all they do is mark the inode dirty and
we don't really care about the difference between them(*) - and just use
the dirty bit in the radix tree when necessary.

To flush the dirty inodes we just do radix_tree_gang_lookup_tag_range()
calls to do ascending cluster order writeback. This will replace the
mount inode list walking in xfs_sync_inodes() and other places to find
dirty inodes.

/me puts on flame-proof suite

I'd even like to go as far as a two pass writeback algorithm; pass
one only writes data, and pass two only writes inodes. The second pass
for XFS needs to be delayed until data writeback is complete because of
delalloc and inode size updates redirtying the inode. The current
mechanism means we often do two inode writes for the one data write...

Basically, our writeback code is a mess and I want to clean it up
before we try to deal with the unlogged changes....

Cheers,

Dave.

(*) Even for FDATASYNC we should always force the log out because we may
have delayed allocation transactions still sitting in iclog buffers. This,
AFAICT, is a bug in the current implementation.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC] Delayed logging of file sizes
  2007-11-26  5:03             ` David Chinner
@ 2007-11-27  3:30               ` Lachlan McIlroy
  2007-11-27 10:53                 ` David Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Lachlan McIlroy @ 2007-11-27  3:30 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs-dev, xfs-oss

David Chinner wrote:
> On Mon, Nov 26, 2007 at 02:16:34PM +1100, Lachlan McIlroy wrote:
>> David Chinner wrote:
>>> On Mon, Nov 26, 2007 at 12:29:36PM +1100, Lachlan McIlroy wrote:
>>>> David Chinner wrote:
>>>>> On Mon, Nov 26, 2007 at 11:19:57AM +1100, Lachlan McIlroy wrote:
>>>>>> David Chinner wrote:
>>>>>>> On Fri, Nov 23, 2007 at 06:04:39PM +1100, Lachlan McIlroy wrote:
>>>>>>>> The easy solution is to log everything so that log replay doesn't need
>>>>>>>> to check if the on-disk version is newer - it can just replay the log.
>>>>>>>> But logging everything would cause too much log traffic so this patch
>>>>>>>> is a compromise and it logs a transaction before we flush an inode to
>>>>>>>> disk only if it has changes that have not yet been logged.
>>>>>>> The problem with this is that the inode will be marked dirty during the
>>>>>>> transaction, so we'll never be able to clean an inode if we issue a
>>>>>>> transaction during inode writeback.
>>>>>> Ah, yeah, good point.  I wrote this patch back before that "dirty inode
>>>>>> on transaction" patch went in.
>>>>> Wouldn't have made aany difference - the inode woul dbe marked dirty
>>>>> at transaction completion...
>>>>>
>>>>>> For this transaction though the changes
>>>>>> to the inode have already been made (ie when we set i_update_core and
>>>>>> called mark_inode_dirty_sync()) so there is no need to dirty it in this
>>>>>> transaction.  I'll keep digging.  Thanks.
>>>>> I wouldn't worry too much about this problem right now - I'm working
>>>>> on moving the dirty state into the inode radix trees so i_update_core
>>>>> might even go away completely soon....
>>>>>
>>>> Which problem?  Just the bit about dirtying the inode or will your changes
>>>> allow us to log all inode changes?
>>> Trying to change XFS to logging all updates.
>> That would be great.  But what about the increase in log traffic that has
>> deterred us from doing this in the past?
> 
> Sorry, i wasn't particularly clear. What I mean was that i_update_core
> might disappear completely with the changes I'm making.
> 
> Basically, we have three different methods of marking the inode dirty
> at the moment - on  the linux inode (mark_inode_dirty[_sync]()), the
> i_update_core = 1 for unlogged changes and logged changes are tracked via the
> inode log item in the AIL.
> 
> One top of that, we have three different methods of flushing them - one
> from the generic code for inodes dirtied by mark_inode_dirty(), one from
> xfssyncd for inodes that are only dirtied by setting i_update_core = 1
> and the other from the xfsaild when log tail pushing.
> 
> Ideally we should only have a single method for pushing out inodes. The first
> step to that is tracking the dirty state in a single tree (the inode radix
> trees). That means we have to hook ->dirty_inode() to catch all dirtying via
> mark_inode_dirty[_sync]() and mark the inodes dirty in the radix tree. Then we
> need to use xfs_mark_inode_dirty_sync() everywhere that we dirty the inode.
Don't we already call mark_inode_dirty[_sync]() everywhere we dirty the inode?

> 
> Once we have all the dirty state in the radix trees we can now get rid of
> i_update_core and i_update_size - all they do is mark the inode dirty and
> we don't really care about the difference between them(*) - and just use
> the dirty bit in the radix tree when necessary.
If we want to check if an inode is dirty do we have to look up the dirty
bit in the tree or is there some easy way to get it from the inode?

By consolidating the different ways of dirtying an inode we lose the ability
to know why it is dirty and what action needs to be done to undirty it.
For example if the inode log item has bits set then we know we have to flush
the log otherwise there is no need.  With a general purpose dirty bit we will
have to flush regardless.  And my recent attempt to fix the log replay issue
relies on i_update_core to indicate there are unlogged changes - I don't see
how that will work with these changes.

> 
> To flush the dirty inodes we just do radix_tree_gang_lookup_tag_range()
> calls to do ascending cluster order writeback. This will replace the
> mount inode list walking in xfs_sync_inodes() and other places to find
> dirty inodes.
> 
> /me puts on flame-proof suite
> 
> I'd even like to go as far as a two pass writeback algorithm; pass
> one only writes data, and pass two only writes inodes. The second pass
> for XFS needs to be delayed until data writeback is complete because of
> delalloc and inode size updates redirtying the inode. The current
> mechanism means we often do two inode writes for the one data write...
> 
> Basically, our writeback code is a mess and I want to clean it up
> before we try to deal with the unlogged changes....
> 
> Cheers,
> 
> Dave.
> 
> (*) Even for FDATASYNC we should always force the log out because we may
> have delayed allocation transactions still sitting in iclog buffers. This,
> AFAICT, is a bug in the current implementation.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC] Delayed logging of file sizes
  2007-11-27  3:30               ` Lachlan McIlroy
@ 2007-11-27 10:53                 ` David Chinner
  2007-11-28  0:43                   ` Lachlan McIlroy
  0 siblings, 1 reply; 14+ messages in thread
From: David Chinner @ 2007-11-27 10:53 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: David Chinner, xfs-dev, xfs-oss

On Tue, Nov 27, 2007 at 02:30:25PM +1100, Lachlan McIlroy wrote:
> David Chinner wrote:
> >Sorry, i wasn't particularly clear. What I mean was that i_update_core
> >might disappear completely with the changes I'm making.
> >
> >Basically, we have three different methods of marking the inode dirty
> >at the moment - on  the linux inode (mark_inode_dirty[_sync]()), the
> >i_update_core = 1 for unlogged changes and logged changes are tracked via 
> >the
> >inode log item in the AIL.
> >
> >One top of that, we have three different methods of flushing them - one
> >from the generic code for inodes dirtied by mark_inode_dirty(), one from
> >xfssyncd for inodes that are only dirtied by setting i_update_core = 1
> >and the other from the xfsaild when log tail pushing.
> >
> >Ideally we should only have a single method for pushing out inodes. The 
> >first
> >step to that is tracking the dirty state in a single tree (the inode radix
> >trees). That means we have to hook ->dirty_inode() to catch all dirtying 
> >via
> >mark_inode_dirty[_sync]() and mark the inodes dirty in the radix tree. 
> >Then we
> >need to use xfs_mark_inode_dirty_sync() everywhere that we dirty the inode.
> Don't we already call mark_inode_dirty[_sync]() everywhere we dirty the 
> inode?

Maybe. Maybe not. Tell me - does xfs_ichgtime() do the right thing?

[ I do know the answer to this question and there's a day of kdb tracing
behind the answer. I wrote a 15 line comment to explain what was going
on in one of my patches. ]

> >Once we have all the dirty state in the radix trees we can now get rid of
> >i_update_core and i_update_size - all they do is mark the inode dirty and
> >we don't really care about the difference between them(*) - and just use
> >the dirty bit in the radix tree when necessary.
> If we want to check if an inode is dirty do we have to look up the dirty
> bit in the tree or is there some easy way to get it from the inode?

xfs_inode_clean(ip) is my preferred interface. How that is finally
implemented will be determined by how this all cleans up and what
performs the best. If lockless tree lookups don't cause performance
problems, then there is little reason to keep redundant information
around.

> By consolidating the different ways of dirtying an inode we lose the ability
> to know why it is dirty and what action needs to be done to undirty it.

The only way to undirty an inode is to write it to disk.

> For example if the inode log item has bits set then we know we have to flush
> the log otherwise there is no need.  With a general purpose dirty bit we 

No, if the log item is present and dirty (i.e. inode is in the AIL),
all it means is that we need to attach a callback to the buffer
(xfs_iflush_done) when dispatching the I/O to do processing of the
log item on I/O completion. Whether i_update_core is set or not
in this case is irrelevant - the log item state overrides that.

> will
> have to flush regardless.  And my recent attempt to fix the log replay issue
> relies on i_update_core to indicate there are unlogged changes - I don't see
> how that will work with these changes.

But your changes could not be implemented, either. You can't log the inode
to clean it - it merely transfers the writeback from one list to
another.

So, the cleaner fix is to do this - change the xfs_inode_flush()
just to unconditionally log the inode and don't do inode writeback *at
all* from there. That will catch all cases of unlogged changes and leave
inode writeback to tail-pushing or xfssyncd which can be driven by
the radix tree.

Basically, if we only ever write state to disk that we've logged,
then we are home free. That means the only time we should update
the unlogged fields - timestamps and inode size - is during a
transaction commit and not during inode writeback. If we do that
then i_update_core and i_update_size go away completely and the
only place we need track inode dirty state in XFS is when the
inodes are in the AIL list.

Even better: this removes one of the three places where we do inode
writeback and is a significant step towards:

> >I'd even like to go as far as a two pass writeback algorithm; pass
> >one only writes data, and pass two only writes inodes. The second pass
> >for XFS needs to be delayed until data writeback is complete because of
> >delalloc and inode size updates redirtying the inode. The current
> >mechanism means we often do two inode writes for the one data write...

----

What I'm trying to say is that I don't think we can cleanly fix the problem
with the current structure, so let's not waste time on it. A cleaner
fix should just fall out a simpler writeback structure.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC] Delayed logging of file sizes
  2007-11-27 10:53                 ` David Chinner
@ 2007-11-28  0:43                   ` Lachlan McIlroy
  2007-11-28  2:01                     ` David Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Lachlan McIlroy @ 2007-11-28  0:43 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs-dev, xfs-oss

David Chinner wrote:
> On Tue, Nov 27, 2007 at 02:30:25PM +1100, Lachlan McIlroy wrote:
>> David Chinner wrote:
>>> Sorry, i wasn't particularly clear. What I mean was that i_update_core
>>> might disappear completely with the changes I'm making.
>>>
>>> Basically, we have three different methods of marking the inode dirty
>>> at the moment - on  the linux inode (mark_inode_dirty[_sync]()), the
>>> i_update_core = 1 for unlogged changes and logged changes are tracked via 
>>> the
>>> inode log item in the AIL.
>>>
>>> One top of that, we have three different methods of flushing them - one
>> >from the generic code for inodes dirtied by mark_inode_dirty(), one from
>>> xfssyncd for inodes that are only dirtied by setting i_update_core = 1
>>> and the other from the xfsaild when log tail pushing.
>>>
>>> Ideally we should only have a single method for pushing out inodes. The 
>>> first
>>> step to that is tracking the dirty state in a single tree (the inode radix
>>> trees). That means we have to hook ->dirty_inode() to catch all dirtying 
>>> via
>>> mark_inode_dirty[_sync]() and mark the inodes dirty in the radix tree. 
>>> Then we
>>> need to use xfs_mark_inode_dirty_sync() everywhere that we dirty the inode.
>> Don't we already call mark_inode_dirty[_sync]() everywhere we dirty the 
>> inode?
> 
> Maybe. Maybe not. Tell me - does xfs_ichgtime() do the right thing?
> 
> [ I do know the answer to this question and there's a day of kdb tracing
> behind the answer. I wrote a 15 line comment to explain what was going
> on in one of my patches. ]

Are you referring to the !(inode->i_state & I_LOCK) check?
Anyway, since you know the answer why don't you enlighten me?

> 
>>> Once we have all the dirty state in the radix trees we can now get rid of
>>> i_update_core and i_update_size - all they do is mark the inode dirty and
>>> we don't really care about the difference between them(*) - and just use
>>> the dirty bit in the radix tree when necessary.
>> If we want to check if an inode is dirty do we have to look up the dirty
>> bit in the tree or is there some easy way to get it from the inode?
> 
> xfs_inode_clean(ip) is my preferred interface. How that is finally
> implemented will be determined by how this all cleans up and what
> performs the best. If lockless tree lookups don't cause performance
> problems, then there is little reason to keep redundant information
> around.
I can't imagine that a tree lookup (lockless or not) would be faster
than dereferencing fields from the inode.  If keeping the inode's dirty
flags and the ones in the radix tree in sync is an issue then maybe
tree lookups are a performance hit we can live with.

> 
>> By consolidating the different ways of dirtying an inode we lose the ability
>> to know why it is dirty and what action needs to be done to undirty it.
> 
> The only way to undirty an inode is to write it to disk.
True.  I was thinking about what may need to be done before we write it
to disk such as flushing the log but that would just be dependent on
whether the inode is pinned?

> 
>> For example if the inode log item has bits set then we know we have to flush
>> the log otherwise there is no need.  With a general purpose dirty bit we 
> 
> No, if the log item is present and dirty (i.e. inode is in the AIL),
> all it means is that we need to attach a callback to the buffer
> (xfs_iflush_done) when dispatching the I/O to do processing of the
> log item on I/O completion. Whether i_update_core is set or not
> in this case is irrelevant - the log item state overrides that.
> 
>> will
>> have to flush regardless.  And my recent attempt to fix the log replay issue
>> relies on i_update_core to indicate there are unlogged changes - I don't see
>> how that will work with these changes.
> 
> But your changes could not be implemented, either. You can't log the inode
> to clean it - it merely transfers the writeback from one list to
> another.
Could not be implemented?  What was that patch I sent around then?  It was
implemented and it did work - it got XFSQA test 182 to finally pass.  But
sure it wasn't an ideal approach.  I even fixed it so that it didn't dirty
the inode during the transaction.

> 
> So, the cleaner fix is to do this - change the xfs_inode_flush()
> just to unconditionally log the inode and don't do inode writeback *at
> all* from there. That will catch all cases of unlogged changes and leave
> inode writeback to tail-pushing or xfssyncd which can be driven by
> the radix tree.
Huh?  Aren't we trying to minimize the number of transactions we do?  My
changes introduce new transactions but only when we have to.  You're saying
here that we log the inode unconditionally - how is that better?  I'm not
trying to defend my changes here (I don't care how the problem gets fixed)
- I'm just trying to understand why your suggestions are a good idea.

I do like the way it simplifies inode writeback though - a sync would
optionally log all the inodes and then just flush the log and that's it
(I think).

> 
> Basically, if we only ever write state to disk that we've logged,
> then we are home free. That means the only time we should update
> the unlogged fields - timestamps and inode size - is during a
> transaction commit and not during inode writeback. If we do that
> then i_update_core and i_update_size go away completely and the
> only place we need track inode dirty state in XFS is when the
> inodes are in the AIL list.
> 
> Even better: this removes one of the three places where we do inode
> writeback and is a significant step towards:
> 
>>> I'd even like to go as far as a two pass writeback algorithm; pass
>>> one only writes data, and pass two only writes inodes. The second pass
>>> for XFS needs to be delayed until data writeback is complete because of
>>> delalloc and inode size updates redirtying the inode. The current
>>> mechanism means we often do two inode writes for the one data write...
> 
> ----
> 
> What I'm trying to say is that I don't think we can cleanly fix the problem
> with the current structure, so let's not waste time on it. A cleaner
> fix should just fall out a simpler writeback structure.
> 
Fair enough.  I'll wait for the patches.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC] Delayed logging of file sizes
  2007-11-28  0:43                   ` Lachlan McIlroy
@ 2007-11-28  2:01                     ` David Chinner
  2007-11-28  4:18                       ` Lachlan McIlroy
  0 siblings, 1 reply; 14+ messages in thread
From: David Chinner @ 2007-11-28  2:01 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: David Chinner, xfs-dev, xfs-oss

On Wed, Nov 28, 2007 at 11:43:26AM +1100, Lachlan McIlroy wrote:
> David Chinner wrote:
> >On Tue, Nov 27, 2007 at 02:30:25PM +1100, Lachlan McIlroy wrote:
> >>David Chinner wrote:
> >>>Sorry, i wasn't particularly clear. What I mean was that i_update_core
> >>>might disappear completely with the changes I'm making.
> >>>
> >>>Basically, we have three different methods of marking the inode dirty
> >>>at the moment - on  the linux inode (mark_inode_dirty[_sync]()), the
> >>>i_update_core = 1 for unlogged changes and logged changes are tracked 
> >>>via the
> >>>inode log item in the AIL.
> >>>
> >>>One top of that, we have three different methods of flushing them - one
> >>>from the generic code for inodes dirtied by mark_inode_dirty(), one from
> >>>xfssyncd for inodes that are only dirtied by setting i_update_core = 1
> >>>and the other from the xfsaild when log tail pushing.
> >>>
> >>>Ideally we should only have a single method for pushing out inodes. The 
> >>>first
> >>>step to that is tracking the dirty state in a single tree (the inode 
> >>>radix
> >>>trees). That means we have to hook ->dirty_inode() to catch all dirtying 
> >>>via
> >>>mark_inode_dirty[_sync]() and mark the inodes dirty in the radix tree. 
> >>>Then we
> >>>need to use xfs_mark_inode_dirty_sync() everywhere that we dirty the 
> >>>inode.
> >>Don't we already call mark_inode_dirty[_sync]() everywhere we dirty the 
> >>inode?
> >
> >Maybe. Maybe not. Tell me - does xfs_ichgtime() do the right thing?
> >
> >[ I do know the answer to this question and there's a day of kdb tracing
> >behind the answer. I wrote a 15 line comment to explain what was going
> >on in one of my patches. ]
> 
> Are you referring to the !(inode->i_state & I_LOCK) check?

Yup.

> Anyway, since you know the answer why don't you enlighten me?

When allocating a new inode, we mark the inode dirty when first
setting the timestamps in xfs_dir_ialloc(). At the time this happens
the inode is I_LOCK|I_NEW and hence mark_inode_dirty_sync() would just
mark the inode dirty and *not* move it to the dirty list.

Because unlock_new_inode() does not check the dirty state when
removing the I_LOCK state, the inode is never moved to the dirty list
if it is already dirty (unlike __sync_single_inode()).

Further calls to mark_inode_dirty_sync() see the inode as dirty and
don't move it to the dirty list, either. Hence the inode would never
get flushed out by the generic code if we called
mark_inode_dirty_sync() in that location.

Why is it wrong? It should be checking I_NEW, not I_LOCK because all
other cases where I_LOCK might be set are covered by the code that
unlocks the inode.

> >>>Once we have all the dirty state in the radix trees we can now get rid of
> >>>i_update_core and i_update_size - all they do is mark the inode dirty and
> >>>we don't really care about the difference between them(*) - and just use
> >>>the dirty bit in the radix tree when necessary.
> >>If we want to check if an inode is dirty do we have to look up the dirty
> >>bit in the tree or is there some easy way to get it from the inode?
> >
> >xfs_inode_clean(ip) is my preferred interface. How that is finally
> >implemented will be determined by how this all cleans up and what
> >performs the best. If lockless tree lookups don't cause performance
> >problems, then there is little reason to keep redundant information
> >around.
> I can't imagine that a tree lookup (lockless or not) would be faster
> than dereferencing fields from the inode.  If keeping the inode's dirty
> flags and the ones in the radix tree in sync is an issue then maybe
> tree lookups are a performance hit we can live with.

I'm hoping to avoid this problem altogether by removing as many
"is the inode dirty" checks as possible. If inode writeback is
driven exclusively by the radix tree dirty bit via a traversal
and we only write back logged changes, then I don't think we need
to be checking if the inode is clean very often.

That is, if we see the inode in xfs_flush_inode() then it is
dirty at the linux level, so we log the inode. That makes the
inode clean at the linux layer and dirty at the XFS level, and
we know that as long as the inode remains in the AIL it is dirty. 

We only ever flush inodes based on a AIL push (which doesn't
require dirty bits) or via the syncd, which looks up dirty
inodes via the radix tree tag, and hence most of the dirty
checks on the inode can go away because we don't need to
check it during writeback now.

> >>By consolidating the different ways of dirtying an inode we lose the 
> >>ability
> >>to know why it is dirty and what action needs to be done to undirty it.
> >
> >The only way to undirty an inode is to write it to disk.
> True.  I was thinking about what may need to be done before we write it
> to disk such as flushing the log but that would just be dependent on
> whether the inode is pinned?

Right, flushing the log is only needed if it is pinned.

> >>For example if the inode log item has bits set then we know we have to 
> >>flush
> >>the log otherwise there is no need.  With a general purpose dirty bit we 
> >
> >No, if the log item is present and dirty (i.e. inode is in the AIL),
> >all it means is that we need to attach a callback to the buffer
> >(xfs_iflush_done) when dispatching the I/O to do processing of the
> >log item on I/O completion. Whether i_update_core is set or not
> >in this case is irrelevant - the log item state overrides that.
> >
> >>will
> >>have to flush regardless.  And my recent attempt to fix the log replay 
> >>issue
> >>relies on i_update_core to indicate there are unlogged changes - I don't 
> >>see
> >>how that will work with these changes.
> >
> >But your changes could not be implemented, either. You can't log the inode
> >to clean it - it merely transfers the writeback from one list to
> >another.
> Could not be implemented?  What was that patch I sent around then? 

Sorry, I missed an important work there - could not be implemented
_efficiently_.

Basically, you are logging the inode, then call xfs_iflush, which
immediately sees it pinned and forces the log. That's an extra
transaction *and* log I/O for every inode we write. That defeats all
inode clustering and and will seriously harm performance.

Also, the change fails to log changes to inodes in the same
cluster that get written out because they are dirty.

> >So, the cleaner fix is to do this - change the xfs_inode_flush()
> >just to unconditionally log the inode and don't do inode writeback *at
> >all* from there. That will catch all cases of unlogged changes and leave
> >inode writeback to tail-pushing or xfssyncd which can be driven by
> >the radix tree.
> Huh?  Aren't we trying to minimize the number of transactions we do?  My
> changes introduce new transactions but only when we have to.  You're saying
> here that we log the inode unconditionally - how is that better?  I'm not
> trying to defend my changes here (I don't care how the problem gets fixed)
> - I'm just trying to understand why your suggestions are a good idea.

Because we can log entire inode cluster's worth of changes in a single
transaction. One transaction vs one I/O - it's a decent tradeoff to
avoid this problem, esp. as we'll get improved inode writeback clustering
if we flush from the radix tree (i.e. clusters get flushed in ascending
inode number order).....

> I do like the way it simplifies inode writeback though - a sync would
> optionally log all the inodes and then just flush the log and that's it
> (I think).

Yup, pretty much.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC] Delayed logging of file sizes
  2007-11-28  2:01                     ` David Chinner
@ 2007-11-28  4:18                       ` Lachlan McIlroy
  2007-11-28  9:07                         ` David Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Lachlan McIlroy @ 2007-11-28  4:18 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs-dev, xfs-oss

David Chinner wrote:
> On Wed, Nov 28, 2007 at 11:43:26AM +1100, Lachlan McIlroy wrote:
>> David Chinner wrote:
>>> On Tue, Nov 27, 2007 at 02:30:25PM +1100, Lachlan McIlroy wrote:
>>>> David Chinner wrote:
>>>>> Sorry, i wasn't particularly clear. What I mean was that i_update_core
>>>>> might disappear completely with the changes I'm making.
>>>>>
>>>>> Basically, we have three different methods of marking the inode dirty
>>>>> at the moment - on  the linux inode (mark_inode_dirty[_sync]()), the
>>>>> i_update_core = 1 for unlogged changes and logged changes are tracked 
>>>>> via the
>>>>> inode log item in the AIL.
>>>>>
>>>>> One top of that, we have three different methods of flushing them - one
>>>> >from the generic code for inodes dirtied by mark_inode_dirty(), one from
>>>>> xfssyncd for inodes that are only dirtied by setting i_update_core = 1
>>>>> and the other from the xfsaild when log tail pushing.
>>>>>
>>>>> Ideally we should only have a single method for pushing out inodes. The 
>>>>> first
>>>>> step to that is tracking the dirty state in a single tree (the inode 
>>>>> radix
>>>>> trees). That means we have to hook ->dirty_inode() to catch all dirtying 
>>>>> via
>>>>> mark_inode_dirty[_sync]() and mark the inodes dirty in the radix tree. 
>>>>> Then we
>>>>> need to use xfs_mark_inode_dirty_sync() everywhere that we dirty the 
>>>>> inode.
>>>> Don't we already call mark_inode_dirty[_sync]() everywhere we dirty the 
>>>> inode?
>>> Maybe. Maybe not. Tell me - does xfs_ichgtime() do the right thing?
>>>
>>> [ I do know the answer to this question and there's a day of kdb tracing
>>> behind the answer. I wrote a 15 line comment to explain what was going
>>> on in one of my patches. ]
>> Are you referring to the !(inode->i_state & I_LOCK) check?
> 
> Yup.
I've never liked that check, can we just get rid of it?

> 
>> Anyway, since you know the answer why don't you enlighten me?
> 
> When allocating a new inode, we mark the inode dirty when first
> setting the timestamps in xfs_dir_ialloc(). At the time this happens
> the inode is I_LOCK|I_NEW and hence mark_inode_dirty_sync() would just
> mark the inode dirty and *not* move it to the dirty list.
> 
> Because unlock_new_inode() does not check the dirty state when
> removing the I_LOCK state, the inode is never moved to the dirty list
> if it is already dirty (unlike __sync_single_inode()).
> 
> Further calls to mark_inode_dirty_sync() see the inode as dirty and
> don't move it to the dirty list, either. Hence the inode would never
> get flushed out by the generic code if we called
> mark_inode_dirty_sync() in that location.
> 
> Why is it wrong? It should be checking I_NEW, not I_LOCK because all
> other cases where I_LOCK might be set are covered by the code that
> unlocks the inode.
> 
>>>>> Once we have all the dirty state in the radix trees we can now get rid of
>>>>> i_update_core and i_update_size - all they do is mark the inode dirty and
>>>>> we don't really care about the difference between them(*) - and just use
>>>>> the dirty bit in the radix tree when necessary.
>>>> If we want to check if an inode is dirty do we have to look up the dirty
>>>> bit in the tree or is there some easy way to get it from the inode?
>>> xfs_inode_clean(ip) is my preferred interface. How that is finally
>>> implemented will be determined by how this all cleans up and what
>>> performs the best. If lockless tree lookups don't cause performance
>>> problems, then there is little reason to keep redundant information
>>> around.
>> I can't imagine that a tree lookup (lockless or not) would be faster
>> than dereferencing fields from the inode.  If keeping the inode's dirty
>> flags and the ones in the radix tree in sync is an issue then maybe
>> tree lookups are a performance hit we can live with.
> 
> I'm hoping to avoid this problem altogether by removing as many
> "is the inode dirty" checks as possible. If inode writeback is
> driven exclusively by the radix tree dirty bit via a traversal
> and we only write back logged changes, then I don't think we need
> to be checking if the inode is clean very often.
> 
> That is, if we see the inode in xfs_flush_inode() then it is
> dirty at the linux level, so we log the inode. That makes the
> inode clean at the linux layer and dirty at the XFS level, and
> we know that as long as the inode remains in the AIL it is dirty. 
> 
> We only ever flush inodes based on a AIL push (which doesn't
> require dirty bits) or via the syncd, which looks up dirty
> inodes via the radix tree tag, and hence most of the dirty
> checks on the inode can go away because we don't need to
> check it during writeback now.
> 
>>>> By consolidating the different ways of dirtying an inode we lose the 
>>>> ability
>>>> to know why it is dirty and what action needs to be done to undirty it.
>>> The only way to undirty an inode is to write it to disk.
>> True.  I was thinking about what may need to be done before we write it
>> to disk such as flushing the log but that would just be dependent on
>> whether the inode is pinned?
> 
> Right, flushing the log is only needed if it is pinned.
> 
>>>> For example if the inode log item has bits set then we know we have to 
>>>> flush
>>>> the log otherwise there is no need.  With a general purpose dirty bit we 
>>> No, if the log item is present and dirty (i.e. inode is in the AIL),
>>> all it means is that we need to attach a callback to the buffer
>>> (xfs_iflush_done) when dispatching the I/O to do processing of the
>>> log item on I/O completion. Whether i_update_core is set or not
>>> in this case is irrelevant - the log item state overrides that.
>>>
>>>> will
>>>> have to flush regardless.  And my recent attempt to fix the log replay 
>>>> issue
>>>> relies on i_update_core to indicate there are unlogged changes - I don't 
>>>> see
>>>> how that will work with these changes.
>>> But your changes could not be implemented, either. You can't log the inode
>>> to clean it - it merely transfers the writeback from one list to
>>> another.
>> Could not be implemented?  What was that patch I sent around then? 
> 
> Sorry, I missed an important work there - could not be implemented
> _efficiently_.
> 
> Basically, you are logging the inode, then call xfs_iflush, which
> immediately sees it pinned and forces the log. That's an extra
> transaction *and* log I/O for every inode we write. That defeats all
> inode clustering and and will seriously harm performance.
I didn't see another way around it.  We only need to force the log for
pinned inodes if it is a sync writeback, otherwise we can just try again
later.

> 
> Also, the change fails to log changes to inodes in the same
> cluster that get written out because they are dirty.
That's where it all sort of falls apart.  I didn't want to log the inode
in xfs_iflush_int() because we have the flush lock held and I was pretty
sure logging a transaction with the flush lock held would be a bad idea.
That's why I specifically removed the code that resets i_update_core in
xfs_iflush_int() - so that other inodes in the same cluster will still be
flagged as having unlogged changes even after the inodes have been synced
to disk.  But as I said it was an idea that needed some polishing.

> 
>>> So, the cleaner fix is to do this - change the xfs_inode_flush()
>>> just to unconditionally log the inode and don't do inode writeback *at
>>> all* from there. That will catch all cases of unlogged changes and leave
>>> inode writeback to tail-pushing or xfssyncd which can be driven by
>>> the radix tree.
>> Huh?  Aren't we trying to minimize the number of transactions we do?  My
>> changes introduce new transactions but only when we have to.  You're saying
>> here that we log the inode unconditionally - how is that better?  I'm not
>> trying to defend my changes here (I don't care how the problem gets fixed)
>> - I'm just trying to understand why your suggestions are a good idea.
> 
> Because we can log entire inode cluster's worth of changes in a single
> transaction. One transaction vs one I/O - it's a decent tradeoff to
> avoid this problem, esp. as we'll get improved inode writeback clustering
> if we flush from the radix tree (i.e. clusters get flushed in ascending
> inode number order).....
That should help a lot but it will use even more space in the log - quite a
lot more if just one inode in the cluster needs to be logged.  Do you plan
to do this in the write_inode path?  If so we'll have inodes that have been
logged (with a previous cluster) that still have I_DIRTY set.  When these
inodes go through the write_inode path we'll need to skip the transaction.

> 
>> I do like the way it simplifies inode writeback though - a sync would
>> optionally log all the inodes and then just flush the log and that's it
>> (I think).
> 
> Yup, pretty much.
> 
> Cheers,
> 
> Dave.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC] Delayed logging of file sizes
  2007-11-28  4:18                       ` Lachlan McIlroy
@ 2007-11-28  9:07                         ` David Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: David Chinner @ 2007-11-28  9:07 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: David Chinner, xfs-dev, xfs-oss

On Wed, Nov 28, 2007 at 03:18:54PM +1100, Lachlan McIlroy wrote:
> David Chinner wrote:
> >>>[ I do know the answer to this question and there's a day of kdb tracing
> >>>behind the answer. I wrote a 15 line comment to explain what was going
> >>>on in one of my patches. ]
> >>Are you referring to the !(inode->i_state & I_LOCK) check?
> >
> >Yup.
> I've never liked that check, can we just get rid of it?

No - removing the check is what lead me to understand why it
was necessary.

> >Sorry, I missed an important work there - could not be implemented
> >_efficiently_.
> >
> >Basically, you are logging the inode, then call xfs_iflush, which
> >immediately sees it pinned and forces the log. That's an extra
> >transaction *and* log I/O for every inode we write. That defeats all
> >inode clustering and and will seriously harm performance.
> I didn't see another way around it.  We only need to force the log for
> pinned inodes if it is a sync writeback, otherwise we can just try again
> later.

But we don't do that right now - we call xfs_ipinwait() in xfs_iflush()
which forces the log.

> >Also, the change fails to log changes to inodes in the same
> >cluster that get written out because they are dirty.
> That's where it all sort of falls apart.  I didn't want to log the inode
> in xfs_iflush_int() because we have the flush lock held and I was pretty
> sure logging a transaction with the flush lock held would be a bad idea.
> That's why I specifically removed the code that resets i_update_core in
> xfs_iflush_int() - so that other inodes in the same cluster will still be
> flagged as having unlogged changes even after the inodes have been synced
> to disk.  But as I said it was an idea that needed some polishing.

It's messy, and if we are logging changes then we should never write
to disk unlogged changes....

> >>>So, the cleaner fix is to do this - change the xfs_inode_flush()
> >>>just to unconditionally log the inode and don't do inode writeback *at
> >>>all* from there. That will catch all cases of unlogged changes and leave
> >>>inode writeback to tail-pushing or xfssyncd which can be driven by
> >>>the radix tree.
> >>Huh?  Aren't we trying to minimize the number of transactions we do?  My
> >>changes introduce new transactions but only when we have to.  You're 
> >>saying
> >>here that we log the inode unconditionally - how is that better?  I'm not
> >>trying to defend my changes here (I don't care how the problem gets fixed)
> >>- I'm just trying to understand why your suggestions are a good idea.
> >
> >Because we can log entire inode cluster's worth of changes in a single
> >transaction. One transaction vs one I/O - it's a decent tradeoff to
> >avoid this problem, esp. as we'll get improved inode writeback clustering
> >if we flush from the radix tree (i.e. clusters get flushed in ascending
> >inode number order).....
> That should help a lot but it will use even more space in the log - quite a
> lot more if just one inode in the cluster needs to be logged.

32 inodes * 100 bytes for the inode core - 3k per inode is we log the
entire cluster. But we know what is dirty and what isn't, so that's worst
case. i.e. we only log those that are dirty. It's not log space I'm
worried about here - it's the transaction overhead....

Still, doing it as a cluster is probably premature optimisation. Lets
see what logging only the dirty inodes gets us and go from there.

> Do you plan
> to do this in the write_inode path?  If so we'll have inodes that have been
> logged (with a previous cluster) that still have I_DIRTY set.

I'll cross that one if we need to - we can probably just clear it if
the inode is not otherwise dirty...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2007-11-28  9:07 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-23  7:04 [PATCH, RFC] Delayed logging of file sizes Lachlan McIlroy
2007-11-25 22:59 ` David Chinner
2007-11-26  0:19   ` Lachlan McIlroy
2007-11-26  1:10     ` David Chinner
2007-11-26  1:29       ` Lachlan McIlroy
2007-11-26  2:15         ` David Chinner
2007-11-26  3:16           ` Lachlan McIlroy
2007-11-26  5:03             ` David Chinner
2007-11-27  3:30               ` Lachlan McIlroy
2007-11-27 10:53                 ` David Chinner
2007-11-28  0:43                   ` Lachlan McIlroy
2007-11-28  2:01                     ` David Chinner
2007-11-28  4:18                       ` Lachlan McIlroy
2007-11-28  9:07                         ` David Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox