public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* XFS metadata flushing design - current and future
@ 2011-08-27  8:03 Christoph Hellwig
  2011-08-29  1:01 ` Dave Chinner
  2011-09-09 22:31 ` Stewart Smith
  0 siblings, 2 replies; 9+ messages in thread
From: Christoph Hellwig @ 2011-08-27  8:03 UTC (permalink / raw)
  To: xfs

Here is a little writeup I did about how we handle dirty metadata
flushing in XFS currently, and how we can improve on it in the
relatively short term:


---
Metadata flushing in XFS
========================

This document describes the state of the handling of dirty XFS in-core
metadata, and how it gets flushed to disk, as well as ideas how to
simplify it in the future.


Buffers
-------

All metadata is XFS is read and written using buffers as the lowest layer.
There are two ways to write a buffer back to disk: delwri and sync.
Delwri means the buffers gets added to a delayed write list, which a
background thread writes back periodically or when forced to.  Synchronous
writes means the buffer is written back immediately, and the callers waits
for completion synchronously.

Logging and the Active Item List (AIL)
--------------------------------------

The prime method of metadata writeback in XFS is by logging the changes
into the transaction log, and writing back the changes to the original
location in the background.  The prime data structure to drive the
asynchronous write back is the Active Item List or AIL.  The AIL contains
a list of all changes in the log that need to be written back, ordered
by on the time they were committed to the log using the Log Sequence
Number (LSN).  The AIL is periodically pushed out to try to move the
log tail LSN forward.  In addition periodically the sync worker attempts
to push out all items in the AIL.

Non-transaction metadata updates
--------------------------------

XFS still has a few updates where update metadata non-transactional.

The prime cause for non-transaction metadata updates are timestamps in the
inode, and inode size updates from extending writes.  These are handled
by marking the inode dirty in the VFS and XFS inodes, and either relying
on transactional updates to piggy-back these updates, or on the VFS
periodic writeback thread to call into the ->write_inode method in
XFS to write these changes back.  ->write_inode either starts delwri
buffer writeback on the inode, or starts a new transaction to log
the inode core containing these changes.

The dquot structures may be scheduled for delwri writeback after a
quota check during an unclean mount.

Extended attribute payloads that are stored outside the main attribute
btree are written back synchronously using buffers.

New allocation group headers written during a filesystem resizing are
written synchronously using buffers.

The superblock is written synchronously using buffers during umount
and sync operations.

Log recovery writes back various pieces of metadata synchronously
or using delwri buffers.


Other flushing methods
----------------------

For historical reasons we still have a few places that flush XFS metadata
using others methods than logging and the AIL or explicit synchronous
or delwri writes.

Filesystem freezing loops over all inodes in the system to flush out
inodes marked dirty directly using xfs_iflush.

The quotacheck code marks dquots dirty, just to flush them at the end of
the quotacheck operation.

The periodic and explicit sync code walks through all dqouts and writes
back all dirty dquots directly.


Future directions
-----------------

We should get rid of both the reliance of the VFS writeback tracking, and
XFS-internal non-AIL metadata flushing.

To get rid of the VFS writeback we'll just need to log all time stamps and size
updates explicitly when they happens.  This could be done today, but the
overhead for frequent transactions in that area is deemed to high, especially
with delayed logging enabled.  We plan to deprecate the non-delaylog mode
by Linux 3.3, and introduce a new fast-path for inode core updates that
will allow to use direct logging for this updates without introducing
large overhead.

The explicit inode flushing using xfs_sync_attr looks like an attempt to
make sure we do not have any inodes in the AIL when freezing a filesystem.
A better replacement would be a call into the AIL code that allows to
completely empty the AIL before a freeze.

The explicit quota flushing needs a bit more work.  First quota check needs
to be converted to queue up inodes to the delwri list immediately when
updating the dquot for each inode.  Second the code in xfs_qm_scall_setqlim
that attached a dquot to the transaction, but marks it dirty manually instead
of through the transaction interface needs a detailed audit.  After this
we should be able to get rid of all explicit xfs_qm_sync calls.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS metadata flushing design - current and future
  2011-08-27  8:03 XFS metadata flushing design - current and future Christoph Hellwig
@ 2011-08-29  1:01 ` Dave Chinner
  2011-08-29  6:33   ` Christoph Hellwig
  2011-08-29 12:33   ` Christoph Hellwig
  2011-09-09 22:31 ` Stewart Smith
  1 sibling, 2 replies; 9+ messages in thread
From: Dave Chinner @ 2011-08-29  1:01 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Sat, Aug 27, 2011 at 04:03:21AM -0400, Christoph Hellwig wrote:
> Here is a little writeup I did about how we handle dirty metadata
> flushing in XFS currently, and how we can improve on it in the
> relatively short term:
> 
> 
> ---
> Metadata flushing in XFS
> ========================
> 
> This document describes the state of the handling of dirty XFS in-core
> metadata, and how it gets flushed to disk, as well as ideas how to
> simplify it in the future.
> 
> 
> Buffers
> -------
> 
> All metadata is XFS is read and written using buffers as the lowest layer.
> There are two ways to write a buffer back to disk: delwri and sync.
> Delwri means the buffers gets added to a delayed write list, which a
> background thread writes back periodically or when forced to.  Synchronous
> writes means the buffer is written back immediately, and the callers waits
> for completion synchronously.

Right, that's how buffers are flushed, but for some metadata there
is a layer above this - the in-memory object that needs to be
flushed to the buffer before the buffer can be written. Inodes and
dquots fall into this category, so describing how they are flushed
would also be a good idea. something like:

----

High Level Objects
------------------

Some objects are logged directly when changed, rather than modified
in buffers first. When these items are written back, then first need
to b flushed to the backing buffer, and then IO issued on the
backing buffer. These objects can be written in two ways: delwri and
sync.

Delwri means the object is locked and written to the backing buffer,
and the buffer is then written via it's delwri mechanism. The object
remains locked (and so cannot be written to the buffer again) until
the backing buffer is written to disk and marked clean. This allows
multiple objects in the one buffer to be written at different times
but be cleaned in a single buffer IO.

Delwri means the object is locked and written to the backing buffer,
and the buffer is written immediately to disk via it's sync
mechanism. The object remains locked until the buffer IO completes.

Objects need to be attached to the buffer with a callback so that
they can be updated and unlocked when buffer IO completes. Buffer IO
completion will walk the callback list to do this processing.

----


> Logging and the Active Item List (AIL)
> --------------------------------------
> 
> The prime method of metadata writeback in XFS is by logging the changes
> into the transaction log, and writing back the changes to the original
> location in the background.  The prime data structure to drive the
> asynchronous write back is the Active Item List or AIL.  The AIL contains
> a list of all changes in the log that need to be written back, ordered
> by on the time they were committed to the log using the Log Sequence
> Number (LSN).  The AIL is periodically pushed out to try to move the
> log tail LSN forward.  In addition periodically the sync worker attempts
> to push out all items in the AIL.
> 
> Non-transaction metadata updates
> --------------------------------
> 
> XFS still has a few updates where update metadata non-transactional.
> 
> The prime cause for non-transaction metadata updates are timestamps in the
> inode, and inode size updates from extending writes.  These are handled
> by marking the inode dirty in the VFS and XFS inodes, and either relying
> on transactional updates to piggy-back these updates, or on the VFS
> periodic writeback thread to call into the ->write_inode method in
> XFS to write these changes back.  ->write_inode either starts delwri
> buffer writeback on the inode, or starts a new transaction to log
> the inode core containing these changes.
> 
> The dquot structures may be scheduled for delwri writeback after a
> quota check during an unclean mount.
> 
> Extended attribute payloads that are stored outside the main attribute
> btree are written back synchronously using buffers.
> 
> New allocation group headers written during a filesystem resizing are
> written synchronously using buffers.
> 
> The superblock is written synchronously using buffers during umount
> and sync operations.
> 
> Log recovery writes back various pieces of metadata synchronously
> or using delwri buffers.
> 
> 
> Other flushing methods
> ----------------------
> 
> For historical reasons we still have a few places that flush XFS metadata
> using others methods than logging and the AIL or explicit synchronous
> or delwri writes.
> 
> Filesystem freezing loops over all inodes in the system to flush out
> inodes marked dirty directly using xfs_iflush.
> 
> The quotacheck code marks dquots dirty, just to flush them at the end of
> the quotacheck operation.

This is safe because the filesystem isn't "open for business" until
the quotacheck completes. The quotacheck needed flags aren't cleared
until all the updates are on disk, so this doesn't need tobe done
transactionally.

> The periodic and explicit sync code walks through all dqouts and writes
> back all dirty dquots directly.
>
> Future directions
> -----------------
> 
> We should get rid of both the reliance of the VFS writeback tracking, and
> XFS-internal non-AIL metadata flushing.

I'm assuming you mean VFS level dirty inode writeback tracking, not
dirty page cache tracking?

> To get rid of the VFS writeback we'll just need to log all time stamps and size
> updates explicitly when they happens.  This could be done today, but the
> overhead for frequent transactions in that area is deemed to high, especially
> with delayed logging enabled.  We plan to deprecate the non-delaylog mode
> by Linux 3.3, and introduce a new fast-path for inode core updates that
> will allow to use direct logging for this updates without introducing
> large overhead.
> 
> The explicit inode flushing using xfs_sync_attr looks like an attempt to
> make sure we do not have any inodes in the AIL when freezing a filesystem.
> A better replacement would be a call into the AIL code that allows to
> completely empty the AIL before a freeze.

Agreed, good simplification, and would enable us to get rid of some
of the kludgy code in freeze.

> The explicit quota flushing needs a bit more work.  First quota check needs
> to be converted to queue up inodes to the delwri list immediately when
> updating the dquot for each inode.  Second the code in xfs_qm_scall_setqlim
> that attached a dquot to the transaction, but marks it dirty manually instead
> of through the transaction interface needs a detailed audit.  After this
> we should be able to get rid of all explicit xfs_qm_sync calls.

Yes, it would be great to remove theneed for explicit quota
flushing.


Another thing I've noticed is that AIL pushing of dirty inodes can
be quite inefficient from a CPU usage perspective. Inodes that have
already been flushed to their backing buffer results in a
IOP_PUSHBUF call when the AIL tries to push them. Pushing the buffer
requires a buffer cache search, followed by a delwri list promotion.
However, the initial xfs_iflush() call on a dirty inode also
clusters all the other remaining dirty inodes in the buffer to the
buffer. When the AIl hits those other dirty inodes, they are already
locked and so we do a IOP_PUSHBUF call. On every other dirty inode.
So on a completely dirty inode cluster, we do ~30 needless buffer
cache searches and buffer delwri promotions all for the same buffer.
That's a lot of extra work we don't need to be doing - ~10% of the
buffer cache lookups come from IOP_PUSHBUF under inode intensive
metadata workloads:

	xs_push_ail_pushbuf...       5665434
	xs_iflush_count.......        173551
	xs_icluster_flushcnt..        171554
	xs_icluster_flushinode       5316393
	pb_get................      63362891

This shows we've done 171k explicit inode cluster flushes when
writing inodes, and we've clustered 5.3M inodes in those cluster
writes. We've also have 5.6M IOP_PUSHBUF calls, which indicates most
of them are coming from finding flush locked inodes. There's been
63M buffer cache lookups, so we're causing roughly 8% of buffer
cache lookups just through flushing inodes from the AIL.

Also, larger inode buffers to reduce the amount of IO we do to both
read and write inodes might also provide significant benefits by
reducing the amount of IO and number of buffers we need to track in
the cache...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS metadata flushing design - current and future
  2011-08-29  1:01 ` Dave Chinner
@ 2011-08-29  6:33   ` Christoph Hellwig
  2011-08-29 12:33   ` Christoph Hellwig
  1 sibling, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2011-08-29  6:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Aug 29, 2011 at 11:01:49AM +1000, Dave Chinner wrote:
> Right, that's how buffers are flushed, but for some metadata there
> is a layer above this - the in-memory object that needs to be
> flushed to the buffer before the buffer can be written. Inodes and
> dquots fall into this category, so describing how they are flushed
> would also be a good idea. something like:

Sounds fine.

> Delwri means the object is locked and written to the backing buffer,
> and the buffer is then written via it's delwri mechanism. The object
> remains locked (and so cannot be written to the buffer again) until
> the backing buffer is written to disk and marked clean. This allows
> multiple objects in the one buffer to be written at different times
> but be cleaned in a single buffer IO.

Locked is a bit to simple here - we keep the flush lock, but not the
main object lock.

> > inodes marked dirty directly using xfs_iflush.
> > 
> > The quotacheck code marks dquots dirty, just to flush them at the end of
> > the quotacheck operation.
> 
> This is safe because the filesystem isn't "open for business" until
> the quotacheck completes. The quotacheck needed flags aren't cleared
> until all the updates are on disk, so this doesn't need tobe done
> transactionally.

Yes, it's safe - but another different layer of dirty metadata to track.

> > 
> > We should get rid of both the reliance of the VFS writeback tracking, and
> > XFS-internal non-AIL metadata flushing.
> 
> I'm assuming you mean VFS level dirty inode writeback tracking, not
> dirty page cache tracking?

Yes, I'll clarify it.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS metadata flushing design - current and future
  2011-08-29  1:01 ` Dave Chinner
  2011-08-29  6:33   ` Christoph Hellwig
@ 2011-08-29 12:33   ` Christoph Hellwig
  2011-08-30  1:28     ` Dave Chinner
  1 sibling, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2011-08-29 12:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Mon, Aug 29, 2011 at 11:01:49AM +1000, Dave Chinner wrote:
> Another thing I've noticed is that AIL pushing of dirty inodes can
> be quite inefficient from a CPU usage perspective. Inodes that have
> already been flushed to their backing buffer results in a
> IOP_PUSHBUF call when the AIL tries to push them. Pushing the buffer
> requires a buffer cache search, followed by a delwri list promotion.
> However, the initial xfs_iflush() call on a dirty inode also
> clusters all the other remaining dirty inodes in the buffer to the
> buffer. When the AIl hits those other dirty inodes, they are already
> locked and so we do a IOP_PUSHBUF call. On every other dirty inode.
> So on a completely dirty inode cluster, we do ~30 needless buffer
> cache searches and buffer delwri promotions all for the same buffer.
> That's a lot of extra work we don't need to be doing - ~10% of the
> buffer cache lookups come from IOP_PUSHBUF under inode intensive
> metadata workloads:

One really stupid thing we do in that area is that the xfs_iflush from
xfs_inode_item_push puts the buffer at the end of the delwri list and
expects it to be aged, just so that the first xfs_inode_item_pushbuf
can promote it to the front of the list.  Now that we mostly write
metadata from AIL pushing we should not do an additional pass of aging
on that - that's what we already the AIL for.  Once we did that we
should be able to remove the buffer promotion and make the pushuf a
no-op.  The only thing this might interact with in a not so nice way
would be inode reclaim if it still did delwri writes with the delay
period, but we might be able to get away without that one as well.

> Also, larger inode buffers to reduce the amount of IO we do to both
> read and write inodes might also provide significant benefits by
> reducing the amount of IO and number of buffers we need to track in
> the cache...

We could try to get for large in-core clusters.  That is try to always
allocate N aligned inode clusters together, and always read/write
clusters in that alignment together if possible.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS metadata flushing design - current and future
  2011-08-29 12:33   ` Christoph Hellwig
@ 2011-08-30  1:28     ` Dave Chinner
  2011-08-30  5:09       ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2011-08-30  1:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Mon, Aug 29, 2011 at 08:33:18AM -0400, Christoph Hellwig wrote:
> On Mon, Aug 29, 2011 at 11:01:49AM +1000, Dave Chinner wrote:
> > Another thing I've noticed is that AIL pushing of dirty inodes can
> > be quite inefficient from a CPU usage perspective. Inodes that have
> > already been flushed to their backing buffer results in a
> > IOP_PUSHBUF call when the AIL tries to push them. Pushing the buffer
> > requires a buffer cache search, followed by a delwri list promotion.
> > However, the initial xfs_iflush() call on a dirty inode also
> > clusters all the other remaining dirty inodes in the buffer to the
> > buffer. When the AIl hits those other dirty inodes, they are already
> > locked and so we do a IOP_PUSHBUF call. On every other dirty inode.
> > So on a completely dirty inode cluster, we do ~30 needless buffer
> > cache searches and buffer delwri promotions all for the same buffer.
> > That's a lot of extra work we don't need to be doing - ~10% of the
> > buffer cache lookups come from IOP_PUSHBUF under inode intensive
> > metadata workloads:
> 
> One really stupid thing we do in that area is that the xfs_iflush from
> xfs_inode_item_push puts the buffer at the end of the delwri list and
> expects it to be aged, just so that the first xfs_inode_item_pushbuf
> can promote it to the front of the list.  Now that we mostly write
> metadata from AIL pushing

Actually, when we have lots of data to be written, we still call
xfs_iflush() a lot from .write_inode. That's where the delayed write
buffer aging has significant benefit.

> we should not do an additional pass of aging
> on that - that's what we already the AIL for.  Once we did that we
> should be able to remove the buffer promotion and make the pushuf a
> no-op. 

If we remove the promotion, then it can be up to 15s before the IO
is actually dispatched, resulting in long stalls until the buffer
ages out. That's the problem that I introduced the promotion to fix.
Yes, the inode writeback code has changed a bit since then, but not
significantly enough to remove that problem.

> The only thing this might interact with in a not so nice way
> would be inode reclaim if it still did delwri writes with the delay
> period, but we might be able to get away without that one as well.

Right - if we only ever call xfs_iflush() from IOP_PUSH() and
shrinker based inode reclaim, then I think this problem mostly
goes away. We'd still need the shrinker path to be able to call 
xfs_iflush() for synchronous inode cluster writeback as that is the
method by which we ensure memory reclaim makes progress....

To make this work, rather than doing the current "pushbuf" operation
on inodes, lets make xfs_iflush() return the backing buffer locked
rather than submitting IO on it directly. Then the caller can submit
the buffer IO however it wants. That way reclaim can do synchronous
IO, and for the AIL pusher we can add the buffer to a local list
that we can then submit for IO rather than the current xfsbufd
wakeup call we do. All inodes we see flush locked in IOP_PUSH we can
then ignore, knowing that they are either currently under IO or on
the local list pending IO submission. Either way, we don't need to
try a pushbuf operation on flush locked inodes.

[ IOWs, the xfs_inode_item_push() simply locks the inode and returns
PUSHBUF if it needs flushing, then xfs_inode_item_pushbuf() calls
xfs_iflush() and gets the dirty buffer back, which it then adds the
buffer to a local dispatch list rather than submitting IO directly. ]

---

FWIW, what this discussion is indicating to me is that we should
timestamp entries in the AIL so we can push it to a time threshold
as well as a LSN threshold.

That is, whenever we insert a new entry into the AIL, we not only
update the LSN and position, we also update the insert time of the
buffer. We can then update the time based threshold every few
seconds and have the AIL wq walker walk until the time threshold is
reached pushing items to disk. This would also make the xfssyncd
"push the entire AIL" change to "push anything older than 30s" which
is much more desirable from a sustained workload POV.

If we then modify the way all buffers are treated such that
the AIL is the "delayed write list" (i.e. we take a reference count
to the buffer when it is first logged and not in the AIL) and the
pushbuf operations simply add the buffer to the local dispatch list,
we can get rid of the delwri buffer list altogether. That also gets
rid of the xfsbufd, too, as the AIL handles the aging, reference
counting and writeback of the buffers entirely....

> > Also, larger inode buffers to reduce the amount of IO we do to both
> > read and write inodes might also provide significant benefits by
> > reducing the amount of IO and number of buffers we need to track in
> > the cache...
> 
> We could try to get for large in-core clusters.  That is try to always
> allocate N aligned inode clusters together, and always read/write
> clusters in that alignment together if possible.

Well, just increasing the cluster buffer to cover an entire inode
chunk for common inode sizes (16k for 256 byte inodes and 32k for
512 byte inodes) would make a significant difference, I think.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS metadata flushing design - current and future
  2011-08-30  1:28     ` Dave Chinner
@ 2011-08-30  5:09       ` Christoph Hellwig
  2011-08-30  7:06         ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2011-08-30  5:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Aug 30, 2011 at 11:28:21AM +1000, Dave Chinner wrote:
> > One really stupid thing we do in that area is that the xfs_iflush from
> > xfs_inode_item_push puts the buffer at the end of the delwri list and
> > expects it to be aged, just so that the first xfs_inode_item_pushbuf
> > can promote it to the front of the list.  Now that we mostly write
> > metadata from AIL pushing
> 
> Actually, when we have lots of data to be written, we still call
> xfs_iflush() a lot from .write_inode. That's where the delayed write
> buffer aging has significant benefit.

One thing we could try is to log the inode there even for non-blocking
write_inode calls to unifty the code path.  But I suspect simply waiting
for 3.3 and making future progress based on the removal of ->write_inode
is going to to save us a lot of work and deal with different behaviour.

> [ IOWs, the xfs_inode_item_push() simply locks the inode and returns

              xfs_inode_item_trylock?

> PUSHBUF if it needs flushing, then xfs_inode_item_pushbuf() calls
> xfs_iflush() and gets the dirty buffer back, which it then adds the
> buffer to a local dispatch list rather than submitting IO directly. ]

Why not keep using _push for this?

> FWIW, what this discussion is indicating to me is that we should
> timestamp entries in the AIL so we can push it to a time threshold
> as well as a LSN threshold.
> 
> That is, whenever we insert a new entry into the AIL, we not only
> update the LSN and position, we also update the insert time of the
> buffer. We can then update the time based threshold every few
> seconds and have the AIL wq walker walk until the time threshold is
> reached pushing items to disk. This would also make the xfssyncd
> "push the entire AIL" change to "push anything older than 30s" which
> is much more desirable from a sustained workload POV.

Sounds reasonable.  Except that we need to do the timestamp on the log
item and not the buffer given that we might often not even have a
buffer at AIL insertation time.

> If we then modify the way all buffers are treated such that
> the AIL is the "delayed write list" (i.e. we take a reference count
> to the buffer when it is first logged and not in the AIL) and the
> pushbuf operations simply add the buffer to the local dispatch list,
> we can get rid of the delwri buffer list altogether. That also gets
> rid of the xfsbufd, too, as the AIL handles the aging, reference
> counting and writeback of the buffers entirely....

That would be nice.  We'll need some ways to deal with the delwri
buffers from quotacheck and log recovery if we do this, but we could
just revert to the good old async buffers if we want to keep things
simple.  Alternatively we could keep local buffer submission lists
in quotacheck and pass2 of log recovery, similar to what you suggested
for the AIL worked.

We could also check if we can get away with not needing lists managed
by us at all and rely on the on-stack plugging, which I'm about to move
up from the request layer to the bio layer and thus making generally
useful.

> > > Also, larger inode buffers to reduce the amount of IO we do to both
> > > read and write inodes might also provide significant benefits by
> > > reducing the amount of IO and number of buffers we need to track in
> > > the cache...
> > 
> > We could try to get for large in-core clusters.  That is try to always
> > allocate N aligned inode clusters together, and always read/write
> > clusters in that alignment together if possible.
> 
> Well, just increasing the cluster buffer to cover an entire inode
> chunk for common inode sizes (16k for 256 byte inodes and 32k for
> 512 byte inodes) would make a significant difference, I think.

Ok, starting out simple might make most sense.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS metadata flushing design - current and future
  2011-08-30  5:09       ` Christoph Hellwig
@ 2011-08-30  7:06         ` Dave Chinner
  2011-08-30  7:10           ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2011-08-30  7:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Aug 30, 2011 at 01:09:59AM -0400, Christoph Hellwig wrote:
> On Tue, Aug 30, 2011 at 11:28:21AM +1000, Dave Chinner wrote:
> > > One really stupid thing we do in that area is that the xfs_iflush from
> > > xfs_inode_item_push puts the buffer at the end of the delwri list and
> > > expects it to be aged, just so that the first xfs_inode_item_pushbuf
> > > can promote it to the front of the list.  Now that we mostly write
> > > metadata from AIL pushing
> > 
> > Actually, when we have lots of data to be written, we still call
> > xfs_iflush() a lot from .write_inode. That's where the delayed write
> > buffer aging has significant benefit.
> 
> One thing we could try is to log the inode there even for non-blocking
> write_inode calls to unifty the code path.  But I suspect simply waiting
> for 3.3 and making future progress based on the removal of ->write_inode
> is going to to save us a lot of work and deal with different behaviour.

*nod*

> > [ IOWs, the xfs_inode_item_push() simply locks the inode and returns
> 
>               xfs_inode_item_trylock?

yeah, sorry, got it mixed up.

> > PUSHBUF if it needs flushing, then xfs_inode_item_pushbuf() calls
> > xfs_iflush() and gets the dirty buffer back, which it then adds the
> > buffer to a local dispatch list rather than submitting IO directly. ]
> 
> Why not keep using _push for this?

Because if we are returning a buffer, then it makes sense to use
pushbuf and change the prototype for that operation and leave
IOP_PUSH completely unchanged...

> > FWIW, what this discussion is indicating to me is that we should
> > timestamp entries in the AIL so we can push it to a time threshold
> > as well as a LSN threshold.
> > 
> > That is, whenever we insert a new entry into the AIL, we not only
> > update the LSN and position, we also update the insert time of the
> > buffer. We can then update the time based threshold every few
> > seconds and have the AIL wq walker walk until the time threshold is
> > reached pushing items to disk. This would also make the xfssyncd
> > "push the entire AIL" change to "push anything older than 30s" which
> > is much more desirable from a sustained workload POV.
> 
> Sounds reasonable.  Except that we need to do the timestamp on the log
> item and not the buffer given that we might often not even have a
> buffer at AIL insertation time.

yeah, that's what i intended that to mean, sorry if it wasn't clear.

> > If we then modify the way all buffers are treated such that
> > the AIL is the "delayed write list" (i.e. we take a reference count
> > to the buffer when it is first logged and not in the AIL) and the
> > pushbuf operations simply add the buffer to the local dispatch list,
> > we can get rid of the delwri buffer list altogether. That also gets
> > rid of the xfsbufd, too, as the AIL handles the aging, reference
> > counting and writeback of the buffers entirely....
> 
> That would be nice.  We'll need some ways to deal with the delwri
> buffers from quotacheck and log recovery if we do this, but we could
> just revert to the good old async buffers if we want to keep things
> simple.  Alternatively we could keep local buffer submission lists
> in quotacheck and pass2 of log recovery, similar to what you suggested
> for the AIL worked.
> 
> We could also check if we can get away with not needing lists managed
> by us at all and rely on the on-stack plugging, which I'm about to move
> up from the request layer to the bio layer and thus making generally
> useful.

The advantage of using our own list is that we can still then sort
them (might be thousands of buffers we queue in a single pass)
before submitting them for IO.  The on-stack plugging doesn't allow
this at all, IIUC, as itis really just a FIFO list above the IO
scheduler queues....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS metadata flushing design - current and future
  2011-08-30  7:06         ` Dave Chinner
@ 2011-08-30  7:10           ` Christoph Hellwig
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2011-08-30  7:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Tue, Aug 30, 2011 at 05:06:42PM +1000, Dave Chinner wrote:
> 
> The advantage of using our own list is that we can still then sort
> them (might be thousands of buffers we queue in a single pass)
> before submitting them for IO.  The on-stack plugging doesn't allow
> this at all, IIUC, as itis really just a FIFO list above the IO
> scheduler queues....

The on-stack plugging already sorts - manually in that code for the
request queues (not interesting for us), and currently by block number
as well using the elevator.  But the current code is more of a
guideline, I have some fairly big changes for the that area of block
layer in the pipeline.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS metadata flushing design - current and future
  2011-08-27  8:03 XFS metadata flushing design - current and future Christoph Hellwig
  2011-08-29  1:01 ` Dave Chinner
@ 2011-09-09 22:31 ` Stewart Smith
  1 sibling, 0 replies; 9+ messages in thread
From: Stewart Smith @ 2011-09-09 22:31 UTC (permalink / raw)
  To: Christoph Hellwig, xfs

On Sat, 27 Aug 2011 04:03:21 -0400, Christoph Hellwig <hch@infradead.org> wrote:
> All metadata is XFS is read and written using buffers as the lowest
> layer.

s/is/in/

pretty minor :)

-- 
Stewart Smith

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-09-09 22:31 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-27  8:03 XFS metadata flushing design - current and future Christoph Hellwig
2011-08-29  1:01 ` Dave Chinner
2011-08-29  6:33   ` Christoph Hellwig
2011-08-29 12:33   ` Christoph Hellwig
2011-08-30  1:28     ` Dave Chinner
2011-08-30  5:09       ` Christoph Hellwig
2011-08-30  7:06         ` Dave Chinner
2011-08-30  7:10           ` Christoph Hellwig
2011-09-09 22:31 ` Stewart Smith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox