* XFS metadata flushing design - current and future @ 2011-08-27 8:03 Christoph Hellwig 2011-08-29 1:01 ` Dave Chinner 2011-09-09 22:31 ` Stewart Smith 0 siblings, 2 replies; 9+ messages in thread From: Christoph Hellwig @ 2011-08-27 8:03 UTC (permalink / raw) To: xfs Here is a little writeup I did about how we handle dirty metadata flushing in XFS currently, and how we can improve on it in the relatively short term: --- Metadata flushing in XFS ======================== This document describes the state of the handling of dirty XFS in-core metadata, and how it gets flushed to disk, as well as ideas how to simplify it in the future. Buffers ------- All metadata is XFS is read and written using buffers as the lowest layer. There are two ways to write a buffer back to disk: delwri and sync. Delwri means the buffers gets added to a delayed write list, which a background thread writes back periodically or when forced to. Synchronous writes means the buffer is written back immediately, and the callers waits for completion synchronously. Logging and the Active Item List (AIL) -------------------------------------- The prime method of metadata writeback in XFS is by logging the changes into the transaction log, and writing back the changes to the original location in the background. The prime data structure to drive the asynchronous write back is the Active Item List or AIL. The AIL contains a list of all changes in the log that need to be written back, ordered by on the time they were committed to the log using the Log Sequence Number (LSN). The AIL is periodically pushed out to try to move the log tail LSN forward. In addition periodically the sync worker attempts to push out all items in the AIL. Non-transaction metadata updates -------------------------------- XFS still has a few updates where update metadata non-transactional. The prime cause for non-transaction metadata updates are timestamps in the inode, and inode size updates from extending writes. These are handled by marking the inode dirty in the VFS and XFS inodes, and either relying on transactional updates to piggy-back these updates, or on the VFS periodic writeback thread to call into the ->write_inode method in XFS to write these changes back. ->write_inode either starts delwri buffer writeback on the inode, or starts a new transaction to log the inode core containing these changes. The dquot structures may be scheduled for delwri writeback after a quota check during an unclean mount. Extended attribute payloads that are stored outside the main attribute btree are written back synchronously using buffers. New allocation group headers written during a filesystem resizing are written synchronously using buffers. The superblock is written synchronously using buffers during umount and sync operations. Log recovery writes back various pieces of metadata synchronously or using delwri buffers. Other flushing methods ---------------------- For historical reasons we still have a few places that flush XFS metadata using others methods than logging and the AIL or explicit synchronous or delwri writes. Filesystem freezing loops over all inodes in the system to flush out inodes marked dirty directly using xfs_iflush. The quotacheck code marks dquots dirty, just to flush them at the end of the quotacheck operation. The periodic and explicit sync code walks through all dqouts and writes back all dirty dquots directly. Future directions ----------------- We should get rid of both the reliance of the VFS writeback tracking, and XFS-internal non-AIL metadata flushing. To get rid of the VFS writeback we'll just need to log all time stamps and size updates explicitly when they happens. This could be done today, but the overhead for frequent transactions in that area is deemed to high, especially with delayed logging enabled. We plan to deprecate the non-delaylog mode by Linux 3.3, and introduce a new fast-path for inode core updates that will allow to use direct logging for this updates without introducing large overhead. The explicit inode flushing using xfs_sync_attr looks like an attempt to make sure we do not have any inodes in the AIL when freezing a filesystem. A better replacement would be a call into the AIL code that allows to completely empty the AIL before a freeze. The explicit quota flushing needs a bit more work. First quota check needs to be converted to queue up inodes to the delwri list immediately when updating the dquot for each inode. Second the code in xfs_qm_scall_setqlim that attached a dquot to the transaction, but marks it dirty manually instead of through the transaction interface needs a detailed audit. After this we should be able to get rid of all explicit xfs_qm_sync calls. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS metadata flushing design - current and future 2011-08-27 8:03 XFS metadata flushing design - current and future Christoph Hellwig @ 2011-08-29 1:01 ` Dave Chinner 2011-08-29 6:33 ` Christoph Hellwig 2011-08-29 12:33 ` Christoph Hellwig 2011-09-09 22:31 ` Stewart Smith 1 sibling, 2 replies; 9+ messages in thread From: Dave Chinner @ 2011-08-29 1:01 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs On Sat, Aug 27, 2011 at 04:03:21AM -0400, Christoph Hellwig wrote: > Here is a little writeup I did about how we handle dirty metadata > flushing in XFS currently, and how we can improve on it in the > relatively short term: > > > --- > Metadata flushing in XFS > ======================== > > This document describes the state of the handling of dirty XFS in-core > metadata, and how it gets flushed to disk, as well as ideas how to > simplify it in the future. > > > Buffers > ------- > > All metadata is XFS is read and written using buffers as the lowest layer. > There are two ways to write a buffer back to disk: delwri and sync. > Delwri means the buffers gets added to a delayed write list, which a > background thread writes back periodically or when forced to. Synchronous > writes means the buffer is written back immediately, and the callers waits > for completion synchronously. Right, that's how buffers are flushed, but for some metadata there is a layer above this - the in-memory object that needs to be flushed to the buffer before the buffer can be written. Inodes and dquots fall into this category, so describing how they are flushed would also be a good idea. something like: ---- High Level Objects ------------------ Some objects are logged directly when changed, rather than modified in buffers first. When these items are written back, then first need to b flushed to the backing buffer, and then IO issued on the backing buffer. These objects can be written in two ways: delwri and sync. Delwri means the object is locked and written to the backing buffer, and the buffer is then written via it's delwri mechanism. The object remains locked (and so cannot be written to the buffer again) until the backing buffer is written to disk and marked clean. This allows multiple objects in the one buffer to be written at different times but be cleaned in a single buffer IO. Delwri means the object is locked and written to the backing buffer, and the buffer is written immediately to disk via it's sync mechanism. The object remains locked until the buffer IO completes. Objects need to be attached to the buffer with a callback so that they can be updated and unlocked when buffer IO completes. Buffer IO completion will walk the callback list to do this processing. ---- > Logging and the Active Item List (AIL) > -------------------------------------- > > The prime method of metadata writeback in XFS is by logging the changes > into the transaction log, and writing back the changes to the original > location in the background. The prime data structure to drive the > asynchronous write back is the Active Item List or AIL. The AIL contains > a list of all changes in the log that need to be written back, ordered > by on the time they were committed to the log using the Log Sequence > Number (LSN). The AIL is periodically pushed out to try to move the > log tail LSN forward. In addition periodically the sync worker attempts > to push out all items in the AIL. > > Non-transaction metadata updates > -------------------------------- > > XFS still has a few updates where update metadata non-transactional. > > The prime cause for non-transaction metadata updates are timestamps in the > inode, and inode size updates from extending writes. These are handled > by marking the inode dirty in the VFS and XFS inodes, and either relying > on transactional updates to piggy-back these updates, or on the VFS > periodic writeback thread to call into the ->write_inode method in > XFS to write these changes back. ->write_inode either starts delwri > buffer writeback on the inode, or starts a new transaction to log > the inode core containing these changes. > > The dquot structures may be scheduled for delwri writeback after a > quota check during an unclean mount. > > Extended attribute payloads that are stored outside the main attribute > btree are written back synchronously using buffers. > > New allocation group headers written during a filesystem resizing are > written synchronously using buffers. > > The superblock is written synchronously using buffers during umount > and sync operations. > > Log recovery writes back various pieces of metadata synchronously > or using delwri buffers. > > > Other flushing methods > ---------------------- > > For historical reasons we still have a few places that flush XFS metadata > using others methods than logging and the AIL or explicit synchronous > or delwri writes. > > Filesystem freezing loops over all inodes in the system to flush out > inodes marked dirty directly using xfs_iflush. > > The quotacheck code marks dquots dirty, just to flush them at the end of > the quotacheck operation. This is safe because the filesystem isn't "open for business" until the quotacheck completes. The quotacheck needed flags aren't cleared until all the updates are on disk, so this doesn't need tobe done transactionally. > The periodic and explicit sync code walks through all dqouts and writes > back all dirty dquots directly. > > Future directions > ----------------- > > We should get rid of both the reliance of the VFS writeback tracking, and > XFS-internal non-AIL metadata flushing. I'm assuming you mean VFS level dirty inode writeback tracking, not dirty page cache tracking? > To get rid of the VFS writeback we'll just need to log all time stamps and size > updates explicitly when they happens. This could be done today, but the > overhead for frequent transactions in that area is deemed to high, especially > with delayed logging enabled. We plan to deprecate the non-delaylog mode > by Linux 3.3, and introduce a new fast-path for inode core updates that > will allow to use direct logging for this updates without introducing > large overhead. > > The explicit inode flushing using xfs_sync_attr looks like an attempt to > make sure we do not have any inodes in the AIL when freezing a filesystem. > A better replacement would be a call into the AIL code that allows to > completely empty the AIL before a freeze. Agreed, good simplification, and would enable us to get rid of some of the kludgy code in freeze. > The explicit quota flushing needs a bit more work. First quota check needs > to be converted to queue up inodes to the delwri list immediately when > updating the dquot for each inode. Second the code in xfs_qm_scall_setqlim > that attached a dquot to the transaction, but marks it dirty manually instead > of through the transaction interface needs a detailed audit. After this > we should be able to get rid of all explicit xfs_qm_sync calls. Yes, it would be great to remove theneed for explicit quota flushing. Another thing I've noticed is that AIL pushing of dirty inodes can be quite inefficient from a CPU usage perspective. Inodes that have already been flushed to their backing buffer results in a IOP_PUSHBUF call when the AIL tries to push them. Pushing the buffer requires a buffer cache search, followed by a delwri list promotion. However, the initial xfs_iflush() call on a dirty inode also clusters all the other remaining dirty inodes in the buffer to the buffer. When the AIl hits those other dirty inodes, they are already locked and so we do a IOP_PUSHBUF call. On every other dirty inode. So on a completely dirty inode cluster, we do ~30 needless buffer cache searches and buffer delwri promotions all for the same buffer. That's a lot of extra work we don't need to be doing - ~10% of the buffer cache lookups come from IOP_PUSHBUF under inode intensive metadata workloads: xs_push_ail_pushbuf... 5665434 xs_iflush_count....... 173551 xs_icluster_flushcnt.. 171554 xs_icluster_flushinode 5316393 pb_get................ 63362891 This shows we've done 171k explicit inode cluster flushes when writing inodes, and we've clustered 5.3M inodes in those cluster writes. We've also have 5.6M IOP_PUSHBUF calls, which indicates most of them are coming from finding flush locked inodes. There's been 63M buffer cache lookups, so we're causing roughly 8% of buffer cache lookups just through flushing inodes from the AIL. Also, larger inode buffers to reduce the amount of IO we do to both read and write inodes might also provide significant benefits by reducing the amount of IO and number of buffers we need to track in the cache... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS metadata flushing design - current and future 2011-08-29 1:01 ` Dave Chinner @ 2011-08-29 6:33 ` Christoph Hellwig 2011-08-29 12:33 ` Christoph Hellwig 1 sibling, 0 replies; 9+ messages in thread From: Christoph Hellwig @ 2011-08-29 6:33 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Mon, Aug 29, 2011 at 11:01:49AM +1000, Dave Chinner wrote: > Right, that's how buffers are flushed, but for some metadata there > is a layer above this - the in-memory object that needs to be > flushed to the buffer before the buffer can be written. Inodes and > dquots fall into this category, so describing how they are flushed > would also be a good idea. something like: Sounds fine. > Delwri means the object is locked and written to the backing buffer, > and the buffer is then written via it's delwri mechanism. The object > remains locked (and so cannot be written to the buffer again) until > the backing buffer is written to disk and marked clean. This allows > multiple objects in the one buffer to be written at different times > but be cleaned in a single buffer IO. Locked is a bit to simple here - we keep the flush lock, but not the main object lock. > > inodes marked dirty directly using xfs_iflush. > > > > The quotacheck code marks dquots dirty, just to flush them at the end of > > the quotacheck operation. > > This is safe because the filesystem isn't "open for business" until > the quotacheck completes. The quotacheck needed flags aren't cleared > until all the updates are on disk, so this doesn't need tobe done > transactionally. Yes, it's safe - but another different layer of dirty metadata to track. > > > > We should get rid of both the reliance of the VFS writeback tracking, and > > XFS-internal non-AIL metadata flushing. > > I'm assuming you mean VFS level dirty inode writeback tracking, not > dirty page cache tracking? Yes, I'll clarify it. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS metadata flushing design - current and future 2011-08-29 1:01 ` Dave Chinner 2011-08-29 6:33 ` Christoph Hellwig @ 2011-08-29 12:33 ` Christoph Hellwig 2011-08-30 1:28 ` Dave Chinner 1 sibling, 1 reply; 9+ messages in thread From: Christoph Hellwig @ 2011-08-29 12:33 UTC (permalink / raw) To: Dave Chinner; +Cc: Christoph Hellwig, xfs On Mon, Aug 29, 2011 at 11:01:49AM +1000, Dave Chinner wrote: > Another thing I've noticed is that AIL pushing of dirty inodes can > be quite inefficient from a CPU usage perspective. Inodes that have > already been flushed to their backing buffer results in a > IOP_PUSHBUF call when the AIL tries to push them. Pushing the buffer > requires a buffer cache search, followed by a delwri list promotion. > However, the initial xfs_iflush() call on a dirty inode also > clusters all the other remaining dirty inodes in the buffer to the > buffer. When the AIl hits those other dirty inodes, they are already > locked and so we do a IOP_PUSHBUF call. On every other dirty inode. > So on a completely dirty inode cluster, we do ~30 needless buffer > cache searches and buffer delwri promotions all for the same buffer. > That's a lot of extra work we don't need to be doing - ~10% of the > buffer cache lookups come from IOP_PUSHBUF under inode intensive > metadata workloads: One really stupid thing we do in that area is that the xfs_iflush from xfs_inode_item_push puts the buffer at the end of the delwri list and expects it to be aged, just so that the first xfs_inode_item_pushbuf can promote it to the front of the list. Now that we mostly write metadata from AIL pushing we should not do an additional pass of aging on that - that's what we already the AIL for. Once we did that we should be able to remove the buffer promotion and make the pushuf a no-op. The only thing this might interact with in a not so nice way would be inode reclaim if it still did delwri writes with the delay period, but we might be able to get away without that one as well. > Also, larger inode buffers to reduce the amount of IO we do to both > read and write inodes might also provide significant benefits by > reducing the amount of IO and number of buffers we need to track in > the cache... We could try to get for large in-core clusters. That is try to always allocate N aligned inode clusters together, and always read/write clusters in that alignment together if possible. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS metadata flushing design - current and future 2011-08-29 12:33 ` Christoph Hellwig @ 2011-08-30 1:28 ` Dave Chinner 2011-08-30 5:09 ` Christoph Hellwig 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2011-08-30 1:28 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs On Mon, Aug 29, 2011 at 08:33:18AM -0400, Christoph Hellwig wrote: > On Mon, Aug 29, 2011 at 11:01:49AM +1000, Dave Chinner wrote: > > Another thing I've noticed is that AIL pushing of dirty inodes can > > be quite inefficient from a CPU usage perspective. Inodes that have > > already been flushed to their backing buffer results in a > > IOP_PUSHBUF call when the AIL tries to push them. Pushing the buffer > > requires a buffer cache search, followed by a delwri list promotion. > > However, the initial xfs_iflush() call on a dirty inode also > > clusters all the other remaining dirty inodes in the buffer to the > > buffer. When the AIl hits those other dirty inodes, they are already > > locked and so we do a IOP_PUSHBUF call. On every other dirty inode. > > So on a completely dirty inode cluster, we do ~30 needless buffer > > cache searches and buffer delwri promotions all for the same buffer. > > That's a lot of extra work we don't need to be doing - ~10% of the > > buffer cache lookups come from IOP_PUSHBUF under inode intensive > > metadata workloads: > > One really stupid thing we do in that area is that the xfs_iflush from > xfs_inode_item_push puts the buffer at the end of the delwri list and > expects it to be aged, just so that the first xfs_inode_item_pushbuf > can promote it to the front of the list. Now that we mostly write > metadata from AIL pushing Actually, when we have lots of data to be written, we still call xfs_iflush() a lot from .write_inode. That's where the delayed write buffer aging has significant benefit. > we should not do an additional pass of aging > on that - that's what we already the AIL for. Once we did that we > should be able to remove the buffer promotion and make the pushuf a > no-op. If we remove the promotion, then it can be up to 15s before the IO is actually dispatched, resulting in long stalls until the buffer ages out. That's the problem that I introduced the promotion to fix. Yes, the inode writeback code has changed a bit since then, but not significantly enough to remove that problem. > The only thing this might interact with in a not so nice way > would be inode reclaim if it still did delwri writes with the delay > period, but we might be able to get away without that one as well. Right - if we only ever call xfs_iflush() from IOP_PUSH() and shrinker based inode reclaim, then I think this problem mostly goes away. We'd still need the shrinker path to be able to call xfs_iflush() for synchronous inode cluster writeback as that is the method by which we ensure memory reclaim makes progress.... To make this work, rather than doing the current "pushbuf" operation on inodes, lets make xfs_iflush() return the backing buffer locked rather than submitting IO on it directly. Then the caller can submit the buffer IO however it wants. That way reclaim can do synchronous IO, and for the AIL pusher we can add the buffer to a local list that we can then submit for IO rather than the current xfsbufd wakeup call we do. All inodes we see flush locked in IOP_PUSH we can then ignore, knowing that they are either currently under IO or on the local list pending IO submission. Either way, we don't need to try a pushbuf operation on flush locked inodes. [ IOWs, the xfs_inode_item_push() simply locks the inode and returns PUSHBUF if it needs flushing, then xfs_inode_item_pushbuf() calls xfs_iflush() and gets the dirty buffer back, which it then adds the buffer to a local dispatch list rather than submitting IO directly. ] --- FWIW, what this discussion is indicating to me is that we should timestamp entries in the AIL so we can push it to a time threshold as well as a LSN threshold. That is, whenever we insert a new entry into the AIL, we not only update the LSN and position, we also update the insert time of the buffer. We can then update the time based threshold every few seconds and have the AIL wq walker walk until the time threshold is reached pushing items to disk. This would also make the xfssyncd "push the entire AIL" change to "push anything older than 30s" which is much more desirable from a sustained workload POV. If we then modify the way all buffers are treated such that the AIL is the "delayed write list" (i.e. we take a reference count to the buffer when it is first logged and not in the AIL) and the pushbuf operations simply add the buffer to the local dispatch list, we can get rid of the delwri buffer list altogether. That also gets rid of the xfsbufd, too, as the AIL handles the aging, reference counting and writeback of the buffers entirely.... > > Also, larger inode buffers to reduce the amount of IO we do to both > > read and write inodes might also provide significant benefits by > > reducing the amount of IO and number of buffers we need to track in > > the cache... > > We could try to get for large in-core clusters. That is try to always > allocate N aligned inode clusters together, and always read/write > clusters in that alignment together if possible. Well, just increasing the cluster buffer to cover an entire inode chunk for common inode sizes (16k for 256 byte inodes and 32k for 512 byte inodes) would make a significant difference, I think. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS metadata flushing design - current and future 2011-08-30 1:28 ` Dave Chinner @ 2011-08-30 5:09 ` Christoph Hellwig 2011-08-30 7:06 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Christoph Hellwig @ 2011-08-30 5:09 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Tue, Aug 30, 2011 at 11:28:21AM +1000, Dave Chinner wrote: > > One really stupid thing we do in that area is that the xfs_iflush from > > xfs_inode_item_push puts the buffer at the end of the delwri list and > > expects it to be aged, just so that the first xfs_inode_item_pushbuf > > can promote it to the front of the list. Now that we mostly write > > metadata from AIL pushing > > Actually, when we have lots of data to be written, we still call > xfs_iflush() a lot from .write_inode. That's where the delayed write > buffer aging has significant benefit. One thing we could try is to log the inode there even for non-blocking write_inode calls to unifty the code path. But I suspect simply waiting for 3.3 and making future progress based on the removal of ->write_inode is going to to save us a lot of work and deal with different behaviour. > [ IOWs, the xfs_inode_item_push() simply locks the inode and returns xfs_inode_item_trylock? > PUSHBUF if it needs flushing, then xfs_inode_item_pushbuf() calls > xfs_iflush() and gets the dirty buffer back, which it then adds the > buffer to a local dispatch list rather than submitting IO directly. ] Why not keep using _push for this? > FWIW, what this discussion is indicating to me is that we should > timestamp entries in the AIL so we can push it to a time threshold > as well as a LSN threshold. > > That is, whenever we insert a new entry into the AIL, we not only > update the LSN and position, we also update the insert time of the > buffer. We can then update the time based threshold every few > seconds and have the AIL wq walker walk until the time threshold is > reached pushing items to disk. This would also make the xfssyncd > "push the entire AIL" change to "push anything older than 30s" which > is much more desirable from a sustained workload POV. Sounds reasonable. Except that we need to do the timestamp on the log item and not the buffer given that we might often not even have a buffer at AIL insertation time. > If we then modify the way all buffers are treated such that > the AIL is the "delayed write list" (i.e. we take a reference count > to the buffer when it is first logged and not in the AIL) and the > pushbuf operations simply add the buffer to the local dispatch list, > we can get rid of the delwri buffer list altogether. That also gets > rid of the xfsbufd, too, as the AIL handles the aging, reference > counting and writeback of the buffers entirely.... That would be nice. We'll need some ways to deal with the delwri buffers from quotacheck and log recovery if we do this, but we could just revert to the good old async buffers if we want to keep things simple. Alternatively we could keep local buffer submission lists in quotacheck and pass2 of log recovery, similar to what you suggested for the AIL worked. We could also check if we can get away with not needing lists managed by us at all and rely on the on-stack plugging, which I'm about to move up from the request layer to the bio layer and thus making generally useful. > > > Also, larger inode buffers to reduce the amount of IO we do to both > > > read and write inodes might also provide significant benefits by > > > reducing the amount of IO and number of buffers we need to track in > > > the cache... > > > > We could try to get for large in-core clusters. That is try to always > > allocate N aligned inode clusters together, and always read/write > > clusters in that alignment together if possible. > > Well, just increasing the cluster buffer to cover an entire inode > chunk for common inode sizes (16k for 256 byte inodes and 32k for > 512 byte inodes) would make a significant difference, I think. Ok, starting out simple might make most sense. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS metadata flushing design - current and future 2011-08-30 5:09 ` Christoph Hellwig @ 2011-08-30 7:06 ` Dave Chinner 2011-08-30 7:10 ` Christoph Hellwig 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2011-08-30 7:06 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs On Tue, Aug 30, 2011 at 01:09:59AM -0400, Christoph Hellwig wrote: > On Tue, Aug 30, 2011 at 11:28:21AM +1000, Dave Chinner wrote: > > > One really stupid thing we do in that area is that the xfs_iflush from > > > xfs_inode_item_push puts the buffer at the end of the delwri list and > > > expects it to be aged, just so that the first xfs_inode_item_pushbuf > > > can promote it to the front of the list. Now that we mostly write > > > metadata from AIL pushing > > > > Actually, when we have lots of data to be written, we still call > > xfs_iflush() a lot from .write_inode. That's where the delayed write > > buffer aging has significant benefit. > > One thing we could try is to log the inode there even for non-blocking > write_inode calls to unifty the code path. But I suspect simply waiting > for 3.3 and making future progress based on the removal of ->write_inode > is going to to save us a lot of work and deal with different behaviour. *nod* > > [ IOWs, the xfs_inode_item_push() simply locks the inode and returns > > xfs_inode_item_trylock? yeah, sorry, got it mixed up. > > PUSHBUF if it needs flushing, then xfs_inode_item_pushbuf() calls > > xfs_iflush() and gets the dirty buffer back, which it then adds the > > buffer to a local dispatch list rather than submitting IO directly. ] > > Why not keep using _push for this? Because if we are returning a buffer, then it makes sense to use pushbuf and change the prototype for that operation and leave IOP_PUSH completely unchanged... > > FWIW, what this discussion is indicating to me is that we should > > timestamp entries in the AIL so we can push it to a time threshold > > as well as a LSN threshold. > > > > That is, whenever we insert a new entry into the AIL, we not only > > update the LSN and position, we also update the insert time of the > > buffer. We can then update the time based threshold every few > > seconds and have the AIL wq walker walk until the time threshold is > > reached pushing items to disk. This would also make the xfssyncd > > "push the entire AIL" change to "push anything older than 30s" which > > is much more desirable from a sustained workload POV. > > Sounds reasonable. Except that we need to do the timestamp on the log > item and not the buffer given that we might often not even have a > buffer at AIL insertation time. yeah, that's what i intended that to mean, sorry if it wasn't clear. > > If we then modify the way all buffers are treated such that > > the AIL is the "delayed write list" (i.e. we take a reference count > > to the buffer when it is first logged and not in the AIL) and the > > pushbuf operations simply add the buffer to the local dispatch list, > > we can get rid of the delwri buffer list altogether. That also gets > > rid of the xfsbufd, too, as the AIL handles the aging, reference > > counting and writeback of the buffers entirely.... > > That would be nice. We'll need some ways to deal with the delwri > buffers from quotacheck and log recovery if we do this, but we could > just revert to the good old async buffers if we want to keep things > simple. Alternatively we could keep local buffer submission lists > in quotacheck and pass2 of log recovery, similar to what you suggested > for the AIL worked. > > We could also check if we can get away with not needing lists managed > by us at all and rely on the on-stack plugging, which I'm about to move > up from the request layer to the bio layer and thus making generally > useful. The advantage of using our own list is that we can still then sort them (might be thousands of buffers we queue in a single pass) before submitting them for IO. The on-stack plugging doesn't allow this at all, IIUC, as itis really just a FIFO list above the IO scheduler queues.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS metadata flushing design - current and future 2011-08-30 7:06 ` Dave Chinner @ 2011-08-30 7:10 ` Christoph Hellwig 0 siblings, 0 replies; 9+ messages in thread From: Christoph Hellwig @ 2011-08-30 7:10 UTC (permalink / raw) To: Dave Chinner; +Cc: Christoph Hellwig, xfs On Tue, Aug 30, 2011 at 05:06:42PM +1000, Dave Chinner wrote: > > The advantage of using our own list is that we can still then sort > them (might be thousands of buffers we queue in a single pass) > before submitting them for IO. The on-stack plugging doesn't allow > this at all, IIUC, as itis really just a FIFO list above the IO > scheduler queues.... The on-stack plugging already sorts - manually in that code for the request queues (not interesting for us), and currently by block number as well using the elevator. But the current code is more of a guideline, I have some fairly big changes for the that area of block layer in the pipeline. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS metadata flushing design - current and future 2011-08-27 8:03 XFS metadata flushing design - current and future Christoph Hellwig 2011-08-29 1:01 ` Dave Chinner @ 2011-09-09 22:31 ` Stewart Smith 1 sibling, 0 replies; 9+ messages in thread From: Stewart Smith @ 2011-09-09 22:31 UTC (permalink / raw) To: Christoph Hellwig, xfs On Sat, 27 Aug 2011 04:03:21 -0400, Christoph Hellwig <hch@infradead.org> wrote: > All metadata is XFS is read and written using buffers as the lowest > layer. s/is/in/ pretty minor :) -- Stewart Smith _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2011-09-09 22:31 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-08-27 8:03 XFS metadata flushing design - current and future Christoph Hellwig 2011-08-29 1:01 ` Dave Chinner 2011-08-29 6:33 ` Christoph Hellwig 2011-08-29 12:33 ` Christoph Hellwig 2011-08-30 1:28 ` Dave Chinner 2011-08-30 5:09 ` Christoph Hellwig 2011-08-30 7:06 ` Dave Chinner 2011-08-30 7:10 ` Christoph Hellwig 2011-09-09 22:31 ` Stewart Smith
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox