Re: [PATCH 2/4] design: incorporate the delayed logging design doc

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Allison Henderson <allison.henderson@oracle.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org, azarus@posteo.net
Subject: Re: [PATCH 2/4] design: incorporate the delayed logging design doc
Date: Sun, 13 May 2018 19:29:03 -0700	[thread overview]
Message-ID: <2d96508e-8c89-7a33-71f1-576f49fafe0c@oracle.com> (raw)
In-Reply-To: <152623083653.10242.6271621221466813278.stgit@magnolia>

Looks good.  Thx!

Reviewed by: Allison Henderson <allison.henderson@oracle.com>

On 05/13/2018 10:00 AM, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
>
> Move the delayed logging design documentation into the design book.
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>   .../delayed_logging.asciidoc                       |  808 ++++++++++++++++++++
>   design/XFS_Filesystem_Structure/docinfo.xml        |   14
>   .../xfs_filesystem_structure.asciidoc              |    2
>   design/xfs-delayed-logging-design.asciidoc         |  810 --------------------
>   4 files changed, 824 insertions(+), 810 deletions(-)
>   create mode 100644 design/XFS_Filesystem_Structure/delayed_logging.asciidoc
>   delete mode 100644 design/xfs-delayed-logging-design.asciidoc
>
>
> diff --git a/design/XFS_Filesystem_Structure/delayed_logging.asciidoc b/design/XFS_Filesystem_Structure/delayed_logging.asciidoc
> new file mode 100644
> index 0000000..e9a336f
> --- /dev/null
> +++ b/design/XFS_Filesystem_Structure/delayed_logging.asciidoc
> @@ -0,0 +1,808 @@
> += Delayed Logging
> +
> +== Introduction to Re-logging in XFS
> +
> +XFS logging is a combination of logical and physical logging. Some objects,
> +such as inodes and dquots, are logged in logical format where the details
> +logged are made up of the changes to in-core structures rather than on-disk
> +structures. Other objects - typically buffers - have their physical changes
> +logged. The reason for these differences is to reduce the amount of log space
> +required for objects that are frequently logged. Some parts of inodes are more
> +frequently logged than others, and inodes are typically more frequently logged
> +than any other object (except maybe the superblock buffer) so keeping the
> +amount of metadata logged low is of prime importance.
> +
> +The reason that this is such a concern is that XFS allows multiple separate
> +modifications to a single object to be carried in the log at any given time.
> +This allows the log to avoid needing to flush each change to disk before
> +recording a new change to the object. XFS does this via a method called
> +"re-logging". Conceptually, this is quite simple - all it requires is that any
> +new change to the object is recorded with a *new copy* of all the existing
> +changes in the new transaction that is written to the log.
> +
> +That is, if we have a sequence of changes A through to F, and the object was
> +written to disk after change D, we would see in the log the following series
> +of transactions, their contents and the log sequence number (LSN) of the
> +transaction:
> +
> +....
> +	Transaction		Contents	LSN
> +	   A			   A		   X
> +	   B			  A+B		  X+n
> +	   C			 A+B+C		 X+n+m
> +	   D			A+B+C+D		X+n+m+o
> +	    <object written to disk>
> +	   E			   E		   Y (> X+n+m+o)
> +	   F			  E+F		  Y+p
> +....
> +
> +In other words, each time an object is relogged, the new transaction contains
> +the aggregation of all the previous changes currently held only in the log.
> +
> +This relogging technique also allows objects to be moved forward in the log so
> +that an object being relogged does not prevent the tail of the log from ever
> +moving forward.  This can be seen in the table above by the changing
> +(increasing) LSN of each subsequent transaction - the LSN is effectively a
> +direct encoding of the location in the log of the transaction.
> +
> +This relogging is also used to implement long-running, multiple-commit
> +transactions.  These transaction are known as rolling transactions, and require
> +a special log reservation known as a permanent transaction reservation. A
> +typical example of a rolling transaction is the removal of extents from an
> +inode which can only be done at a rate of two extents per transaction because
> +of reservation size limitations. Hence a rolling extent removal transaction
> +keeps relogging the inode and btree buffers as they get modified in each
> +removal operation. This keeps them moving forward in the log as the operation
> +progresses, ensuring that current operation never gets blocked by itself if the
> +log wraps around.
> +
> +Hence it can be seen that the relogging operation is fundamental to the correct
> +working of the XFS journalling subsystem. From the above description, most
> +people should be able to see why the XFS metadata operations writes so much to
> +the log - repeated operations to the same objects write the same changes to
> +the log over and over again. Worse is the fact that objects tend to get
> +dirtier as they get relogged, so each subsequent transaction is writing more
> +metadata into the log.
> +
> +Another feature of the XFS transaction subsystem is that most transactions are
> +asynchronous. That is, they don't commit to disk until either a log buffer is
> +filled (a log buffer can hold multiple transactions) or a synchronous operation
> +forces the log buffers holding the transactions to disk. This means that XFS is
> +doing aggregation of transactions in memory - batching them, if you like - to
> +minimise the impact of the log IO on transaction throughput.
> +
> +The limitation on asynchronous transaction throughput is the number and size of
> +log buffers made available by the log manager. By default there are 8 log
> +buffers available and the size of each is 32kB - the size can be increased up
> +to 256kB by use of a mount option.
> +
> +Effectively, this gives us the maximum bound of outstanding metadata changes
> +that can be made to the filesystem at any point in time - if all the log
> +buffers are full and under IO, then no more transactions can be committed until
> +the current batch completes. It is now common for a single current CPU core to
> +be to able to issue enough transactions to keep the log buffers full and under
> +IO permanently. Hence the XFS journalling subsystem can be considered to be IO
> +bound.
> +
> +== Delayed Logging Concepts
> +
> +The key thing to note about the asynchronous logging combined with the
> +relogging technique XFS uses is that we can be relogging changed objects
> +multiple times before they are committed to disk in the log buffers. If we
> +return to the previous relogging example, it is entirely possible that
> +transactions A through D are committed to disk in the same log buffer.
> +
> +That is, a single log buffer may contain multiple copies of the same object,
> +but only one of those copies needs to be there - the last one "D", as it
> +contains all the changes from the previous changes. In other words, we have one
> +necessary copy in the log buffer, and three stale copies that are simply
> +wasting space. When we are doing repeated operations on the same set of
> +objects, these "stale objects" can be over 90% of the space used in the log
> +buffers. It is clear that reducing the number of stale objects written to the
> +log would greatly reduce the amount of metadata we write to the log, and this
> +is the fundamental goal of delayed logging.
> +
> +From a conceptual point of view, XFS is already doing relogging in memory (where
> +memory == log buffer), only it is doing it extremely inefficiently. It is using
> +logical to physical formatting to do the relogging because there is no
> +infrastructure to keep track of logical changes in memory prior to physically
> +formatting the changes in a transaction to the log buffer. Hence we cannot avoid
> +accumulating stale objects in the log buffers.
> +
> +Delayed logging is the name we've given to keeping and tracking transactional
> +changes to objects in memory outside the log buffer infrastructure. Because of
> +the relogging concept fundamental to the XFS journalling subsystem, this is
> +actually relatively easy to do - all the changes to logged items are already
> +tracked in the current infrastructure. The big problem is how to accumulate
> +them and get them to the log in a consistent, recoverable manner.
> +Describing the problems and how they have been solved is the focus of this
> +document.
> +
> +One of the key changes that delayed logging makes to the operation of the
> +journalling subsystem is that it disassociates the amount of outstanding
> +metadata changes from the size and number of log buffers available. In other
> +words, instead of there only being a maximum of 2MB of transaction changes not
> +written to the log at any point in time, there may be a much greater amount
> +being accumulated in memory. Hence the potential for loss of metadata on a
> +crash is much greater than for the existing logging mechanism.
> +
> +It should be noted that this does not change the guarantee that log recovery
> +will result in a consistent filesystem. What it does mean is that as far as the
> +recovered filesystem is concerned, there may be many thousands of transactions
> +that simply did not occur as a result of the crash. This makes it even more
> +important that applications that care about their data use fsync() where they
> +need to ensure application level data integrity is maintained.
> +
> +It should be noted that delayed logging is not an innovative new concept that
> +warrants rigorous proofs to determine whether it is correct or not. The method
> +of accumulating changes in memory for some period before writing them to the
> +log is used effectively in many filesystems including ext3 and ext4. Hence
> +no time is spent in this document trying to convince the reader that the
> +concept is sound. Instead it is simply considered a "solved problem" and as
> +such implementing it in XFS is purely an exercise in software engineering.
> +
> +The fundamental requirements for delayed logging in XFS are simple:
> +
> +	. Reduce the amount of metadata written to the log by at least
> +	   an order of magnitude.
> +	. Supply sufficient statistics to validate Requirement #1.
> +	. Supply sufficient new tracing infrastructure to be able to debug
> +	   problems with the new code.
> +	. No on-disk format change (metadata or log format).
> +	. Enable and disable with a mount option.
> +	. No performance regressions for synchronous transaction workloads.
> +
> +== Delayed Logging Design
> +
> +=== Storing Changes
> +
> +The problem with accumulating changes at a logical level (i.e. just using the
> +existing log item dirty region tracking) is that when it comes to writing the
> +changes to the log buffers, we need to ensure that the object we are formatting
> +is not changing while we do this. This requires locking the object to prevent
> +concurrent modification. Hence flushing the logical changes to the log would
> +require us to lock every object, format them, and then unlock them again.
> +
> +This introduces lots of scope for deadlocks with transactions that are already
> +running. For example, a transaction has object A locked and modified, but needs
> +the delayed logging tracking lock to commit the transaction. However, the
> +flushing thread has the delayed logging tracking lock already held, and is
> +trying to get the lock on object A to flush it to the log buffer. This appears
> +to be an unsolvable deadlock condition, and it was solving this problem that
> +was the barrier to implementing delayed logging for so long.
> +
> +The solution is relatively simple - it just took a long time to recognise it.
> +Put simply, the current logging code formats the changes to each item into an
> +vector array that points to the changed regions in the item. The log write code
> +simply copies the memory these vectors point to into the log buffer during
> +transaction commit while the item is locked in the transaction. Instead of
> +using the log buffer as the destination of the formatting code, we can use an
> +allocated memory buffer big enough to fit the formatted vector.
> +
> +If we then copy the vector into the memory buffer and rewrite the vector to
> +point to the memory buffer rather than the object itself, we now have a copy of
> +the changes in a format that is compatible with the log buffer writing code.
> +that does not require us to lock the item to access. This formatting and
> +rewriting can all be done while the object is locked during transaction commit,
> +resulting in a vector that is transactionally consistent and can be accessed
> +without needing to lock the owning item.
> +
> +Hence we avoid the need to lock items when we need to flush outstanding
> +asynchronous transactions to the log. The differences between the existing
> +formatting method and the delayed logging formatting can be seen in the
> +diagram below.
> +
> +Current format log vector:
> +....
> +Object    +---------------------------------------------+
> +Vector 1      +----+
> +Vector 2                    +----+
> +Vector 3                                   +----------+
> +....
> +
> +After formatting:
> +
> +....
> +Log Buffer    +-V1-+-V2-+----V3----+
> +....
> +
> +Delayed logging vector:
> +
> +....
> +Object    +---------------------------------------------+
> +Vector 1      +----+
> +Vector 2                    +----+
> +Vector 3                                   +----------+
> +....
> +
> +After formatting:
> +
> +....
> +Memory Buffer +-V1-+-V2-+----V3----+
> +Vector 1      +----+
> +Vector 2           +----+
> +Vector 3                +----------+
> +....
> +
> +The memory buffer and associated vector need to be passed as a single object,
> +but still need to be associated with the parent object so if the object is
> +relogged we can replace the current memory buffer with a new memory buffer that
> +contains the latest changes.
> +
> +The reason for keeping the vector around after we've formatted the memory
> +buffer is to support splitting vectors across log buffer boundaries correctly.
> +If we don't keep the vector around, we do not know where the region boundaries
> +are in the item, so we'd need a new encapsulation method for regions in the log
> +buffer writing (i.e. double encapsulation). This would be an on-disk format
> +change and as such is not desirable.  It also means we'd have to write the log
> +region headers in the formatting stage, which is problematic as there is per
> +region state that needs to be placed into the headers during the log write.
> +
> +Hence we need to keep the vector, but by attaching the memory buffer to it and
> +rewriting the vector addresses to point at the memory buffer we end up with a
> +self-describing object that can be passed to the log buffer write code to be
> +handled in exactly the same manner as the existing log vectors are handled.
> +Hence we avoid needing a new on-disk format to handle items that have been
> +relogged in memory.
> +
> +
> +=== Tracking Changes
> +
> +Now that we can record transactional changes in memory in a form that allows
> +them to be used without limitations, we need to be able to track and accumulate
> +them so that they can be written to the log at some later point in time.  The
> +log item is the natural place to store this vector and buffer, and also makes sense
> +to be the object that is used to track committed objects as it will always
> +exist once the object has been included in a transaction.
> +
> +The log item is already used to track the log items that have been written to
> +the log but not yet written to disk. Such log items are considered "active"
> +and as such are stored in the Active Item List (AIL) which is a LSN-ordered
> +double linked list. Items are inserted into this list during log buffer IO
> +completion, after which they are unpinned and can be written to disk. An object
> +that is in the AIL can be relogged, which causes the object to be pinned again
> +and then moved forward in the AIL when the log buffer IO completes for that
> +transaction.
> +
> +Essentially, this shows that an item that is in the AIL can still be modified
> +and relogged, so any tracking must be separate to the AIL infrastructure. As
> +such, we cannot reuse the AIL list pointers for tracking committed items, nor
> +can we store state in any field that is protected by the AIL lock. Hence the
> +committed item tracking needs it's own locks, lists and state fields in the log
> +item.
> +
> +Similar to the AIL, tracking of committed items is done through a new list
> +called the Committed Item List (CIL).  The list tracks log items that have been
> +committed and have formatted memory buffers attached to them. It tracks objects
> +in transaction commit order, so when an object is relogged it is removed from
> +it's place in the list and re-inserted at the tail. This is entirely arbitrary
> +and done to make it easy for debugging - the last items in the list are the
> +ones that are most recently modified. Ordering of the CIL is not necessary for
> +transactional integrity (as discussed in the next section) so the ordering is
> +done for convenience/sanity of the developers.
> +
> +
> +=== Checkpoints
> +
> +When we have a log synchronisation event, commonly known as a "log force",
> +all the items in the CIL must be written into the log via the log buffers.
> +We need to write these items in the order that they exist in the CIL, and they
> +need to be written as an atomic transaction. The need for all the objects to be
> +written as an atomic transaction comes from the requirements of relogging and
> +log replay - all the changes in all the objects in a given transaction must
> +either be completely replayed during log recovery, or not replayed at all. If
> +a transaction is not replayed because it is not complete in the log, then
> +no later transactions should be replayed, either.
> +
> +To fulfill this requirement, we need to write the entire CIL in a single log
> +transaction. Fortunately, the XFS log code has no fixed limit on the size of a
> +transaction, nor does the log replay code. The only fundamental limit is that
> +the transaction cannot be larger than just under half the size of the log.  The
> +reason for this limit is that to find the head and tail of the log, there must
> +be at least one complete transaction in the log at any given time. If a
> +transaction is larger than half the log, then there is the possibility that a
> +crash during the write of a such a transaction could partially overwrite the
> +only complete previous transaction in the log. This will result in a recovery
> +failure and an inconsistent filesystem and hence we must enforce the maximum
> +size of a checkpoint to be slightly less than a half the log.
> +
> +Apart from this size requirement, a checkpoint transaction looks no different
> +to any other transaction - it contains a transaction header, a series of
> +formatted log items and a commit record at the tail. From a recovery
> +perspective, the checkpoint transaction is also no different - just a lot
> +bigger with a lot more items in it. The worst case effect of this is that we
> +might need to tune the recovery transaction object hash size.
> +
> +Because the checkpoint is just another transaction and all the changes to log
> +items are stored as log vectors, we can use the existing log buffer writing
> +code to write the changes into the log. To do this efficiently, we need to
> +minimise the time we hold the CIL locked while writing the checkpoint
> +transaction. The current log write code enables us to do this easily with the
> +way it separates the writing of the transaction contents (the log vectors) from
> +the transaction commit record, but tracking this requires us to have a
> +per-checkpoint context that travels through the log write process through to
> +checkpoint completion.
> +
> +Hence a checkpoint has a context that tracks the state of the current
> +checkpoint from initiation to checkpoint completion. A new context is initiated
> +at the same time a checkpoint transaction is started. That is, when we remove
> +all the current items from the CIL during a checkpoint operation, we move all
> +those changes into the current checkpoint context. We then initialise a new
> +context and attach that to the CIL for aggregation of new transactions.
> +
> +This allows us to unlock the CIL immediately after transfer of all the
> +committed items and effectively allow new transactions to be issued while we
> +are formatting the checkpoint into the log. It also allows concurrent
> +checkpoints to be written into the log buffers in the case of log force heavy
> +workloads, just like the existing transaction commit code does. This, however,
> +requires that we strictly order the commit records in the log so that
> +checkpoint sequence order is maintained during log replay.
> +
> +To ensure that we can be writing an item into a checkpoint transaction at
> +the same time another transaction modifies the item and inserts the log item
> +into the new CIL, then checkpoint transaction commit code cannot use log items
> +to store the list of log vectors that need to be written into the transaction.
> +Hence log vectors need to be able to be chained together to allow them to be
> +detached from the log items. That is, when the CIL is flushed the memory
> +buffer and log vector attached to each log item needs to be attached to the
> +checkpoint context so that the log item can be released. In diagrammatic form,
> +the CIL would look like this before the flush:
> +
> +----
> +	CIL Head
> +	   |
> +	   V
> +	Log Item <-> log vector 1	-> memory buffer
> +	   |				-> vector array
> +	   V
> +	Log Item <-> log vector 2	-> memory buffer
> +	   |				-> vector array
> +	   V
> +	......
> +	   |
> +	   V
> +	Log Item <-> log vector N-1	-> memory buffer
> +	   |				-> vector array
> +	   V
> +	Log Item <-> log vector N	-> memory buffer
> +					-> vector array
> +----
> +
> +And after the flush the CIL head is empty, and the checkpoint context log
> +vector list would look like:
> +
> +----
> +	Checkpoint Context
> +	   |
> +	   V
> +	log vector 1	-> memory buffer
> +	   |		-> vector array
> +	   |		-> Log Item
> +	   V
> +	log vector 2	-> memory buffer
> +	   |		-> vector array
> +	   |		-> Log Item
> +	   V
> +	......
> +	   |
> +	   V
> +	log vector N-1	-> memory buffer
> +	   |		-> vector array
> +	   |		-> Log Item
> +	   V
> +	log vector N	-> memory buffer
> +			-> vector array
> +			-> Log Item
> +----
> +
> +Once this transfer is done, the CIL can be unlocked and new transactions can
> +start, while the checkpoint flush code works over the log vector chain to
> +commit the checkpoint.
> +
> +Once the checkpoint is written into the log buffers, the checkpoint context is
> +attached to the log buffer that the commit record was written to along with a
> +completion callback. Log IO completion will call that callback, which can then
> +run transaction committed processing for the log items (i.e. insert into AIL
> +and unpin) in the log vector chain and then free the log vector chain and
> +checkpoint context.
> +
> +Discussion Point: I am uncertain as to whether the log item is the most
> +efficient way to track vectors, even though it seems like the natural way to do
> +it. The fact that we walk the log items (in the CIL) just to chain the log
> +vectors and break the link between the log item and the log vector means that
> +we take a cache line hit for the log item list modification, then another for
> +the log vector chaining. If we track by the log vectors, then we only need to
> +break the link between the log item and the log vector, which means we should
> +dirty only the log item cachelines. Normally I wouldn't be concerned about one
> +vs two dirty cachelines except for the fact I've seen upwards of 80,000 log
> +vectors in one checkpoint transaction. I'd guess this is a "measure and
> +compare" situation that can be done after a working and reviewed implementation
> +is in the dev tree....
> +
> +=== Checkpoint Sequencing
> +
> +One of the key aspects of the XFS transaction subsystem is that it tags
> +committed transactions with the log sequence number of the transaction commit.
> +This allows transactions to be issued asynchronously even though there may be
> +future operations that cannot be completed until that transaction is fully
> +committed to the log. In the rare case that a dependent operation occurs (e.g.
> +re-using a freed metadata extent for a data extent), a special, optimised log
> +force can be issued to force the dependent transaction to disk immediately.
> +
> +To do this, transactions need to record the LSN of the commit record of the
> +transaction. This LSN comes directly from the log buffer the transaction is
> +written into. While this works just fine for the existing transaction
> +mechanism, it does not work for delayed logging because transactions are not
> +written directly into the log buffers. Hence some other method of sequencing
> +transactions is required.
> +
> +As discussed in the checkpoint section, delayed logging uses per-checkpoint
> +contexts, and as such it is simple to assign a sequence number to each
> +checkpoint. Because the switching of checkpoint contexts must be done
> +atomically, it is simple to ensure that each new context has a monotonically
> +increasing sequence number assigned to it without the need for an external
> +atomic counter - we can just take the current context sequence number and add
> +one to it for the new context.
> +
> +Then, instead of assigning a log buffer LSN to the transaction commit LSN
> +during the commit, we can assign the current checkpoint sequence. This allows
> +operations that track transactions that have not yet completed know what
> +checkpoint sequence needs to be committed before they can continue. As a
> +result, the code that forces the log to a specific LSN now needs to ensure that
> +the log forces to a specific checkpoint.
> +
> +To ensure that we can do this, we need to track all the checkpoint contexts
> +that are currently committing to the log. When we flush a checkpoint, the
> +context gets added to a "committing" list which can be searched. When a
> +checkpoint commit completes, it is removed from the committing list. Because
> +the checkpoint context records the LSN of the commit record for the checkpoint,
> +we can also wait on the log buffer that contains the commit record, thereby
> +using the existing log force mechanisms to execute synchronous forces.
> +
> +It should be noted that the synchronous forces may need to be extended with
> +mitigation algorithms similar to the current log buffer code to allow
> +aggregation of multiple synchronous transactions if there are already
> +synchronous transactions being flushed. Investigation of the performance of the
> +current design is needed before making any decisions here.
> +
> +The main concern with log forces is to ensure that all the previous checkpoints
> +are also committed to disk before the one we need to wait for. Therefore we
> +need to check that all the prior contexts in the committing list are also
> +complete before waiting on the one we need to complete. We do this
> +synchronisation in the log force code so that we don't need to wait anywhere
> +else for such serialisation - it only matters when we do a log force.
> +
> +The only remaining complexity is that a log force now also has to handle the
> +case where the forcing sequence number is the same as the current context. That
> +is, we need to flush the CIL and potentially wait for it to complete. This is a
> +simple addition to the existing log forcing code to check the sequence numbers
> +and push if required. Indeed, placing the current sequence checkpoint flush in
> +the log force code enables the current mechanism for issuing synchronous
> +transactions to remain untouched (i.e. commit an asynchronous transaction, then
> +force the log at the LSN of that transaction) and so the higher level code
> +behaves the same regardless of whether delayed logging is being used or not.
> +
> +=== Checkpoint Log Space Accounting
> +
> +The big issue for a checkpoint transaction is the log space reservation for the
> +transaction. We don't know how big a checkpoint transaction is going to be
> +ahead of time, nor how many log buffers it will take to write out, nor the
> +number of split log vector regions are going to be used. We can track the
> +amount of log space required as we add items to the commit item list, but we
> +still need to reserve the space in the log for the checkpoint.
> +
> +A typical transaction reserves enough space in the log for the worst case space
> +usage of the transaction. The reservation accounts for log record headers,
> +transaction and region headers, headers for split regions, buffer tail padding,
> +etc. as well as the actual space for all the changed metadata in the
> +transaction. While some of this is fixed overhead, much of it is dependent on
> +the size of the transaction and the number of regions being logged (the number
> +of log vectors in the transaction).
> +
> +An example of the differences would be logging directory changes versus logging
> +inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then
> +there are lots of transactions that only contain an inode core and an inode log
> +format structure. That is, two vectors totaling roughly 150 bytes. If we modify
> +10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
> +vector is 12 bytes, so the total to be logged is approximately 1.75MB. In
> +comparison, if we are logging full directory buffers, they are typically 4KB
> +each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a
> +buffer format structure for each buffer - roughly 800 vectors or 1.51MB total
> +space.  From this, it should be obvious that a static log space reservation is
> +not particularly flexible and is difficult to select the "optimal value" for
> +all workloads.
> +
> +Further, if we are going to use a static reservation, which bit of the entire
> +reservation does it cover? We account for space used by the transaction
> +reservation by tracking the space currently used by the object in the CIL and
> +then calculating the increase or decrease in space used as the object is
> +relogged. This allows for a checkpoint reservation to only have to account for
> +log buffer metadata used such as log header records.
> +
> +However, even using a static reservation for just the log metadata is
> +problematic. Typically log record headers use at least 16KB of log space per
> +1MB of log space consumed (512 bytes per 32k) and the reservation needs to be
> +large enough to handle arbitrary sized checkpoint transactions. This
> +reservation needs to be made before the checkpoint is started, and we need to
> +be able to reserve the space without sleeping.  For a 8MB checkpoint, we need a
> +reservation of around 150KB, which is a non-trivial amount of space.
> +
> +A static reservation needs to manipulate the log grant counters - we can take a
> +permanent reservation on the space, but we still need to make sure we refresh
> +the write reservation (the actual space available to the transaction) after
> +every checkpoint transaction completion. Unfortunately, if this space is not
> +available when required, then the regrant code will sleep waiting for it.
> +
> +The problem with this is that it can lead to deadlocks as we may need to commit
> +checkpoints to be able to free up log space (refer back to the description of
> +rolling transactions for an example of this).  Hence we *must* always have
> +space available in the log if we are to use static reservations, and that is
> +very difficult and complex to arrange. It is possible to do, but there is a
> +simpler way.
> +
> +The simpler way of doing this is tracking the entire log space used by the
> +items in the CIL and using this to dynamically calculate the amount of log
> +space required by the log metadata. If this log metadata space changes as a
> +result of a transaction commit inserting a new memory buffer into the CIL, then
> +the difference in space required is removed from the transaction that causes
> +the change. Transactions at this level will *always* have enough space
> +available in their reservation for this as they have already reserved the
> +maximal amount of log metadata space they require, and such a delta reservation
> +will always be less than or equal to the maximal amount in the reservation.
> +
> +Hence we can grow the checkpoint transaction reservation dynamically as items
> +are added to the CIL and avoid the need for reserving and regranting log space
> +up front. This avoids deadlocks and removes a blocking point from the
> +checkpoint flush code.
> +
> +As mentioned early, transactions can't grow to more than half the size of the
> +log. Hence as part of the reservation growing, we need to also check the size
> +of the reservation against the maximum allowed transaction size. If we reach
> +the maximum threshold, we need to push the CIL to the log. This is effectively
> +a "background flush" and is done on demand. This is identical to
> +a CIL push triggered by a log force, only that there is no waiting for the
> +checkpoint commit to complete. This background push is checked and executed by
> +transaction commit code.
> +
> +If the transaction subsystem goes idle while we still have items in the CIL,
> +they will be flushed by the periodic log force issued by the xfssyncd. This log
> +force will push the CIL to disk, and if the transaction subsystem stays idle,
> +allow the idle log to be covered (effectively marked clean) in exactly the same
> +manner that is done for the existing logging method. A discussion point is
> +whether this log force needs to be done more frequently than the current rate
> +which is once every 30s.
> +
> +
> +=== Log Item Pinning
> +
> +Currently log items are pinned during transaction commit while the items are
> +still locked. This happens just after the items are formatted, though it could
> +be done any time before the items are unlocked. The result of this mechanism is
> +that items get pinned once for every transaction that is committed to the log
> +buffers. Hence items that are relogged in the log buffers will have a pin count
> +for every outstanding transaction they were dirtied in. When each of these
> +transactions is completed, they will unpin the item once. As a result, the item
> +only becomes unpinned when all the transactions complete and there are no
> +pending transactions. Thus the pinning and unpinning of a log item is symmetric
> +as there is a 1:1 relationship with transaction commit and log item completion.
> +
> +For delayed logging, however, we have an asymmetric transaction commit to
> +completion relationship. Every time an object is relogged in the CIL it goes
> +through the commit process without a corresponding completion being registered.
> +That is, we now have a many-to-one relationship between transaction commit and
> +log item completion. The result of this is that pinning and unpinning of the
> +log items becomes unbalanced if we retain the "pin on transaction commit, unpin
> +on transaction completion" model.
> +
> +To keep pin/unpin symmetry, the algorithm needs to change to a "pin on
> +insertion into the CIL, unpin on checkpoint completion". In other words, the
> +pinning and unpinning becomes symmetric around a checkpoint context. We have to
> +pin the object the first time it is inserted into the CIL - if it is already in
> +the CIL during a transaction commit, then we do not pin it again. Because there
> +can be multiple outstanding checkpoint contexts, we can still see elevated pin
> +counts, but as each checkpoint completes the pin count will retain the correct
> +value according to it's context.
> +
> +Just to make matters more slightly more complex, this checkpoint level context
> +for the pin count means that the pinning of an item must take place under the
> +CIL commit/flush lock. If we pin the object outside this lock, we cannot
> +guarantee which context the pin count is associated with. This is because of
> +the fact pinning the item is dependent on whether the item is present in the
> +current CIL or not. If we don't pin the CIL first before we check and pin the
> +object, we have a race with CIL being flushed between the check and the pin
> +(or not pinning, as the case may be). Hence we must hold the CIL flush/commit
> +lock to guarantee that we pin the items correctly.
> +
> +=== Concurrent Scalability
> +
> +A fundamental requirement for the CIL is that accesses through transaction
> +commits must scale to many concurrent commits. The current transaction commit
> +code does not break down even when there are transactions coming from 2048
> +processors at once. The current transaction code does not go any faster than if
> +there was only one CPU using it, but it does not slow down either.
> +
> +As a result, the delayed logging transaction commit code needs to be designed
> +for concurrency from the ground up. It is obvious that there are serialisation
> +points in the design - the three important ones are:
> +
> +	. Locking out new transaction commits while flushing the CIL
> +	. Adding items to the CIL and updating item space accounting
> +	. Checkpoint commit ordering
> +
> +Looking at the transaction commit and CIL flushing interactions, it is clear
> +that we have a many-to-one interaction here. That is, the only restriction on
> +the number of concurrent transactions that can be trying to commit at once is
> +the amount of space available in the log for their reservations. The practical
> +limit here is in the order of several hundred concurrent transactions for a
> +128MB log, which means that it is generally one per CPU in a machine.
> +
> +The amount of time a transaction commit needs to hold out a flush is a
> +relatively long period of time - the pinning of log items needs to be done
> +while we are holding out a CIL flush, so at the moment that means it is held
> +across the formatting of the objects into memory buffers (i.e. while memcpy()s
> +are in progress). Ultimately a two pass algorithm where the formatting is done
> +separately to the pinning of objects could be used to reduce the hold time of
> +the transaction commit side.
> +
> +Because of the number of potential transaction commit side holders, the lock
> +really needs to be a sleeping lock - if the CIL flush takes the lock, we do not
> +want every other CPU in the machine spinning on the CIL lock. Given that
> +flushing the CIL could involve walking a list of tens of thousands of log
> +items, it will get held for a significant time and so spin contention is a
> +significant concern. Preventing lots of CPUs spinning doing nothing is the
> +main reason for choosing a sleeping lock even though nothing in either the
> +transaction commit or CIL flush side sleeps with the lock held.
> +
> +It should also be noted that CIL flushing is also a relatively rare operation
> +compared to transaction commit for asynchronous transaction workloads - only
> +time will tell if using a read-write semaphore for exclusion will limit
> +transaction commit concurrency due to cache line bouncing of the lock on the
> +read side.
> +
> +The second serialisation point is on the transaction commit side where items
> +are inserted into the CIL. Because transactions can enter this code
> +concurrently, the CIL needs to be protected separately from the above
> +commit/flush exclusion. It also needs to be an exclusive lock but it is only
> +held for a very short time and so a spin lock is appropriate here. It is
> +possible that this lock will become a contention point, but given the short
> +hold time once per transaction I think that contention is unlikely.
> +
> +The final serialisation point is the checkpoint commit record ordering code
> +that is run as part of the checkpoint commit and log force sequencing. The code
> +path that triggers a CIL flush (i.e. whatever triggers the log force) will enter
> +an ordering loop after writing all the log vectors into the log buffers but
> +before writing the commit record. This loop walks the list of committing
> +checkpoints and needs to block waiting for checkpoints to complete their commit
> +record write. As a result it needs a lock and a wait variable. Log force
> +sequencing also requires the same lock, list walk, and blocking mechanism to
> +ensure completion of checkpoints.
> +
> +These two sequencing operations can use the mechanism even though the
> +events they are waiting for are different. The checkpoint commit record
> +sequencing needs to wait until checkpoint contexts contain a commit LSN
> +(obtained through completion of a commit record write) while log force
> +sequencing needs to wait until previous checkpoint contexts are removed from
> +the committing list (i.e. they've completed). A simple wait variable and
> +broadcast wakeups (thundering herds) has been used to implement these two
> +serialisation queues. They use the same lock as the CIL, too. If we see too
> +much contention on the CIL lock, or too many context switches as a result of
> +the broadcast wakeups these operations can be put under a new spinlock and
> +given separate wait lists to reduce lock contention and the number of processes
> +woken by the wrong event.
> +
> +
> +=== Lifecycle Changes
> +
> +The existing log item life cycle is as follows:
> +
> +----
> +	1. Transaction allocate
> +	2. Transaction reserve
> +	3. Lock item
> +	4. Join item to transaction
> +		If not already attached,
> +			Allocate log item
> +			Attach log item to owner item
> +		Attach log item to transaction
> +	5. Modify item
> +		Record modifications in log item
> +	6. Transaction commit
> +		Pin item in memory
> +		Format item into log buffer
> +		Write commit LSN into transaction
> +		Unlock item
> +		Attach transaction to log buffer
> +
> +	<log buffer IO dispatched>
> +	<log buffer IO completes>
> +
> +	7. Transaction completion
> +		Mark log item committed
> +		Insert log item into AIL
> +			Write commit LSN into log item
> +		Unpin log item
> +	8. AIL traversal
> +		Lock item
> +		Mark log item clean
> +		Flush item to disk
> +
> +	<item IO completion>
> +
> +	9. Log item removed from AIL
> +		Moves log tail
> +		Item unlocked
> +----
> +
> +Essentially, steps 1-6 operate independently from step 7, which is also
> +independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9
> +at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur
> +at the same time. If the log item is in the AIL or between steps 6 and 7
> +and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9
> +are entered and completed is the object considered clean.
> +
> +With delayed logging, there are new steps inserted into the life cycle:
> +
> +----
> +	1. Transaction allocate
> +	2. Transaction reserve
> +	3. Lock item
> +	4. Join item to transaction
> +		If not already attached,
> +			Allocate log item
> +			Attach log item to owner item
> +		Attach log item to transaction
> +	5. Modify item
> +		Record modifications in log item
> +	6. Transaction commit
> +		Pin item in memory if not pinned in CIL
> +		Format item into log vector + buffer
> +		Attach log vector and buffer to log item
> +		Insert log item into CIL
> +		Write CIL context sequence into transaction
> +		Unlock item
> +
> +	<next log force>
> +
> +	7. CIL push
> +		lock CIL flush
> +		Chain log vectors and buffers together
> +		Remove items from CIL
> +		unlock CIL flush
> +		write log vectors into log
> +		sequence commit records
> +		attach checkpoint context to log buffer
> +
> +	<log buffer IO dispatched>
> +	<log buffer IO completes>
> +
> +	8. Checkpoint completion
> +		Mark log item committed
> +		Insert item into AIL
> +			Write commit LSN into log item
> +		Unpin log item
> +	9. AIL traversal
> +		Lock item
> +		Mark log item clean
> +		Flush item to disk
> +	<item IO completion>
> +	10. Log item removed from AIL
> +		Moves log tail
> +		Item unlocked
> +----
> +
> +From this, it can be seen that the only life cycle differences between the two
> +logging methods are in the middle of the life cycle - they still have the same
> +beginning and end and execution constraints. The only differences are in the
> +committing of the log items to the log itself and the completion processing.
> +Hence delayed logging should not introduce any constraints on log item
> +behaviour, allocation or freeing that don't already exist.
> +
> +As a result of this zero-impact "insertion" of delayed logging infrastructure
> +and the design of the internal structures to avoid on disk format changes, we
> +can basically switch between delayed logging and the existing mechanism with a
> +mount option. Fundamentally, there is no reason why the log manager would not
> +be able to swap methods automatically and transparently depending on load
> +characteristics, but this should not be necessary if delayed logging works as
> +designed.
> +
> +EOF.
> diff --git a/design/XFS_Filesystem_Structure/docinfo.xml b/design/XFS_Filesystem_Structure/docinfo.xml
> index e13d705..558a04c 100644
> --- a/design/XFS_Filesystem_Structure/docinfo.xml
> +++ b/design/XFS_Filesystem_Structure/docinfo.xml
> @@ -169,4 +169,18 @@
>   			</simplelist>
>   		</revdescription>
>   	</revision>
> +	<revision>
> +		<revnumber>3.141592</revnumber>
> +		<date>May 2018</date>
> +		<author>
> +			<firstname>Darrick</firstname>
> +			<surname>Wong</surname>
> +			<email>darrick.wong@oracle.com</email>
> +		</author>
> +		<revdescription>
> +			<simplelist>
> +				<member>Incorporate Dave Chinner's log design document.</member>
> +			</simplelist>
> +		</revdescription>
> +	</revision>
>   </revhistory>
> diff --git a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
> index 8d54935..7bdfade 100644
> --- a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
> +++ b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
> @@ -48,6 +48,8 @@ include::overview.asciidoc[]
>   
>   include::metadata_integrity.asciidoc[]
>   
> +include::delayed_logging.asciidoc[]
> +
>   include::reflink.asciidoc[]
>   
>   include::reconstruction.asciidoc[]
> diff --git a/design/xfs-delayed-logging-design.asciidoc b/design/xfs-delayed-logging-design.asciidoc
> deleted file mode 100644
> index e54e786..0000000
> --- a/design/xfs-delayed-logging-design.asciidoc
> +++ /dev/null
> @@ -1,810 +0,0 @@
> -= XFS Delayed Logging Design
> -Dave Chinner, <dchinner@redhat.com>
> -v1.0, Feb 2014: Initial conversion to asciidoc
> -
> -== Introduction to Re-logging in XFS
> -
> -XFS logging is a combination of logical and physical logging. Some objects,
> -such as inodes and dquots, are logged in logical format where the details
> -logged are made up of the changes to in-core structures rather than on-disk
> -structures. Other objects - typically buffers - have their physical changes
> -logged. The reason for these differences is to reduce the amount of log space
> -required for objects that are frequently logged. Some parts of inodes are more
> -frequently logged than others, and inodes are typically more frequently logged
> -than any other object (except maybe the superblock buffer) so keeping the
> -amount of metadata logged low is of prime importance.
> -
> -The reason that this is such a concern is that XFS allows multiple separate
> -modifications to a single object to be carried in the log at any given time.
> -This allows the log to avoid needing to flush each change to disk before
> -recording a new change to the object. XFS does this via a method called
> -"re-logging". Conceptually, this is quite simple - all it requires is that any
> -new change to the object is recorded with a *new copy* of all the existing
> -changes in the new transaction that is written to the log.
> -
> -That is, if we have a sequence of changes A through to F, and the object was
> -written to disk after change D, we would see in the log the following series
> -of transactions, their contents and the log sequence number (LSN) of the
> -transaction:
> -
> -....
> -	Transaction		Contents	LSN
> -	   A			   A		   X
> -	   B			  A+B		  X+n
> -	   C			 A+B+C		 X+n+m
> -	   D			A+B+C+D		X+n+m+o
> -	    <object written to disk>
> -	   E			   E		   Y (> X+n+m+o)
> -	   F			  E+F		  Y+p
> -....
> -
> -In other words, each time an object is relogged, the new transaction contains
> -the aggregation of all the previous changes currently held only in the log.
> -
> -This relogging technique also allows objects to be moved forward in the log so
> -that an object being relogged does not prevent the tail of the log from ever
> -moving forward.  This can be seen in the table above by the changing
> -(increasing) LSN of each subsequent transaction - the LSN is effectively a
> -direct encoding of the location in the log of the transaction.
> -
> -This relogging is also used to implement long-running, multiple-commit
> -transactions.  These transaction are known as rolling transactions, and require
> -a special log reservation known as a permanent transaction reservation. A
> -typical example of a rolling transaction is the removal of extents from an
> -inode which can only be done at a rate of two extents per transaction because
> -of reservation size limitations. Hence a rolling extent removal transaction
> -keeps relogging the inode and btree buffers as they get modified in each
> -removal operation. This keeps them moving forward in the log as the operation
> -progresses, ensuring that current operation never gets blocked by itself if the
> -log wraps around.
> -
> -Hence it can be seen that the relogging operation is fundamental to the correct
> -working of the XFS journalling subsystem. From the above description, most
> -people should be able to see why the XFS metadata operations writes so much to
> -the log - repeated operations to the same objects write the same changes to
> -the log over and over again. Worse is the fact that objects tend to get
> -dirtier as they get relogged, so each subsequent transaction is writing more
> -metadata into the log.
> -
> -Another feature of the XFS transaction subsystem is that most transactions are
> -asynchronous. That is, they don't commit to disk until either a log buffer is
> -filled (a log buffer can hold multiple transactions) or a synchronous operation
> -forces the log buffers holding the transactions to disk. This means that XFS is
> -doing aggregation of transactions in memory - batching them, if you like - to
> -minimise the impact of the log IO on transaction throughput.
> -
> -The limitation on asynchronous transaction throughput is the number and size of
> -log buffers made available by the log manager. By default there are 8 log
> -buffers available and the size of each is 32kB - the size can be increased up
> -to 256kB by use of a mount option.
> -
> -Effectively, this gives us the maximum bound of outstanding metadata changes
> -that can be made to the filesystem at any point in time - if all the log
> -buffers are full and under IO, then no more transactions can be committed until
> -the current batch completes. It is now common for a single current CPU core to
> -be to able to issue enough transactions to keep the log buffers full and under
> -IO permanently. Hence the XFS journalling subsystem can be considered to be IO
> -bound.
> -
> -== Delayed Logging Concepts
> -
> -The key thing to note about the asynchronous logging combined with the
> -relogging technique XFS uses is that we can be relogging changed objects
> -multiple times before they are committed to disk in the log buffers. If we
> -return to the previous relogging example, it is entirely possible that
> -transactions A through D are committed to disk in the same log buffer.
> -
> -That is, a single log buffer may contain multiple copies of the same object,
> -but only one of those copies needs to be there - the last one "D", as it
> -contains all the changes from the previous changes. In other words, we have one
> -necessary copy in the log buffer, and three stale copies that are simply
> -wasting space. When we are doing repeated operations on the same set of
> -objects, these "stale objects" can be over 90% of the space used in the log
> -buffers. It is clear that reducing the number of stale objects written to the
> -log would greatly reduce the amount of metadata we write to the log, and this
> -is the fundamental goal of delayed logging.
> -
> -From a conceptual point of view, XFS is already doing relogging in memory (where
> -memory == log buffer), only it is doing it extremely inefficiently. It is using
> -logical to physical formatting to do the relogging because there is no
> -infrastructure to keep track of logical changes in memory prior to physically
> -formatting the changes in a transaction to the log buffer. Hence we cannot avoid
> -accumulating stale objects in the log buffers.
> -
> -Delayed logging is the name we've given to keeping and tracking transactional
> -changes to objects in memory outside the log buffer infrastructure. Because of
> -the relogging concept fundamental to the XFS journalling subsystem, this is
> -actually relatively easy to do - all the changes to logged items are already
> -tracked in the current infrastructure. The big problem is how to accumulate
> -them and get them to the log in a consistent, recoverable manner.
> -Describing the problems and how they have been solved is the focus of this
> -document.
> -
> -One of the key changes that delayed logging makes to the operation of the
> -journalling subsystem is that it disassociates the amount of outstanding
> -metadata changes from the size and number of log buffers available. In other
> -words, instead of there only being a maximum of 2MB of transaction changes not
> -written to the log at any point in time, there may be a much greater amount
> -being accumulated in memory. Hence the potential for loss of metadata on a
> -crash is much greater than for the existing logging mechanism.
> -
> -It should be noted that this does not change the guarantee that log recovery
> -will result in a consistent filesystem. What it does mean is that as far as the
> -recovered filesystem is concerned, there may be many thousands of transactions
> -that simply did not occur as a result of the crash. This makes it even more
> -important that applications that care about their data use fsync() where they
> -need to ensure application level data integrity is maintained.
> -
> -It should be noted that delayed logging is not an innovative new concept that
> -warrants rigorous proofs to determine whether it is correct or not. The method
> -of accumulating changes in memory for some period before writing them to the
> -log is used effectively in many filesystems including ext3 and ext4. Hence
> -no time is spent in this document trying to convince the reader that the
> -concept is sound. Instead it is simply considered a "solved problem" and as
> -such implementing it in XFS is purely an exercise in software engineering.
> -
> -The fundamental requirements for delayed logging in XFS are simple:
> -
> -	. Reduce the amount of metadata written to the log by at least
> -	   an order of magnitude.
> -	. Supply sufficient statistics to validate Requirement #1.
> -	. Supply sufficient new tracing infrastructure to be able to debug
> -	   problems with the new code.
> -	. No on-disk format change (metadata or log format).
> -	. Enable and disable with a mount option.
> -	. No performance regressions for synchronous transaction workloads.
> -
> -== Delayed Logging Design
> -
> -=== Storing Changes
> -
> -The problem with accumulating changes at a logical level (i.e. just using the
> -existing log item dirty region tracking) is that when it comes to writing the
> -changes to the log buffers, we need to ensure that the object we are formatting
> -is not changing while we do this. This requires locking the object to prevent
> -concurrent modification. Hence flushing the logical changes to the log would
> -require us to lock every object, format them, and then unlock them again.
> -
> -This introduces lots of scope for deadlocks with transactions that are already
> -running. For example, a transaction has object A locked and modified, but needs
> -the delayed logging tracking lock to commit the transaction. However, the
> -flushing thread has the delayed logging tracking lock already held, and is
> -trying to get the lock on object A to flush it to the log buffer. This appears
> -to be an unsolvable deadlock condition, and it was solving this problem that
> -was the barrier to implementing delayed logging for so long.
> -
> -The solution is relatively simple - it just took a long time to recognise it.
> -Put simply, the current logging code formats the changes to each item into an
> -vector array that points to the changed regions in the item. The log write code
> -simply copies the memory these vectors point to into the log buffer during
> -transaction commit while the item is locked in the transaction. Instead of
> -using the log buffer as the destination of the formatting code, we can use an
> -allocated memory buffer big enough to fit the formatted vector.
> -
> -If we then copy the vector into the memory buffer and rewrite the vector to
> -point to the memory buffer rather than the object itself, we now have a copy of
> -the changes in a format that is compatible with the log buffer writing code.
> -that does not require us to lock the item to access. This formatting and
> -rewriting can all be done while the object is locked during transaction commit,
> -resulting in a vector that is transactionally consistent and can be accessed
> -without needing to lock the owning item.
> -
> -Hence we avoid the need to lock items when we need to flush outstanding
> -asynchronous transactions to the log. The differences between the existing
> -formatting method and the delayed logging formatting can be seen in the
> -diagram below.
> -
> -Current format log vector:
> -....
> -Object    +---------------------------------------------+
> -Vector 1      +----+
> -Vector 2                    +----+
> -Vector 3                                   +----------+
> -....
> -
> -After formatting:
> -
> -....
> -Log Buffer    +-V1-+-V2-+----V3----+
> -....
> -
> -Delayed logging vector:
> -
> -....
> -Object    +---------------------------------------------+
> -Vector 1      +----+
> -Vector 2                    +----+
> -Vector 3                                   +----------+
> -....
> -
> -After formatting:
> -
> -....
> -Memory Buffer +-V1-+-V2-+----V3----+
> -Vector 1      +----+
> -Vector 2           +----+
> -Vector 3                +----------+
> -....
> -
> -The memory buffer and associated vector need to be passed as a single object,
> -but still need to be associated with the parent object so if the object is
> -relogged we can replace the current memory buffer with a new memory buffer that
> -contains the latest changes.
> -
> -The reason for keeping the vector around after we've formatted the memory
> -buffer is to support splitting vectors across log buffer boundaries correctly.
> -If we don't keep the vector around, we do not know where the region boundaries
> -are in the item, so we'd need a new encapsulation method for regions in the log
> -buffer writing (i.e. double encapsulation). This would be an on-disk format
> -change and as such is not desirable.  It also means we'd have to write the log
> -region headers in the formatting stage, which is problematic as there is per
> -region state that needs to be placed into the headers during the log write.
> -
> -Hence we need to keep the vector, but by attaching the memory buffer to it and
> -rewriting the vector addresses to point at the memory buffer we end up with a
> -self-describing object that can be passed to the log buffer write code to be
> -handled in exactly the same manner as the existing log vectors are handled.
> -Hence we avoid needing a new on-disk format to handle items that have been
> -relogged in memory.
> -
> -
> -=== Tracking Changes
> -
> -Now that we can record transactional changes in memory in a form that allows
> -them to be used without limitations, we need to be able to track and accumulate
> -them so that they can be written to the log at some later point in time.  The
> -log item is the natural place to store this vector and buffer, and also makes sense
> -to be the object that is used to track committed objects as it will always
> -exist once the object has been included in a transaction.
> -
> -The log item is already used to track the log items that have been written to
> -the log but not yet written to disk. Such log items are considered "active"
> -and as such are stored in the Active Item List (AIL) which is a LSN-ordered
> -double linked list. Items are inserted into this list during log buffer IO
> -completion, after which they are unpinned and can be written to disk. An object
> -that is in the AIL can be relogged, which causes the object to be pinned again
> -and then moved forward in the AIL when the log buffer IO completes for that
> -transaction.
> -
> -Essentially, this shows that an item that is in the AIL can still be modified
> -and relogged, so any tracking must be separate to the AIL infrastructure. As
> -such, we cannot reuse the AIL list pointers for tracking committed items, nor
> -can we store state in any field that is protected by the AIL lock. Hence the
> -committed item tracking needs it's own locks, lists and state fields in the log
> -item.
> -
> -Similar to the AIL, tracking of committed items is done through a new list
> -called the Committed Item List (CIL).  The list tracks log items that have been
> -committed and have formatted memory buffers attached to them. It tracks objects
> -in transaction commit order, so when an object is relogged it is removed from
> -it's place in the list and re-inserted at the tail. This is entirely arbitrary
> -and done to make it easy for debugging - the last items in the list are the
> -ones that are most recently modified. Ordering of the CIL is not necessary for
> -transactional integrity (as discussed in the next section) so the ordering is
> -done for convenience/sanity of the developers.
> -
> -
> -=== Checkpoints
> -
> -When we have a log synchronisation event, commonly known as a "log force",
> -all the items in the CIL must be written into the log via the log buffers.
> -We need to write these items in the order that they exist in the CIL, and they
> -need to be written as an atomic transaction. The need for all the objects to be
> -written as an atomic transaction comes from the requirements of relogging and
> -log replay - all the changes in all the objects in a given transaction must
> -either be completely replayed during log recovery, or not replayed at all. If
> -a transaction is not replayed because it is not complete in the log, then
> -no later transactions should be replayed, either.
> -
> -To fulfill this requirement, we need to write the entire CIL in a single log
> -transaction. Fortunately, the XFS log code has no fixed limit on the size of a
> -transaction, nor does the log replay code. The only fundamental limit is that
> -the transaction cannot be larger than just under half the size of the log.  The
> -reason for this limit is that to find the head and tail of the log, there must
> -be at least one complete transaction in the log at any given time. If a
> -transaction is larger than half the log, then there is the possibility that a
> -crash during the write of a such a transaction could partially overwrite the
> -only complete previous transaction in the log. This will result in a recovery
> -failure and an inconsistent filesystem and hence we must enforce the maximum
> -size of a checkpoint to be slightly less than a half the log.
> -
> -Apart from this size requirement, a checkpoint transaction looks no different
> -to any other transaction - it contains a transaction header, a series of
> -formatted log items and a commit record at the tail. From a recovery
> -perspective, the checkpoint transaction is also no different - just a lot
> -bigger with a lot more items in it. The worst case effect of this is that we
> -might need to tune the recovery transaction object hash size.
> -
> -Because the checkpoint is just another transaction and all the changes to log
> -items are stored as log vectors, we can use the existing log buffer writing
> -code to write the changes into the log. To do this efficiently, we need to
> -minimise the time we hold the CIL locked while writing the checkpoint
> -transaction. The current log write code enables us to do this easily with the
> -way it separates the writing of the transaction contents (the log vectors) from
> -the transaction commit record, but tracking this requires us to have a
> -per-checkpoint context that travels through the log write process through to
> -checkpoint completion.
> -
> -Hence a checkpoint has a context that tracks the state of the current
> -checkpoint from initiation to checkpoint completion. A new context is initiated
> -at the same time a checkpoint transaction is started. That is, when we remove
> -all the current items from the CIL during a checkpoint operation, we move all
> -those changes into the current checkpoint context. We then initialise a new
> -context and attach that to the CIL for aggregation of new transactions.
> -
> -This allows us to unlock the CIL immediately after transfer of all the
> -committed items and effectively allow new transactions to be issued while we
> -are formatting the checkpoint into the log. It also allows concurrent
> -checkpoints to be written into the log buffers in the case of log force heavy
> -workloads, just like the existing transaction commit code does. This, however,
> -requires that we strictly order the commit records in the log so that
> -checkpoint sequence order is maintained during log replay.
> -
> -To ensure that we can be writing an item into a checkpoint transaction at
> -the same time another transaction modifies the item and inserts the log item
> -into the new CIL, then checkpoint transaction commit code cannot use log items
> -to store the list of log vectors that need to be written into the transaction.
> -Hence log vectors need to be able to be chained together to allow them to be
> -detached from the log items. That is, when the CIL is flushed the memory
> -buffer and log vector attached to each log item needs to be attached to the
> -checkpoint context so that the log item can be released. In diagrammatic form,
> -the CIL would look like this before the flush:
> -
> -----
> -	CIL Head
> -	   |
> -	   V
> -	Log Item <-> log vector 1	-> memory buffer
> -	   |				-> vector array
> -	   V
> -	Log Item <-> log vector 2	-> memory buffer
> -	   |				-> vector array
> -	   V
> -	......
> -	   |
> -	   V
> -	Log Item <-> log vector N-1	-> memory buffer
> -	   |				-> vector array
> -	   V
> -	Log Item <-> log vector N	-> memory buffer
> -					-> vector array
> -----
> -
> -And after the flush the CIL head is empty, and the checkpoint context log
> -vector list would look like:
> -
> -----
> -	Checkpoint Context
> -	   |
> -	   V
> -	log vector 1	-> memory buffer
> -	   |		-> vector array
> -	   |		-> Log Item
> -	   V
> -	log vector 2	-> memory buffer
> -	   |		-> vector array
> -	   |		-> Log Item
> -	   V
> -	......
> -	   |
> -	   V
> -	log vector N-1	-> memory buffer
> -	   |		-> vector array
> -	   |		-> Log Item
> -	   V
> -	log vector N	-> memory buffer
> -			-> vector array
> -			-> Log Item
> -----
> -
> -Once this transfer is done, the CIL can be unlocked and new transactions can
> -start, while the checkpoint flush code works over the log vector chain to
> -commit the checkpoint.
> -
> -Once the checkpoint is written into the log buffers, the checkpoint context is
> -attached to the log buffer that the commit record was written to along with a
> -completion callback. Log IO completion will call that callback, which can then
> -run transaction committed processing for the log items (i.e. insert into AIL
> -and unpin) in the log vector chain and then free the log vector chain and
> -checkpoint context.
> -
> -Discussion Point: I am uncertain as to whether the log item is the most
> -efficient way to track vectors, even though it seems like the natural way to do
> -it. The fact that we walk the log items (in the CIL) just to chain the log
> -vectors and break the link between the log item and the log vector means that
> -we take a cache line hit for the log item list modification, then another for
> -the log vector chaining. If we track by the log vectors, then we only need to
> -break the link between the log item and the log vector, which means we should
> -dirty only the log item cachelines. Normally I wouldn't be concerned about one
> -vs two dirty cachelines except for the fact I've seen upwards of 80,000 log
> -vectors in one checkpoint transaction. I'd guess this is a "measure and
> -compare" situation that can be done after a working and reviewed implementation
> -is in the dev tree....
> -
> -=== Checkpoint Sequencing
> -
> -One of the key aspects of the XFS transaction subsystem is that it tags
> -committed transactions with the log sequence number of the transaction commit.
> -This allows transactions to be issued asynchronously even though there may be
> -future operations that cannot be completed until that transaction is fully
> -committed to the log. In the rare case that a dependent operation occurs (e.g.
> -re-using a freed metadata extent for a data extent), a special, optimised log
> -force can be issued to force the dependent transaction to disk immediately.
> -
> -To do this, transactions need to record the LSN of the commit record of the
> -transaction. This LSN comes directly from the log buffer the transaction is
> -written into. While this works just fine for the existing transaction
> -mechanism, it does not work for delayed logging because transactions are not
> -written directly into the log buffers. Hence some other method of sequencing
> -transactions is required.
> -
> -As discussed in the checkpoint section, delayed logging uses per-checkpoint
> -contexts, and as such it is simple to assign a sequence number to each
> -checkpoint. Because the switching of checkpoint contexts must be done
> -atomically, it is simple to ensure that each new context has a monotonically
> -increasing sequence number assigned to it without the need for an external
> -atomic counter - we can just take the current context sequence number and add
> -one to it for the new context.
> -
> -Then, instead of assigning a log buffer LSN to the transaction commit LSN
> -during the commit, we can assign the current checkpoint sequence. This allows
> -operations that track transactions that have not yet completed know what
> -checkpoint sequence needs to be committed before they can continue. As a
> -result, the code that forces the log to a specific LSN now needs to ensure that
> -the log forces to a specific checkpoint.
> -
> -To ensure that we can do this, we need to track all the checkpoint contexts
> -that are currently committing to the log. When we flush a checkpoint, the
> -context gets added to a "committing" list which can be searched. When a
> -checkpoint commit completes, it is removed from the committing list. Because
> -the checkpoint context records the LSN of the commit record for the checkpoint,
> -we can also wait on the log buffer that contains the commit record, thereby
> -using the existing log force mechanisms to execute synchronous forces.
> -
> -It should be noted that the synchronous forces may need to be extended with
> -mitigation algorithms similar to the current log buffer code to allow
> -aggregation of multiple synchronous transactions if there are already
> -synchronous transactions being flushed. Investigation of the performance of the
> -current design is needed before making any decisions here.
> -
> -The main concern with log forces is to ensure that all the previous checkpoints
> -are also committed to disk before the one we need to wait for. Therefore we
> -need to check that all the prior contexts in the committing list are also
> -complete before waiting on the one we need to complete. We do this
> -synchronisation in the log force code so that we don't need to wait anywhere
> -else for such serialisation - it only matters when we do a log force.
> -
> -The only remaining complexity is that a log force now also has to handle the
> -case where the forcing sequence number is the same as the current context. That
> -is, we need to flush the CIL and potentially wait for it to complete. This is a
> -simple addition to the existing log forcing code to check the sequence numbers
> -and push if required. Indeed, placing the current sequence checkpoint flush in
> -the log force code enables the current mechanism for issuing synchronous
> -transactions to remain untouched (i.e. commit an asynchronous transaction, then
> -force the log at the LSN of that transaction) and so the higher level code
> -behaves the same regardless of whether delayed logging is being used or not.
> -
> -=== Checkpoint Log Space Accounting
> -
> -The big issue for a checkpoint transaction is the log space reservation for the
> -transaction. We don't know how big a checkpoint transaction is going to be
> -ahead of time, nor how many log buffers it will take to write out, nor the
> -number of split log vector regions are going to be used. We can track the
> -amount of log space required as we add items to the commit item list, but we
> -still need to reserve the space in the log for the checkpoint.
> -
> -A typical transaction reserves enough space in the log for the worst case space
> -usage of the transaction. The reservation accounts for log record headers,
> -transaction and region headers, headers for split regions, buffer tail padding,
> -etc. as well as the actual space for all the changed metadata in the
> -transaction. While some of this is fixed overhead, much of it is dependent on
> -the size of the transaction and the number of regions being logged (the number
> -of log vectors in the transaction).
> -
> -An example of the differences would be logging directory changes versus logging
> -inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then
> -there are lots of transactions that only contain an inode core and an inode log
> -format structure. That is, two vectors totaling roughly 150 bytes. If we modify
> -10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
> -vector is 12 bytes, so the total to be logged is approximately 1.75MB. In
> -comparison, if we are logging full directory buffers, they are typically 4KB
> -each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a
> -buffer format structure for each buffer - roughly 800 vectors or 1.51MB total
> -space.  From this, it should be obvious that a static log space reservation is
> -not particularly flexible and is difficult to select the "optimal value" for
> -all workloads.
> -
> -Further, if we are going to use a static reservation, which bit of the entire
> -reservation does it cover? We account for space used by the transaction
> -reservation by tracking the space currently used by the object in the CIL and
> -then calculating the increase or decrease in space used as the object is
> -relogged. This allows for a checkpoint reservation to only have to account for
> -log buffer metadata used such as log header records.
> -
> -However, even using a static reservation for just the log metadata is
> -problematic. Typically log record headers use at least 16KB of log space per
> -1MB of log space consumed (512 bytes per 32k) and the reservation needs to be
> -large enough to handle arbitrary sized checkpoint transactions. This
> -reservation needs to be made before the checkpoint is started, and we need to
> -be able to reserve the space without sleeping.  For a 8MB checkpoint, we need a
> -reservation of around 150KB, which is a non-trivial amount of space.
> -
> -A static reservation needs to manipulate the log grant counters - we can take a
> -permanent reservation on the space, but we still need to make sure we refresh
> -the write reservation (the actual space available to the transaction) after
> -every checkpoint transaction completion. Unfortunately, if this space is not
> -available when required, then the regrant code will sleep waiting for it.
> -
> -The problem with this is that it can lead to deadlocks as we may need to commit
> -checkpoints to be able to free up log space (refer back to the description of
> -rolling transactions for an example of this).  Hence we *must* always have
> -space available in the log if we are to use static reservations, and that is
> -very difficult and complex to arrange. It is possible to do, but there is a
> -simpler way.
> -
> -The simpler way of doing this is tracking the entire log space used by the
> -items in the CIL and using this to dynamically calculate the amount of log
> -space required by the log metadata. If this log metadata space changes as a
> -result of a transaction commit inserting a new memory buffer into the CIL, then
> -the difference in space required is removed from the transaction that causes
> -the change. Transactions at this level will *always* have enough space
> -available in their reservation for this as they have already reserved the
> -maximal amount of log metadata space they require, and such a delta reservation
> -will always be less than or equal to the maximal amount in the reservation.
> -
> -Hence we can grow the checkpoint transaction reservation dynamically as items
> -are added to the CIL and avoid the need for reserving and regranting log space
> -up front. This avoids deadlocks and removes a blocking point from the
> -checkpoint flush code.
> -
> -As mentioned early, transactions can't grow to more than half the size of the
> -log. Hence as part of the reservation growing, we need to also check the size
> -of the reservation against the maximum allowed transaction size. If we reach
> -the maximum threshold, we need to push the CIL to the log. This is effectively
> -a "background flush" and is done on demand. This is identical to
> -a CIL push triggered by a log force, only that there is no waiting for the
> -checkpoint commit to complete. This background push is checked and executed by
> -transaction commit code.
> -
> -If the transaction subsystem goes idle while we still have items in the CIL,
> -they will be flushed by the periodic log force issued by the xfssyncd. This log
> -force will push the CIL to disk, and if the transaction subsystem stays idle,
> -allow the idle log to be covered (effectively marked clean) in exactly the same
> -manner that is done for the existing logging method. A discussion point is
> -whether this log force needs to be done more frequently than the current rate
> -which is once every 30s.
> -
> -
> -=== Log Item Pinning
> -
> -Currently log items are pinned during transaction commit while the items are
> -still locked. This happens just after the items are formatted, though it could
> -be done any time before the items are unlocked. The result of this mechanism is
> -that items get pinned once for every transaction that is committed to the log
> -buffers. Hence items that are relogged in the log buffers will have a pin count
> -for every outstanding transaction they were dirtied in. When each of these
> -transactions is completed, they will unpin the item once. As a result, the item
> -only becomes unpinned when all the transactions complete and there are no
> -pending transactions. Thus the pinning and unpinning of a log item is symmetric
> -as there is a 1:1 relationship with transaction commit and log item completion.
> -
> -For delayed logging, however, we have an asymmetric transaction commit to
> -completion relationship. Every time an object is relogged in the CIL it goes
> -through the commit process without a corresponding completion being registered.
> -That is, we now have a many-to-one relationship between transaction commit and
> -log item completion. The result of this is that pinning and unpinning of the
> -log items becomes unbalanced if we retain the "pin on transaction commit, unpin
> -on transaction completion" model.
> -
> -To keep pin/unpin symmetry, the algorithm needs to change to a "pin on
> -insertion into the CIL, unpin on checkpoint completion". In other words, the
> -pinning and unpinning becomes symmetric around a checkpoint context. We have to
> -pin the object the first time it is inserted into the CIL - if it is already in
> -the CIL during a transaction commit, then we do not pin it again. Because there
> -can be multiple outstanding checkpoint contexts, we can still see elevated pin
> -counts, but as each checkpoint completes the pin count will retain the correct
> -value according to it's context.
> -
> -Just to make matters more slightly more complex, this checkpoint level context
> -for the pin count means that the pinning of an item must take place under the
> -CIL commit/flush lock. If we pin the object outside this lock, we cannot
> -guarantee which context the pin count is associated with. This is because of
> -the fact pinning the item is dependent on whether the item is present in the
> -current CIL or not. If we don't pin the CIL first before we check and pin the
> -object, we have a race with CIL being flushed between the check and the pin
> -(or not pinning, as the case may be). Hence we must hold the CIL flush/commit
> -lock to guarantee that we pin the items correctly.
> -
> -=== Concurrent Scalability
> -
> -A fundamental requirement for the CIL is that accesses through transaction
> -commits must scale to many concurrent commits. The current transaction commit
> -code does not break down even when there are transactions coming from 2048
> -processors at once. The current transaction code does not go any faster than if
> -there was only one CPU using it, but it does not slow down either.
> -
> -As a result, the delayed logging transaction commit code needs to be designed
> -for concurrency from the ground up. It is obvious that there are serialisation
> -points in the design - the three important ones are:
> -
> -	. Locking out new transaction commits while flushing the CIL
> -	. Adding items to the CIL and updating item space accounting
> -	. Checkpoint commit ordering
> -
> -Looking at the transaction commit and CIL flushing interactions, it is clear
> -that we have a many-to-one interaction here. That is, the only restriction on
> -the number of concurrent transactions that can be trying to commit at once is
> -the amount of space available in the log for their reservations. The practical
> -limit here is in the order of several hundred concurrent transactions for a
> -128MB log, which means that it is generally one per CPU in a machine.
> -
> -The amount of time a transaction commit needs to hold out a flush is a
> -relatively long period of time - the pinning of log items needs to be done
> -while we are holding out a CIL flush, so at the moment that means it is held
> -across the formatting of the objects into memory buffers (i.e. while memcpy()s
> -are in progress). Ultimately a two pass algorithm where the formatting is done
> -separately to the pinning of objects could be used to reduce the hold time of
> -the transaction commit side.
> -
> -Because of the number of potential transaction commit side holders, the lock
> -really needs to be a sleeping lock - if the CIL flush takes the lock, we do not
> -want every other CPU in the machine spinning on the CIL lock. Given that
> -flushing the CIL could involve walking a list of tens of thousands of log
> -items, it will get held for a significant time and so spin contention is a
> -significant concern. Preventing lots of CPUs spinning doing nothing is the
> -main reason for choosing a sleeping lock even though nothing in either the
> -transaction commit or CIL flush side sleeps with the lock held.
> -
> -It should also be noted that CIL flushing is also a relatively rare operation
> -compared to transaction commit for asynchronous transaction workloads - only
> -time will tell if using a read-write semaphore for exclusion will limit
> -transaction commit concurrency due to cache line bouncing of the lock on the
> -read side.
> -
> -The second serialisation point is on the transaction commit side where items
> -are inserted into the CIL. Because transactions can enter this code
> -concurrently, the CIL needs to be protected separately from the above
> -commit/flush exclusion. It also needs to be an exclusive lock but it is only
> -held for a very short time and so a spin lock is appropriate here. It is
> -possible that this lock will become a contention point, but given the short
> -hold time once per transaction I think that contention is unlikely.
> -
> -The final serialisation point is the checkpoint commit record ordering code
> -that is run as part of the checkpoint commit and log force sequencing. The code
> -path that triggers a CIL flush (i.e. whatever triggers the log force) will enter
> -an ordering loop after writing all the log vectors into the log buffers but
> -before writing the commit record. This loop walks the list of committing
> -checkpoints and needs to block waiting for checkpoints to complete their commit
> -record write. As a result it needs a lock and a wait variable. Log force
> -sequencing also requires the same lock, list walk, and blocking mechanism to
> -ensure completion of checkpoints.
> -
> -These two sequencing operations can use the mechanism even though the
> -events they are waiting for are different. The checkpoint commit record
> -sequencing needs to wait until checkpoint contexts contain a commit LSN
> -(obtained through completion of a commit record write) while log force
> -sequencing needs to wait until previous checkpoint contexts are removed from
> -the committing list (i.e. they've completed). A simple wait variable and
> -broadcast wakeups (thundering herds) has been used to implement these two
> -serialisation queues. They use the same lock as the CIL, too. If we see too
> -much contention on the CIL lock, or too many context switches as a result of
> -the broadcast wakeups these operations can be put under a new spinlock and
> -given separate wait lists to reduce lock contention and the number of processes
> -woken by the wrong event.
> -
> -
> -=== Lifecycle Changes
> -
> -The existing log item life cycle is as follows:
> -
> -----
> -	1. Transaction allocate
> -	2. Transaction reserve
> -	3. Lock item
> -	4. Join item to transaction
> -		If not already attached,
> -			Allocate log item
> -			Attach log item to owner item
> -		Attach log item to transaction
> -	5. Modify item
> -		Record modifications in log item
> -	6. Transaction commit
> -		Pin item in memory
> -		Format item into log buffer
> -		Write commit LSN into transaction
> -		Unlock item
> -		Attach transaction to log buffer
> -
> -	<log buffer IO dispatched>
> -	<log buffer IO completes>
> -
> -	7. Transaction completion
> -		Mark log item committed
> -		Insert log item into AIL
> -			Write commit LSN into log item
> -		Unpin log item
> -	8. AIL traversal
> -		Lock item
> -		Mark log item clean
> -		Flush item to disk
> -
> -	<item IO completion>
> -
> -	9. Log item removed from AIL
> -		Moves log tail
> -		Item unlocked
> -----
> -
> -Essentially, steps 1-6 operate independently from step 7, which is also
> -independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9
> -at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur
> -at the same time. If the log item is in the AIL or between steps 6 and 7
> -and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9
> -are entered and completed is the object considered clean.
> -
> -With delayed logging, there are new steps inserted into the life cycle:
> -
> -----
> -	1. Transaction allocate
> -	2. Transaction reserve
> -	3. Lock item
> -	4. Join item to transaction
> -		If not already attached,
> -			Allocate log item
> -			Attach log item to owner item
> -		Attach log item to transaction
> -	5. Modify item
> -		Record modifications in log item
> -	6. Transaction commit
> -		Pin item in memory if not pinned in CIL
> -		Format item into log vector + buffer
> -		Attach log vector and buffer to log item
> -		Insert log item into CIL
> -		Write CIL context sequence into transaction
> -		Unlock item
> -
> -	<next log force>
> -
> -	7. CIL push
> -		lock CIL flush
> -		Chain log vectors and buffers together
> -		Remove items from CIL
> -		unlock CIL flush
> -		write log vectors into log
> -		sequence commit records
> -		attach checkpoint context to log buffer
> -
> -	<log buffer IO dispatched>
> -	<log buffer IO completes>
> -
> -	8. Checkpoint completion
> -		Mark log item committed
> -		Insert item into AIL
> -			Write commit LSN into log item
> -		Unpin log item
> -	9. AIL traversal
> -		Lock item
> -		Mark log item clean
> -		Flush item to disk
> -	<item IO completion>
> -	10. Log item removed from AIL
> -		Moves log tail
> -		Item unlocked
> -----
> -
> -From this, it can be seen that the only life cycle differences between the two
> -logging methods are in the middle of the life cycle - they still have the same
> -beginning and end and execution constraints. The only differences are in the
> -committing of the log items to the log itself and the completion processing.
> -Hence delayed logging should not introduce any constraints on log item
> -behaviour, allocation or freeing that don't already exist.
> -
> -As a result of this zero-impact "insertion" of delayed logging infrastructure
> -and the design of the internal structures to avoid on disk format changes, we
> -can basically switch between delayed logging and the existing mechanism with a
> -mount option. Fundamentally, there is no reason why the log manager would not
> -be able to swap methods automatically and transparently depending on load
> -characteristics, but this should not be necessary if delayed logging works as
> -designed.
> -
> -EOF.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=LHZQ8fHvy6wDKXGTWcm97burZH5sQKHRDMaY1UthQxc&m=Nf8WUxufLEmKZuk2gKNHK8pZ4RMwAPFrtbhGK9U84W8&s=jCdh7D6v75E1G6nLbTu4rv3rhsIoKPJUPx5KvBMwjWk&e=

next prev parent reply	other threads:[~2018-05-14  2:29 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-13 17:00 [PATCH 0/4] xfs-docs: incorporate design documents Darrick J. Wong
2018-05-13 17:00 ` [PATCH 1/4] design: rename disk structures book Darrick J. Wong
2018-05-14  2:28   ` Allison Henderson
2018-05-13 17:00 ` [PATCH 2/4] design: incorporate the delayed logging design doc Darrick J. Wong
2018-05-14  2:29   ` Allison Henderson [this message]
2018-05-13 17:00 ` [PATCH 3/4] design: incorporate self-describing metadata " Darrick J. Wong
2018-05-14  2:30   ` Allison Henderson
2018-05-13 17:00 ` [PATCH 4/4] design: merge both metadata integrity chapters Darrick J. Wong
2018-05-14  2:30   ` Allison Henderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2d96508e-8c89-7a33-71f1-576f49fafe0c@oracle.com \
    --to=allison.henderson@oracle.com \
    --cc=azarus@posteo.net \
    --cc=darrick.wong@oracle.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).