***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* ***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction
@ 2013-08-12 13:19 Dave Chinner
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 01/17] xfs: inode allocation tickets Dave Chinner
                   ` (16 more replies)
  0 siblings, 17 replies; 22+ messages in thread
From: Dave Chinner @ 2013-08-12 13:19 UTC (permalink / raw)
  To: xfs

Hi folks,

Call this a 'request for discussion' or a 'request for developers',
however you want to look at it. I know there are people asking me
for bits of work that can be done, so I spent a little bit of spare
time I had sitting around in waiting rooms documenting the inode
development direction I'm planning to take for inode allocation and
freeing over the next few months.

There are a bunch of things that have lead to this:

	- allocation speed when free inodes are sparse
	- free space fragmentation preventing inode allocation
	- inability to cluster large numbers of inodes close
	  together
	- inode allocation transaction reservations being larger
	  than then need to be for 98% of allocations
	- unlinked list processing having a high cost when we have
	  lots of unlinked inodes waiting for reclaim
	- inode freeing being tied to VFS inode cache eviction
	- inability to recycle free inodes directly from the
	  unlinked list
	- support for O_TMPFILE

Basically, this started from me looking at what O_TMPFILE needed
to be supported, and grew from there. O_TMPFILE needs separation of
the inode allocation from the namespace operations, and link_at()
needs to be able to remove an inode from the unlinked list and link
it to the namespace.

That leads to inode allocation having very distinct operations that
are currently commingled by the transaction subsystem and the need
to guarantee enough log space for inode allocation and namespace
modification to happen atomically. Breaking this all up leads to a
bunch of optimisations that center around either avoiding
unnecessary work or being able to do it in batches asynchronously to
the foreground context that is running.

There's a lot of work here, some is dependent on other bits, and
some is completely separate. If anyone wants to pick up one (or
more) of the pieces and work on it, then I'm happy to help people
work through the changes and test them. I'll be slowly peeling off
pieces of this myself even if nobody else does.

Note that a good deal of these changes are only ever going to work
effectively on v5 filesystems e.g. atomic multi-chunk inode
allocation and incore inode unlinked lists and logging. Hence I've
only really focussed on optimisations and modifications that make
sense from a v5 filesystem POV. 

Comments, flames and volunteers welcome.

Cheers,

Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* ***** SUSPECTED SPAM ***** [RFD 01/17] xfs: inode allocation tickets
  2013-08-12 13:19 ***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction Dave Chinner
@ 2013-08-12 13:19 ` Dave Chinner
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 02/17] xfs: separate inode chunk allocation from free inode allocation Dave Chinner
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2013-08-12 13:19 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

If we want to split inode allocation up into a background chunk allocator and an
indvidual free inode allocator, then we need to be able to guarantee that a free
inode will be available before we take a log reservation for the free inode
allocation.

If we don't guarantee that we can allocate an inode before we reserve log space
for the individual inode allocation, then we may reserve all the remaining log
space for the free inode allocation and then not be able to reserve space for
a ne winode chunk allocation in the log. This will cause an inode allocation
deadlock.

To avoid this deadlock, use a ticket system to guarantee an allocation has a
reserved free inode before it proceeds to the transaction reservation for the
allocation. This allows the free inode allocation to block waiting for
background allocation to allocate more inode chunks in a sane and rational
manner.

The ticket system needs to be a per-allocation group ticket, as inodes are
allocated and tracked at a per-AG granularity. Hence we need to restructure the
inode allocation code to select an AG for the new inode as early as possible
and then take a ticket on that AG. It is entirely possible that we can then get
an ENOSPC error for the inode chunk allocation, so we must be able to fall all
the way back to this ticket allocation loop to try another AG in this case.
Hauling this AG selection loop out of the internal free inode allocation code
might be sufficiently complex to warrant multiple setup patches by itself....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_ialloc.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_ialloc.h b/fs/xfs/xfs_ialloc.h
index 68c0732..1da16f5 100644
--- a/fs/xfs/xfs_ialloc.h
+++ b/fs/xfs/xfs_ialloc.h
@@ -24,6 +24,8 @@ struct xfs_imap;
 struct xfs_mount;
 struct xfs_trans;

+struct xfs_ialloc_ticket;
+
 /*
  * Allocation parameters for inode allocation.
  */
-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* ***** SUSPECTED SPAM ***** [RFD 02/17] xfs: separate inode chunk allocation from free inode allocation
  2013-08-12 13:19 ***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction Dave Chinner
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 01/17] xfs: inode allocation tickets Dave Chinner
@ 2013-08-12 13:19 ` Dave Chinner
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 03/17] xfs: move inode chunk allocation into a workqueue Dave Chinner
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2013-08-12 13:19 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we can guarantee the availability of a free inode for an
inode allocation transaction, split the inode chunk allocation out
of the inode allocation transaction completely.  Instead, if the
inode ticket reservation detects no free inodes available, do the
inode chunk allocation immediately from this context.

This means that we need to split the create/mkdir/mknod transaction
reservations apart - we use the inode chunk allocation reservation
part for inode chunk allocation transactions, and the free inode
allocation and directory modification part for the
create/mkdir/mknod operation.

At this point, we have effectively decoupled free inode allocation
from inode chunk allocation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_trans.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 2b49463..f469e72 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -109,7 +109,8 @@ typedef struct xfs_trans_header {
 #define	XFS_TRANS_SB_COUNT		41
 #define	XFS_TRANS_CHECKPOINT		42
 #define	XFS_TRANS_ICREATE		43
-#define	XFS_TRANS_TYPE_MAX		43
+#define	XFS_TRANS_IALLOC_CHUNK		44
+#define	XFS_TRANS_TYPE_MAX		44
 /* new transaction types need to be reflected in xfs_logprint(8) */

 #define XFS_TRANS_TYPES \
-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* ***** SUSPECTED SPAM ***** [RFD 03/17] xfs: move inode chunk allocation into a workqueue
  2013-08-12 13:19 ***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction Dave Chinner
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 01/17] xfs: inode allocation tickets Dave Chinner
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 02/17] xfs: separate inode chunk allocation from free inode allocation Dave Chinner
@ 2013-08-12 13:19 ` Dave Chinner
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 04/17] xfs: optimise background inode chunk allocation Dave Chinner
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2013-08-12 13:19 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Now that inode chunk allocation has been separated from free inode allocation,
we no longer need to do it in-line with the high level operation that requires
a free inode. As such, we can move inode chunk allocation into a workqueue and
trigger it to run asynchronously.

The inode allocation ticket will need to block the foreground operation if there
aren't sufficent free inodes available. It also needs to kick the inode
chunk allocation worker into action to ensure that free inodes will become
available in the not-to-distant future and wait for free inodes become
available.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_ialloc.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/xfs/xfs_ialloc.h b/fs/xfs/xfs_ialloc.h
index 1da16f5..0494855 100644
--- a/fs/xfs/xfs_ialloc.h
+++ b/fs/xfs/xfs_ialloc.h
@@ -24,6 +24,7 @@ struct xfs_imap;
 struct xfs_mount;
 struct xfs_trans;
 
+/* blocks when empty */
 struct xfs_ialloc_ticket;
 
 /*
-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* ***** SUSPECTED SPAM ***** [RFD 04/17] xfs: optimise background inode chunk allocation
  2013-08-12 13:19 ***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction Dave Chinner
                   ` (2 preceding siblings ...)
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 03/17] xfs: move inode chunk allocation into a workqueue Dave Chinner
@ 2013-08-12 13:19 ` Dave Chinner
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 05/17] xfs: introduce a free inode allocation btree Dave Chinner
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2013-08-12 13:19 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Now that physical inode allocation is being done in the background and separated
from the high level free inode allocation operations, we can start to optimise
the way we allocate physical inode chunks based on observation of inode chunk
allocation requirements.

To start with, we need to determine the approximate rate at which we are
allocating inode chunks. This will tell us how many inode chunks we should
allocate at a time to try to minimise the amount of time free inode allocation
stalls waiting for chunk allocation to occur. Ideally we want to allocate in
large enough chunks that we rarely block free inode allocation.

Assuming a typical inode allocation rate of approximately 20,000 per second per
CPU (~2GHz Xeon CPUs run at around this rate), then we are allocating roughly
300 inode chunks per second. We can assume that this is the rate at which we can
allocate from a single AG, as inode allocation within an AG is single threaded.
Hence trying to keep a "chunks allocated per second" measure probably has
sufficient resolution to provide a stable rate which we can use to allocate an
appropriate number of chunks ahead of time.

This would also allow us to determine a low watermark at which the inode
allocation ticket subsystem can use to kick chunk allocation before we run out
of inodes and force free inode allocation to block.

Once we have a determined rate, we can use that to allocate a number of inode
chunks in a single execution of the worker. Ideally, we want the worker to
allocate enough inode chunks so that it only needs to run a couple of times a
second, and to be able to do this allocation in a manner that results in
large contiguous regions of inode chunks.

For v4 superblocks, just iterate the existing inode chunk allocation transaction
to allocate a chunk at a time. For v5 superblocks, we have the logical inode
create transaction which allows us to initialise an arbitrary number of inode
chunks at a time.

The limit of chunks we can support right now with the current transaction
reservation is the maximum number of sequential records we can insert into the
inode btree while guaranteeing only a single leaf to root split will occur. This
will probably require a special new btree operation for a bulk record insert
with a single index path update once the split and insert is done. This is
probably sufficiently complex that it will require a series of several patches
to do.

Once we can allocate multiple inode chunks in a single operation, we can
optimise inode chunk layout for stripe unit/width extremely well. i.e. we should
allocate a fully aligned stripe unit at a time, and potentially larger if
allowed by the limits of a bulk record insert.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_ag.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
index 317aa86..eb25689 100644
--- a/fs/xfs/xfs_ag.h
+++ b/fs/xfs/xfs_ag.h
@@ -249,6 +249,8 @@ typedef struct xfs_perag {
 	xfs_agino_t	pagi_freecount;	/* number of free inodes */
 	xfs_agino_t	pagi_count;	/* number of allocated inodes */

+	int		pagi_chunk_alloc_rate;
+
 	/*
 	 * Inode allocation search lookup optimisation.
 	 * If the pagino matches, the search for new inodes
-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* ***** SUSPECTED SPAM ***** [RFD 05/17] xfs: introduce a free inode allocation btree
  2013-08-12 13:19 ***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction Dave Chinner
                   ` (3 preceding siblings ...)
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 04/17] xfs: optimise background inode chunk allocation Dave Chinner
@ 2013-08-12 13:19 ` Dave Chinner
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 06/17] xfs: partial inode chunk allocation Dave Chinner
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2013-08-12 13:19 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

One of the biggest problems with inode allocation performance right
now is that searching for a free inode requires an exhaustive scan
of the inode btree to find a record with a free inode in it. IOWs,
the inode btree indexes inode chunks, not free inodes.

To speed up the search for a free inode, introduce a new per-AG
btree rooted in the AGI that tracks records with free inodes in
them. This requires an inode chunk allocation to insert a record
into two AGI btrees - one for the allocated inode chunk, and one
for the free inodes record.

When we allocate a free inode, we now will need to modify two
records - one in each tree - and potentially remove a record from
the free inode btree. That is, if a record has no free inodes, then
it is removed from the btree. This means we have to ensure that the
transaction reservation for a free inode modification has enough
space in it for a inode btree merge.

Finally, it means that freeing an inode can insert a record into the
free inode btree. This can cause a split of the tree and hence we
need to ensure that the transaction reservation takes this into
account.

This structure means that the free inode btree only tracks inode
chunks with free inodes in them and hence will always provide
extremely fast lookup of the closest free inode to the allocation
target. When the free inode btree exists, we will no longer use the
allocated inode chunk btree for allocation lookups - only the free
inode btree will be used.

This functionality requires that we use a read-only compatible
feature flag - older kernels can still read the filesystem structure
just fine, but they aren't allowed to modify it as that will result
in the new free inode btree not being updated correctly.

Another advantage of the second btree is that we now have some
redundant metadata pointing to inode chunks. it's not complete, but
it certainly will help determining if an inode is supposed to be
free or not when corruptions occur. i.e. it is no longer a single
bit of data in a single btree record.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_ag.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
index eb25689..1a97646 100644
--- a/fs/xfs/xfs_ag.h
+++ b/fs/xfs/xfs_ag.h
@@ -166,6 +166,9 @@ typedef struct xfs_agi {
 	__be32		agi_pad32;
 	__be64		agi_lsn;	/* last write sequence */

+	__be32		agi_free_root;
+	__be32		agi_free_level;
+
 	/* structure must be padded to 64 bit alignment */
 } xfs_agi_t;

-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* ***** SUSPECTED SPAM ***** [RFD 06/17] xfs: partial inode chunk allocation
  2013-08-12 13:19 ***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction Dave Chinner
                   ` (4 preceding siblings ...)
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 05/17] xfs: introduce a free inode allocation btree Dave Chinner
@ 2013-08-12 13:19 ` Dave Chinner
  2013-08-13 22:07   ` Brian Foster
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 07/17] xfs: separate inode chunk freeing from inode freeing Dave Chinner
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2013-08-12 13:19 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When a filesystem ages or when certain workloads dominate the storage capacity
of the filesystem, it can become difficult to find contiguous free space in the
filesystem and hence inode allocation can fail long before the filesystem is out
of space.

To avoid this problem, we need to be able to use smaller extents in the
filesystem to hold inodes than the size needed to hold a full chunk. To enable
this, we need to keep track of the region of the inode chunk that has actually
been allocated in the inode allocation record itself. The inobt record contains
a free inode count field that uses 32 bits of space, but has a maximum possible
value of 64. Hence there are many bitsin the field that we can repurpose for
a "allocated regions" mask.

To simplify the implementation and checking of the field, split the 32 bit field
into an 8 bite count variable in the same location as the existing count (i.e.
the LSB of the 32 bit variable, remembering that XFS big endian on disk), an 8
bit pad field and a 16 bit mask field that contains the allocated extent
tracking.

As we have 16 bits in the mask, each bit represents 4 inodes and hence that
defines the minimum allocation size we can support. In all cases, this will
limit the largest contiguous allocation required to 2 blocks for a new as the
minimum filesystem block size is limited by mkfs to being twice the inode size.
In most common configurations, a single block will contain more than 4
inodes and so this isn't a major limitation at all.

Hence during extent allocation for the inode chunk, if we cannot find an aligned
and contiguous extent, we can settle for something that is as large as possible
and mask off the region that we weren't able to allocate. When freeing the
chunk, we'll also know what extent we need to free. And for untrusted inode
number lookup, we can determine if the inode number falls into the invalid part
of the chunk.

Further, to avoid needing to do multiple extent allocations for "sparse" inode
chunks, if we allocate an extent that overlaps an existing partial inode chunk,
we can simply update the mask and free count to indicate that there are multiple
valid extents in the chunk. This gives us a potential route for partial inode
chunks to be made whole via ongoing filesystem modification or a forced scan
once space has been made available.

To make this as close to transparent as possible, use a value of 0 to indicate
that there are valid inodes in this location, and a value of 1 to indicate that
it is an invalid region. This means that the filesystem will be backwards
compatible with existing kernels and userspace up until the first partial chunk
is allocated. At that point, we need to set an incompatible feature flag as
older kernels and userspace are unable to interpret the value in the "free
inodes" field correctly. This also means that if we scan the inode btrees and
determine that there are no partial inode chunks, we can remove the feature
bit...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_ialloc_btree.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_ialloc_btree.h b/fs/xfs/xfs_ialloc_btree.h
index 3ac36b76..75ee794 100644
--- a/fs/xfs/xfs_ialloc_btree.h
+++ b/fs/xfs/xfs_ialloc_btree.h
@@ -48,7 +48,9 @@ static inline xfs_inofree_t xfs_inobt_maskn(int i, int n)
  */
 typedef struct xfs_inobt_rec {
 	__be32		ir_startino;	/* starting inode number */
-	__be32		ir_freecount;	/* count of free inodes (set bits) */
+	__be16		ir_alloc_mask;
+	__u8		ir_pad;
+	__u8		ir_freecount;
 	__be64		ir_free;	/* free inode mask */
 } xfs_inobt_rec_t;
 
-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* ***** SUSPECTED SPAM ***** [RFD 07/17] xfs: separate inode chunk freeing from inode freeing
  2013-08-12 13:19 ***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction Dave Chinner
                   ` (5 preceding siblings ...)
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 06/17] xfs: partial inode chunk allocation Dave Chinner
@ 2013-08-12 13:19 ` Dave Chinner
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 08/17] xfs: inode chunk freeing in the background Dave Chinner
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2013-08-12 13:19 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Currenetly inode chunk freeing is done when the last inode in the
inode chunk is marked as free. This results in immediate inode chunk
removal, which for some workloads is not desirable as they allocate
more inodes almost immediately (e.g. workloads with lots of
temporary or short-term files).

There are other reasons this behaviour is undesirable - if we are
allocating inode chunks in regions of stripe units or larger, we
don't want to punch holes in the middle of the inode regions as
inodes are freed - we want to keep the contiguous until we can free
them an entire stripe unit at a time. This minimises the free space
fragmentation that inode chunk removal can cause, and also prevents
interleaving of inode chunks with other data and/or metadata over
time.

Hence we should separate the inode chunk freeing from indvidual
inode freeing. The process is similar to the separation of the inode
chunk allocation - the first step is to separate out the inode btree
transaction and log reservation from the "inactive" transaction so
we can run a separate transaction after the inactive transaction has
been committed to free the inode chunk.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_trans.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index f469e72..4c80c84 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -110,7 +110,8 @@ typedef struct xfs_trans_header {
 #define	XFS_TRANS_CHECKPOINT		42
 #define	XFS_TRANS_ICREATE		43
 #define	XFS_TRANS_IALLOC_CHUNK		44
-#define	XFS_TRANS_TYPE_MAX		44
+#define	XFS_TRANS_IFREE_CHUNK		45
+#define	XFS_TRANS_TYPE_MAX		45
 /* new transaction types need to be reflected in xfs_logprint(8) */

 #define XFS_TRANS_TYPES \
-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* ***** SUSPECTED SPAM ***** [RFD 08/17] xfs: inode chunk freeing in the background
  2013-08-12 13:19 ***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction Dave Chinner
                   ` (6 preceding siblings ...)
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 07/17] xfs: separate inode chunk freeing from inode freeing Dave Chinner
@ 2013-08-12 13:19 ` Dave Chinner
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 09/17] xfs: optimise inode chunk freeing Dave Chinner
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2013-08-12 13:19 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Now that inode chunk freeing has been separated from freeing indivdiual inodes
we no longer need to do it in-line with the high level unlink inode operation.
As such, we can move inode chunk freeing into a workqueue and trigger it to run
asynchronously.

Moving the chunk freeing to the background allows us to delay the decision to
free the inode chunk and further optimise inode chunk freeing according to the
current workload.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_ag.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
index 1a97646..b34f641 100644
--- a/fs/xfs/xfs_ag.h
+++ b/fs/xfs/xfs_ag.h
@@ -253,6 +253,7 @@ typedef struct xfs_perag {
 	xfs_agino_t	pagi_count;	/* number of allocated inodes */
 
 	int		pagi_chunk_alloc_rate;
+	int		pagi_chunk_free_rate;
 
 	/*
 	 * Inode allocation search lookup optimisation.
-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* ***** SUSPECTED SPAM ***** [RFD 09/17] xfs: optimise inode chunk freeing
  2013-08-12 13:19 ***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction Dave Chinner
                   ` (7 preceding siblings ...)
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 08/17] xfs: inode chunk freeing in the background Dave Chinner
@ 2013-08-12 13:19 ` Dave Chinner
  2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 10/17] xfs: swap extents operations for CRC filesystems Dave Chinner
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2013-08-12 13:19 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Now that the inode chunk freeing is done asynchronously, we can make
more intelligent decisions about freeing inode chunks. As we have a
inode btree that tracks free inodes, we can quickly find out whether
the adjacent inode chunks are free. We can then match inode chunk
freeing patterns to the allocation patterns that are in use.

We can also track the rate at which we are freeing inode chunks and
compare that to the rate at which we are allocating inode chunks. If
we are both allocating and freeing inode chunks, then we should slow
down the rate at which we are freeing inode chunks so that
allocation can occur directly from the empty inode chunks rather
than forcing them to be reallocated shortly after then have been
freed.

Further, for sequential chunks we should be able to implement bulk
removal of the records from the inode btrees as long as we can
guarantee that it only results in a single merge operation. The
constraints and processes would be similar to the bulk insert
operation proposed for inode allocation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_ag.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
index b34f641..c423191 100644
--- a/fs/xfs/xfs_ag.h
+++ b/fs/xfs/xfs_ag.h
@@ -254,6 +254,7 @@ typedef struct xfs_perag {

 	int		pagi_chunk_alloc_rate;
 	int		pagi_chunk_free_rate;
+	xfs_agino_t	pagi_free_chunk;

 	/*
 	 * Inode allocation search lookup optimisation.
-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* ***** SUSPECTED SPAM ***** [RFD 10/17] xfs: swap extents operations for CRC filesystems
  2013-08-12 13:19 ***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction Dave Chinner
                   ` (8 preceding siblings ...)
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 09/17] xfs: optimise inode chunk freeing Dave Chinner
@ 2013-08-12 13:20 ` Dave Chinner
  2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 11/17] xfs: factor xfs_create to prepare for O_TMPFILE Dave Chinner
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2013-08-12 13:20 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

For CRC enabled filesystems, we can't just swap inode forks from one inode to
another when defragmenting a file - the blocks in the inode fork bmap btree
contain pointers back to the owner inode. Hence if we are to swap the inode
forks we have to atomically modify every block in the btree during the
transaction.

There are two approaches to doing this. Firstly, if we are doing an entire fork
swap, we could create a new transaction item type that indicates we are changing
the owner of a certain structure from one value to another, and then use ordered
buffer logging to modify all the buffers in the tree without needing to log
them. This would then require log recovery to perform the modification of the
owner information of the objects/structures in question.

This does introduce some interesting ordering details into recovery - we have to
make sure that the owner change replay occurs after the change that moves the
objects is made, not before. Hence we can't use a separate log item for this as
we have no guarantee of strict ordering between multiple items in the log due to
the relogging action of asynchronous transaction commits. Hence there is no
"generic" method we can use for changing the ownership of arbitrary metadata
structures.

For inode forks, however, there is a simple method of communicating that the
fork contents need the owner rewritten - we can pass a inode log format flag
for the fork for the transaction that does a fork swap. This flag will then
follow the inode fork through relogging actions so when the swap actually gets
replayed the ownership can be changed immediately by log recovery. So that gives
us a simple method of "whole fork" exchange between two inodes.

THis is relatively simple to implement, so it makes sense to do this as an
initial implementation to support xfs_fsr on CRC enabled filesytems in the same
manner as we do on existing filesystems.

The second approach is to implement a proper extent swap transaction which moves
an arbitrary range of a fork from one inode to another. This would need to be
done as a permenent rolling transaction that moves a fixed number of extents at
a time between the two inode forks. local/extent format implementation is
trivial - we only need to modify the inode forks and log the inodes to implement
it - but the btree implementation is much, much harder.

The first thing to note is that the two inodes that are being swapped do not
necessarily contain the same data, and hence we cannot assume that we are making
a symmetrical modification. Hence we have to involve an intermediate inode fork
to stage the movement of extents. That is, we move extents from the source to
the intermediate record, move the extents on the target to the source, and then
move the intermediate record extents to the target. Because of the nature of the
movement, we want all three movements in a single transaction but we do not want
the intermediate record to show up in any transactions.

This is made complex due to the fact that the extents being swapped might be of
different offsets and lengths, and hence the movement per transaction may
require swapping of partial extent ranges on one side where one inode has a
alarge contiguous extent and the other has lots of small extents in the same
range. This means that the number of transactions we need to do the swap is
not clearly defined before we start the operation. This is very similar to the
problem truncate has - it has to string multiple extent manipulation operations
together into a single atomic operation.

The extent freeing code does this via a pair of intent/done items that wrap the
entire operation - the EFI/EFD items. To do a co-ordinated, atomic extent swap,
we are going to need to and equivalent intent/done pair of log items to indicate
that the upcoming stream of extent manipulations need to be replayed in
completely. This is necessary as the individual extent movement transactions can
result in bmbt blocks being allocated and freed, and hence can be rolling
transacitons themselves made atomic via EFI/EFD intents in xfs_bmap_finish().
Hence, at minimum, we need to ensure that each extent that is swapped is fully
and correctly replayed and to do that we need Swap Extent Intent and Swap Extent
Done pair of log items.

Like the EFI/EFD items, however, these intents can record multiple extents to be
swapped at a time, and hence this allows us some flexibility in determining how
to batch up modifications for efficiency purposes. The ESI would record the
exact extent records being swapped between inodes and be committed, after which
we can then swap in a multi-transaction loop (to handle bmap btree
allocation/free operations during insert/remove operations) that updates the ESD
after each extent range in the ESI is swapped sucessfully.

As a result, recovery woul dbe very similiar to EFI/EFD recovery - as each ESD
is seen, it cancels the completed range of the related ESI, and when all ranges
are cancelled the ESI/ESD are removed from the reocvery list. If there are ESIs
left at the end of the recovery pass, we then need to run a loop that completes
them and so leaves the the inodes in a known correct state.

This is, overall, much more complex than what is currently needed for xfs_fsr
support, so this is more documentation of how we would implement generic ranged
extent swap support for XFS.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_dfrag.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_dfrag.h b/fs/xfs/xfs_dfrag.h
index 20bdd93..ad688fd 100644
--- a/fs/xfs/xfs_dfrag.h
+++ b/fs/xfs/xfs_dfrag.h
@@ -19,7 +19,7 @@
 #define	__XFS_DFRAG_H__

 /*
- * Structure passed to xfs_swapext
+ * Structure passed to xfs_swapext, currently only supports full file
  */

 typedef struct xfs_swapext
-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* ***** SUSPECTED SPAM ***** [RFD 11/17] xfs: factor xfs_create to prepare for O_TMPFILE
  2013-08-12 13:19 ***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction Dave Chinner
                   ` (9 preceding siblings ...)
  2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 10/17] xfs: swap extents operations for CRC filesystems Dave Chinner
@ 2013-08-12 13:20 ` Dave Chinner
  2013-08-20  8:16   ` Zhi Yong Wu
  2013-11-06 11:21   ` Christoph Hellwig
  2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 12/17] xfs: add tmpfile methods Dave Chinner
                   ` (5 subsequent siblings)
  16 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2013-08-12 13:20 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

O_TMPFILE support requires allocating an inode that is not attached to the
a current namespace - it's anonymous. The current inode allocation code runs
through xfs_create() which requires a parent inode and a name to be passed to
it. for O_TMPFILE, we do not have a parent inode or a name so we cannot use
the same calling conventions as xfs_create() to allocate a inode.

In this case, the inode is anonymous, so it is a property of the allocation
group it is allocated to, not the namespace. Hence all we really need to pass
from the VFS is a struct xfs_mount and the struct xfs_inode pointer that we
return the allocated inode in.

The allocation of the inode requires a different log reservation to mkdir/create
as there is no directory modification taking place, though we still need to
reserve/account quotas appropriately. We do not need to check if we can add the
entry to the directory, either.

Hence the majority of the inode allocation code is similar to that in
xfs_create, and so can be factored out of xfs_create() and reused.

The fact that a parent inode does not exist follows into xfs_dir_ialloc() and
xfs_ialloc(), too. xfs_dir_ialloc() does not actually use the parent inode, just
passes it through to xfs_ialloc(). xfs_ialloc() can handle a null parent inode,
but it results in a target inode number of 0 and so allocation will always
target AG 0, This will effectively serialise O_TMPFILE allocation and removal.

Hence we should separate the parent inode from the allocation target inode all
the way down to xfs_dialloc() while factoring this code. This will allow us to
use a separate AG rotor to direct allocation of temporary files around different
AGs, allowing them to the allocated and removed concurrently.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 96dda62..9c20a2c 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -112,6 +112,7 @@ xfs_cleanup_inode(
 	iput(inode);
 }
 
+/* how much of this does tmpfile need? */
 STATIC int
 xfs_vn_mknod(
 	struct inode	*dir,
-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* ***** SUSPECTED SPAM *****  [RFD 12/17] xfs: add tmpfile methods
  2013-08-12 13:19 ***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction Dave Chinner
                   ` (10 preceding siblings ...)
  2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 11/17] xfs: factor xfs_create to prepare for O_TMPFILE Dave Chinner
@ 2013-08-12 13:20 ` Dave Chinner
  2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 13/17] xfs: allow linkat() on O_TMPFILE files Dave Chinner
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2013-08-12 13:20 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

To add O_TMPFILE support, we need to add a new method to the VFS
interface.  This method does the dentry manipulation and drives the
transactional inode allocation and links it to the unlinked inode
list.

The inode allocation needs new transaction reservations that cover
onyl the free inode allocation modifications - it does not need the
reservations that cover directory entry additions. With these
reservations, we can call the newly factored inode allocation
function to get an allocated inode.

The final step is to link the inode into the unlinked list on the
AGI so that the inode will be freed and reclaimed when it is no
longer referenced - that will remove it form the unlinked list at
that time. If we crash with the O_TMPFILE still open, log recovery
will process the unlinked lists and free it at that point in time.

Once the inode is added to the unlinked list, we can commit the
transaction and link the inode to the provided dentry and finish
initialising the VFS part of the dentry.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 9c20a2c..82ea957 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1082,6 +1082,7 @@ static const struct inode_operations xfs_dir_inode_operations = {
 	.removexattr		= generic_removexattr,
 	.listxattr		= xfs_vn_listxattr,
 	.update_time		= xfs_vn_update_time,
+	//.tmpfile		= xfs_vn_tmpfile,
 };

 static const struct inode_operations xfs_dir_ci_inode_operations = {
@@ -1108,6 +1109,7 @@ static const struct inode_operations xfs_dir_ci_inode_operations = {
 	.removexattr		= generic_removexattr,
 	.listxattr		= xfs_vn_listxattr,
 	.update_time		= xfs_vn_update_time,
+	//.tmpfile		= xfs_vn_tmpfile,
 };

 static const struct inode_operations xfs_symlink_inode_operations = {
-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* ***** SUSPECTED SPAM ***** [RFD 13/17] xfs: allow linkat() on O_TMPFILE files
  2013-08-12 13:19 ***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction Dave Chinner
                   ` (11 preceding siblings ...)
  2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 12/17] xfs: add tmpfile methods Dave Chinner
@ 2013-08-12 13:20 ` Dave Chinner
  2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 14/17] xfs: separate inode freeing from inactivation Dave Chinner
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2013-08-12 13:20 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The VFS allows an anonymous temporary file to be named at a later
time via a linkat() syscall. Inodes that are created as O_TMPFILE
are marked with a special flag I_LINKABLE that tells the VFS that it
is OK to add a link to the inode even though it currently has a zero
link count.

To support this in XFS, we need to have xfs_link() detect this flag
is set and behave appropriately when detected. When this situation
is detected, we have ot ensure that the transaciton reservation
takes into account the additional overhead of removing the inode
from the unlinked list.

Once we have determined that we can add the directory entry for the
new link, we can remove the inode from the unlinked list and then
add the directory entry. We do the operation in this order so that
the AGI locking versus directory block allocation (and hence AGF
locking) is the same as xfs_create() does.

Finally, when we bump the link count, we need to ensure we don't
assert fail on the zero link count that the O_TMPFILE inode has.
This is the only case where incremeting the link count when it is
zero is now valid, so we need to make sure that xfs_bumplink()
checks this precisely. The VFS should ensure that I_LINKABLE is
removed on the first link of an anonymous file so care is needed to
ensure that the checks capture this behaviour, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_utils.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/xfs/xfs_utils.c b/fs/xfs/xfs_utils.c
index 0025c78..fb2fa6c 100644
--- a/fs/xfs/xfs_utils.c
+++ b/fs/xfs/xfs_utils.c
@@ -293,6 +293,7 @@ xfs_bumplink(
 {
 	xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_CHG);

+	/* tmpfiles won't like this */
 	ASSERT(ip->i_d.di_nlink > 0);
 	ip->i_d.di_nlink++;
 	inc_nlink(VFS_I(ip));
-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* ***** SUSPECTED SPAM ***** [RFD 14/17] xfs: separate inode freeing from inactivation
  2013-08-12 13:19 ***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction Dave Chinner
                   ` (12 preceding siblings ...)
  2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 13/17] xfs: allow linkat() on O_TMPFILE files Dave Chinner
@ 2013-08-12 13:20 ` Dave Chinner
  2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 15/17] xfs: introduce a method vector for unlinked list operations Dave Chinner
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2013-08-12 13:20 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Inode freeing and unlinked list processing is done as part of the
inactivation transaction when the last reference goes away from the
VFS inode. While it is advantageous to truncate away all the extents
allocated to the inode at this point, it is not necesarily in our
best interests to free the inode immediately.

While the inode is on the unlinked list and there are no more VFS
references to the inode, it is effectively a free inode - the
unlinked list reference tells us this rather than the inode btree
marking the inode free.

If we separate the actual freeing of the inode from the VFS
references, we have an inode that we can reallocate for use without
needing to pass it through the inode allocation btree. That is, we
can allocate directly from the unlinked list in the AG. We already
have the ability to do this for the O_TMPFILE/linkat(2) case where
we allocate directly to the unlinked list and then later link the
referenced inode to a directory and remove it from the unlinked
list.

In this case, if we have an unreferenced inode on the unlinked list,
we can allocate it directly simply by removing it from the unlinked
list. Further, O_TMPFILE allocations can be made effectively without
any transactions being issued at all if there are already free,
unreferenced inodes on the unlinked list.

Hence we need a method of finding inodes that are unreferenced but
on the unlinked list availble for allocation. A simple method for
doing this is using a inode cache radix tree tag on the inodes that
are unlinked and unreferenced but still on the unlinked list. A
simple tag check can tell us if there are any available for this
method of allocation, so there's no overhead to determine what
method to use.

Further, by using a radix tree tag we can use an inode cache
iterator function to run a periodic worker to remove inodes from the
unlinked list and mark them free in the inode btree. This the
advantage of doing the inode freeing in the background is that we do
not have to worry about how quickly we can remove inodes from the
unlinked list as it is not longer in the fast path. This enables us
to use trylock semantics for freeing the inodes and so we can skip
inodes we'd otherwise block on.

Alternatively, we can use the presence of the radix tree tag to
indicate that we need to walk the unlinked inode lists freeing
inodes from them. This may seem appealing until we realise that each
inode on a unlinked list belongs to a different inode chunk due
to the hashing function used. Hence every inode we free will modify
different btree record and so there is no locality of modification
in the inode btree structures and inode backing buffers.

If we use a radix tree walk, we will process all the free inodes in
a chunk and hence keep good CPU cache locality for all the data
structures that we need to modify for freeing those inodes. This
will be more CPU efficient as the data cache footprint of the walk
will be much smaller and hence we'll stall the CPU a lot less
waiting for cache lines to be loaded from memory.

This background freeing process allows us to make further changes to
the unlinked lists that avoid unsolvable deadlocks. For example, if
we cannot lock inodes on the unlinked list, we can simply have the
freeing of the inode retried again at some point in the future
automatically.

Finally, we need an inode flag to indicate that the inode is in this
special unlinked, unreferenced state when lockless cache lookups are
done. This ensures that we can safely avoid these inodes as lookup
circumstances allow and work correctly with the inode reclaim state
machine. e.g. for allocaiton optimisations, we want to be able to
find these inodes, but for all other lookups we want an ENOENT to be
returned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_vnodeops.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index dc730ac..db712fb 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -374,6 +374,8 @@ xfs_inactive(

 	ASSERT(ip->i_d.di_anextents == 0);

+	/* this is where we need to split inactivation and inode freeing */
+
 	/*
 	 * Free the inode.
 	 */
-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* ***** SUSPECTED SPAM ***** [RFD 15/17] xfs: introduce a method vector for unlinked list operations
  2013-08-12 13:19 ***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction Dave Chinner
                   ` (13 preceding siblings ...)
  2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 14/17] xfs: separate inode freeing from inactivation Dave Chinner
@ 2013-08-12 13:20 ` Dave Chinner
  2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 16/17] xfs: add in-core unlinked list for v3 inodes Dave Chinner
  2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 17/17] xfs: log unlinked list modifications in the incore v3 inode Dave Chinner
  16 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2013-08-12 13:20 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Filesystems with V3 inodes can log unlinked inode list modifications
as part of the inode core without needing to use the inode buffers
to log the list modifications or walk the list. However, this
requires a very different method of implementing the unlinked lists,
and so it makes sense to factor out the unlinked list implementation
into a pair of vectored operations for adding and removing the inode
from the current unlinked list.

Add an operations vector to the struct xfs_inode and hook it up so
that all inodes use it to call the current linked list manipulation
functions.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index b55fd34..2bb7060 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -22,6 +22,8 @@ struct posix_acl;
 struct xfs_dinode;
 struct xfs_inode;
 
+struct xfs_iops;
+
 /*
  * Fork identifiers.
  */
-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* ***** SUSPECTED SPAM ***** [RFD 16/17] xfs: add in-core unlinked list for v3 inodes
  2013-08-12 13:19 ***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction Dave Chinner
                   ` (14 preceding siblings ...)
  2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 15/17] xfs: introduce a method vector for unlinked list operations Dave Chinner
@ 2013-08-12 13:20 ` Dave Chinner
  2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 17/17] xfs: log unlinked list modifications in the incore v3 inode Dave Chinner
  16 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2013-08-12 13:20 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The first step in optimising unlinked list processing for v3 inodes
is to avoid having to walk the inode buffers to find the next inode
in the list. By definition, inodes on the unlinked list must be in
the inode cache and so we can add a set of incore hash lists to the
struct xfs_perag that match the on disk structure.

This allows us to use in-core locks rather than buffer locks to
provide exclusion for walking and manipulating the list. It also
allows us to use double linked lists so we don't need to walk the
list to find the adjacent inode on the list that needs to be
modified when we remove an inode from the unlinked list.

Further, it allows us to keep track of the tail of the unlinked list
so we can add inodes to the tail of the list instead of the head.
The combination of incore locking and adding inodes to the tail of
the unlinked lists means we can avoid locking the AGI during
additions to the unlinked list (execpt when the list is empty),
meaning we can execute concurrent list modifications safely when the
AGI itself doesn't need direct modification as part of the
operation.  This removes a significant serialisation point from the
unlink(2) path.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_ag.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
index c423191..17f7d43 100644
--- a/fs/xfs/xfs_ag.h
+++ b/fs/xfs/xfs_ag.h
@@ -235,6 +235,8 @@ typedef struct xfs_agfl {
  */
 #define XFS_PAGB_NUM_SLOTS	128

+struct xfs_iunlink_list;
+
 typedef struct xfs_perag {
 	struct xfs_mount *pag_mount;	/* owner filesystem */
 	xfs_agnumber_t	pag_agno;	/* AG this structure belongs to */
-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* ***** SUSPECTED SPAM ***** [RFD 17/17] xfs: log unlinked list modifications in the incore v3 inode
  2013-08-12 13:19 ***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction Dave Chinner
                   ` (15 preceding siblings ...)
  2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 16/17] xfs: add in-core unlinked list for v3 inodes Dave Chinner
@ 2013-08-12 13:20 ` Dave Chinner
  16 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2013-08-12 13:20 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we have incore unlinked lists and try-lock capability for
unlinked list removal operations, we can now switch the v3 inodes to
use transactions that directly modify and log the in-core inode
unlinked list pointers.

To do this, we need to lock the inode that points to the current
inode and update it's unlinked list pointer and log it. With that
modification, the current inode has been removed from the unlinked
list. If the current inode is at the head of the unlinked list, then
instead of an inode modification we need to modify the AGI unlinked
bucket pointer.

This can all be contained within the .iunlink_remove() method for v3
inodes, but we have to be careful about locking the previous inode -
it needs to use trylock semantics so we don't introduce deadlock
problems, and that means we need to ensure that the xfs_ifree path
handles EAGAIN errors correctly and passes it back to the caller so
that it can be retried again at a later time.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 2bb7060..4c10fa9 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -253,6 +253,7 @@ typedef struct xfs_inode {
 	struct xfs_dquot	*i_udquot;	/* user dquot */
 	struct xfs_dquot	*i_gdquot;	/* group dquot */
 	struct xfs_dquot	*i_pdquot;	/* project dquot */
+	struct list_head	i_unlink_list;

 	/* Inode location stuff */
 	xfs_ino_t		i_ino;		/* inode number (agno/agino)*/
-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: ***** SUSPECTED SPAM ***** [RFD 06/17] xfs: partial inode chunk allocation
  2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 06/17] xfs: partial inode chunk allocation Dave Chinner
@ 2013-08-13 22:07   ` Brian Foster
  0 siblings, 0 replies; 22+ messages in thread
From: Brian Foster @ 2013-08-13 22:07 UTC (permalink / raw)
  To: xfs

On 08/12/2013 09:19 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When a filesystem ages or when certain workloads dominate the storage capacity
> of the filesystem, it can become difficult to find contiguous free space in the
> filesystem and hence inode allocation can fail long before the filesystem is out
> of space.
> 
...
> ---

The issue outlined above is something that was observed with workloads
running against swift (object storage) on top of gluster (distributed
storage) on top of XFS with larger than default inode sizes (512b, 1k).
Unfortunately, I don't have any specific data to describe the workload.
If I recall correctly, the end result was free space fragmentation
leading to premature ENOSPC on inode allocation due to unavailability of
sufficiently sized extents for inode chunks.

After a brief irc conversation, Dave suggested that the immediately
previous item:

[RFD 05/17] xfs: introduce a free inode allocation btree

... tie in with and precede this partial chunk allocation work, so I'm
going to try to pick off this new free inode btree and partial chunk
allocation work. Just a heads up to the list to try and avoid any
duplicate effort. :)

Thanks again for writing this up, Dave.

Brian

>  fs/xfs/xfs_ialloc_btree.h | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_ialloc_btree.h b/fs/xfs/xfs_ialloc_btree.h
> index 3ac36b76..75ee794 100644
> --- a/fs/xfs/xfs_ialloc_btree.h
> +++ b/fs/xfs/xfs_ialloc_btree.h
> @@ -48,7 +48,9 @@ static inline xfs_inofree_t xfs_inobt_maskn(int i, int n)
>   */
>  typedef struct xfs_inobt_rec {
>  	__be32		ir_startino;	/* starting inode number */
> -	__be32		ir_freecount;	/* count of free inodes (set bits) */
> +	__be16		ir_alloc_mask;
> +	__u8		ir_pad;
> +	__u8		ir_freecount;
>  	__be64		ir_free;	/* free inode mask */
>  } xfs_inobt_rec_t;
>  
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ***** SUSPECTED SPAM ***** [RFD 11/17] xfs: factor xfs_create to prepare for O_TMPFILE
  2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 11/17] xfs: factor xfs_create to prepare for O_TMPFILE Dave Chinner
@ 2013-08-20  8:16   ` Zhi Yong Wu
  2013-11-06 11:20     ` Christoph Hellwig
  2013-11-06 11:21   ` Christoph Hellwig
  1 sibling, 1 reply; 22+ messages in thread
From: Zhi Yong Wu @ 2013-08-20  8:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfstests

HI,

I'd like to pick off this item and its following 2 items, please avoid
the duplicated work, thanks.

[RFD 11/17] xfs: factor xfs_create to prepare for O_TMPFILE,
[RFD 12/17] xfs: add tmpfile methods
[RFD 13/17] xfs: allow linkat() on O_TMPFILE files

On Mon, Aug 12, 2013 at 9:20 PM, Dave Chinner <david@fromorbit.com> wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> O_TMPFILE support requires allocating an inode that is not attached to the
> a current namespace - it's anonymous. The current inode allocation code runs
> through xfs_create() which requires a parent inode and a name to be passed to
> it. for O_TMPFILE, we do not have a parent inode or a name so we cannot use
> the same calling conventions as xfs_create() to allocate a inode.
>
> In this case, the inode is anonymous, so it is a property of the allocation
> group it is allocated to, not the namespace. Hence all we really need to pass
> from the VFS is a struct xfs_mount and the struct xfs_inode pointer that we
> return the allocated inode in.
>
> The allocation of the inode requires a different log reservation to mkdir/create
> as there is no directory modification taking place, though we still need to
> reserve/account quotas appropriately. We do not need to check if we can add the
> entry to the directory, either.
>
> Hence the majority of the inode allocation code is similar to that in
> xfs_create, and so can be factored out of xfs_create() and reused.
>
> The fact that a parent inode does not exist follows into xfs_dir_ialloc() and
> xfs_ialloc(), too. xfs_dir_ialloc() does not actually use the parent inode, just
> passes it through to xfs_ialloc(). xfs_ialloc() can handle a null parent inode,
> but it results in a target inode number of 0 and so allocation will always
> target AG 0, This will effectively serialise O_TMPFILE allocation and removal.
>
> Hence we should separate the parent inode from the allocation target inode all
> the way down to xfs_dialloc() while factoring this code. This will allow us to
> use a separate AG rotor to direct allocation of temporary files around different
> AGs, allowing them to the allocated and removed concurrently.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_iops.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 96dda62..9c20a2c 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -112,6 +112,7 @@ xfs_cleanup_inode(
>         iput(inode);
>  }
>
> +/* how much of this does tmpfile need? */
>  STATIC int
>  xfs_vn_mknod(
>         struct inode    *dir,
> --
> 1.8.3.2
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs



-- 
Regards,

Zhi Yong Wu

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ***** SUSPECTED SPAM ***** [RFD 11/17] xfs: factor xfs_create to prepare for O_TMPFILE
  2013-08-20  8:16   ` Zhi Yong Wu
@ 2013-11-06 11:20     ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2013-11-06 11:20 UTC (permalink / raw)
  To: Zhi Yong Wu; +Cc: xfstests

On Tue, Aug 20, 2013 at 04:16:59PM +0800, Zhi Yong Wu wrote:
> HI,
> 
> I'd like to pick off this item and its following 2 items, please avoid
> the duplicated work, thanks.
> 
> [RFD 11/17] xfs: factor xfs_create to prepare for O_TMPFILE,
> [RFD 12/17] xfs: add tmpfile methods
> [RFD 13/17] xfs: allow linkat() on O_TMPFILE files

Did you make any progress on this?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ***** SUSPECTED SPAM ***** [RFD 11/17] xfs: factor xfs_create to prepare for O_TMPFILE
  2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 11/17] xfs: factor xfs_create to prepare for O_TMPFILE Dave Chinner
  2013-08-20  8:16   ` Zhi Yong Wu
@ 2013-11-06 11:21   ` Christoph Hellwig
  1 sibling, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2013-11-06 11:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Zhi Yong Wu, xfs

On Mon, Aug 12, 2013 at 11:20:01PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> O_TMPFILE support requires allocating an inode that is not attached to the
> a current namespace - it's anonymous. The current inode allocation code runs
> through xfs_create() which requires a parent inode and a name to be passed to
> it. for O_TMPFILE, we do not have a parent inode or a name so we cannot use
> the same calling conventions as xfs_create() to allocate a inode.

Btw, O_TMPFILE does get a path passed, which the existing
implementations treat as an invisible parent.  I would suggest we follow
that path for our allocation decisions for now.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2013-11-06 11:21 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-12 13:19 ***** SUSPECTED SPAM ***** [RFD 00/17] xfs: inode management development direction Dave Chinner
2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 01/17] xfs: inode allocation tickets Dave Chinner
2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 02/17] xfs: separate inode chunk allocation from free inode allocation Dave Chinner
2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 03/17] xfs: move inode chunk allocation into a workqueue Dave Chinner
2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 04/17] xfs: optimise background inode chunk allocation Dave Chinner
2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 05/17] xfs: introduce a free inode allocation btree Dave Chinner
2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 06/17] xfs: partial inode chunk allocation Dave Chinner
2013-08-13 22:07   ` Brian Foster
2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 07/17] xfs: separate inode chunk freeing from inode freeing Dave Chinner
2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 08/17] xfs: inode chunk freeing in the background Dave Chinner
2013-08-12 13:19 ` ***** SUSPECTED SPAM ***** [RFD 09/17] xfs: optimise inode chunk freeing Dave Chinner
2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 10/17] xfs: swap extents operations for CRC filesystems Dave Chinner
2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 11/17] xfs: factor xfs_create to prepare for O_TMPFILE Dave Chinner
2013-08-20  8:16   ` Zhi Yong Wu
2013-11-06 11:20     ` Christoph Hellwig
2013-11-06 11:21   ` Christoph Hellwig
2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 12/17] xfs: add tmpfile methods Dave Chinner
2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 13/17] xfs: allow linkat() on O_TMPFILE files Dave Chinner
2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 14/17] xfs: separate inode freeing from inactivation Dave Chinner
2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 15/17] xfs: introduce a method vector for unlinked list operations Dave Chinner
2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 16/17] xfs: add in-core unlinked list for v3 inodes Dave Chinner
2013-08-12 13:20 ` ***** SUSPECTED SPAM ***** [RFD 17/17] xfs: log unlinked list modifications in the incore v3 inode Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox