* [PATCH 0/3] xfs: regression fixes for 3.16-rc5
@ 2014-07-10 23:26 Dave Chinner
2014-07-10 23:26 ` [PATCH 1/3] Revert "xfs: block allocation work needs to be kswapd aware" Dave Chinner
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Dave Chinner @ 2014-07-10 23:26 UTC (permalink / raw)
To: xfs
Hi folks,
These three patches are fixes for recent regressions. The first two
are fixing up the mess that making the allocation workqueue kswapd
aware. The first reverts the original patch, and the second moves
the stack switch to the problematic double btree split path rather
than being done for all writeback allocation. This allows the stack
split to be done unconditionally for all allocations due to the
relative rarity of it occurring now, and allows the kswapd awareness
to be passed because we aren't going to swamp memory reclaim with
hundreds of concurrent allocation requests from "kswapd" context.
The last patch is for an older regression, and one that was tripped
over recently when fixing up v3.2.0 of repair. It only affects
people who are switching from newer kernels to older kernels and
have only user quotas enabled. That combination shouldn't be too
frequent - the fact the bug has been there since 3.12 indicates that
this, indeed, isn't a frequent occurrence. Still, it needs to be
fixed.
Comments, thoughts, and testing all welcome...
-Dave.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/3] Revert "xfs: block allocation work needs to be kswapd aware"
2014-07-10 23:26 [PATCH 0/3] xfs: regression fixes for 3.16-rc5 Dave Chinner
@ 2014-07-10 23:26 ` Dave Chinner
2014-07-11 12:32 ` Brian Foster
2014-07-10 23:26 ` [PATCH 2/3] xfs: refine the allocation stack switch Dave Chinner
2014-07-10 23:26 ` [PATCH 3/3] xfs: null unused quota inodes when quota is on Dave Chinner
2 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2014-07-10 23:26 UTC (permalink / raw)
To: xfs
This reverts commit 1f6d64829db78a7e1d63e15c9f48f0a5d2b5a679.
This commit resulted in regressions in performance in low
memory situations where kswapd was doing writeback of delayed
allocation blocks. It resulted in significant parallelism of the
kswapd work and with the special kswapd flags meant that hundreds of
active allocation could dip into kswapd specific memory reserves and
avoid being throttled. This cause a large amount of performance
variation, as well as random OOM-killer invocations that didn't
previously exist.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_bmap_util.c | 16 +++-------------
fs/xfs/xfs_bmap_util.h | 13 ++++++-------
2 files changed, 9 insertions(+), 20 deletions(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 703b3ec..057f671 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -258,23 +258,14 @@ xfs_bmapi_allocate_worker(
struct xfs_bmalloca *args = container_of(work,
struct xfs_bmalloca, work);
unsigned long pflags;
- unsigned long new_pflags = PF_FSTRANS;
- /*
- * we are in a transaction context here, but may also be doing work
- * in kswapd context, and hence we may need to inherit that state
- * temporarily to ensure that we don't block waiting for memory reclaim
- * in any way.
- */
- if (args->kswapd)
- new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
-
- current_set_flags_nested(&pflags, new_pflags);
+ /* we are in a transaction context here */
+ current_set_flags_nested(&pflags, PF_FSTRANS);
args->result = __xfs_bmapi_allocate(args);
complete(args->done);
- current_restore_flags_nested(&pflags, new_pflags);
+ current_restore_flags_nested(&pflags, PF_FSTRANS);
}
/*
@@ -293,7 +284,6 @@ xfs_bmapi_allocate(
args->done = &done;
- args->kswapd = current_is_kswapd();
INIT_WORK_ONSTACK(&args->work, xfs_bmapi_allocate_worker);
queue_work(xfs_alloc_wq, &args->work);
wait_for_completion(&done);
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 075f722..935ed2b 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -50,13 +50,12 @@ struct xfs_bmalloca {
xfs_extlen_t total; /* total blocks needed for xaction */
xfs_extlen_t minlen; /* minimum allocation size (blocks) */
xfs_extlen_t minleft; /* amount must be left after alloc */
- bool eof; /* set if allocating past last extent */
- bool wasdel; /* replacing a delayed allocation */
- bool userdata;/* set if is user data */
- bool aeof; /* allocated space at eof */
- bool conv; /* overwriting unwritten extents */
- bool stack_switch;
- bool kswapd; /* allocation in kswapd context */
+ char eof; /* set if allocating past last extent */
+ char wasdel; /* replacing a delayed allocation */
+ char userdata;/* set if is user data */
+ char aeof; /* allocated space at eof */
+ char conv; /* overwriting unwritten extents */
+ char stack_switch;
int flags;
struct completion *done;
struct work_struct work;
--
2.0.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/3] xfs: refine the allocation stack switch
2014-07-10 23:26 [PATCH 0/3] xfs: regression fixes for 3.16-rc5 Dave Chinner
2014-07-10 23:26 ` [PATCH 1/3] Revert "xfs: block allocation work needs to be kswapd aware" Dave Chinner
@ 2014-07-10 23:26 ` Dave Chinner
2014-07-11 12:33 ` Brian Foster
2014-07-10 23:26 ` [PATCH 3/3] xfs: null unused quota inodes when quota is on Dave Chinner
2 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2014-07-10 23:26 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
The allocation stack switch at xfs_bmapi_allocate() has served it's
purpose, but is no longer a sufficient solution to the stack usage
problem we have in the XFS allocation path.
Whilst the kernel stack size is now 16k, that is not a valid reason
for undoing all our "keep stack usage down" modifications. What it
does allow us to do is have the freedom to refine and perfect the
modifications knowing that if we get it wrong it won't blow up in
our faces - we have a safety net now.
This is important because we still have the issue of older kernels
having smaller stacks and that they are still supported and are
demonstrating a wide range of different stack overflows. Red Hat
has several open bugs for allocation based stack overflows from
directory modifications and direct IO block allocation and these
problems still need to be solved. If we can solve them upstream,
then distro's won't need to bake their own unique solutions.
To that end, I've observed that every allocation based stack
overflow report has had a specific characteristic - it has happened
during or directly after a bmap btree block split. That event
requires a new block to be allocated to the tree, and so we
effectively stack one allocation stack on top of another, and that's
when we get into trouble.
A further observation is that bmap btree block splits are much rarer
than writeback allocation - over a range of different workloads I've
observed the ratio of bmap btree inserts to splits ranges from 100:1
(xfstests run) to 10000:1 (local VM image server with sparse files
that range in the hundreds of thousands to millions of extents).
Either way, bmap btree split events are much, much rarer than
allocation events.
Finally, we have to move the kswapd state to the allocation workqueue
work when allocation is done on behalf of kswapd. This is proving to
cause significant perturbation in performance under memory pressure
and appears to be generating allocation deadlock warnings under some
workloads, so avoiding the use of a workqueue for the majority of
kswapd writeback allocation will minimise the impact of such
behaviour.
Hence it makes sense to move the stack switch to xfs_btree_split()
and only do it for bmap btree splits. Stack switches during
allocation will be much rarer, so there won't be significant
performacne overhead caused by switching stacks. The worse case
stack from all allocation paths will be split, not just writeback.
And the majority of memory allocations will be done in the correct
context (e.g. kswapd) without causing additional latency, and so we
simplify the memory reclaim interactions between processes,
workqueues and kswapd.
The worst stack I've been able to generate with this patch in place
is 5600 bytes deep. It's very revealing because we exit XFS at:
37) 1768 64 kmem_cache_alloc+0x13b/0x170
about 1800 bytes of stack consumed, and the remaining 3800 bytes
(and 36 functions) is memory reclaim, swap and the IO stack. And
this occurs in the inode allocation from an open(O_CREAT) syscall,
not writeback.
The amount of stack being used is much less than I've previously be
able to generate - fs_mark testing has been able to generate stack
usage of around 7k without too much trouble; with this patch it's
only just getting to 5.5k. This is primarily because the metadata
allocation paths (e.g. directory blocks) are no longer causing
double splits on the same stack, and hence now stack tracing is
showing swapping being the worst stack consumer rather than XFS.
Performance of fs_mark inode create workloads is unchanged.
Performance of fs_mark async fsync workloads is consistently good
with context switches reduced by around 150,000/s (30%).
Performance of dbench, streaming IO and postmark is unchanged.
Allocation deadlock warnings have not been seen on the workloads
that generated them since adding this patch.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_bmap.c | 4 +--
fs/xfs/xfs_bmap_util.c | 43 --------------------------
fs/xfs/xfs_bmap_util.h | 15 +++++----
fs/xfs/xfs_btree.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++--
4 files changed, 91 insertions(+), 55 deletions(-)
diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
index 96175df..32bc49c 100644
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -4298,8 +4298,8 @@ xfs_bmapi_delay(
}
-int
-__xfs_bmapi_allocate(
+static int
+xfs_bmapi_allocate(
struct xfs_bmalloca *bma)
{
struct xfs_mount *mp = bma->ip->i_mount;
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 057f671..64731ef 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -249,49 +249,6 @@ xfs_bmap_rtalloc(
}
/*
- * Stack switching interfaces for allocation
- */
-static void
-xfs_bmapi_allocate_worker(
- struct work_struct *work)
-{
- struct xfs_bmalloca *args = container_of(work,
- struct xfs_bmalloca, work);
- unsigned long pflags;
-
- /* we are in a transaction context here */
- current_set_flags_nested(&pflags, PF_FSTRANS);
-
- args->result = __xfs_bmapi_allocate(args);
- complete(args->done);
-
- current_restore_flags_nested(&pflags, PF_FSTRANS);
-}
-
-/*
- * Some allocation requests often come in with little stack to work on. Push
- * them off to a worker thread so there is lots of stack to use. Otherwise just
- * call directly to avoid the context switch overhead here.
- */
-int
-xfs_bmapi_allocate(
- struct xfs_bmalloca *args)
-{
- DECLARE_COMPLETION_ONSTACK(done);
-
- if (!args->stack_switch)
- return __xfs_bmapi_allocate(args);
-
-
- args->done = &done;
- INIT_WORK_ONSTACK(&args->work, xfs_bmapi_allocate_worker);
- queue_work(xfs_alloc_wq, &args->work);
- wait_for_completion(&done);
- destroy_work_on_stack(&args->work);
- return args->result;
-}
-
-/*
* Check if the endoff is outside the last extent. If so the caller will grow
* the allocation to a stripe unit boundary. All offsets are considered outside
* the end of file for an empty fork, so 1 is returned in *eof in that case.
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 935ed2b..91df8e9 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -50,12 +50,13 @@ struct xfs_bmalloca {
xfs_extlen_t total; /* total blocks needed for xaction */
xfs_extlen_t minlen; /* minimum allocation size (blocks) */
xfs_extlen_t minleft; /* amount must be left after alloc */
- char eof; /* set if allocating past last extent */
- char wasdel; /* replacing a delayed allocation */
- char userdata;/* set if is user data */
- char aeof; /* allocated space at eof */
- char conv; /* overwriting unwritten extents */
- char stack_switch;
+ bool eof; /* set if allocating past last extent */
+ bool wasdel; /* replacing a delayed allocation */
+ bool userdata;/* set if is user data */
+ bool aeof; /* allocated space at eof */
+ bool conv; /* overwriting unwritten extents */
+ bool stack_switch;
+ bool kswapd; /* work being done for kswapd */
int flags;
struct completion *done;
struct work_struct work;
@@ -65,8 +66,6 @@ struct xfs_bmalloca {
int xfs_bmap_finish(struct xfs_trans **tp, struct xfs_bmap_free *flist,
int *committed);
int xfs_bmap_rtalloc(struct xfs_bmalloca *ap);
-int xfs_bmapi_allocate(struct xfs_bmalloca *args);
-int __xfs_bmapi_allocate(struct xfs_bmalloca *args);
int xfs_bmap_eof(struct xfs_inode *ip, xfs_fileoff_t endoff,
int whichfork, int *eof);
int xfs_bmap_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
diff --git a/fs/xfs/xfs_btree.c b/fs/xfs/xfs_btree.c
index bf810c6..61168e3 100644
--- a/fs/xfs/xfs_btree.c
+++ b/fs/xfs/xfs_btree.c
@@ -33,6 +33,7 @@
#include "xfs_error.h"
#include "xfs_trace.h"
#include "xfs_cksum.h"
+#include "xfs_alloc.h"
/*
* Cursor allocation zone.
@@ -2322,8 +2323,8 @@ error1:
* Return new block number and the key to its first
* record (to be inserted into parent).
*/
-STATIC int /* error */
-xfs_btree_split(
+int /* error */
+__xfs_btree_split(
struct xfs_btree_cur *cur,
int level,
union xfs_btree_ptr *ptrp,
@@ -2503,6 +2504,85 @@ error0:
return error;
}
+struct xfs_btree_split_args {
+ struct xfs_btree_cur *cur;
+ int level;
+ union xfs_btree_ptr *ptrp;
+ union xfs_btree_key *key;
+ struct xfs_btree_cur **curp;
+ int *stat; /* success/failure */
+ int result;
+ bool kswapd; /* allocation in kswapd context */
+ struct completion *done;
+ struct work_struct work;
+};
+
+/*
+ * Stack switching interfaces for allocation
+ */
+static void
+xfs_btree_split_worker(
+ struct work_struct *work)
+{
+ struct xfs_btree_split_args *args = container_of(work,
+ struct xfs_btree_split_args, work);
+ unsigned long pflags;
+ unsigned long new_pflags = PF_FSTRANS;
+
+ /*
+ * we are in a transaction context here, but may also be doing work
+ * in kswapd context, and hence we may need to inherit that state
+ * temporarily to ensure that we don't block waiting for memory reclaim
+ * in any way.
+ */
+ if (args->kswapd)
+ new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
+
+ current_set_flags_nested(&pflags, new_pflags);
+
+ args->result = __xfs_btree_split(args->cur, args->level, args->ptrp,
+ args->key, args->curp, args->stat);
+ complete(args->done);
+
+ current_restore_flags_nested(&pflags, new_pflags);
+}
+
+/*
+ * BMBT split requests often come in with little stack to work on. Push
+ * them off to a worker thread so there is lots of stack to use. For the other
+ * btree types, just call directly to avoid the context switch overhead here.
+ */
+STATIC int /* error */
+xfs_btree_split(
+ struct xfs_btree_cur *cur,
+ int level,
+ union xfs_btree_ptr *ptrp,
+ union xfs_btree_key *key,
+ struct xfs_btree_cur **curp,
+ int *stat) /* success/failure */
+{
+ struct xfs_btree_split_args args;
+ DECLARE_COMPLETION_ONSTACK(done);
+
+ if (cur->bc_btnum != XFS_BTNUM_BMAP)
+ return __xfs_btree_split(cur, level, ptrp, key, curp, stat);
+
+ args.cur = cur;
+ args.level = level;
+ args.ptrp = ptrp;
+ args.key = key;
+ args.curp = curp;
+ args.stat = stat;
+ args.done = &done;
+ args.kswapd = current_is_kswapd();
+ INIT_WORK_ONSTACK(&args.work, xfs_btree_split_worker);
+ queue_work(xfs_alloc_wq, &args.work);
+ wait_for_completion(&done);
+ destroy_work_on_stack(&args.work);
+ return args.result;
+}
+
+
/*
* Copy the old inode root contents into a real block and make the
* broot point to it.
--
2.0.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 3/3] xfs: null unused quota inodes when quota is on
2014-07-10 23:26 [PATCH 0/3] xfs: regression fixes for 3.16-rc5 Dave Chinner
2014-07-10 23:26 ` [PATCH 1/3] Revert "xfs: block allocation work needs to be kswapd aware" Dave Chinner
2014-07-10 23:26 ` [PATCH 2/3] xfs: refine the allocation stack switch Dave Chinner
@ 2014-07-10 23:26 ` Dave Chinner
2014-07-11 13:15 ` Brian Foster
2014-07-11 15:22 ` Arkadiusz Miśkiewicz
2 siblings, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2014-07-10 23:26 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
When quota is on, it is expected that unused quota inodes have a
value of NULLFSINO. The changes to support a separate project quota
in 3.12 broken this rule for non-project quota inode enabled
filesystem, as the code now refuses to write the group quota inode
if neither group or project quotas are enabled. This regression was
introduced by commit d892d58 ("xfs: Start using pquotaino from the
superblock").
In this case, we should be writing NULLFSINO rather than nothing to
ensure that we leave the group quota inode in a valid state while
quotas are enabled.
Failure to do so doesn't cause a current kernel to break - the
separate project quota inodes introduced translation code to always
treat a zero inode as NULLFSINO. This was introduced by commit
0102629 ("xfs: Initialize all quota inodes to be NULLFSINO") with is
also in 3.12 but older kernels do not do this and hence taking a
filesystem back to an older kernel can result in quotas failing
initialisation at mount time. When that happens, we see this in
dmesg:
[ 1649.215390] XFS (sdb): Mounting Filesystem
[ 1649.316894] XFS (sdb): Failed to initialize disk quotas.
[ 1649.316902] XFS (sdb): Ending clean mount
By ensuring that we write NULLFSINO to quota inodes that aren't
active, we avoid this problem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_sb.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/xfs_sb.c b/fs/xfs/xfs_sb.c
index c3453b1..9a58699 100644
--- a/fs/xfs/xfs_sb.c
+++ b/fs/xfs/xfs_sb.c
@@ -483,10 +483,16 @@ xfs_sb_quota_to_disk(
}
/*
- * GQUOTINO and PQUOTINO cannot be used together in versions
- * of superblock that do not have pquotino. from->sb_flags
- * tells us which quota is active and should be copied to
- * disk.
+ * GQUOTINO and PQUOTINO cannot be used together in versions of
+ * superblock that do not have pquotino. from->sb_flags tells us which
+ * quota is active and should be copied to disk. If neither are active,
+ * make sure we write NULLFSINO to the sb_gquotino field as a quota
+ * inode value of "0" is invalid when the XFS_SB_VERSION_QUOTA feature
+ * bit is set.
+ *
+ * Note that we don't need to handle the sb_uquotino or sb_pquotino here
+ * as they do not require any translation. Hence the main sb field loop
+ * will write them appropriately from the in-core superblock.
*/
if ((*fields & XFS_SB_GQUOTINO) &&
(from->sb_qflags & XFS_GQUOTA_ACCT))
@@ -494,6 +500,8 @@ xfs_sb_quota_to_disk(
else if ((*fields & XFS_SB_PQUOTINO) &&
(from->sb_qflags & XFS_PQUOTA_ACCT))
to->sb_gquotino = cpu_to_be64(from->sb_pquotino);
+ else
+ to->sb_gquotino = cpu_to_be64(NULLFSINO);
*fields &= ~(XFS_SB_PQUOTINO | XFS_SB_GQUOTINO);
}
--
2.0.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/3] Revert "xfs: block allocation work needs to be kswapd aware"
2014-07-10 23:26 ` [PATCH 1/3] Revert "xfs: block allocation work needs to be kswapd aware" Dave Chinner
@ 2014-07-11 12:32 ` Brian Foster
0 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2014-07-11 12:32 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Fri, Jul 11, 2014 at 09:26:17AM +1000, Dave Chinner wrote:
> This reverts commit 1f6d64829db78a7e1d63e15c9f48f0a5d2b5a679.
>
> This commit resulted in regressions in performance in low
> memory situations where kswapd was doing writeback of delayed
> allocation blocks. It resulted in significant parallelism of the
> kswapd work and with the special kswapd flags meant that hundreds of
> active allocation could dip into kswapd specific memory reserves and
> avoid being throttled. This cause a large amount of performance
> variation, as well as random OOM-killer invocations that didn't
> previously exist.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
I was going to suggest to keep the bool types, but that's fixed up in
the subsequent patch. ;) Looks good...
Reviewed-by: Brian Foster <bfoster@redhat.com>
> fs/xfs/xfs_bmap_util.c | 16 +++-------------
> fs/xfs/xfs_bmap_util.h | 13 ++++++-------
> 2 files changed, 9 insertions(+), 20 deletions(-)
>
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 703b3ec..057f671 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -258,23 +258,14 @@ xfs_bmapi_allocate_worker(
> struct xfs_bmalloca *args = container_of(work,
> struct xfs_bmalloca, work);
> unsigned long pflags;
> - unsigned long new_pflags = PF_FSTRANS;
>
> - /*
> - * we are in a transaction context here, but may also be doing work
> - * in kswapd context, and hence we may need to inherit that state
> - * temporarily to ensure that we don't block waiting for memory reclaim
> - * in any way.
> - */
> - if (args->kswapd)
> - new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> -
> - current_set_flags_nested(&pflags, new_pflags);
> + /* we are in a transaction context here */
> + current_set_flags_nested(&pflags, PF_FSTRANS);
>
> args->result = __xfs_bmapi_allocate(args);
> complete(args->done);
>
> - current_restore_flags_nested(&pflags, new_pflags);
> + current_restore_flags_nested(&pflags, PF_FSTRANS);
> }
>
> /*
> @@ -293,7 +284,6 @@ xfs_bmapi_allocate(
>
>
> args->done = &done;
> - args->kswapd = current_is_kswapd();
> INIT_WORK_ONSTACK(&args->work, xfs_bmapi_allocate_worker);
> queue_work(xfs_alloc_wq, &args->work);
> wait_for_completion(&done);
> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> index 075f722..935ed2b 100644
> --- a/fs/xfs/xfs_bmap_util.h
> +++ b/fs/xfs/xfs_bmap_util.h
> @@ -50,13 +50,12 @@ struct xfs_bmalloca {
> xfs_extlen_t total; /* total blocks needed for xaction */
> xfs_extlen_t minlen; /* minimum allocation size (blocks) */
> xfs_extlen_t minleft; /* amount must be left after alloc */
> - bool eof; /* set if allocating past last extent */
> - bool wasdel; /* replacing a delayed allocation */
> - bool userdata;/* set if is user data */
> - bool aeof; /* allocated space at eof */
> - bool conv; /* overwriting unwritten extents */
> - bool stack_switch;
> - bool kswapd; /* allocation in kswapd context */
> + char eof; /* set if allocating past last extent */
> + char wasdel; /* replacing a delayed allocation */
> + char userdata;/* set if is user data */
> + char aeof; /* allocated space at eof */
> + char conv; /* overwriting unwritten extents */
> + char stack_switch;
> int flags;
> struct completion *done;
> struct work_struct work;
> --
> 2.0.0
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] xfs: refine the allocation stack switch
2014-07-10 23:26 ` [PATCH 2/3] xfs: refine the allocation stack switch Dave Chinner
@ 2014-07-11 12:33 ` Brian Foster
2014-07-11 21:57 ` Dave Chinner
0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2014-07-11 12:33 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Fri, Jul 11, 2014 at 09:26:18AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
...
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_bmap.c | 4 +--
> fs/xfs/xfs_bmap_util.c | 43 --------------------------
> fs/xfs/xfs_bmap_util.h | 15 +++++----
> fs/xfs/xfs_btree.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++--
> 4 files changed, 91 insertions(+), 55 deletions(-)
>
...
> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> index 935ed2b..91df8e9 100644
> --- a/fs/xfs/xfs_bmap_util.h
> +++ b/fs/xfs/xfs_bmap_util.h
> @@ -50,12 +50,13 @@ struct xfs_bmalloca {
> xfs_extlen_t total; /* total blocks needed for xaction */
> xfs_extlen_t minlen; /* minimum allocation size (blocks) */
> xfs_extlen_t minleft; /* amount must be left after alloc */
> - char eof; /* set if allocating past last extent */
> - char wasdel; /* replacing a delayed allocation */
> - char userdata;/* set if is user data */
> - char aeof; /* allocated space at eof */
> - char conv; /* overwriting unwritten extents */
> - char stack_switch;
> + bool eof; /* set if allocating past last extent */
> + bool wasdel; /* replacing a delayed allocation */
> + bool userdata;/* set if is user data */
> + bool aeof; /* allocated space at eof */
> + bool conv; /* overwriting unwritten extents */
> + bool stack_switch;
> + bool kswapd; /* work being done for kswapd */
Neither stack_switch nor kswapd are used any longer. Removal of
stack_switch means that the XFS_BMAPI_STACK_SWITCH flag can go away as
well.
> int flags;
> struct completion *done;
> struct work_struct work;
> @@ -65,8 +66,6 @@ struct xfs_bmalloca {
> int xfs_bmap_finish(struct xfs_trans **tp, struct xfs_bmap_free *flist,
> int *committed);
> int xfs_bmap_rtalloc(struct xfs_bmalloca *ap);
> -int xfs_bmapi_allocate(struct xfs_bmalloca *args);
> -int __xfs_bmapi_allocate(struct xfs_bmalloca *args);
> int xfs_bmap_eof(struct xfs_inode *ip, xfs_fileoff_t endoff,
> int whichfork, int *eof);
> int xfs_bmap_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
> diff --git a/fs/xfs/xfs_btree.c b/fs/xfs/xfs_btree.c
> index bf810c6..61168e3 100644
> --- a/fs/xfs/xfs_btree.c
> +++ b/fs/xfs/xfs_btree.c
> @@ -33,6 +33,7 @@
> #include "xfs_error.h"
> #include "xfs_trace.h"
> #include "xfs_cksum.h"
> +#include "xfs_alloc.h"
>
> /*
> * Cursor allocation zone.
> @@ -2322,8 +2323,8 @@ error1:
> * Return new block number and the key to its first
> * record (to be inserted into parent).
> */
> -STATIC int /* error */
> -xfs_btree_split(
> +int /* error */
> +__xfs_btree_split(
Looks like this can remain static. The rest looks Ok to me, but I'll run
some tests too.
Brian
> struct xfs_btree_cur *cur,
> int level,
> union xfs_btree_ptr *ptrp,
> @@ -2503,6 +2504,85 @@ error0:
> return error;
> }
>
> +struct xfs_btree_split_args {
> + struct xfs_btree_cur *cur;
> + int level;
> + union xfs_btree_ptr *ptrp;
> + union xfs_btree_key *key;
> + struct xfs_btree_cur **curp;
> + int *stat; /* success/failure */
> + int result;
> + bool kswapd; /* allocation in kswapd context */
> + struct completion *done;
> + struct work_struct work;
> +};
> +
> +/*
> + * Stack switching interfaces for allocation
> + */
> +static void
> +xfs_btree_split_worker(
> + struct work_struct *work)
> +{
> + struct xfs_btree_split_args *args = container_of(work,
> + struct xfs_btree_split_args, work);
> + unsigned long pflags;
> + unsigned long new_pflags = PF_FSTRANS;
> +
> + /*
> + * we are in a transaction context here, but may also be doing work
> + * in kswapd context, and hence we may need to inherit that state
> + * temporarily to ensure that we don't block waiting for memory reclaim
> + * in any way.
> + */
> + if (args->kswapd)
> + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> +
> + current_set_flags_nested(&pflags, new_pflags);
> +
> + args->result = __xfs_btree_split(args->cur, args->level, args->ptrp,
> + args->key, args->curp, args->stat);
> + complete(args->done);
> +
> + current_restore_flags_nested(&pflags, new_pflags);
> +}
> +
> +/*
> + * BMBT split requests often come in with little stack to work on. Push
> + * them off to a worker thread so there is lots of stack to use. For the other
> + * btree types, just call directly to avoid the context switch overhead here.
> + */
> +STATIC int /* error */
> +xfs_btree_split(
> + struct xfs_btree_cur *cur,
> + int level,
> + union xfs_btree_ptr *ptrp,
> + union xfs_btree_key *key,
> + struct xfs_btree_cur **curp,
> + int *stat) /* success/failure */
> +{
> + struct xfs_btree_split_args args;
> + DECLARE_COMPLETION_ONSTACK(done);
> +
> + if (cur->bc_btnum != XFS_BTNUM_BMAP)
> + return __xfs_btree_split(cur, level, ptrp, key, curp, stat);
> +
> + args.cur = cur;
> + args.level = level;
> + args.ptrp = ptrp;
> + args.key = key;
> + args.curp = curp;
> + args.stat = stat;
> + args.done = &done;
> + args.kswapd = current_is_kswapd();
> + INIT_WORK_ONSTACK(&args.work, xfs_btree_split_worker);
> + queue_work(xfs_alloc_wq, &args.work);
> + wait_for_completion(&done);
> + destroy_work_on_stack(&args.work);
> + return args.result;
> +}
> +
> +
> /*
> * Copy the old inode root contents into a real block and make the
> * broot point to it.
> --
> 2.0.0
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] xfs: null unused quota inodes when quota is on
2014-07-10 23:26 ` [PATCH 3/3] xfs: null unused quota inodes when quota is on Dave Chinner
@ 2014-07-11 13:15 ` Brian Foster
2014-07-11 22:00 ` Dave Chinner
2014-07-11 15:22 ` Arkadiusz Miśkiewicz
1 sibling, 1 reply; 14+ messages in thread
From: Brian Foster @ 2014-07-11 13:15 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Fri, Jul 11, 2014 at 09:26:19AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> When quota is on, it is expected that unused quota inodes have a
> value of NULLFSINO. The changes to support a separate project quota
> in 3.12 broken this rule for non-project quota inode enabled
> filesystem, as the code now refuses to write the group quota inode
> if neither group or project quotas are enabled. This regression was
> introduced by commit d892d58 ("xfs: Start using pquotaino from the
> superblock").
>
> In this case, we should be writing NULLFSINO rather than nothing to
> ensure that we leave the group quota inode in a valid state while
> quotas are enabled.
>
> Failure to do so doesn't cause a current kernel to break - the
> separate project quota inodes introduced translation code to always
> treat a zero inode as NULLFSINO. This was introduced by commit
> 0102629 ("xfs: Initialize all quota inodes to be NULLFSINO") with is
> also in 3.12 but older kernels do not do this and hence taking a
> filesystem back to an older kernel can result in quotas failing
> initialisation at mount time. When that happens, we see this in
> dmesg:
>
> [ 1649.215390] XFS (sdb): Mounting Filesystem
> [ 1649.316894] XFS (sdb): Failed to initialize disk quotas.
> [ 1649.316902] XFS (sdb): Ending clean mount
>
> By ensuring that we write NULLFSINO to quota inodes that aren't
> active, we avoid this problem.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_sb.c | 16 ++++++++++++----
> 1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/fs/xfs/xfs_sb.c b/fs/xfs/xfs_sb.c
> index c3453b1..9a58699 100644
> --- a/fs/xfs/xfs_sb.c
> +++ b/fs/xfs/xfs_sb.c
> @@ -483,10 +483,16 @@ xfs_sb_quota_to_disk(
> }
>
> /*
> - * GQUOTINO and PQUOTINO cannot be used together in versions
> - * of superblock that do not have pquotino. from->sb_flags
> - * tells us which quota is active and should be copied to
> - * disk.
> + * GQUOTINO and PQUOTINO cannot be used together in versions of
> + * superblock that do not have pquotino. from->sb_flags tells us which
> + * quota is active and should be copied to disk. If neither are active,
> + * make sure we write NULLFSINO to the sb_gquotino field as a quota
> + * inode value of "0" is invalid when the XFS_SB_VERSION_QUOTA feature
> + * bit is set.
> + *
> + * Note that we don't need to handle the sb_uquotino or sb_pquotino here
> + * as they do not require any translation. Hence the main sb field loop
> + * will write them appropriately from the in-core superblock.
> */
sb_uquotino does not require any translation, but sb_pquotino simply
doesn't exist on older sb's, right? Is that what you mean to say here?
E.g., we clear XFS_SB_PQUOTINO from fields below, so the field loop
isn't going to write it (and I take that's the right thing to do).
> if ((*fields & XFS_SB_GQUOTINO) &&
> (from->sb_qflags & XFS_GQUOTA_ACCT))
> @@ -494,6 +500,8 @@ xfs_sb_quota_to_disk(
> else if ((*fields & XFS_SB_PQUOTINO) &&
> (from->sb_qflags & XFS_PQUOTA_ACCT))
> to->sb_gquotino = cpu_to_be64(from->sb_pquotino);
> + else
> + to->sb_gquotino = cpu_to_be64(NULLFSINO);
>
We update the field manually and clear the flag bit so the loop doesn't
handle it. Seems Ok, despite my minor confusion over the above comment:
Reviewed-by: Brian Foster <bfoster@redhat.com>
> *fields &= ~(XFS_SB_PQUOTINO | XFS_SB_GQUOTINO);
> }
> --
> 2.0.0
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] xfs: null unused quota inodes when quota is on
2014-07-10 23:26 ` [PATCH 3/3] xfs: null unused quota inodes when quota is on Dave Chinner
2014-07-11 13:15 ` Brian Foster
@ 2014-07-11 15:22 ` Arkadiusz Miśkiewicz
2014-07-11 15:30 ` Arkadiusz Miśkiewicz
1 sibling, 1 reply; 14+ messages in thread
From: Arkadiusz Miśkiewicz @ 2014-07-11 15:22 UTC (permalink / raw)
To: xfs
On Friday 11 of July 2014, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> When quota is on, it is expected that unused quota inodes have a
> value of NULLFSINO. The changes to support a separate project quota
> in 3.12 broken this rule for non-project quota inode enabled
> filesystem, as the code now refuses to write the group quota inode
> if neither group or project quotas are enabled. This regression was
> introduced by commit d892d58 ("xfs: Start using pquotaino from the
> superblock").
>
> In this case, we should be writing NULLFSINO rather than nothing to
> ensure that we leave the group quota inode in a valid state while
> quotas are enabled.
>
> Failure to do so doesn't cause a current kernel to break - the
> separate project quota inodes introduced translation code to always
> treat a zero inode as NULLFSINO. This was introduced by commit
> 0102629 ("xfs: Initialize all quota inodes to be NULLFSINO") with is
> also in 3.12 but older kernels do not do this and hence taking a
> filesystem back to an older kernel can result in quotas failing
> initialisation at mount time. When that happens, we see this in
> dmesg:
>
> [ 1649.215390] XFS (sdb): Mounting Filesystem
> [ 1649.316894] XFS (sdb): Failed to initialize disk quotas.
> [ 1649.316902] XFS (sdb): Ending clean mount
>
> By ensuring that we write NULLFSINO to quota inodes that aren't
> active, we avoid this problem.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_sb.c | 16 ++++++++++++----
> 1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/fs/xfs/xfs_sb.c b/fs/xfs/xfs_sb.c
> index c3453b1..9a58699 100644
> --- a/fs/xfs/xfs_sb.c
> +++ b/fs/xfs/xfs_sb.c
> @@ -483,10 +483,16 @@ xfs_sb_quota_to_disk(
> }
>
> /*
> - * GQUOTINO and PQUOTINO cannot be used together in versions
> - * of superblock that do not have pquotino. from->sb_flags
> - * tells us which quota is active and should be copied to
> - * disk.
> + * GQUOTINO and PQUOTINO cannot be used together in versions of
> + * superblock that do not have pquotino. from->sb_flags tells us which
> + * quota is active and should be copied to disk. If neither are active,
> + * make sure we write NULLFSINO to the sb_gquotino field as a quota
> + * inode value of "0" is invalid when the XFS_SB_VERSION_QUOTA feature
> + * bit is set.
> + *
> + * Note that we don't need to handle the sb_uquotino or sb_pquotino here
> + * as they do not require any translation. Hence the main sb field loop
> + * will write them appropriately from the in-core superblock.
> */
> if ((*fields & XFS_SB_GQUOTINO) &&
> (from->sb_qflags & XFS_GQUOTA_ACCT))
> @@ -494,6 +500,8 @@ xfs_sb_quota_to_disk(
> else if ((*fields & XFS_SB_PQUOTINO) &&
> (from->sb_qflags & XFS_PQUOTA_ACCT))
> to->sb_gquotino = cpu_to_be64(from->sb_pquotino);
> + else
> + to->sb_gquotino = cpu_to_be64(NULLFSINO);
>
> *fields &= ~(XFS_SB_PQUOTINO | XFS_SB_GQUOTINO);
> }
Tested-by: Arkadiusz Miśkiewicz <arekm@maven.pl>
3.16.0-rc4-00120-g85d90fa + patch
$ truncate -s 50M 50M-image
$ mkfs.xfs -f 50M-image
meta-data=50M-image isize=256 agcount=2, agsize=6400 blks
= sectsz=512 attr=2, projid32bit=1
= crc=0 finobt=0
data = bsize=4096 blocks=12800, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal log bsize=4096 blocks=853, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
$ xfs_db 50M-image -c "sb 0" -c "print" |grep quot
uquotino = 0
gquotino = 0
pquotino = 0
$ sudo mount 50M-image /media/floppy/ -o usrquota
[sudo] password for arekm:
$ dmesg|tail -n 4
[ 98.413877] XFS (loop0): Mounting V4 Filesystem
[ 98.445950] XFS (loop0): Ending clean mount
[ 98.445987] XFS (loop0): Quotacheck needed: Please wait.
[ 98.469666] XFS (loop0): Quotacheck: Done.
$ sudo umount /media/floppy/
]$ xfs_db 50M-image -c "sb 0" -c "print" |grep quot
uquotino = 131
gquotino = null
pquotino = 0
and it properly mounted on 3.10.46 with usrquota,grpquota
--
Arkadiusz Miśkiewicz, arekm / maven.pl
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] xfs: null unused quota inodes when quota is on
2014-07-11 15:22 ` Arkadiusz Miśkiewicz
@ 2014-07-11 15:30 ` Arkadiusz Miśkiewicz
2014-07-14 0:58 ` Dave Chinner
0 siblings, 1 reply; 14+ messages in thread
From: Arkadiusz Miśkiewicz @ 2014-07-11 15:30 UTC (permalink / raw)
To: xfs
On Friday 11 of July 2014, Arkadiusz Miśkiewicz wrote:
> On Friday 11 of July 2014, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > When quota is on, it is expected that unused quota inodes have a
> > value of NULLFSINO. The changes to support a separate project quota
> > in 3.12 broken this rule for non-project quota inode enabled
> > filesystem, as the code now refuses to write the group quota inode
> > if neither group or project quotas are enabled. This regression was
> > introduced by commit d892d58 ("xfs: Start using pquotaino from the
> > superblock").
> >
> > In this case, we should be writing NULLFSINO rather than nothing to
> > ensure that we leave the group quota inode in a valid state while
> > quotas are enabled.
> >
> > Failure to do so doesn't cause a current kernel to break - the
> > separate project quota inodes introduced translation code to always
> > treat a zero inode as NULLFSINO. This was introduced by commit
> > 0102629 ("xfs: Initialize all quota inodes to be NULLFSINO") with is
> > also in 3.12 but older kernels do not do this and hence taking a
> > filesystem back to an older kernel can result in quotas failing
> > initialisation at mount time. When that happens, we see this in
> > dmesg:
> >
> > [ 1649.215390] XFS (sdb): Mounting Filesystem
> > [ 1649.316894] XFS (sdb): Failed to initialize disk quotas.
> > [ 1649.316902] XFS (sdb): Ending clean mount
> >
> > By ensuring that we write NULLFSINO to quota inodes that aren't
> > active, we avoid this problem.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >
> > fs/xfs/xfs_sb.c | 16 ++++++++++++----
> > 1 file changed, 12 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_sb.c b/fs/xfs/xfs_sb.c
> > index c3453b1..9a58699 100644
> > --- a/fs/xfs/xfs_sb.c
> > +++ b/fs/xfs/xfs_sb.c
> > @@ -483,10 +483,16 @@ xfs_sb_quota_to_disk(
> >
> > }
> >
> > /*
> >
> > - * GQUOTINO and PQUOTINO cannot be used together in versions
> > - * of superblock that do not have pquotino. from->sb_flags
> > - * tells us which quota is active and should be copied to
> > - * disk.
> > + * GQUOTINO and PQUOTINO cannot be used together in versions of
> > + * superblock that do not have pquotino. from->sb_flags tells us which
> > + * quota is active and should be copied to disk. If neither are
active,
> > + * make sure we write NULLFSINO to the sb_gquotino field as a quota
> > + * inode value of "0" is invalid when the XFS_SB_VERSION_QUOTA feature
> > + * bit is set.
> > + *
> > + * Note that we don't need to handle the sb_uquotino or sb_pquotino
> > here + * as they do not require any translation. Hence the main sb
> > field loop + * will write them appropriately from the in-core
> > superblock.
> >
> > */
> >
> > if ((*fields & XFS_SB_GQUOTINO) &&
> >
> > (from->sb_qflags & XFS_GQUOTA_ACCT))
> >
> > @@ -494,6 +500,8 @@ xfs_sb_quota_to_disk(
> >
> > else if ((*fields & XFS_SB_PQUOTINO) &&
> >
> > (from->sb_qflags & XFS_PQUOTA_ACCT))
> >
> > to->sb_gquotino = cpu_to_be64(from->sb_pquotino);
> >
> > + else
> > + to->sb_gquotino = cpu_to_be64(NULLFSINO);
> >
> > *fields &= ~(XFS_SB_PQUOTINO | XFS_SB_GQUOTINO);
> >
> > }
>
> Tested-by: Arkadiusz Miśkiewicz <arekm@maven.pl>
>
> 3.16.0-rc4-00120-g85d90fa + patch
>
> $ truncate -s 50M 50M-image
> $ mkfs.xfs -f 50M-image
> meta-data=50M-image isize=256 agcount=2, agsize=6400 blks
> = sectsz=512 attr=2, projid32bit=1
> = crc=0 finobt=0
> data = bsize=4096 blocks=12800, imaxpct=25
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0 ftype=0
> log =internal log bsize=4096 blocks=853, version=2
> = sectsz=512 sunit=0 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
> $ xfs_db 50M-image -c "sb 0" -c "print" |grep quot
> uquotino = 0
> gquotino = 0
> pquotino = 0
> $ sudo mount 50M-image /media/floppy/ -o usrquota
> [sudo] password for arekm:
> $ dmesg|tail -n 4
> [ 98.413877] XFS (loop0): Mounting V4 Filesystem
> [ 98.445950] XFS (loop0): Ending clean mount
> [ 98.445987] XFS (loop0): Quotacheck needed: Please wait.
> [ 98.469666] XFS (loop0): Quotacheck: Done.
> $ sudo umount /media/floppy/
> ]$ xfs_db 50M-image -c "sb 0" -c "print" |grep quot
> uquotino = 131
> gquotino = null
> pquotino = 0
>
> and it properly mounted on 3.10.46 with usrquota,grpquota
Actually there is a problem with different options, so likely more fixes is
needed. gquotinode doesn't get allocated if mounted with grpquota:
3.16git+patch:
[arekm@t400 test]$ rm 50M-image
[arekm@t400 test]$ truncate -s 50M 50M-image
[arekm@t400 test]$ mkfs.xfs -f 50M-image
meta-data=50M-image isize=256 agcount=2, agsize=6400 blks
= sectsz=512 attr=2, projid32bit=1
= crc=0 finobt=0
data = bsize=4096 blocks=12800, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal log bsize=4096 blocks=853, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
$ xfs_db 50M-image -c "sb 0" -c "print" |grep quot
uquotino = 0
gquotino = 0
pquotino = 0
$ sudo mount 50M-image /media/floppy/ -o grpquota
$ sudo umount /media/floppy/
$ xfs_db 50M-image -c "sb 0" -c "print" |grep quot
uquotino = null
gquotino = null
pquotino = 0
but on 3.10.46 gquotino is allocated:
$ rm 50M-image
$ truncate -s 50M 50M-image
$ mkfs.xfs 50M-image
meta-data=50M-image isize=256 agcount=2, agsize=6400 blks
= sectsz=512 attr=2, projid32bit=1
= crc=0 finobt=0
data = bsize=4096 blocks=12800, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal log bsize=4096 blocks=853, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
$ xfs_db 50M-image -c "sb 0" -c "print" |grep quot
uquotino = 0
gquotino = 0
pquotino = 0
$ sudo mount 50M-image /media/floppy/ -o grpquota
$ sudo umount /media/floppy/
$ xfs_db 50M-image -c "sb 0" -c "print" |grep quot
uquotino = null
gquotino = 131
pquotino = 0
--
Arkadiusz Miśkiewicz, arekm / maven.pl
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] xfs: refine the allocation stack switch
2014-07-11 12:33 ` Brian Foster
@ 2014-07-11 21:57 ` Dave Chinner
0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2014-07-11 21:57 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Fri, Jul 11, 2014 at 08:33:53AM -0400, Brian Foster wrote:
> On Fri, Jul 11, 2014 at 09:26:18AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> ...
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> > fs/xfs/xfs_bmap.c | 4 +--
> > fs/xfs/xfs_bmap_util.c | 43 --------------------------
> > fs/xfs/xfs_bmap_util.h | 15 +++++----
> > fs/xfs/xfs_btree.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++--
> > 4 files changed, 91 insertions(+), 55 deletions(-)
> >
> ...
> > diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> > index 935ed2b..91df8e9 100644
> > --- a/fs/xfs/xfs_bmap_util.h
> > +++ b/fs/xfs/xfs_bmap_util.h
> > @@ -50,12 +50,13 @@ struct xfs_bmalloca {
> > xfs_extlen_t total; /* total blocks needed for xaction */
> > xfs_extlen_t minlen; /* minimum allocation size (blocks) */
> > xfs_extlen_t minleft; /* amount must be left after alloc */
> > - char eof; /* set if allocating past last extent */
> > - char wasdel; /* replacing a delayed allocation */
> > - char userdata;/* set if is user data */
> > - char aeof; /* allocated space at eof */
> > - char conv; /* overwriting unwritten extents */
> > - char stack_switch;
> > + bool eof; /* set if allocating past last extent */
> > + bool wasdel; /* replacing a delayed allocation */
> > + bool userdata;/* set if is user data */
> > + bool aeof; /* allocated space at eof */
> > + bool conv; /* overwriting unwritten extents */
> > + bool stack_switch;
> > + bool kswapd; /* work being done for kswapd */
>
> Neither stack_switch nor kswapd are used any longer. Removal of
> stack_switch means that the XFS_BMAPI_STACK_SWITCH flag can go away as
> well.
Yup, I'll fix that.
> > -STATIC int /* error */
> > -xfs_btree_split(
> > +int /* error */
> > +__xfs_btree_split(
>
> Looks like this can remain static. The rest looks Ok to me, but I'll run
> some tests too.
I'll fix that too.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] xfs: null unused quota inodes when quota is on
2014-07-11 13:15 ` Brian Foster
@ 2014-07-11 22:00 ` Dave Chinner
0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2014-07-11 22:00 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Fri, Jul 11, 2014 at 09:15:15AM -0400, Brian Foster wrote:
> On Fri, Jul 11, 2014 at 09:26:19AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > When quota is on, it is expected that unused quota inodes have a
> > value of NULLFSINO. The changes to support a separate project quota
> > in 3.12 broken this rule for non-project quota inode enabled
> > filesystem, as the code now refuses to write the group quota inode
> > if neither group or project quotas are enabled. This regression was
> > introduced by commit d892d58 ("xfs: Start using pquotaino from the
> > superblock").
> >
> > In this case, we should be writing NULLFSINO rather than nothing to
> > ensure that we leave the group quota inode in a valid state while
> > quotas are enabled.
> >
> > Failure to do so doesn't cause a current kernel to break - the
> > separate project quota inodes introduced translation code to always
> > treat a zero inode as NULLFSINO. This was introduced by commit
> > 0102629 ("xfs: Initialize all quota inodes to be NULLFSINO") with is
> > also in 3.12 but older kernels do not do this and hence taking a
> > filesystem back to an older kernel can result in quotas failing
> > initialisation at mount time. When that happens, we see this in
> > dmesg:
> >
> > [ 1649.215390] XFS (sdb): Mounting Filesystem
> > [ 1649.316894] XFS (sdb): Failed to initialize disk quotas.
> > [ 1649.316902] XFS (sdb): Ending clean mount
> >
> > By ensuring that we write NULLFSINO to quota inodes that aren't
> > active, we avoid this problem.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> > fs/xfs/xfs_sb.c | 16 ++++++++++++----
> > 1 file changed, 12 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_sb.c b/fs/xfs/xfs_sb.c
> > index c3453b1..9a58699 100644
> > --- a/fs/xfs/xfs_sb.c
> > +++ b/fs/xfs/xfs_sb.c
> > @@ -483,10 +483,16 @@ xfs_sb_quota_to_disk(
> > }
> >
> > /*
> > - * GQUOTINO and PQUOTINO cannot be used together in versions
> > - * of superblock that do not have pquotino. from->sb_flags
> > - * tells us which quota is active and should be copied to
> > - * disk.
> > + * GQUOTINO and PQUOTINO cannot be used together in versions of
> > + * superblock that do not have pquotino. from->sb_flags tells us which
> > + * quota is active and should be copied to disk. If neither are active,
> > + * make sure we write NULLFSINO to the sb_gquotino field as a quota
> > + * inode value of "0" is invalid when the XFS_SB_VERSION_QUOTA feature
> > + * bit is set.
> > + *
> > + * Note that we don't need to handle the sb_uquotino or sb_pquotino here
> > + * as they do not require any translation. Hence the main sb field loop
> > + * will write them appropriately from the in-core superblock.
> > */
>
> sb_uquotino does not require any translation, but sb_pquotino simply
> doesn't exist on older sb's, right? Is that what you mean to say here?
> E.g., we clear XFS_SB_PQUOTINO from fields below, so the field loop
> isn't going to write it (and I take that's the right thing to do).
Correct.
> > if ((*fields & XFS_SB_GQUOTINO) &&
> > (from->sb_qflags & XFS_GQUOTA_ACCT))
> > @@ -494,6 +500,8 @@ xfs_sb_quota_to_disk(
> > else if ((*fields & XFS_SB_PQUOTINO) &&
> > (from->sb_qflags & XFS_PQUOTA_ACCT))
> > to->sb_gquotino = cpu_to_be64(from->sb_pquotino);
> > + else
> > + to->sb_gquotino = cpu_to_be64(NULLFSINO);
> >
>
> We update the field manually and clear the flag bit so the loop doesn't
> handle it. Seems Ok, despite my minor confusion over the above comment:
Yeah, it's a bit messy. And I'm really thinking that given how
rarely we change the superblock, we should just convert it to the
same "convert everything" mechanism we use for all the other on-disk
formatting functions. i.e. get rid of all the bitfield crap....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] xfs: null unused quota inodes when quota is on
2014-07-11 15:30 ` Arkadiusz Miśkiewicz
@ 2014-07-14 0:58 ` Dave Chinner
0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2014-07-14 0:58 UTC (permalink / raw)
To: Arkadiusz Miśkiewicz; +Cc: xfs
On Fri, Jul 11, 2014 at 05:30:10PM +0200, Arkadiusz Miśkiewicz wrote:
> On Friday 11 of July 2014, Arkadiusz Miśkiewicz wrote:
> > On Friday 11 of July 2014, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > When quota is on, it is expected that unused quota inodes have a
> > > value of NULLFSINO. The changes to support a separate project quota
> > > in 3.12 broken this rule for non-project quota inode enabled
> > > filesystem, as the code now refuses to write the group quota inode
> > > if neither group or project quotas are enabled. This regression was
> > > introduced by commit d892d58 ("xfs: Start using pquotaino from the
> > > superblock").
> > >
> > > In this case, we should be writing NULLFSINO rather than nothing to
> > > ensure that we leave the group quota inode in a valid state while
> > > quotas are enabled.
> > >
> > > Failure to do so doesn't cause a current kernel to break - the
> > > separate project quota inodes introduced translation code to always
> > > treat a zero inode as NULLFSINO. This was introduced by commit
> > > 0102629 ("xfs: Initialize all quota inodes to be NULLFSINO") with is
> > > also in 3.12 but older kernels do not do this and hence taking a
> > > filesystem back to an older kernel can result in quotas failing
> > > initialisation at mount time. When that happens, we see this in
> > > dmesg:
> > >
> > > [ 1649.215390] XFS (sdb): Mounting Filesystem
> > > [ 1649.316894] XFS (sdb): Failed to initialize disk quotas.
> > > [ 1649.316902] XFS (sdb): Ending clean mount
> > >
> > > By ensuring that we write NULLFSINO to quota inodes that aren't
> > > active, we avoid this problem.
> > >
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
.....
> Actually there is a problem with different options, so likely more fixes is
> needed. gquotinode doesn't get allocated if mounted with grpquota:
>
> 3.16git+patch:
> [arekm@t400 test]$ rm 50M-image
> [arekm@t400 test]$ truncate -s 50M 50M-image
> [arekm@t400 test]$ mkfs.xfs -f 50M-image
> meta-data=50M-image isize=256 agcount=2, agsize=6400 blks
> = sectsz=512 attr=2, projid32bit=1
> = crc=0 finobt=0
> data = bsize=4096 blocks=12800, imaxpct=25
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0 ftype=0
> log =internal log bsize=4096 blocks=853, version=2
> = sectsz=512 sunit=0 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
> $ xfs_db 50M-image -c "sb 0" -c "print" |grep quot
> uquotino = 0
> gquotino = 0
> pquotino = 0
> $ sudo mount 50M-image /media/floppy/ -o grpquota
> $ sudo umount /media/floppy/
> $ xfs_db 50M-image -c "sb 0" -c "print" |grep quot
> uquotino = null
> gquotino = null
> pquotino = 0
Ok, I can reproduce that.
What a freakin' mess.
We can't unconditionally write NULLFSINO to the field because of the
way the updates and logging of the superblock buffer work - it only
updates the fields that are changed, so most updates (which occur
during mount or unmount) don't set the quota inode fields at all.
In fact, the whole "only write the specific sb fields" code seems
like an optimisation that is no longer needed, given that we never
actually update or log the superblock buffer in a fast path anymore.
So, for the dev tree, I'm just going to rip all this crap out. As to
a small, targeted regression fix, well, I haven't worked that out
yet....
-Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 2/3] xfs: refine the allocation stack switch
2014-07-14 4:48 [PATCH 0/3, V2] xfs; fixes for 3.16-rc5 Dave Chinner
@ 2014-07-14 4:48 ` Dave Chinner
2014-07-14 12:02 ` Brian Foster
0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2014-07-14 4:48 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
The allocation stack switch at xfs_bmapi_allocate() has served it's
purpose, but is no longer a sufficient solution to the stack usage
problem we have in the XFS allocation path.
Whilst the kernel stack size is now 16k, that is not a valid reason
for undoing all our "keep stack usage down" modifications. What it
does allow us to do is have the freedom to refine and perfect the
modifications knowing that if we get it wrong it won't blow up in
our faces - we have a safety net now.
This is important because we still have the issue of older kernels
having smaller stacks and that they are still supported and are
demonstrating a wide range of different stack overflows. Red Hat
has several open bugs for allocation based stack overflows from
directory modifications and direct IO block allocation and these
problems still need to be solved. If we can solve them upstream,
then distro's won't need to bake their own unique solutions.
To that end, I've observed that every allocation based stack
overflow report has had a specific characteristic - it has happened
during or directly after a bmap btree block split. That event
requires a new block to be allocated to the tree, and so we
effectively stack one allocation stack on top of another, and that's
when we get into trouble.
A further observation is that bmap btree block splits are much rarer
than writeback allocation - over a range of different workloads I've
observed the ratio of bmap btree inserts to splits ranges from 100:1
(xfstests run) to 10000:1 (local VM image server with sparse files
that range in the hundreds of thousands to millions of extents).
Either way, bmap btree split events are much, much rarer than
allocation events.
Finally, we have to move the kswapd state to the allocation workqueue
work when allocation is done on behalf of kswapd. This is proving to
cause significant perturbation in performance under memory pressure
and appears to be generating allocation deadlock warnings under some
workloads, so avoiding the use of a workqueue for the majority of
kswapd writeback allocation will minimise the impact of such
behaviour.
Hence it makes sense to move the stack switch to xfs_btree_split()
and only do it for bmap btree splits. Stack switches during
allocation will be much rarer, so there won't be significant
performacne overhead caused by switching stacks. The worse case
stack from all allocation paths will be split, not just writeback.
And the majority of memory allocations will be done in the correct
context (e.g. kswapd) without causing additional latency, and so we
simplify the memory reclaim interactions between processes,
workqueues and kswapd.
The worst stack I've been able to generate with this patch in place
is 5600 bytes deep. It's very revealing because we exit XFS at:
37) 1768 64 kmem_cache_alloc+0x13b/0x170
about 1800 bytes of stack consumed, and the remaining 3800 bytes
(and 36 functions) is memory reclaim, swap and the IO stack. And
this occurs in the inode allocation from an open(O_CREAT) syscall,
not writeback.
The amount of stack being used is much less than I've previously be
able to generate - fs_mark testing has been able to generate stack
usage of around 7k without too much trouble; with this patch it's
only just getting to 5.5k. This is primarily because the metadata
allocation paths (e.g. directory blocks) are no longer causing
double splits on the same stack, and hence now stack tracing is
showing swapping being the worst stack consumer rather than XFS.
Performance of fs_mark inode create workloads is unchanged.
Performance of fs_mark async fsync workloads is consistently good
with context switches reduced by around 150,000/s (30%).
Performance of dbench, streaming IO and postmark is unchanged.
Allocation deadlock warnings have not been seen on the workloads
that generated them since adding this patch.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_bmap.c | 7 ++---
fs/xfs/xfs_bmap.h | 4 +--
fs/xfs/xfs_bmap_util.c | 43 --------------------------
fs/xfs/xfs_bmap_util.h | 13 +++-----
fs/xfs/xfs_btree.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++++-
fs/xfs/xfs_iomap.c | 3 +-
6 files changed, 90 insertions(+), 62 deletions(-)
diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
index 96175df..75c3fe5 100644
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -4298,8 +4298,8 @@ xfs_bmapi_delay(
}
-int
-__xfs_bmapi_allocate(
+static int
+xfs_bmapi_allocate(
struct xfs_bmalloca *bma)
{
struct xfs_mount *mp = bma->ip->i_mount;
@@ -4578,9 +4578,6 @@ xfs_bmapi_write(
bma.flist = flist;
bma.firstblock = firstblock;
- if (flags & XFS_BMAPI_STACK_SWITCH)
- bma.stack_switch = 1;
-
while (bno < end && n < *nmap) {
inhole = eof || bma.got.br_startoff > bno;
wasdelay = !inhole && isnullstartblock(bma.got.br_startblock);
diff --git a/fs/xfs/xfs_bmap.h b/fs/xfs/xfs_bmap.h
index 38ba36e..b879ca5 100644
--- a/fs/xfs/xfs_bmap.h
+++ b/fs/xfs/xfs_bmap.h
@@ -77,7 +77,6 @@ typedef struct xfs_bmap_free
* from written to unwritten, otherwise convert from unwritten to written.
*/
#define XFS_BMAPI_CONVERT 0x040
-#define XFS_BMAPI_STACK_SWITCH 0x080
#define XFS_BMAPI_FLAGS \
{ XFS_BMAPI_ENTIRE, "ENTIRE" }, \
@@ -86,8 +85,7 @@ typedef struct xfs_bmap_free
{ XFS_BMAPI_PREALLOC, "PREALLOC" }, \
{ XFS_BMAPI_IGSTATE, "IGSTATE" }, \
{ XFS_BMAPI_CONTIG, "CONTIG" }, \
- { XFS_BMAPI_CONVERT, "CONVERT" }, \
- { XFS_BMAPI_STACK_SWITCH, "STACK_SWITCH" }
+ { XFS_BMAPI_CONVERT, "CONVERT" }
static inline int xfs_bmapi_aflag(int w)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 057f671..64731ef 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -249,49 +249,6 @@ xfs_bmap_rtalloc(
}
/*
- * Stack switching interfaces for allocation
- */
-static void
-xfs_bmapi_allocate_worker(
- struct work_struct *work)
-{
- struct xfs_bmalloca *args = container_of(work,
- struct xfs_bmalloca, work);
- unsigned long pflags;
-
- /* we are in a transaction context here */
- current_set_flags_nested(&pflags, PF_FSTRANS);
-
- args->result = __xfs_bmapi_allocate(args);
- complete(args->done);
-
- current_restore_flags_nested(&pflags, PF_FSTRANS);
-}
-
-/*
- * Some allocation requests often come in with little stack to work on. Push
- * them off to a worker thread so there is lots of stack to use. Otherwise just
- * call directly to avoid the context switch overhead here.
- */
-int
-xfs_bmapi_allocate(
- struct xfs_bmalloca *args)
-{
- DECLARE_COMPLETION_ONSTACK(done);
-
- if (!args->stack_switch)
- return __xfs_bmapi_allocate(args);
-
-
- args->done = &done;
- INIT_WORK_ONSTACK(&args->work, xfs_bmapi_allocate_worker);
- queue_work(xfs_alloc_wq, &args->work);
- wait_for_completion(&done);
- destroy_work_on_stack(&args->work);
- return args->result;
-}
-
-/*
* Check if the endoff is outside the last extent. If so the caller will grow
* the allocation to a stripe unit boundary. All offsets are considered outside
* the end of file for an empty fork, so 1 is returned in *eof in that case.
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 935ed2b..2fdb72d 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -50,12 +50,11 @@ struct xfs_bmalloca {
xfs_extlen_t total; /* total blocks needed for xaction */
xfs_extlen_t minlen; /* minimum allocation size (blocks) */
xfs_extlen_t minleft; /* amount must be left after alloc */
- char eof; /* set if allocating past last extent */
- char wasdel; /* replacing a delayed allocation */
- char userdata;/* set if is user data */
- char aeof; /* allocated space at eof */
- char conv; /* overwriting unwritten extents */
- char stack_switch;
+ bool eof; /* set if allocating past last extent */
+ bool wasdel; /* replacing a delayed allocation */
+ bool userdata;/* set if is user data */
+ bool aeof; /* allocated space at eof */
+ bool conv; /* overwriting unwritten extents */
int flags;
struct completion *done;
struct work_struct work;
@@ -65,8 +64,6 @@ struct xfs_bmalloca {
int xfs_bmap_finish(struct xfs_trans **tp, struct xfs_bmap_free *flist,
int *committed);
int xfs_bmap_rtalloc(struct xfs_bmalloca *ap);
-int xfs_bmapi_allocate(struct xfs_bmalloca *args);
-int __xfs_bmapi_allocate(struct xfs_bmalloca *args);
int xfs_bmap_eof(struct xfs_inode *ip, xfs_fileoff_t endoff,
int whichfork, int *eof);
int xfs_bmap_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
diff --git a/fs/xfs/xfs_btree.c b/fs/xfs/xfs_btree.c
index bf810c6..cf893bc 100644
--- a/fs/xfs/xfs_btree.c
+++ b/fs/xfs/xfs_btree.c
@@ -33,6 +33,7 @@
#include "xfs_error.h"
#include "xfs_trace.h"
#include "xfs_cksum.h"
+#include "xfs_alloc.h"
/*
* Cursor allocation zone.
@@ -2323,7 +2324,7 @@ error1:
* record (to be inserted into parent).
*/
STATIC int /* error */
-xfs_btree_split(
+__xfs_btree_split(
struct xfs_btree_cur *cur,
int level,
union xfs_btree_ptr *ptrp,
@@ -2503,6 +2504,85 @@ error0:
return error;
}
+struct xfs_btree_split_args {
+ struct xfs_btree_cur *cur;
+ int level;
+ union xfs_btree_ptr *ptrp;
+ union xfs_btree_key *key;
+ struct xfs_btree_cur **curp;
+ int *stat; /* success/failure */
+ int result;
+ bool kswapd; /* allocation in kswapd context */
+ struct completion *done;
+ struct work_struct work;
+};
+
+/*
+ * Stack switching interfaces for allocation
+ */
+static void
+xfs_btree_split_worker(
+ struct work_struct *work)
+{
+ struct xfs_btree_split_args *args = container_of(work,
+ struct xfs_btree_split_args, work);
+ unsigned long pflags;
+ unsigned long new_pflags = PF_FSTRANS;
+
+ /*
+ * we are in a transaction context here, but may also be doing work
+ * in kswapd context, and hence we may need to inherit that state
+ * temporarily to ensure that we don't block waiting for memory reclaim
+ * in any way.
+ */
+ if (args->kswapd)
+ new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
+
+ current_set_flags_nested(&pflags, new_pflags);
+
+ args->result = __xfs_btree_split(args->cur, args->level, args->ptrp,
+ args->key, args->curp, args->stat);
+ complete(args->done);
+
+ current_restore_flags_nested(&pflags, new_pflags);
+}
+
+/*
+ * BMBT split requests often come in with little stack to work on. Push
+ * them off to a worker thread so there is lots of stack to use. For the other
+ * btree types, just call directly to avoid the context switch overhead here.
+ */
+STATIC int /* error */
+xfs_btree_split(
+ struct xfs_btree_cur *cur,
+ int level,
+ union xfs_btree_ptr *ptrp,
+ union xfs_btree_key *key,
+ struct xfs_btree_cur **curp,
+ int *stat) /* success/failure */
+{
+ struct xfs_btree_split_args args;
+ DECLARE_COMPLETION_ONSTACK(done);
+
+ if (cur->bc_btnum != XFS_BTNUM_BMAP)
+ return __xfs_btree_split(cur, level, ptrp, key, curp, stat);
+
+ args.cur = cur;
+ args.level = level;
+ args.ptrp = ptrp;
+ args.key = key;
+ args.curp = curp;
+ args.stat = stat;
+ args.done = &done;
+ args.kswapd = current_is_kswapd();
+ INIT_WORK_ONSTACK(&args.work, xfs_btree_split_worker);
+ queue_work(xfs_alloc_wq, &args.work);
+ wait_for_completion(&done);
+ destroy_work_on_stack(&args.work);
+ return args.result;
+}
+
+
/*
* Copy the old inode root contents into a real block and make the
* broot point to it.
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 6c5eb4c..6d3ec2b 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -749,8 +749,7 @@ xfs_iomap_write_allocate(
* pointer that the caller gave to us.
*/
error = xfs_bmapi_write(tp, ip, map_start_fsb,
- count_fsb,
- XFS_BMAPI_STACK_SWITCH,
+ count_fsb, 0,
&first_block, 1,
imap, &nimaps, &free_list);
if (error)
--
2.0.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] xfs: refine the allocation stack switch
2014-07-14 4:48 ` [PATCH 2/3] xfs: refine the allocation stack switch Dave Chinner
@ 2014-07-14 12:02 ` Brian Foster
0 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2014-07-14 12:02 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Mon, Jul 14, 2014 at 02:48:58PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
...
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
Looks good...
Reviewed-by: Brian Foster <bfoster@redhat.com>
> fs/xfs/xfs_bmap.c | 7 ++---
> fs/xfs/xfs_bmap.h | 4 +--
> fs/xfs/xfs_bmap_util.c | 43 --------------------------
> fs/xfs/xfs_bmap_util.h | 13 +++-----
> fs/xfs/xfs_btree.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++++-
> fs/xfs/xfs_iomap.c | 3 +-
> 6 files changed, 90 insertions(+), 62 deletions(-)
>
> diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
> index 96175df..75c3fe5 100644
> --- a/fs/xfs/xfs_bmap.c
> +++ b/fs/xfs/xfs_bmap.c
> @@ -4298,8 +4298,8 @@ xfs_bmapi_delay(
> }
>
>
> -int
> -__xfs_bmapi_allocate(
> +static int
> +xfs_bmapi_allocate(
> struct xfs_bmalloca *bma)
> {
> struct xfs_mount *mp = bma->ip->i_mount;
> @@ -4578,9 +4578,6 @@ xfs_bmapi_write(
> bma.flist = flist;
> bma.firstblock = firstblock;
>
> - if (flags & XFS_BMAPI_STACK_SWITCH)
> - bma.stack_switch = 1;
> -
> while (bno < end && n < *nmap) {
> inhole = eof || bma.got.br_startoff > bno;
> wasdelay = !inhole && isnullstartblock(bma.got.br_startblock);
> diff --git a/fs/xfs/xfs_bmap.h b/fs/xfs/xfs_bmap.h
> index 38ba36e..b879ca5 100644
> --- a/fs/xfs/xfs_bmap.h
> +++ b/fs/xfs/xfs_bmap.h
> @@ -77,7 +77,6 @@ typedef struct xfs_bmap_free
> * from written to unwritten, otherwise convert from unwritten to written.
> */
> #define XFS_BMAPI_CONVERT 0x040
> -#define XFS_BMAPI_STACK_SWITCH 0x080
>
> #define XFS_BMAPI_FLAGS \
> { XFS_BMAPI_ENTIRE, "ENTIRE" }, \
> @@ -86,8 +85,7 @@ typedef struct xfs_bmap_free
> { XFS_BMAPI_PREALLOC, "PREALLOC" }, \
> { XFS_BMAPI_IGSTATE, "IGSTATE" }, \
> { XFS_BMAPI_CONTIG, "CONTIG" }, \
> - { XFS_BMAPI_CONVERT, "CONVERT" }, \
> - { XFS_BMAPI_STACK_SWITCH, "STACK_SWITCH" }
> + { XFS_BMAPI_CONVERT, "CONVERT" }
>
>
> static inline int xfs_bmapi_aflag(int w)
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 057f671..64731ef 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -249,49 +249,6 @@ xfs_bmap_rtalloc(
> }
>
> /*
> - * Stack switching interfaces for allocation
> - */
> -static void
> -xfs_bmapi_allocate_worker(
> - struct work_struct *work)
> -{
> - struct xfs_bmalloca *args = container_of(work,
> - struct xfs_bmalloca, work);
> - unsigned long pflags;
> -
> - /* we are in a transaction context here */
> - current_set_flags_nested(&pflags, PF_FSTRANS);
> -
> - args->result = __xfs_bmapi_allocate(args);
> - complete(args->done);
> -
> - current_restore_flags_nested(&pflags, PF_FSTRANS);
> -}
> -
> -/*
> - * Some allocation requests often come in with little stack to work on. Push
> - * them off to a worker thread so there is lots of stack to use. Otherwise just
> - * call directly to avoid the context switch overhead here.
> - */
> -int
> -xfs_bmapi_allocate(
> - struct xfs_bmalloca *args)
> -{
> - DECLARE_COMPLETION_ONSTACK(done);
> -
> - if (!args->stack_switch)
> - return __xfs_bmapi_allocate(args);
> -
> -
> - args->done = &done;
> - INIT_WORK_ONSTACK(&args->work, xfs_bmapi_allocate_worker);
> - queue_work(xfs_alloc_wq, &args->work);
> - wait_for_completion(&done);
> - destroy_work_on_stack(&args->work);
> - return args->result;
> -}
> -
> -/*
> * Check if the endoff is outside the last extent. If so the caller will grow
> * the allocation to a stripe unit boundary. All offsets are considered outside
> * the end of file for an empty fork, so 1 is returned in *eof in that case.
> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> index 935ed2b..2fdb72d 100644
> --- a/fs/xfs/xfs_bmap_util.h
> +++ b/fs/xfs/xfs_bmap_util.h
> @@ -50,12 +50,11 @@ struct xfs_bmalloca {
> xfs_extlen_t total; /* total blocks needed for xaction */
> xfs_extlen_t minlen; /* minimum allocation size (blocks) */
> xfs_extlen_t minleft; /* amount must be left after alloc */
> - char eof; /* set if allocating past last extent */
> - char wasdel; /* replacing a delayed allocation */
> - char userdata;/* set if is user data */
> - char aeof; /* allocated space at eof */
> - char conv; /* overwriting unwritten extents */
> - char stack_switch;
> + bool eof; /* set if allocating past last extent */
> + bool wasdel; /* replacing a delayed allocation */
> + bool userdata;/* set if is user data */
> + bool aeof; /* allocated space at eof */
> + bool conv; /* overwriting unwritten extents */
> int flags;
> struct completion *done;
> struct work_struct work;
> @@ -65,8 +64,6 @@ struct xfs_bmalloca {
> int xfs_bmap_finish(struct xfs_trans **tp, struct xfs_bmap_free *flist,
> int *committed);
> int xfs_bmap_rtalloc(struct xfs_bmalloca *ap);
> -int xfs_bmapi_allocate(struct xfs_bmalloca *args);
> -int __xfs_bmapi_allocate(struct xfs_bmalloca *args);
> int xfs_bmap_eof(struct xfs_inode *ip, xfs_fileoff_t endoff,
> int whichfork, int *eof);
> int xfs_bmap_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
> diff --git a/fs/xfs/xfs_btree.c b/fs/xfs/xfs_btree.c
> index bf810c6..cf893bc 100644
> --- a/fs/xfs/xfs_btree.c
> +++ b/fs/xfs/xfs_btree.c
> @@ -33,6 +33,7 @@
> #include "xfs_error.h"
> #include "xfs_trace.h"
> #include "xfs_cksum.h"
> +#include "xfs_alloc.h"
>
> /*
> * Cursor allocation zone.
> @@ -2323,7 +2324,7 @@ error1:
> * record (to be inserted into parent).
> */
> STATIC int /* error */
> -xfs_btree_split(
> +__xfs_btree_split(
> struct xfs_btree_cur *cur,
> int level,
> union xfs_btree_ptr *ptrp,
> @@ -2503,6 +2504,85 @@ error0:
> return error;
> }
>
> +struct xfs_btree_split_args {
> + struct xfs_btree_cur *cur;
> + int level;
> + union xfs_btree_ptr *ptrp;
> + union xfs_btree_key *key;
> + struct xfs_btree_cur **curp;
> + int *stat; /* success/failure */
> + int result;
> + bool kswapd; /* allocation in kswapd context */
> + struct completion *done;
> + struct work_struct work;
> +};
> +
> +/*
> + * Stack switching interfaces for allocation
> + */
> +static void
> +xfs_btree_split_worker(
> + struct work_struct *work)
> +{
> + struct xfs_btree_split_args *args = container_of(work,
> + struct xfs_btree_split_args, work);
> + unsigned long pflags;
> + unsigned long new_pflags = PF_FSTRANS;
> +
> + /*
> + * we are in a transaction context here, but may also be doing work
> + * in kswapd context, and hence we may need to inherit that state
> + * temporarily to ensure that we don't block waiting for memory reclaim
> + * in any way.
> + */
> + if (args->kswapd)
> + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> +
> + current_set_flags_nested(&pflags, new_pflags);
> +
> + args->result = __xfs_btree_split(args->cur, args->level, args->ptrp,
> + args->key, args->curp, args->stat);
> + complete(args->done);
> +
> + current_restore_flags_nested(&pflags, new_pflags);
> +}
> +
> +/*
> + * BMBT split requests often come in with little stack to work on. Push
> + * them off to a worker thread so there is lots of stack to use. For the other
> + * btree types, just call directly to avoid the context switch overhead here.
> + */
> +STATIC int /* error */
> +xfs_btree_split(
> + struct xfs_btree_cur *cur,
> + int level,
> + union xfs_btree_ptr *ptrp,
> + union xfs_btree_key *key,
> + struct xfs_btree_cur **curp,
> + int *stat) /* success/failure */
> +{
> + struct xfs_btree_split_args args;
> + DECLARE_COMPLETION_ONSTACK(done);
> +
> + if (cur->bc_btnum != XFS_BTNUM_BMAP)
> + return __xfs_btree_split(cur, level, ptrp, key, curp, stat);
> +
> + args.cur = cur;
> + args.level = level;
> + args.ptrp = ptrp;
> + args.key = key;
> + args.curp = curp;
> + args.stat = stat;
> + args.done = &done;
> + args.kswapd = current_is_kswapd();
> + INIT_WORK_ONSTACK(&args.work, xfs_btree_split_worker);
> + queue_work(xfs_alloc_wq, &args.work);
> + wait_for_completion(&done);
> + destroy_work_on_stack(&args.work);
> + return args.result;
> +}
> +
> +
> /*
> * Copy the old inode root contents into a real block and make the
> * broot point to it.
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 6c5eb4c..6d3ec2b 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -749,8 +749,7 @@ xfs_iomap_write_allocate(
> * pointer that the caller gave to us.
> */
> error = xfs_bmapi_write(tp, ip, map_start_fsb,
> - count_fsb,
> - XFS_BMAPI_STACK_SWITCH,
> + count_fsb, 0,
> &first_block, 1,
> imap, &nimaps, &free_list);
> if (error)
> --
> 2.0.0
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2014-07-14 12:02 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-10 23:26 [PATCH 0/3] xfs: regression fixes for 3.16-rc5 Dave Chinner
2014-07-10 23:26 ` [PATCH 1/3] Revert "xfs: block allocation work needs to be kswapd aware" Dave Chinner
2014-07-11 12:32 ` Brian Foster
2014-07-10 23:26 ` [PATCH 2/3] xfs: refine the allocation stack switch Dave Chinner
2014-07-11 12:33 ` Brian Foster
2014-07-11 21:57 ` Dave Chinner
2014-07-10 23:26 ` [PATCH 3/3] xfs: null unused quota inodes when quota is on Dave Chinner
2014-07-11 13:15 ` Brian Foster
2014-07-11 22:00 ` Dave Chinner
2014-07-11 15:22 ` Arkadiusz Miśkiewicz
2014-07-11 15:30 ` Arkadiusz Miśkiewicz
2014-07-14 0:58 ` Dave Chinner
-- strict thread matches above, loose matches on Subject: below --
2014-07-14 4:48 [PATCH 0/3, V2] xfs; fixes for 3.16-rc5 Dave Chinner
2014-07-14 4:48 ` [PATCH 2/3] xfs: refine the allocation stack switch Dave Chinner
2014-07-14 12:02 ` Brian Foster
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox