public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] xfs: run eofblocks scan on ENOSPC
@ 2014-05-23 11:52 Brian Foster
  2014-05-23 11:52 ` [PATCH v2 1/3] xfs: add scan owner field to xfs_eofblocks Brian Foster
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Brian Foster @ 2014-05-23 11:52 UTC (permalink / raw)
  To: xfs

Hi all,

Here's v2 of the eofblocks scan on ENOSPC series, incorporating feedback
from v1:

http://oss.sgi.com/archives/xfs/2014-03/msg00388.html

The major change here is to simplify the error checking logic and tie
the eofblocks scan to the inode flush in the ENOSPC scenario. I've done
some high-level testing that doesn't seem to elicit any sort of
pathological behavior given the circumstances (i.e., performance will
never be ideal as we head into ENOSPC).

I tested on a hacked filesystem that makes preallocation persistent (no
trim on close), disables preallocation throttling and set the background
scanner to a high value to create worst case conditions. I ran an
fs_mark workload to create 64k 1MB files. Then, started 16x8GB
sequential dd writers expected to hit ENOSPC. This is on a 16xcpu box
with 32GB RAM and a 200GB fs (with agcounts of 32 and 1024).

Via tracepoints, I generally observe that the inode flush acts as a
filter to prevent many threads from entering into eofblocks scans at
once. E.g., by the time the first handful of threads make it through a
scan, they and/or others have dirtied more data for the remaining queued
up inode flushers to work with. I notice some occasional spikes in
kworkers or rcu processing, but nothing for longer than a couple seconds
or so.

A downside I've noticed with this logic is that once one thread runs a
scan and makes it through this retry sequence, it has a better chance to
allocate more of the recently freed space than the others, all of which
might have queued on the inode flush lock by the time the first
flush/scan completes.

This leads to what one might consider "unfair" allocation across the set
of writers when we enter this scenario. E.g., I saw tests were some
threads were able to complete the 8GB write while others only made it to
2-3GB before the filesystem completely ran out of space. Given the
benefit of the series, I think this is something that can be potentially
enhanced incrementally if it turns out to be a problem in practice.

I also have an xfstests test I'm planning to post soon that verifies
lingering preallocations can be reclaimed in a reasonable manner before
returning ENOSPC.

Thoughts, reviews, flames appreciated.

Brian

v2:
- Drop flush mechanism during eofblocks scan (along with prereq patch).
- Simplify scan logic on ENOSPC. Separate EDQUOT from ENOSPC and tie
  ENOSPC scan to inode flush.
- Eliminate unnecessary project quota handling from
  xfs_inode_free_quota_eofblocks() (ENOSPC is a separate path).

Brian Foster (3):
  xfs: add scan owner field to xfs_eofblocks
  xfs: run an eofblocks scan on ENOSPC/EDQUOT
  xfs: squash prealloc while over quota free space as well

 fs/xfs/xfs_dquot.h  | 15 ++++++++++++++
 fs/xfs/xfs_file.c   | 23 +++++++++++++++++----
 fs/xfs/xfs_icache.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_icache.h |  3 +++
 fs/xfs/xfs_iomap.c  | 20 ++++++++++++------
 5 files changed, 109 insertions(+), 11 deletions(-)

-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/3] xfs: add scan owner field to xfs_eofblocks
  2014-05-23 11:52 [PATCH v2 0/3] xfs: run eofblocks scan on ENOSPC Brian Foster
@ 2014-05-23 11:52 ` Brian Foster
  2014-05-26 22:49   ` Dave Chinner
  2014-05-27 10:44   ` Christoph Hellwig
  2014-05-23 11:52 ` [PATCH v2 2/3] xfs: run an eofblocks scan on ENOSPC/EDQUOT Brian Foster
  2014-05-23 11:52 ` [PATCH v2 3/3] xfs: squash prealloc while over quota free space as well Brian Foster
  2 siblings, 2 replies; 15+ messages in thread
From: Brian Foster @ 2014-05-23 11:52 UTC (permalink / raw)
  To: xfs

The scan owner field represents an optional inode number that is
responsible for the current scan. The purpose is to identify that an
inode is under iolock and as such, the iolock shouldn't be attempted
when trimming eofblocks. This is an internal only field.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 12 +++++++++++-
 fs/xfs/xfs_icache.h |  2 ++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index c48df5f..f4191f6 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1211,6 +1211,7 @@ xfs_inode_free_eofblocks(
 {
 	int ret;
 	struct xfs_eofblocks *eofb = args;
+	bool need_iolock = true;
 
 	if (!xfs_can_free_eofblocks(ip, false)) {
 		/* inode could be preallocated or append-only */
@@ -1235,9 +1236,18 @@ xfs_inode_free_eofblocks(
 		if (eofb->eof_flags & XFS_EOF_FLAGS_MINFILESIZE &&
 		    XFS_ISIZE(ip) < eofb->eof_min_file_size)
 			return 0;
+
+		/*
+		 * A scan owner implies we already hold the iolock. Skip it in
+		 * xfs_free_eofblocks() to avoid deadlock. This also eliminates
+		 * the possibility of EAGAIN being returned.
+		 */
+		if (eofb->eof_scan_owner != NULLFSINO &&
+		    eofb->eof_scan_owner == ip->i_ino)
+			need_iolock = false;
 	}
 
-	ret = xfs_free_eofblocks(ip->i_mount, ip, true);
+	ret = xfs_free_eofblocks(ip->i_mount, ip, need_iolock);
 
 	/* don't revisit the inode if we're not waiting */
 	if (ret == EAGAIN && !(flags & SYNC_WAIT))
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 9cf017b..152757f 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -27,6 +27,7 @@ struct xfs_eofblocks {
 	kgid_t		eof_gid;
 	prid_t		eof_prid;
 	__u64		eof_min_file_size;
+	xfs_ino_t	eof_scan_owner;
 };
 
 #define SYNC_WAIT		0x0001	/* wait for i/o to complete */
@@ -84,6 +85,7 @@ xfs_fs_eofblocks_from_user(
 	dst->eof_flags = src->eof_flags;
 	dst->eof_prid = src->eof_prid;
 	dst->eof_min_file_size = src->eof_min_file_size;
+	dst->eof_scan_owner = NULLFSINO;
 
 	dst->eof_uid = INVALID_UID;
 	if (src->eof_flags & XFS_EOF_FLAGS_UID) {
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/3] xfs: run an eofblocks scan on ENOSPC/EDQUOT
  2014-05-23 11:52 [PATCH v2 0/3] xfs: run eofblocks scan on ENOSPC Brian Foster
  2014-05-23 11:52 ` [PATCH v2 1/3] xfs: add scan owner field to xfs_eofblocks Brian Foster
@ 2014-05-23 11:52 ` Brian Foster
  2014-05-26 22:57   ` Dave Chinner
  2014-05-23 11:52 ` [PATCH v2 3/3] xfs: squash prealloc while over quota free space as well Brian Foster
  2 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2014-05-23 11:52 UTC (permalink / raw)
  To: xfs

Speculative preallocation and and the associated throttling metrics
assume we're working with large files on large filesystems. Users have
reported inefficiencies in these mechanisms when we happen to be dealing
with large files on smaller filesystems. This can occur because while
prealloc throttling is aggressive under low free space conditions, it is
not active until we reach 5% free space or less.

For example, a 40GB filesystem has enough space for several files large
enough to have multi-GB preallocations at any given time. If those files
are slow growing, they might reserve preallocation for long periods of
time as well as avoid the background scanner due to frequent
modification. If a new file is written under these conditions, said file
has no access to this already reserved space and premature ENOSPC is
imminent.

To handle this scenario, modify the buffered write ENOSPC handling and
retry sequence to invoke an eofblocks scan. In the smaller filesystem
scenario, the eofblocks scan resets the usage of preallocation such that
when the 5% free space threshold is met, throttling effectively takes
over to provide fair and efficient preallocation until legitimate
ENOSPC.

The eofblocks scan is selective based on the nature of the failure. For
example, an EDQUOT failure in a particular quota will use a filtered
scan for that quota. Because we don't know which quota might have caused
an allocation failure at any given time, we run a scan against each
applicable quota determined to be under low free space conditions.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_dquot.h  | 15 +++++++++++++++
 fs/xfs/xfs_file.c   | 23 +++++++++++++++++++----
 fs/xfs/xfs_icache.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_icache.h |  1 +
 4 files changed, 82 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
index 68a68f7..c24c67e 100644
--- a/fs/xfs/xfs_dquot.h
+++ b/fs/xfs/xfs_dquot.h
@@ -139,6 +139,21 @@ static inline xfs_dquot_t *xfs_inode_dquot(struct xfs_inode *ip, int type)
 	}
 }
 
+/*
+ * Check whether a dquot is under low free space conditions. We assume the quota
+ * is enabled and enforced.
+ */
+static inline bool xfs_dquot_lowsp(struct xfs_dquot *dqp)
+{
+	int64_t freesp;
+
+	freesp = be64_to_cpu(dqp->q_core.d_blk_hardlimit) - dqp->q_res_bcount;
+	if (freesp < dqp->q_low_space[XFS_QLOWSP_1_PCNT])
+		return true;
+
+	return false;
+}
+
 #define XFS_DQ_IS_LOCKED(dqp)	(mutex_is_locked(&((dqp)->q_qlock)))
 #define XFS_DQ_IS_DIRTY(dqp)	((dqp)->dq_flags & XFS_DQ_DIRTY)
 #define XFS_QM_ISUDQ(dqp)	((dqp)->dq_flags & XFS_DQ_USER)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 1b8160d..2e0e73b 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -38,6 +38,7 @@
 #include "xfs_trace.h"
 #include "xfs_log.h"
 #include "xfs_dinode.h"
+#include "xfs_icache.h"
 
 #include <linux/aio.h>
 #include <linux/dcache.h>
@@ -741,14 +742,28 @@ write_retry:
 	ret = generic_perform_write(file, &from, pos);
 	if (likely(ret >= 0))
 		iocb->ki_pos = pos + ret;
+
 	/*
-	 * If we just got an ENOSPC, try to write back all dirty inodes to
-	 * convert delalloc space to free up some of the excess reserved
-	 * metadata space.
+	 * If we hit a space limit, try to free up some lingering preallocated
+	 * space before returning an error. In the case of ENOSPC, first try to
+	 * write back all dirty inodes to free up some of the excess reserved
+	 * metadata space. This reduces the chances that the eofblocks scan
+	 * waits on dirty mappings. Since xfs_flush_inodes() is serialized, this
+	 * also behaves as a filter to prevent too many eofblocks scans from
+	 * running at the same time.
 	 */
-	if (ret == -ENOSPC && !enospc) {
+	if (ret == -EDQUOT && !enospc) {
+		enospc = xfs_inode_free_quota_eofblocks(ip);
+		if (enospc)
+			goto write_retry;
+	} else if (ret == -ENOSPC && !enospc) {
+		struct xfs_eofblocks eofb = {0};
+
 		enospc = 1;
 		xfs_flush_inodes(ip->i_mount);
+		eofb.eof_scan_owner = ip->i_ino; /* for locking */
+		eofb.eof_flags = XFS_EOF_FLAGS_SYNC;
+		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
 		goto write_retry;
 	}
 
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index f4191f6..3cceb1b 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -33,6 +33,9 @@
 #include "xfs_trace.h"
 #include "xfs_icache.h"
 #include "xfs_bmap_util.h"
+#include "xfs_quota.h"
+#include "xfs_dquot_item.h"
+#include "xfs_dquot.h"
 
 #include <linux/kthread.h>
 #include <linux/freezer.h>
@@ -1270,6 +1273,50 @@ xfs_icache_free_eofblocks(
 					 eofb, XFS_ICI_EOFBLOCKS_TAG);
 }
 
+/*
+ * Run eofblocks scans on the quotas applicable to the inode. For inodes with
+ * multiple quotas, we don't know exactly which quota caused an allocation
+ * failure. We make a best effort by running scans for each quota considered
+ * to be under low free space conditions (less than 1% available free space).
+ */
+int
+xfs_inode_free_quota_eofblocks(
+	struct xfs_inode *ip)
+{
+	int scanned = 0;
+	struct xfs_eofblocks eofb = {0,};
+	struct xfs_dquot *dq;
+
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
+
+	/* set the scan owner to avoid potential livelock */
+	eofb.eof_scan_owner = ip->i_ino;
+
+	if (XFS_IS_UQUOTA_ENFORCED(ip->i_mount)) {
+		dq = xfs_inode_dquot(ip, XFS_DQ_USER);
+		if (dq && xfs_dquot_lowsp(dq)) {
+			eofb.eof_uid = VFS_I(ip)->i_uid;
+			eofb.eof_flags = XFS_EOF_FLAGS_SYNC|
+					 XFS_EOF_FLAGS_UID;
+			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
+			scanned = 1;
+		}
+	}
+
+	if (XFS_IS_GQUOTA_ENFORCED(ip->i_mount)) {
+		dq = xfs_inode_dquot(ip, XFS_DQ_GROUP);
+		if (dq && xfs_dquot_lowsp(dq)) {
+			eofb.eof_gid = VFS_I(ip)->i_gid;
+			eofb.eof_flags = XFS_EOF_FLAGS_SYNC|
+					 XFS_EOF_FLAGS_GID;
+			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
+			scanned = 1;
+		}
+	}
+
+	return scanned;
+}
+
 void
 xfs_inode_set_eofblocks_tag(
 	xfs_inode_t	*ip)
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 152757f..b432d83 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -58,6 +58,7 @@ void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
 void xfs_inode_set_eofblocks_tag(struct xfs_inode *ip);
 void xfs_inode_clear_eofblocks_tag(struct xfs_inode *ip);
 int xfs_icache_free_eofblocks(struct xfs_mount *, struct xfs_eofblocks *);
+int xfs_inode_free_quota_eofblocks(struct xfs_inode *ip);
 void xfs_eofblocks_worker(struct work_struct *);
 
 int xfs_inode_ag_iterator(struct xfs_mount *mp,
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/3] xfs: squash prealloc while over quota free space as well
  2014-05-23 11:52 [PATCH v2 0/3] xfs: run eofblocks scan on ENOSPC Brian Foster
  2014-05-23 11:52 ` [PATCH v2 1/3] xfs: add scan owner field to xfs_eofblocks Brian Foster
  2014-05-23 11:52 ` [PATCH v2 2/3] xfs: run an eofblocks scan on ENOSPC/EDQUOT Brian Foster
@ 2014-05-23 11:52 ` Brian Foster
  2014-05-26 23:00   ` Dave Chinner
  2 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2014-05-23 11:52 UTC (permalink / raw)
  To: xfs

Commit 4d559a3b introduced heavy prealloc. squashing to catch the case
of requesting too large a prealloc on smaller filesystems, leading to
repeated flush and retry cycles that occur on ENOSPC. Now that we issue
eofblocks scans on EDQUOT/ENOSPC, squash the prealloc against the
minimum available free space across all applicable quotas as well to
avoid a similar problem of repeated eofblocks scans.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_iomap.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 6c5eb4c..28629e9 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -397,7 +397,8 @@ xfs_quota_calc_throttle(
 	struct xfs_inode *ip,
 	int type,
 	xfs_fsblock_t *qblocks,
-	int *qshift)
+	int *qshift,
+	int64_t	*qfreesp)
 {
 	int64_t freesp;
 	int shift = 0;
@@ -406,6 +407,7 @@ xfs_quota_calc_throttle(
 	/* over hi wmark, squash the prealloc completely */
 	if (dq->q_res_bcount >= dq->q_prealloc_hi_wmark) {
 		*qblocks = 0;
+		*qfreesp = 0;
 		return;
 	}
 
@@ -418,6 +420,9 @@ xfs_quota_calc_throttle(
 			shift += 2;
 	}
 
+	if (freesp < *qfreesp)
+		*qfreesp = freesp;
+
 	/* only overwrite the throttle values if we are more aggressive */
 	if ((freesp >> shift) < (*qblocks >> *qshift)) {
 		*qblocks = freesp;
@@ -476,15 +481,18 @@ xfs_iomap_prealloc_size(
 	}
 
 	/*
-	 * Check each quota to cap the prealloc size and provide a shift
-	 * value to throttle with.
+	 * Check each quota to cap the prealloc size, provide a shift value to
+	 * throttle with and adjust amount of available space.
 	 */
 	if (xfs_quota_need_throttle(ip, XFS_DQ_USER, alloc_blocks))
-		xfs_quota_calc_throttle(ip, XFS_DQ_USER, &qblocks, &qshift);
+		xfs_quota_calc_throttle(ip, XFS_DQ_USER, &qblocks, &qshift,
+					&freesp);
 	if (xfs_quota_need_throttle(ip, XFS_DQ_GROUP, alloc_blocks))
-		xfs_quota_calc_throttle(ip, XFS_DQ_GROUP, &qblocks, &qshift);
+		xfs_quota_calc_throttle(ip, XFS_DQ_GROUP, &qblocks, &qshift,
+					&freesp);
 	if (xfs_quota_need_throttle(ip, XFS_DQ_PROJ, alloc_blocks))
-		xfs_quota_calc_throttle(ip, XFS_DQ_PROJ, &qblocks, &qshift);
+		xfs_quota_calc_throttle(ip, XFS_DQ_PROJ, &qblocks, &qshift,
+					&freesp);
 
 	/*
 	 * The final prealloc size is set to the minimum of free space available
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/3] xfs: add scan owner field to xfs_eofblocks
  2014-05-23 11:52 ` [PATCH v2 1/3] xfs: add scan owner field to xfs_eofblocks Brian Foster
@ 2014-05-26 22:49   ` Dave Chinner
  2014-05-27 10:44   ` Christoph Hellwig
  1 sibling, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2014-05-26 22:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, May 23, 2014 at 07:52:28AM -0400, Brian Foster wrote:
> The scan owner field represents an optional inode number that is
> responsible for the current scan. The purpose is to identify that an
> inode is under iolock and as such, the iolock shouldn't be attempted
> when trimming eofblocks. This is an internal only field.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_icache.c | 12 +++++++++++-
>  fs/xfs/xfs_icache.h |  2 ++
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index c48df5f..f4191f6 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1211,6 +1211,7 @@ xfs_inode_free_eofblocks(
>  {
>  	int ret;
>  	struct xfs_eofblocks *eofb = args;
> +	bool need_iolock = true;
>  
>  	if (!xfs_can_free_eofblocks(ip, false)) {
>  		/* inode could be preallocated or append-only */
> @@ -1235,9 +1236,18 @@ xfs_inode_free_eofblocks(
>  		if (eofb->eof_flags & XFS_EOF_FLAGS_MINFILESIZE &&
>  		    XFS_ISIZE(ip) < eofb->eof_min_file_size)
>  			return 0;
> +
> +		/*
> +		 * A scan owner implies we already hold the iolock. Skip it in
> +		 * xfs_free_eofblocks() to avoid deadlock. This also eliminates
> +		 * the possibility of EAGAIN being returned.
> +		 */
> +		if (eofb->eof_scan_owner != NULLFSINO &&
> +		    eofb->eof_scan_owner == ip->i_ino)
> +			need_iolock = false;

No need to check against NULLFSINO there. ip->i_ino can never be
NULLFSINO, so just checking eofb->eof_scan_owner == ip->i_ino is
sufficient.

What might be an idea is adding a

	ASSERT(eofb->eof_scan_owner != 0);

to the start of the function to catch anyone who does not initialise
it appropriately. The inode number can never be zero (that would
translate to block 0 - the primary superblock) and so this woul dbe
sufficient to ensure callers are doing the right thing....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] xfs: run an eofblocks scan on ENOSPC/EDQUOT
  2014-05-23 11:52 ` [PATCH v2 2/3] xfs: run an eofblocks scan on ENOSPC/EDQUOT Brian Foster
@ 2014-05-26 22:57   ` Dave Chinner
  2014-05-27 12:47     ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2014-05-26 22:57 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, May 23, 2014 at 07:52:29AM -0400, Brian Foster wrote:
> Speculative preallocation and and the associated throttling metrics
> assume we're working with large files on large filesystems. Users have
> reported inefficiencies in these mechanisms when we happen to be dealing
> with large files on smaller filesystems. This can occur because while
> prealloc throttling is aggressive under low free space conditions, it is
> not active until we reach 5% free space or less.
> 
> For example, a 40GB filesystem has enough space for several files large
> enough to have multi-GB preallocations at any given time. If those files
> are slow growing, they might reserve preallocation for long periods of
> time as well as avoid the background scanner due to frequent
> modification. If a new file is written under these conditions, said file
> has no access to this already reserved space and premature ENOSPC is
> imminent.
> 
> To handle this scenario, modify the buffered write ENOSPC handling and
> retry sequence to invoke an eofblocks scan. In the smaller filesystem
> scenario, the eofblocks scan resets the usage of preallocation such that
> when the 5% free space threshold is met, throttling effectively takes
> over to provide fair and efficient preallocation until legitimate
> ENOSPC.
> 
> The eofblocks scan is selective based on the nature of the failure. For
> example, an EDQUOT failure in a particular quota will use a filtered
> scan for that quota. Because we don't know which quota might have caused
> an allocation failure at any given time, we run a scan against each
> applicable quota determined to be under low free space conditions.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_dquot.h  | 15 +++++++++++++++
>  fs/xfs/xfs_file.c   | 23 +++++++++++++++++++----
>  fs/xfs/xfs_icache.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_icache.h |  1 +
>  4 files changed, 82 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
> index 68a68f7..c24c67e 100644
> --- a/fs/xfs/xfs_dquot.h
> +++ b/fs/xfs/xfs_dquot.h
> @@ -139,6 +139,21 @@ static inline xfs_dquot_t *xfs_inode_dquot(struct xfs_inode *ip, int type)
>  	}
>  }
>  
> +/*
> + * Check whether a dquot is under low free space conditions. We assume the quota
> + * is enabled and enforced.
> + */
> +static inline bool xfs_dquot_lowsp(struct xfs_dquot *dqp)
> +{
> +	int64_t freesp;
> +
> +	freesp = be64_to_cpu(dqp->q_core.d_blk_hardlimit) - dqp->q_res_bcount;
> +	if (freesp < dqp->q_low_space[XFS_QLOWSP_1_PCNT])
> +		return true;
> +
> +	return false;
> +}
> +
>  #define XFS_DQ_IS_LOCKED(dqp)	(mutex_is_locked(&((dqp)->q_qlock)))
>  #define XFS_DQ_IS_DIRTY(dqp)	((dqp)->dq_flags & XFS_DQ_DIRTY)
>  #define XFS_QM_ISUDQ(dqp)	((dqp)->dq_flags & XFS_DQ_USER)
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 1b8160d..2e0e73b 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -38,6 +38,7 @@
>  #include "xfs_trace.h"
>  #include "xfs_log.h"
>  #include "xfs_dinode.h"
> +#include "xfs_icache.h"
>  
>  #include <linux/aio.h>
>  #include <linux/dcache.h>
> @@ -741,14 +742,28 @@ write_retry:
>  	ret = generic_perform_write(file, &from, pos);
>  	if (likely(ret >= 0))
>  		iocb->ki_pos = pos + ret;
> +
>  	/*
> -	 * If we just got an ENOSPC, try to write back all dirty inodes to
> -	 * convert delalloc space to free up some of the excess reserved
> -	 * metadata space.
> +	 * If we hit a space limit, try to free up some lingering preallocated
> +	 * space before returning an error. In the case of ENOSPC, first try to
> +	 * write back all dirty inodes to free up some of the excess reserved
> +	 * metadata space. This reduces the chances that the eofblocks scan
> +	 * waits on dirty mappings. Since xfs_flush_inodes() is serialized, this
> +	 * also behaves as a filter to prevent too many eofblocks scans from
> +	 * running at the same time.
>  	 */
> -	if (ret == -ENOSPC && !enospc) {
> +	if (ret == -EDQUOT && !enospc) {
> +		enospc = xfs_inode_free_quota_eofblocks(ip);
> +		if (enospc)
> +			goto write_retry;
> +	} else if (ret == -ENOSPC && !enospc) {
> +		struct xfs_eofblocks eofb = {0};
> +
>  		enospc = 1;
>  		xfs_flush_inodes(ip->i_mount);
> +		eofb.eof_scan_owner = ip->i_ino; /* for locking */
> +		eofb.eof_flags = XFS_EOF_FLAGS_SYNC;
> +		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
>  		goto write_retry;
>  	}
>  
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index f4191f6..3cceb1b 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -33,6 +33,9 @@
>  #include "xfs_trace.h"
>  #include "xfs_icache.h"
>  #include "xfs_bmap_util.h"
> +#include "xfs_quota.h"
> +#include "xfs_dquot_item.h"
> +#include "xfs_dquot.h"
>  
>  #include <linux/kthread.h>
>  #include <linux/freezer.h>
> @@ -1270,6 +1273,50 @@ xfs_icache_free_eofblocks(
>  					 eofb, XFS_ICI_EOFBLOCKS_TAG);
>  }
>  
> +/*
> + * Run eofblocks scans on the quotas applicable to the inode. For inodes with
> + * multiple quotas, we don't know exactly which quota caused an allocation
> + * failure. We make a best effort by running scans for each quota considered
> + * to be under low free space conditions (less than 1% available free space).
> + */
> +int
> +xfs_inode_free_quota_eofblocks(
> +	struct xfs_inode *ip)
> +{
> +	int scanned = 0;
> +	struct xfs_eofblocks eofb = {0,};
> +	struct xfs_dquot *dq;
> +
> +	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
> +
> +	/* set the scan owner to avoid potential livelock */
> +	eofb.eof_scan_owner = ip->i_ino;
> +
> +	if (XFS_IS_UQUOTA_ENFORCED(ip->i_mount)) {
> +		dq = xfs_inode_dquot(ip, XFS_DQ_USER);
> +		if (dq && xfs_dquot_lowsp(dq)) {
> +			eofb.eof_uid = VFS_I(ip)->i_uid;
> +			eofb.eof_flags = XFS_EOF_FLAGS_SYNC|
> +					 XFS_EOF_FLAGS_UID;
> +			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> +			scanned = 1;
> +		}
> +	}
> +
> +	if (XFS_IS_GQUOTA_ENFORCED(ip->i_mount)) {
> +		dq = xfs_inode_dquot(ip, XFS_DQ_GROUP);
> +		if (dq && xfs_dquot_lowsp(dq)) {
> +			eofb.eof_gid = VFS_I(ip)->i_gid;
> +			eofb.eof_flags = XFS_EOF_FLAGS_SYNC|
> +					 XFS_EOF_FLAGS_GID;
> +			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> +			scanned = 1;
> +		}
> +	}

Rather that doing two scans here, wouldn't it be more efficient
to do:

	eofb.eof_flags = XFS_EOF_FLAGS_SYNC;
	scan = false;
	if (uquota is low) {
		eofb.eof_uid = VFS_I(ip)->i_uid;
		eofb.eof_flags |= XFS_EOF_FLAGS_UID;
		scan = true;
	}
	if (gquota is low) {
		eofb.eof_gid = VFS_I(ip)->i_gid;
		eofb.eof_flags |= XFS_EOF_FLAGS_GID;
		scan = true;
	}
	if (scan)
		xfs_icache_free_eofblocks(ip->i_mount, &eofb);

and change xfs_inode_match_id() to be able to check against multiple
flags on a single inode? That way we only scan the inode cache
once, regardless of the number of quota types that are enabled and
are tracking low space thresholds.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/3] xfs: squash prealloc while over quota free space as well
  2014-05-23 11:52 ` [PATCH v2 3/3] xfs: squash prealloc while over quota free space as well Brian Foster
@ 2014-05-26 23:00   ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2014-05-26 23:00 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, May 23, 2014 at 07:52:30AM -0400, Brian Foster wrote:
> Commit 4d559a3b introduced heavy prealloc. squashing to catch the case
> of requesting too large a prealloc on smaller filesystems, leading to
> repeated flush and retry cycles that occur on ENOSPC. Now that we issue
> eofblocks scans on EDQUOT/ENOSPC, squash the prealloc against the
> minimum available free space across all applicable quotas as well to
> avoid a similar problem of repeated eofblocks scans.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Good idea - thrashing is something worth avoiding....

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/3] xfs: add scan owner field to xfs_eofblocks
  2014-05-23 11:52 ` [PATCH v2 1/3] xfs: add scan owner field to xfs_eofblocks Brian Foster
  2014-05-26 22:49   ` Dave Chinner
@ 2014-05-27 10:44   ` Christoph Hellwig
  2014-05-27 12:18     ` Brian Foster
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2014-05-27 10:44 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, May 23, 2014 at 07:52:28AM -0400, Brian Foster wrote:
> The scan owner field represents an optional inode number that is
> responsible for the current scan. The purpose is to identify that an
> inode is under iolock and as such, the iolock shouldn't be attempted
> when trimming eofblocks. This is an internal only field.

xfs_free_eofblocks already does a trylock, and without that calling it
from one iolock holding process to another would be a deadlock waiting
to happen.

I have to say I'm still not very easy with iolock nesting, even if it's
a trylock.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/3] xfs: add scan owner field to xfs_eofblocks
  2014-05-27 10:44   ` Christoph Hellwig
@ 2014-05-27 12:18     ` Brian Foster
  2014-05-27 21:26       ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2014-05-27 12:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, May 27, 2014 at 03:44:28AM -0700, Christoph Hellwig wrote:
> On Fri, May 23, 2014 at 07:52:28AM -0400, Brian Foster wrote:
> > The scan owner field represents an optional inode number that is
> > responsible for the current scan. The purpose is to identify that an
> > inode is under iolock and as such, the iolock shouldn't be attempted
> > when trimming eofblocks. This is an internal only field.
> 
> xfs_free_eofblocks already does a trylock, and without that calling it
> from one iolock holding process to another would be a deadlock waiting
> to happen.
> 
> I have to say I'm still not very easy with iolock nesting, even if it's
> a trylock.
> 

Right... maybe I'm not parsing your point. The purpose here is to avoid
the trylock entirely. E.g., Indicate that we have already acquired the
lock and can proceed with xfs_free_eofblocks(), rather than fail a
trylock and skip (which appears to be a potential infinite loop scenario
here due to how the AG walking code handles EAGAIN).

Brian

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] xfs: run an eofblocks scan on ENOSPC/EDQUOT
  2014-05-26 22:57   ` Dave Chinner
@ 2014-05-27 12:47     ` Brian Foster
  2014-05-27 21:14       ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2014-05-27 12:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, May 27, 2014 at 08:57:55AM +1000, Dave Chinner wrote:
> On Fri, May 23, 2014 at 07:52:29AM -0400, Brian Foster wrote:
> > Speculative preallocation and and the associated throttling metrics
> > assume we're working with large files on large filesystems. Users have
> > reported inefficiencies in these mechanisms when we happen to be dealing
> > with large files on smaller filesystems. This can occur because while
> > prealloc throttling is aggressive under low free space conditions, it is
> > not active until we reach 5% free space or less.
> > 
> > For example, a 40GB filesystem has enough space for several files large
> > enough to have multi-GB preallocations at any given time. If those files
> > are slow growing, they might reserve preallocation for long periods of
> > time as well as avoid the background scanner due to frequent
> > modification. If a new file is written under these conditions, said file
> > has no access to this already reserved space and premature ENOSPC is
> > imminent.
> > 
> > To handle this scenario, modify the buffered write ENOSPC handling and
> > retry sequence to invoke an eofblocks scan. In the smaller filesystem
> > scenario, the eofblocks scan resets the usage of preallocation such that
> > when the 5% free space threshold is met, throttling effectively takes
> > over to provide fair and efficient preallocation until legitimate
> > ENOSPC.
> > 
> > The eofblocks scan is selective based on the nature of the failure. For
> > example, an EDQUOT failure in a particular quota will use a filtered
> > scan for that quota. Because we don't know which quota might have caused
> > an allocation failure at any given time, we run a scan against each
> > applicable quota determined to be under low free space conditions.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_dquot.h  | 15 +++++++++++++++
> >  fs/xfs/xfs_file.c   | 23 +++++++++++++++++++----
> >  fs/xfs/xfs_icache.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_icache.h |  1 +
> >  4 files changed, 82 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
> > index 68a68f7..c24c67e 100644
> > --- a/fs/xfs/xfs_dquot.h
> > +++ b/fs/xfs/xfs_dquot.h
> > @@ -139,6 +139,21 @@ static inline xfs_dquot_t *xfs_inode_dquot(struct xfs_inode *ip, int type)
> >  	}
> >  }
> >  
> > +/*
> > + * Check whether a dquot is under low free space conditions. We assume the quota
> > + * is enabled and enforced.
> > + */
> > +static inline bool xfs_dquot_lowsp(struct xfs_dquot *dqp)
> > +{
> > +	int64_t freesp;
> > +
> > +	freesp = be64_to_cpu(dqp->q_core.d_blk_hardlimit) - dqp->q_res_bcount;
> > +	if (freesp < dqp->q_low_space[XFS_QLOWSP_1_PCNT])
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> >  #define XFS_DQ_IS_LOCKED(dqp)	(mutex_is_locked(&((dqp)->q_qlock)))
> >  #define XFS_DQ_IS_DIRTY(dqp)	((dqp)->dq_flags & XFS_DQ_DIRTY)
> >  #define XFS_QM_ISUDQ(dqp)	((dqp)->dq_flags & XFS_DQ_USER)
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 1b8160d..2e0e73b 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -38,6 +38,7 @@
> >  #include "xfs_trace.h"
> >  #include "xfs_log.h"
> >  #include "xfs_dinode.h"
> > +#include "xfs_icache.h"
> >  
> >  #include <linux/aio.h>
> >  #include <linux/dcache.h>
> > @@ -741,14 +742,28 @@ write_retry:
> >  	ret = generic_perform_write(file, &from, pos);
> >  	if (likely(ret >= 0))
> >  		iocb->ki_pos = pos + ret;
> > +
> >  	/*
> > -	 * If we just got an ENOSPC, try to write back all dirty inodes to
> > -	 * convert delalloc space to free up some of the excess reserved
> > -	 * metadata space.
> > +	 * If we hit a space limit, try to free up some lingering preallocated
> > +	 * space before returning an error. In the case of ENOSPC, first try to
> > +	 * write back all dirty inodes to free up some of the excess reserved
> > +	 * metadata space. This reduces the chances that the eofblocks scan
> > +	 * waits on dirty mappings. Since xfs_flush_inodes() is serialized, this
> > +	 * also behaves as a filter to prevent too many eofblocks scans from
> > +	 * running at the same time.
> >  	 */
> > -	if (ret == -ENOSPC && !enospc) {
> > +	if (ret == -EDQUOT && !enospc) {
> > +		enospc = xfs_inode_free_quota_eofblocks(ip);
> > +		if (enospc)
> > +			goto write_retry;
> > +	} else if (ret == -ENOSPC && !enospc) {
> > +		struct xfs_eofblocks eofb = {0};
> > +
> >  		enospc = 1;
> >  		xfs_flush_inodes(ip->i_mount);
> > +		eofb.eof_scan_owner = ip->i_ino; /* for locking */
> > +		eofb.eof_flags = XFS_EOF_FLAGS_SYNC;
> > +		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> >  		goto write_retry;
> >  	}
> >  
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index f4191f6..3cceb1b 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -33,6 +33,9 @@
> >  #include "xfs_trace.h"
> >  #include "xfs_icache.h"
> >  #include "xfs_bmap_util.h"
> > +#include "xfs_quota.h"
> > +#include "xfs_dquot_item.h"
> > +#include "xfs_dquot.h"
> >  
> >  #include <linux/kthread.h>
> >  #include <linux/freezer.h>
> > @@ -1270,6 +1273,50 @@ xfs_icache_free_eofblocks(
> >  					 eofb, XFS_ICI_EOFBLOCKS_TAG);
> >  }
> >  
> > +/*
> > + * Run eofblocks scans on the quotas applicable to the inode. For inodes with
> > + * multiple quotas, we don't know exactly which quota caused an allocation
> > + * failure. We make a best effort by running scans for each quota considered
> > + * to be under low free space conditions (less than 1% available free space).
> > + */
> > +int
> > +xfs_inode_free_quota_eofblocks(
> > +	struct xfs_inode *ip)
> > +{
> > +	int scanned = 0;
> > +	struct xfs_eofblocks eofb = {0,};
> > +	struct xfs_dquot *dq;
> > +
> > +	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
> > +
> > +	/* set the scan owner to avoid potential livelock */
> > +	eofb.eof_scan_owner = ip->i_ino;
> > +
> > +	if (XFS_IS_UQUOTA_ENFORCED(ip->i_mount)) {
> > +		dq = xfs_inode_dquot(ip, XFS_DQ_USER);
> > +		if (dq && xfs_dquot_lowsp(dq)) {
> > +			eofb.eof_uid = VFS_I(ip)->i_uid;
> > +			eofb.eof_flags = XFS_EOF_FLAGS_SYNC|
> > +					 XFS_EOF_FLAGS_UID;
> > +			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> > +			scanned = 1;
> > +		}
> > +	}
> > +
> > +	if (XFS_IS_GQUOTA_ENFORCED(ip->i_mount)) {
> > +		dq = xfs_inode_dquot(ip, XFS_DQ_GROUP);
> > +		if (dq && xfs_dquot_lowsp(dq)) {
> > +			eofb.eof_gid = VFS_I(ip)->i_gid;
> > +			eofb.eof_flags = XFS_EOF_FLAGS_SYNC|
> > +					 XFS_EOF_FLAGS_GID;
> > +			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> > +			scanned = 1;
> > +		}
> > +	}
> 
> Rather that doing two scans here, wouldn't it be more efficient
> to do:
> 
> 	eofb.eof_flags = XFS_EOF_FLAGS_SYNC;
> 	scan = false;
> 	if (uquota is low) {
> 		eofb.eof_uid = VFS_I(ip)->i_uid;
> 		eofb.eof_flags |= XFS_EOF_FLAGS_UID;
> 		scan = true;
> 	}
> 	if (gquota is low) {
> 		eofb.eof_gid = VFS_I(ip)->i_gid;
> 		eofb.eof_flags |= XFS_EOF_FLAGS_GID;
> 		scan = true;
> 	}
> 	if (scan)
> 		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> 
> and change xfs_inode_match_id() to be able to check against multiple
> flags on a single inode? That way we only scan the inode cache
> once, regardless of the number of quota types that are enabled and
> are tracking low space thresholds.
> 

Yeah, that would certainly be better from this perspective. We don't
care so much about the characteristics of the inode as much as the
quotas that are associated with it. If I recall, I was somewhat on the
fence about this behavior when we first added the userspace interface
here. IOWs, should the combination of flags define an intersection of
the set of inodes to scan or a union? The more I think about it, I think
the interface kind of suggests the former (from an interface/aesthetic
perspective). E.g., I probably wouldn't expect to add a GID flag to a
UID flag and have my scan become more broad, rather than more
restrictive. Otherwise, the existence of a uid, gid and prid in the
request structure seems kind of arbitrary (as opposed to a list/set of
generic IDs, for example).

I'm not against union behavior in general (and still probably not 100%
against switching the default). I suppose another option could be to add
a set of union/intersection flags that control the behavior here. I'd
be slightly concerned about making this interface too convoluted, but it
is a relatively low level thing, I suppose, without much generic use. We
could also decide not to expose those extra controls to userspace for
the time being.

I need to think about this some more. Thoughts on any of that?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] xfs: run an eofblocks scan on ENOSPC/EDQUOT
  2014-05-27 12:47     ` Brian Foster
@ 2014-05-27 21:14       ` Dave Chinner
  2014-05-28 12:42         ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2014-05-27 21:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, May 27, 2014 at 08:47:55AM -0400, Brian Foster wrote:
> On Tue, May 27, 2014 at 08:57:55AM +1000, Dave Chinner wrote:
> > On Fri, May 23, 2014 at 07:52:29AM -0400, Brian Foster wrote:
> > > +/*
> > > + * Run eofblocks scans on the quotas applicable to the inode. For inodes with
> > > + * multiple quotas, we don't know exactly which quota caused an allocation
> > > + * failure. We make a best effort by running scans for each quota considered
> > > + * to be under low free space conditions (less than 1% available free space).
> > > + */
> > > +int
> > > +xfs_inode_free_quota_eofblocks(
> > > +	struct xfs_inode *ip)
> > > +{
> > > +	int scanned = 0;
> > > +	struct xfs_eofblocks eofb = {0,};
> > > +	struct xfs_dquot *dq;
> > > +
> > > +	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
> > > +
> > > +	/* set the scan owner to avoid potential livelock */
> > > +	eofb.eof_scan_owner = ip->i_ino;
> > > +
> > > +	if (XFS_IS_UQUOTA_ENFORCED(ip->i_mount)) {
> > > +		dq = xfs_inode_dquot(ip, XFS_DQ_USER);
> > > +		if (dq && xfs_dquot_lowsp(dq)) {
> > > +			eofb.eof_uid = VFS_I(ip)->i_uid;
> > > +			eofb.eof_flags = XFS_EOF_FLAGS_SYNC|
> > > +					 XFS_EOF_FLAGS_UID;
> > > +			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> > > +			scanned = 1;
> > > +		}
> > > +	}
> > > +
> > > +	if (XFS_IS_GQUOTA_ENFORCED(ip->i_mount)) {
> > > +		dq = xfs_inode_dquot(ip, XFS_DQ_GROUP);
> > > +		if (dq && xfs_dquot_lowsp(dq)) {
> > > +			eofb.eof_gid = VFS_I(ip)->i_gid;
> > > +			eofb.eof_flags = XFS_EOF_FLAGS_SYNC|
> > > +					 XFS_EOF_FLAGS_GID;
> > > +			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> > > +			scanned = 1;
> > > +		}
> > > +	}
> > 
> > Rather that doing two scans here, wouldn't it be more efficient
> > to do:
> > 
> > 	eofb.eof_flags = XFS_EOF_FLAGS_SYNC;
> > 	scan = false;
> > 	if (uquota is low) {
> > 		eofb.eof_uid = VFS_I(ip)->i_uid;
> > 		eofb.eof_flags |= XFS_EOF_FLAGS_UID;
> > 		scan = true;
> > 	}
> > 	if (gquota is low) {
> > 		eofb.eof_gid = VFS_I(ip)->i_gid;
> > 		eofb.eof_flags |= XFS_EOF_FLAGS_GID;
> > 		scan = true;
> > 	}
> > 	if (scan)
> > 		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> > 
> > and change xfs_inode_match_id() to be able to check against multiple
> > flags on a single inode? That way we only scan the inode cache
> > once, regardless of the number of quota types that are enabled and
> > are tracking low space thresholds.
> > 
> 
> Yeah, that would certainly be better from this perspective. We don't
> care so much about the characteristics of the inode as much as the
> quotas that are associated with it. If I recall, I was somewhat on the
> fence about this behavior when we first added the userspace interface
> here. IOWs, should the combination of flags define an intersection of
> the set of inodes to scan or a union? The more I think about it, I think
> the interface kind of suggests the former (from an interface/aesthetic
> perspective). E.g., I probably wouldn't expect to add a GID flag to a
> UID flag and have my scan become more broad, rather than more
> restrictive. Otherwise, the existence of a uid, gid and prid in the
> request structure seems kind of arbitrary (as opposed to a list/set of
> generic IDs, for example).
> 
> I'm not against union behavior in general (and still probably not 100%
> against switching the default). I suppose another option could be to add
> a set of union/intersection flags that control the behavior here. I'd
> be slightly concerned about making this interface too convoluted, but it
> is a relatively low level thing, I suppose, without much generic use. We
> could also decide not to expose those extra controls to userspace for
> the time being.
> 
> I need to think about this some more. Thoughts on any of that?

What we expose to userspace is orthoganol to what we implment
internally. It makes sense to limit the userspace interface to a
single type at a time, but when we are doing internal cleaner work
it makes sense to match all criteria in a single cache pass.

i.e. Restrict the capability of the user interface at the input
layer rather than restricting the capability of the infrastructure
to do work efficiently...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/3] xfs: add scan owner field to xfs_eofblocks
  2014-05-27 12:18     ` Brian Foster
@ 2014-05-27 21:26       ` Dave Chinner
  2014-05-28  5:30         ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2014-05-27 21:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, xfs

On Tue, May 27, 2014 at 08:18:11AM -0400, Brian Foster wrote:
> On Tue, May 27, 2014 at 03:44:28AM -0700, Christoph Hellwig wrote:
> > On Fri, May 23, 2014 at 07:52:28AM -0400, Brian Foster wrote:
> > > The scan owner field represents an optional inode number that is
> > > responsible for the current scan. The purpose is to identify that an
> > > inode is under iolock and as such, the iolock shouldn't be attempted
> > > when trimming eofblocks. This is an internal only field.
> > 
> > xfs_free_eofblocks already does a trylock, and without that calling it
> > from one iolock holding process to another would be a deadlock waiting
> > to happen.
> > 
> > I have to say I'm still not very easy with iolock nesting, even if it's
> > a trylock.
> > 
> 
> Right... maybe I'm not parsing your point. The purpose here is to avoid
> the trylock entirely. E.g., Indicate that we have already acquired the
> lock and can proceed with xfs_free_eofblocks(), rather than fail a
> trylock and skip (which appears to be a potential infinite loop scenario
> here due to how the AG walking code handles EAGAIN).

I think Christoph's concern here is that we are calling a function
that can take the iolock while we already hold the iolock. i.e. the
reason we have to add the anti-deadlock code in the first place.  To
address that, can we restructure xfs_file_buffered_aio_write() such
that the ENOSPC/EDQUOT flush is done outside the iolock?

>From a quick check, I don't think there is any problem with dropping
the iolock, doing the flushes and then going all the way back to the
start of the function again, but closer examination and testing is
warranted...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/3] xfs: add scan owner field to xfs_eofblocks
  2014-05-27 21:26       ` Dave Chinner
@ 2014-05-28  5:30         ` Christoph Hellwig
  2014-05-28 14:00           ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2014-05-28  5:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, Brian Foster, xfs

On Wed, May 28, 2014 at 07:26:53AM +1000, Dave Chinner wrote:
> > Right... maybe I'm not parsing your point. The purpose here is to avoid
> > the trylock entirely. E.g., Indicate that we have already acquired the
> > lock and can proceed with xfs_free_eofblocks(), rather than fail a
> > trylock and skip (which appears to be a potential infinite loop scenario
> > here due to how the AG walking code handles EAGAIN).
> 
> I think Christoph's concern here is that we are calling a function
> that can take the iolock while we already hold the iolock. i.e. the
> reason we have to add the anti-deadlock code in the first place.

Indeed.

> To
> address that, can we restructure xfs_file_buffered_aio_write() such
> that the ENOSPC/EDQUOT flush is done outside the iolock?
> 
> >From a quick check, I don't think there is any problem with dropping
> the iolock, doing the flushes and then going all the way back to the
> start of the function again, but closer examination and testing is
> warranted...

I think we'd need some form of early space reservation, otherwise we'd
get non-atomic writes.  Time to get those batches write patches out
again..

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] xfs: run an eofblocks scan on ENOSPC/EDQUOT
  2014-05-27 21:14       ` Dave Chinner
@ 2014-05-28 12:42         ` Brian Foster
  0 siblings, 0 replies; 15+ messages in thread
From: Brian Foster @ 2014-05-28 12:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, May 28, 2014 at 07:14:16AM +1000, Dave Chinner wrote:
> On Tue, May 27, 2014 at 08:47:55AM -0400, Brian Foster wrote:
> > On Tue, May 27, 2014 at 08:57:55AM +1000, Dave Chinner wrote:
> > > On Fri, May 23, 2014 at 07:52:29AM -0400, Brian Foster wrote:
> > > > +/*
> > > > + * Run eofblocks scans on the quotas applicable to the inode. For inodes with
> > > > + * multiple quotas, we don't know exactly which quota caused an allocation
> > > > + * failure. We make a best effort by running scans for each quota considered
> > > > + * to be under low free space conditions (less than 1% available free space).
> > > > + */
> > > > +int
> > > > +xfs_inode_free_quota_eofblocks(
> > > > +	struct xfs_inode *ip)
> > > > +{
> > > > +	int scanned = 0;
> > > > +	struct xfs_eofblocks eofb = {0,};
> > > > +	struct xfs_dquot *dq;
> > > > +
> > > > +	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
> > > > +
> > > > +	/* set the scan owner to avoid potential livelock */
> > > > +	eofb.eof_scan_owner = ip->i_ino;
> > > > +
> > > > +	if (XFS_IS_UQUOTA_ENFORCED(ip->i_mount)) {
> > > > +		dq = xfs_inode_dquot(ip, XFS_DQ_USER);
> > > > +		if (dq && xfs_dquot_lowsp(dq)) {
> > > > +			eofb.eof_uid = VFS_I(ip)->i_uid;
> > > > +			eofb.eof_flags = XFS_EOF_FLAGS_SYNC|
> > > > +					 XFS_EOF_FLAGS_UID;
> > > > +			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> > > > +			scanned = 1;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	if (XFS_IS_GQUOTA_ENFORCED(ip->i_mount)) {
> > > > +		dq = xfs_inode_dquot(ip, XFS_DQ_GROUP);
> > > > +		if (dq && xfs_dquot_lowsp(dq)) {
> > > > +			eofb.eof_gid = VFS_I(ip)->i_gid;
> > > > +			eofb.eof_flags = XFS_EOF_FLAGS_SYNC|
> > > > +					 XFS_EOF_FLAGS_GID;
> > > > +			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> > > > +			scanned = 1;
> > > > +		}
> > > > +	}
> > > 
> > > Rather that doing two scans here, wouldn't it be more efficient
> > > to do:
> > > 
> > > 	eofb.eof_flags = XFS_EOF_FLAGS_SYNC;
> > > 	scan = false;
> > > 	if (uquota is low) {
> > > 		eofb.eof_uid = VFS_I(ip)->i_uid;
> > > 		eofb.eof_flags |= XFS_EOF_FLAGS_UID;
> > > 		scan = true;
> > > 	}
> > > 	if (gquota is low) {
> > > 		eofb.eof_gid = VFS_I(ip)->i_gid;
> > > 		eofb.eof_flags |= XFS_EOF_FLAGS_GID;
> > > 		scan = true;
> > > 	}
> > > 	if (scan)
> > > 		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> > > 
> > > and change xfs_inode_match_id() to be able to check against multiple
> > > flags on a single inode? That way we only scan the inode cache
> > > once, regardless of the number of quota types that are enabled and
> > > are tracking low space thresholds.
> > > 
> > 
> > Yeah, that would certainly be better from this perspective. We don't
> > care so much about the characteristics of the inode as much as the
> > quotas that are associated with it. If I recall, I was somewhat on the
> > fence about this behavior when we first added the userspace interface
> > here. IOWs, should the combination of flags define an intersection of
> > the set of inodes to scan or a union? The more I think about it, I think
> > the interface kind of suggests the former (from an interface/aesthetic
> > perspective). E.g., I probably wouldn't expect to add a GID flag to a
> > UID flag and have my scan become more broad, rather than more
> > restrictive. Otherwise, the existence of a uid, gid and prid in the
> > request structure seems kind of arbitrary (as opposed to a list/set of
> > generic IDs, for example).
> > 
> > I'm not against union behavior in general (and still probably not 100%
> > against switching the default). I suppose another option could be to add
> > a set of union/intersection flags that control the behavior here. I'd
> > be slightly concerned about making this interface too convoluted, but it
> > is a relatively low level thing, I suppose, without much generic use. We
> > could also decide not to expose those extra controls to userspace for
> > the time being.
> > 
> > I need to think about this some more. Thoughts on any of that?
> 
> What we expose to userspace is orthoganol to what we implment
> internally. It makes sense to limit the userspace interface to a
> single type at a time, but when we are doing internal cleaner work
> it makes sense to match all criteria in a single cache pass.
> 
> i.e. Restrict the capability of the user interface at the input
> layer rather than restricting the capability of the infrastructure
> to do work efficiently...
> 

Ok... so I'm thinking we can handle this with a new XFS_EOF_FLAGS_UNION
flag. This is masked off from userspace requests. It will be set for the
internal quota scan and xfs_inode_match_id() can check for it and call a
*match_id_union() variant that does the right thing. Thanks.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/3] xfs: add scan owner field to xfs_eofblocks
  2014-05-28  5:30         ` Christoph Hellwig
@ 2014-05-28 14:00           ` Brian Foster
  0 siblings, 0 replies; 15+ messages in thread
From: Brian Foster @ 2014-05-28 14:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, May 27, 2014 at 10:30:19PM -0700, Christoph Hellwig wrote:
> On Wed, May 28, 2014 at 07:26:53AM +1000, Dave Chinner wrote:
> > > Right... maybe I'm not parsing your point. The purpose here is to avoid
> > > the trylock entirely. E.g., Indicate that we have already acquired the
> > > lock and can proceed with xfs_free_eofblocks(), rather than fail a
> > > trylock and skip (which appears to be a potential infinite loop scenario
> > > here due to how the AG walking code handles EAGAIN).
> > 
> > I think Christoph's concern here is that we are calling a function
> > that can take the iolock while we already hold the iolock. i.e. the
> > reason we have to add the anti-deadlock code in the first place.
> 
> Indeed.
> 

Ah, I didn't parse correctly then. Thanks...

> > To
> > address that, can we restructure xfs_file_buffered_aio_write() such
> > that the ENOSPC/EDQUOT flush is done outside the iolock?
> > 
> > >From a quick check, I don't think there is any problem with dropping
> > the iolock, doing the flushes and then going all the way back to the
> > start of the function again, but closer examination and testing is
> > warranted...
> 

I considered this briefly early on, but wasn't sure about whether we
should run through the write_checks() bits more than once (e.g.,
potentially do the eof zeroing, etc., multiple times..?).

> I think we'd need some form of early space reservation, otherwise we'd
> get non-atomic writes.  Time to get those batches write patches out
> again..
> 

So the concern is that multiple writers to an overlapped range could
become interleaved? From passing through the code, we hit
generic_perform_write(), which iters over the iov in a
write_begin/copy_write_end loop. If we hit ENOSPC somewhere in the
middle, we'd return what we've written so far. I don't believe the
buffered_aio_write() path would see the error unless it was the first
attempt at a delayed allocation. IOW, mid-write failure will be a short
write vs. an ENOSPC error.

It seems like it _might_ be safe to drop and reacquire iolock given
these semantics (notwithstanding the write_checks() bits), but I could
certainly be missing something...

Brian

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-05-28 14:01 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-23 11:52 [PATCH v2 0/3] xfs: run eofblocks scan on ENOSPC Brian Foster
2014-05-23 11:52 ` [PATCH v2 1/3] xfs: add scan owner field to xfs_eofblocks Brian Foster
2014-05-26 22:49   ` Dave Chinner
2014-05-27 10:44   ` Christoph Hellwig
2014-05-27 12:18     ` Brian Foster
2014-05-27 21:26       ` Dave Chinner
2014-05-28  5:30         ` Christoph Hellwig
2014-05-28 14:00           ` Brian Foster
2014-05-23 11:52 ` [PATCH v2 2/3] xfs: run an eofblocks scan on ENOSPC/EDQUOT Brian Foster
2014-05-26 22:57   ` Dave Chinner
2014-05-27 12:47     ` Brian Foster
2014-05-27 21:14       ` Dave Chinner
2014-05-28 12:42         ` Brian Foster
2014-05-23 11:52 ` [PATCH v2 3/3] xfs: squash prealloc while over quota free space as well Brian Foster
2014-05-26 23:00   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox