* [PATCH] bulkstat fixups
@ 2007-11-09 5:24 Lachlan McIlroy
2007-11-09 5:35 ` Vlad Apostolov
2007-11-11 21:48 ` David Chinner
0 siblings, 2 replies; 10+ messages in thread
From: Lachlan McIlroy @ 2007-11-09 5:24 UTC (permalink / raw)
To: xfs-dev, xfs-oss
[-- Attachment #1: Type: text/plain, Size: 2374 bytes --]
Here's a collection of fixups for bulkstat for all the remaining issues.
- sanity check for NULL user buffer in xfs_ioc_bulkstat[_compat]()
- remove the special case for XFS_IOC_FSBULKSTAT with count == 1. This special
case causes bulkstat to fail because the special case uses xfs_bulkstat_single()
instead of xfs_bulkstat() and the two functions have different semantics.
xfs_bulkstat() will return the next inode after the one supplied while skipping
internal inodes (ie quota inodes). xfs_bulkstate_single() will only lookup the
inode supplied and return an error if it is an internal inode.
- in xfs_bulkstat(), need to initialise 'lastino' to the inode supplied so in cases
were we return without examining any inodes the scan wont restart back at zero.
- sanity check for valid *ubcountp values. Cannot sanity check for valid ubuffer
here because some users of xfs_bulkstat() don't supply a buffer.
- checks against 'ubleft' (the space left in the user's buffer) should be against
'statstruct_size' which is the supplied minimum object size. The mixture of
checks against statstruct_size and 0 was one of the reasons we were skipping
inodes.
- if the formatter function returns BULKSTAT_RV_NOTHING and an error and the error
is not ENOENT or EINVAL then we need to abort the scan. ENOENT is for inodes that
are no longer valid and we just skip them. EINVAL is returned if we try to lookup
an internal inode so we skip them too. For a DMF scan if the inode and DMF
attribute cannot fit into the space left in the user's buffer it would return
ERANGE. We didn't handle this error and skipped the inode. We would continue to
skip inodes until one fitted into the user's buffer or we completed the scan.
- put back the recalculation of agino (that got removed with the last fix) at the
end of the while loop. This is because the code at the start of the loop expects
agino to be the last inode examined if it is non-zero.
- if we found some inodes but then encountered an error, return success this time
and the error next time. If the formatter aborted with ENOMEM we will now return
this error but only if we couldn't read any inodes. Previously if we encountered
ENOMEM without reading any inodes we returned a zero count and no error which
falsely indicated the scan was complete.
[-- Attachment #2: bulkstat.diff --]
[-- Type: text/x-patch, Size: 4322 bytes --]
--- fs/xfs/linux-2.6/xfs_ioctl.c_1.158 2007-11-09 15:51:03.000000000 +1100
+++ fs/xfs/linux-2.6/xfs_ioctl.c 2007-11-09 11:57:50.000000000 +1100
@@ -1024,24 +1024,20 @@ xfs_ioc_bulkstat(
if ((count = bulkreq.icount) <= 0)
return -XFS_ERROR(EINVAL);
+ if (bulkreq.ubuffer == NULL)
+ return -XFS_ERROR(EINVAL);
+
if (cmd == XFS_IOC_FSINUMBERS)
error = xfs_inumbers(mp, &inlast, &count,
bulkreq.ubuffer, xfs_inumbers_fmt);
else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE)
error = xfs_bulkstat_single(mp, &inlast,
bulkreq.ubuffer, &done);
- else { /* XFS_IOC_FSBULKSTAT */
- if (count == 1 && inlast != 0) {
- inlast++;
- error = xfs_bulkstat_single(mp, &inlast,
- bulkreq.ubuffer, &done);
- } else {
- error = xfs_bulkstat(mp, &inlast, &count,
- (bulkstat_one_pf)xfs_bulkstat_one, NULL,
- sizeof(xfs_bstat_t), bulkreq.ubuffer,
- BULKSTAT_FG_QUICK, &done);
- }
- }
+ else /* XFS_IOC_FSBULKSTAT */
+ error = xfs_bulkstat(mp, &inlast, &count,
+ (bulkstat_one_pf)xfs_bulkstat_one, NULL,
+ sizeof(xfs_bstat_t), bulkreq.ubuffer,
+ BULKSTAT_FG_QUICK, &done);
if (error)
return -error;
--- fs/xfs/linux-2.6/xfs_ioctl32.c_1.23 2007-11-02 14:27:11.000000000 +1100
+++ fs/xfs/linux-2.6/xfs_ioctl32.c 2007-11-02 14:13:27.000000000 +1100
@@ -292,6 +292,9 @@ xfs_ioc_bulkstat_compat(
if ((count = bulkreq.icount) <= 0)
return -XFS_ERROR(EINVAL);
+ if (bulkreq.ubuffer == NULL)
+ return -XFS_ERROR(EINVAL);
+
if (cmd == XFS_IOC_FSINUMBERS)
error = xfs_inumbers(mp, &inlast, &count,
bulkreq.ubuffer, xfs_inumbers_fmt_compat);
--- fs/xfs/xfs_itable.c_1.157 2007-10-25 17:22:09.000000000 +1000
+++ fs/xfs/xfs_itable.c 2007-11-01 17:22:28.000000000 +1100
@@ -353,7 +353,7 @@ xfs_bulkstat(
xfs_inobt_rec_incore_t *irbp; /* current irec buffer pointer */
xfs_inobt_rec_incore_t *irbuf; /* start of irec buffer */
xfs_inobt_rec_incore_t *irbufend; /* end of good irec buffer entries */
- xfs_ino_t lastino=0; /* last inode number returned */
+ xfs_ino_t lastino; /* last inode number returned */
int nbcluster; /* # of blocks in a cluster */
int nicluster; /* # of inodes in a cluster */
int nimask; /* mask for inode clusters */
@@ -373,6 +373,7 @@ xfs_bulkstat(
* Get the last inode value, see if there's nothing to do.
*/
ino = (xfs_ino_t)*lastinop;
+ lastino = ino;
dip = NULL;
agno = XFS_INO_TO_AGNO(mp, ino);
agino = XFS_INO_TO_AGINO(mp, ino);
@@ -382,6 +383,9 @@ xfs_bulkstat(
*ubcountp = 0;
return 0;
}
+ if (!ubcountp || *ubcountp <= 0) {
+ return EINVAL;
+ }
ubcount = *ubcountp; /* statstruct's */
ubleft = ubcount * statstruct_size; /* bytes */
*ubcountp = ubelem = 0;
@@ -560,7 +564,7 @@ xfs_bulkstat(
* Now process this chunk of inodes.
*/
for (agino = irbp->ir_startino, chunkidx = clustidx = 0;
- ubleft > 0 &&
+ ubleft >= statstruct_size &&
irbp->ir_freecount < XFS_INODES_PER_CHUNK;
chunkidx++, clustidx++, agino++) {
ASSERT(chunkidx < XFS_INODES_PER_CHUNK);
@@ -663,15 +667,13 @@ xfs_bulkstat(
ubleft, private_data,
bno, &ubused, dip, &fmterror);
if (fmterror == BULKSTAT_RV_NOTHING) {
- if (error == EFAULT) {
- ubleft = 0;
- rval = error;
- break;
- }
- else if (error == ENOMEM)
+ if (error && error != ENOENT &&
+ error != EINVAL) {
ubleft = 0;
- else
- lastino = ino;
+ rval = error;
+ break;
+ }
+ lastino = ino;
continue;
}
if (fmterror == BULKSTAT_RV_GIVEUP) {
@@ -694,11 +696,12 @@ xfs_bulkstat(
/*
* Set up for the next loop iteration.
*/
- if (ubleft > 0) {
+ if (ubleft >= statstruct_size) {
if (end_of_ag) {
agno++;
agino = 0;
- }
+ } else
+ agino = XFS_INO_TO_AGINO(mp, lastino);
} else
break;
}
@@ -707,6 +710,11 @@ xfs_bulkstat(
*/
kmem_free(irbuf, irbsize);
*ubcountp = ubelem;
+ /*
+ * Found some inodes, return them now and return the error next time.
+ */
+ if (ubelem)
+ rval = 0;
if (agno >= mp->m_sb.sb_agcount) {
/*
* If we ran out of filesystem, mark lastino as off
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bulkstat fixups
2007-11-09 5:24 [PATCH] bulkstat fixups Lachlan McIlroy
@ 2007-11-09 5:35 ` Vlad Apostolov
2007-11-11 21:48 ` David Chinner
1 sibling, 0 replies; 10+ messages in thread
From: Vlad Apostolov @ 2007-11-09 5:35 UTC (permalink / raw)
To: lachlan; +Cc: xfs-dev, xfs-oss
It is looking good Lachlan.
I also verified the patch with XFS QA and Judith reported that
it fixed the xfs_bulkstat() problem - skipping inodes in the last AG.
Regards,
Vlad
Lachlan McIlroy wrote:
> Here's a collection of fixups for bulkstat for all the remaining issues.
>
> - sanity check for NULL user buffer in xfs_ioc_bulkstat[_compat]()
>
> - remove the special case for XFS_IOC_FSBULKSTAT with count == 1.
> This special
> case causes bulkstat to fail because the special case uses
> xfs_bulkstat_single()
> instead of xfs_bulkstat() and the two functions have different
> semantics.
> xfs_bulkstat() will return the next inode after the one supplied
> while skipping
> internal inodes (ie quota inodes). xfs_bulkstate_single() will only
> lookup the
> inode supplied and return an error if it is an internal inode.
>
> - in xfs_bulkstat(), need to initialise 'lastino' to the inode
> supplied so in cases
> were we return without examining any inodes the scan wont restart
> back at zero.
>
> - sanity check for valid *ubcountp values. Cannot sanity check for
> valid ubuffer
> here because some users of xfs_bulkstat() don't supply a buffer.
>
> - checks against 'ubleft' (the space left in the user's buffer) should
> be against
> 'statstruct_size' which is the supplied minimum object size. The
> mixture of
> checks against statstruct_size and 0 was one of the reasons we were
> skipping
> inodes.
>
> - if the formatter function returns BULKSTAT_RV_NOTHING and an error
> and the error
> is not ENOENT or EINVAL then we need to abort the scan. ENOENT is
> for inodes that
> are no longer valid and we just skip them. EINVAL is returned if we
> try to lookup
> an internal inode so we skip them too. For a DMF scan if the inode
> and DMF
> attribute cannot fit into the space left in the user's buffer it
> would return
> ERANGE. We didn't handle this error and skipped the inode. We
> would continue to
> skip inodes until one fitted into the user's buffer or we completed
> the scan.
>
> - put back the recalculation of agino (that got removed with the last
> fix) at the
> end of the while loop. This is because the code at the start of the
> loop expects
> agino to be the last inode examined if it is non-zero.
>
> - if we found some inodes but then encountered an error, return
> success this time
> and the error next time. If the formatter aborted with ENOMEM we
> will now return
> this error but only if we couldn't read any inodes. Previously if
> we encountered
> ENOMEM without reading any inodes we returned a zero count and no
> error which
> falsely indicated the scan was complete.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bulkstat fixups
2007-11-09 5:24 [PATCH] bulkstat fixups Lachlan McIlroy
2007-11-09 5:35 ` Vlad Apostolov
@ 2007-11-11 21:48 ` David Chinner
2007-11-12 2:57 ` Lachlan McIlroy
1 sibling, 1 reply; 10+ messages in thread
From: David Chinner @ 2007-11-11 21:48 UTC (permalink / raw)
To: Lachlan McIlroy; +Cc: xfs-dev, xfs-oss
[Lachlan, can you wrap your email text at 72 columns for ease of quoting?]
On Fri, Nov 09, 2007 at 04:24:02PM +1100, Lachlan McIlroy wrote:
> Here's a collection of fixups for bulkstat for all the remaining issues.
>
> - sanity check for NULL user buffer in xfs_ioc_bulkstat[_compat]()
OK.
> - remove the special case for XFS_IOC_FSBULKSTAT with count == 1. This
> special
> case causes bulkstat to fail because the special case uses
> xfs_bulkstat_single()
> instead of xfs_bulkstat() and the two functions have different semantics.
> xfs_bulkstat() will return the next inode after the one supplied while
> skipping
> internal inodes (ie quota inodes). xfs_bulkstate_single() will only
> lookup the
> inode supplied and return an error if it is an internal inode.
Userspace visile change. What applications do we have that rely on this
behaviour that will be broken by this change?
> - in xfs_bulkstat(), need to initialise 'lastino' to the inode supplied so
> in cases
> were we return without examining any inodes the scan wont restart back at
> zero.
Ok.
> - sanity check for valid *ubcountp values. Cannot sanity check for valid
> ubuffer
> here because some users of xfs_bulkstat() don't supply a buffer.
Ok - the cases that supply a buffer are caught by the first checks you
added, right?
> - checks against 'ubleft' (the space left in the user's buffer) should be
> against
> 'statstruct_size' which is the supplied minimum object size. The mixture
> of
> checks against statstruct_size and 0 was one of the reasons we were
> skipping
> inodes.
Can you wrap these checks in a static inline function so that it is obvious
what the correct way to check is and we don't reintroduce this porblem? i.e.
static inline int
xfs_bulkstat_ubuffer_large_enough(ssize_t space)
{
return (space > sizeof(struct blah));
}
That will also remove a stack variable....
> - if the formatter function returns BULKSTAT_RV_NOTHING and an error and
> the error
> is not ENOENT or EINVAL then we need to abort the scan. ENOENT is for
> inodes that
> are no longer valid and we just skip them. EINVAL is returned if we try
> to lookup
> an internal inode so we skip them too. For a DMF scan if the inode and
> DMF
> attribute cannot fit into the space left in the user's buffer it would
> return
> ERANGE. We didn't handle this error and skipped the inode. We would
> continue to
> skip inodes until one fitted into the user's buffer or we completed the
> scan.
ok.
> - put back the recalculation of agino (that got removed with the last fix)
> at the
> end of the while loop. This is because the code at the start of the loop
> expects
> agino to be the last inode examined if it is non-zero.
ok.
> - if we found some inodes but then encountered an error, return success
> this time
> and the error next time. If the formatter aborted with ENOMEM we will
> now return
> this error but only if we couldn't read any inodes. Previously if we
> encountered
> ENOMEM without reading any inodes we returned a zero count and no error
> which
> falsely indicated the scan was complete.
ok.
FWIW - missing from this set of patches - cpu_relax() in the loops. In the case
where no I/O is required to do the scan, we can hold the cpu for a long time
and that will hold off I/O completion, etc for the cpu bulkstat is running on.
Hence after every cluster we scan we should cpu_relax() to allow other
processes cpu time on that cpu.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bulkstat fixups
2007-11-11 21:48 ` David Chinner
@ 2007-11-12 2:57 ` Lachlan McIlroy
2007-11-12 4:11 ` David Chinner
0 siblings, 1 reply; 10+ messages in thread
From: Lachlan McIlroy @ 2007-11-12 2:57 UTC (permalink / raw)
To: David Chinner; +Cc: xfs-dev, xfs-oss
David Chinner wrote:
> [Lachlan, can you wrap your email text at 72 columns for ease of quoting?]
>
> On Fri, Nov 09, 2007 at 04:24:02PM +1100, Lachlan McIlroy wrote:
>> Here's a collection of fixups for bulkstat for all the remaining issues.
>>
>> - sanity check for NULL user buffer in xfs_ioc_bulkstat[_compat]()
>
> OK.
>
>> - remove the special case for XFS_IOC_FSBULKSTAT with count == 1. This
>> special
>> case causes bulkstat to fail because the special case uses
>> xfs_bulkstat_single()
>> instead of xfs_bulkstat() and the two functions have different semantics.
>> xfs_bulkstat() will return the next inode after the one supplied while
>> skipping
>> internal inodes (ie quota inodes). xfs_bulkstate_single() will only
>> lookup the
>> inode supplied and return an error if it is an internal inode.
>
> Userspace visile change. What applications do we have that rely on this
> behaviour that will be broken by this change?
Any apps that rely on the existing behaviour are probably broken. If an app
wants to call xfs_bulkstat_single() it should use XFS_IOC_FSBULKSTAT_SINGLE.
Alex Elder reported he has a test program that exploited this bug to compare
the results of xfs_bulkstat_one (does an iget) with the results of an
xfs_bulkstat (copying the inode from the cluster). That should be easy to fix.
Other than that it doesn't make sense for an app to use bulkstat to get one
inode at a time but if they do it will now work correctly.
>
>> - in xfs_bulkstat(), need to initialise 'lastino' to the inode supplied so
>> in cases
>> were we return without examining any inodes the scan wont restart back at
>> zero.
>
> Ok.
>
>> - sanity check for valid *ubcountp values. Cannot sanity check for valid
>> ubuffer
>> here because some users of xfs_bulkstat() don't supply a buffer.
>
> Ok - the cases that supply a buffer are caught by the first checks you
> added, right?
>
>> - checks against 'ubleft' (the space left in the user's buffer) should be
>> against
>> 'statstruct_size' which is the supplied minimum object size. The mixture
>> of
>> checks against statstruct_size and 0 was one of the reasons we were
>> skipping
>> inodes.
>
> Can you wrap these checks in a static inline function so that it is obvious
> what the correct way to check is and we don't reintroduce this porblem? i.e.
>
> static inline int
> xfs_bulkstat_ubuffer_large_enough(ssize_t space)
> {
> return (space > sizeof(struct blah));
> }
>
> That will also remove a stack variable....
That won't work - statstruct_size is passed into xfs_bulkstat() so we don't
know what 'blah' is. Maybe a macro would be easier.
#define XFS_BULKSTAT_UBLEFT (ubleft >= statstruct_size)
>
>> - if the formatter function returns BULKSTAT_RV_NOTHING and an error and
>> the error
>> is not ENOENT or EINVAL then we need to abort the scan. ENOENT is for
>> inodes that
>> are no longer valid and we just skip them. EINVAL is returned if we try
>> to lookup
>> an internal inode so we skip them too. For a DMF scan if the inode and
>> DMF
>> attribute cannot fit into the space left in the user's buffer it would
>> return
>> ERANGE. We didn't handle this error and skipped the inode. We would
>> continue to
>> skip inodes until one fitted into the user's buffer or we completed the
>> scan.
>
> ok.
>
>> - put back the recalculation of agino (that got removed with the last fix)
>> at the
>> end of the while loop. This is because the code at the start of the loop
>> expects
>> agino to be the last inode examined if it is non-zero.
>
> ok.
>
>> - if we found some inodes but then encountered an error, return success
>> this time
>> and the error next time. If the formatter aborted with ENOMEM we will
>> now return
>> this error but only if we couldn't read any inodes. Previously if we
>> encountered
>> ENOMEM without reading any inodes we returned a zero count and no error
>> which
>> falsely indicated the scan was complete.
>
> ok.
>
> FWIW - missing from this set of patches - cpu_relax() in the loops. In the case
> where no I/O is required to do the scan, we can hold the cpu for a long time
> and that will hold off I/O completion, etc for the cpu bulkstat is running on.
> Hence after every cluster we scan we should cpu_relax() to allow other
> processes cpu time on that cpu.
>
I don't get how cpu_relax() works. I see that it is called at times with a
spinlock held so it wont trigger a context switch. Does it give interrupts a
chance to run?
It appears to be used where a minor delay is needed - I don't think we have any
cases in xfs_bulkstat() where we need to wait for an event that isn't I/O.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bulkstat fixups
2007-11-12 2:57 ` Lachlan McIlroy
@ 2007-11-12 4:11 ` David Chinner
2007-11-16 4:34 ` Lachlan McIlroy
0 siblings, 1 reply; 10+ messages in thread
From: David Chinner @ 2007-11-12 4:11 UTC (permalink / raw)
To: Lachlan McIlroy; +Cc: David Chinner, xfs-dev, xfs-oss
On Mon, Nov 12, 2007 at 01:57:33PM +1100, Lachlan McIlroy wrote:
> David Chinner wrote:
> >[Lachlan, can you wrap your email text at 72 columns for ease of quoting?]
> >
> >On Fri, Nov 09, 2007 at 04:24:02PM +1100, Lachlan McIlroy wrote:
> >>Here's a collection of fixups for bulkstat for all the remaining issues.
> >>
> >>- sanity check for NULL user buffer in xfs_ioc_bulkstat[_compat]()
> >
> >OK.
> >
> >>- remove the special case for XFS_IOC_FSBULKSTAT with count == 1. This
> >>special
> >> case causes bulkstat to fail because the special case uses
> >> xfs_bulkstat_single()
> >> instead of xfs_bulkstat() and the two functions have different
> >> semantics.
> >> xfs_bulkstat() will return the next inode after the one supplied while
> >> skipping
> >> internal inodes (ie quota inodes). xfs_bulkstate_single() will only
> >> lookup the
> >> inode supplied and return an error if it is an internal inode.
> >
> >Userspace visile change. What applications do we have that rely on this
> >behaviour that will be broken by this change?
>
> Any apps that rely on the existing behaviour are probably broken. If an app
> wants to call xfs_bulkstat_single() it should use XFS_IOC_FSBULKSTAT_SINGLE.
Perhaps, but we can't arbitrarily decide that those apps will now break on
a new kernel with this change. At minimum we need to audit all of the code
we have that uses bulkstat for such breakage (including DMF!) before we make a
change like this.
> >>- checks against 'ubleft' (the space left in the user's buffer) should be
> >>against
> >> 'statstruct_size' which is the supplied minimum object size. The
> >> mixture of
> >> checks against statstruct_size and 0 was one of the reasons we were
> >> skipping
> >> inodes.
> >
> >Can you wrap these checks in a static inline function so that it is obvious
> >what the correct way to check is and we don't reintroduce this porblem?
> >i.e.
> >
> >static inline int
> >xfs_bulkstat_ubuffer_large_enough(ssize_t space)
> >{
> > return (space > sizeof(struct blah));
> >}
> >
> >That will also remove a stack variable....
>
> That won't work - statstruct_size is passed into xfs_bulkstat() so we don't
> know what 'blah' is. Maybe a macro would be easier.
>
> #define XFS_BULKSTAT_UBLEFT (ubleft >= statstruct_size)
Yeah, something like that, but I don't like macros with no parameters used
like that....
> >FWIW - missing from this set of patches - cpu_relax() in the loops. In the
> >case
> >where no I/O is required to do the scan, we can hold the cpu for a long
> >time
> >and that will hold off I/O completion, etc for the cpu bulkstat is running
> >on.
> >Hence after every cluster we scan we should cpu_relax() to allow other
> >processes cpu time on that cpu.
> >
>
> I don't get how cpu_relax() works. I see that it is called at times with a
> spinlock held so it wont trigger a context switch. Does it give interrupts
> a chance to run?
Sorry, my mistake - confused cpu_relax() with cond_resched(). take the above
paragraph and s/cpu_relax/cond_resched/g
> It appears to be used where a minor delay is needed - I don't think we have
> any
> cases in xfs_bulkstat() where we need to wait for an event that isn't I/O.
The issue is when we're hitting cached buffers and we never end up waiting
for I/O - we will then monopolise the cpu we are running on and hold off
all other processing. It's antisocial and leads to high latencies for other
code.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bulkstat fixups
2007-11-12 4:11 ` David Chinner
@ 2007-11-16 4:34 ` Lachlan McIlroy
2007-11-16 4:42 ` Lachlan McIlroy
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Lachlan McIlroy @ 2007-11-16 4:34 UTC (permalink / raw)
To: David Chinner; +Cc: xfs-dev, xfs-oss
[-- Attachment #1: Type: text/plain, Size: 4039 bytes --]
Updated patch - I added cond_resched() calls into each loop - for loops that
have a 'continue' somewhere in them I added the cond_resched() at the start,
otherwise I put it at the end.
David Chinner wrote:
> On Mon, Nov 12, 2007 at 01:57:33PM +1100, Lachlan McIlroy wrote:
>> David Chinner wrote:
>>> [Lachlan, can you wrap your email text at 72 columns for ease of quoting?]
>>>
>>> On Fri, Nov 09, 2007 at 04:24:02PM +1100, Lachlan McIlroy wrote:
>>>> Here's a collection of fixups for bulkstat for all the remaining issues.
>>>>
>>>> - sanity check for NULL user buffer in xfs_ioc_bulkstat[_compat]()
>>> OK.
>>>
>>>> - remove the special case for XFS_IOC_FSBULKSTAT with count == 1. This
>>>> special
>>>> case causes bulkstat to fail because the special case uses
>>>> xfs_bulkstat_single()
>>>> instead of xfs_bulkstat() and the two functions have different
>>>> semantics.
>>>> xfs_bulkstat() will return the next inode after the one supplied while
>>>> skipping
>>>> internal inodes (ie quota inodes). xfs_bulkstate_single() will only
>>>> lookup the
>>>> inode supplied and return an error if it is an internal inode.
>>> Userspace visile change. What applications do we have that rely on this
>>> behaviour that will be broken by this change?
>> Any apps that rely on the existing behaviour are probably broken. If an app
>> wants to call xfs_bulkstat_single() it should use XFS_IOC_FSBULKSTAT_SINGLE.
>
> Perhaps, but we can't arbitrarily decide that those apps will now break on
> a new kernel with this change. At minimum we need to audit all of the code
> we have that uses bulkstat for such breakage (including DMF!) before we make a
> change like this.
I've looked through everything we have in xfs-cmds and nothing relies on
this bug being present. Vlad helped me with the DMF side - DMF does not
use the XFS_IOC_FSBULKSTAT ioctl, it has it's own interface into the kernel
which calls xfs_bulkstat() directly so it wont be affected by this change.
>
>>>> - checks against 'ubleft' (the space left in the user's buffer) should be
>>>> against
>>>> 'statstruct_size' which is the supplied minimum object size. The
>>>> mixture of
>>>> checks against statstruct_size and 0 was one of the reasons we were
>>>> skipping
>>>> inodes.
>>> Can you wrap these checks in a static inline function so that it is obvious
>>> what the correct way to check is and we don't reintroduce this porblem?
>>> i.e.
>>>
>>> static inline int
>>> xfs_bulkstat_ubuffer_large_enough(ssize_t space)
>>> {
>>> return (space > sizeof(struct blah));
>>> }
>>>
>>> That will also remove a stack variable....
>> That won't work - statstruct_size is passed into xfs_bulkstat() so we don't
>> know what 'blah' is. Maybe a macro would be easier.
>>
>> #define XFS_BULKSTAT_UBLEFT (ubleft >= statstruct_size)
>
> Yeah, something like that, but I don't like macros with no parameters used
> like that....
>
>>> FWIW - missing from this set of patches - cpu_relax() in the loops. In the
>>> case
>>> where no I/O is required to do the scan, we can hold the cpu for a long
>>> time
>>> and that will hold off I/O completion, etc for the cpu bulkstat is running
>>> on.
>>> Hence after every cluster we scan we should cpu_relax() to allow other
>>> processes cpu time on that cpu.
>>>
>> I don't get how cpu_relax() works. I see that it is called at times with a
>> spinlock held so it wont trigger a context switch. Does it give interrupts
>> a chance to run?
>
> Sorry, my mistake - confused cpu_relax() with cond_resched(). take the above
> paragraph and s/cpu_relax/cond_resched/g
>
>> It appears to be used where a minor delay is needed - I don't think we have
>> any
>> cases in xfs_bulkstat() where we need to wait for an event that isn't I/O.
>
> The issue is when we're hitting cached buffers and we never end up waiting
> for I/O - we will then monopolise the cpu we are running on and hold off
> all other processing. It's antisocial and leads to high latencies for other
> code.
>
> Cheers,
>
> Dave.
[-- Attachment #2: bulkstat.diff --]
[-- Type: text/x-patch, Size: 4144 bytes --]
--- fs/xfs/xfs_itable.c_1.157 2007-10-25 17:22:09.000000000 +1000
+++ fs/xfs/xfs_itable.c 2007-11-16 14:52:31.000000000 +1100
@@ -316,6 +316,8 @@ xfs_bulkstat_use_dinode(
return 1;
}
+#define XFS_BULKSTAT_UBLEFT(ubleft) ((ubleft) >= statstruct_size)
+
/*
* Return stat information in bulk (by-inode) for the filesystem.
*/
@@ -353,7 +355,7 @@ xfs_bulkstat(
xfs_inobt_rec_incore_t *irbp; /* current irec buffer pointer */
xfs_inobt_rec_incore_t *irbuf; /* start of irec buffer */
xfs_inobt_rec_incore_t *irbufend; /* end of good irec buffer entries */
- xfs_ino_t lastino=0; /* last inode number returned */
+ xfs_ino_t lastino; /* last inode number returned */
int nbcluster; /* # of blocks in a cluster */
int nicluster; /* # of inodes in a cluster */
int nimask; /* mask for inode clusters */
@@ -373,6 +375,7 @@ xfs_bulkstat(
* Get the last inode value, see if there's nothing to do.
*/
ino = (xfs_ino_t)*lastinop;
+ lastino = ino;
dip = NULL;
agno = XFS_INO_TO_AGNO(mp, ino);
agino = XFS_INO_TO_AGINO(mp, ino);
@@ -382,6 +385,9 @@ xfs_bulkstat(
*ubcountp = 0;
return 0;
}
+ if (!ubcountp || *ubcountp <= 0) {
+ return EINVAL;
+ }
ubcount = *ubcountp; /* statstruct's */
ubleft = ubcount * statstruct_size; /* bytes */
*ubcountp = ubelem = 0;
@@ -402,7 +408,8 @@ xfs_bulkstat(
* inode returned; 0 means start of the allocation group.
*/
rval = 0;
- while (ubleft >= statstruct_size && agno < mp->m_sb.sb_agcount) {
+ while (XFS_BULKSTAT_UBLEFT(ubleft) && agno < mp->m_sb.sb_agcount) {
+ cond_resched();
bp = NULL;
down_read(&mp->m_peraglock);
error = xfs_ialloc_read_agi(mp, NULL, agno, &agbp);
@@ -499,6 +506,7 @@ xfs_bulkstat(
break;
error = xfs_inobt_lookup_ge(cur, agino, 0, 0,
&tmp);
+ cond_resched();
}
/*
* If ran off the end of the ag either with an error,
@@ -542,6 +550,7 @@ xfs_bulkstat(
*/
agino = gino + XFS_INODES_PER_CHUNK;
error = xfs_inobt_increment(cur, 0, &tmp);
+ cond_resched();
}
/*
* Drop the btree buffers and the agi buffer.
@@ -555,15 +564,16 @@ xfs_bulkstat(
*/
irbufend = irbp;
for (irbp = irbuf;
- irbp < irbufend && ubleft >= statstruct_size; irbp++) {
+ irbp < irbufend && XFS_BULKSTAT_UBLEFT(ubleft); irbp++) {
/*
* Now process this chunk of inodes.
*/
for (agino = irbp->ir_startino, chunkidx = clustidx = 0;
- ubleft > 0 &&
+ XFS_BULKSTAT_UBLEFT(ubleft) &&
irbp->ir_freecount < XFS_INODES_PER_CHUNK;
chunkidx++, clustidx++, agino++) {
ASSERT(chunkidx < XFS_INODES_PER_CHUNK);
+ cond_resched();
/*
* Recompute agbno if this is the
* first inode of the cluster.
@@ -663,15 +673,13 @@ xfs_bulkstat(
ubleft, private_data,
bno, &ubused, dip, &fmterror);
if (fmterror == BULKSTAT_RV_NOTHING) {
- if (error == EFAULT) {
- ubleft = 0;
- rval = error;
- break;
- }
- else if (error == ENOMEM)
+ if (error && error != ENOENT &&
+ error != EINVAL) {
ubleft = 0;
- else
- lastino = ino;
+ rval = error;
+ break;
+ }
+ lastino = ino;
continue;
}
if (fmterror == BULKSTAT_RV_GIVEUP) {
@@ -686,6 +694,8 @@ xfs_bulkstat(
ubelem++;
lastino = ino;
}
+
+ cond_resched();
}
if (bp)
@@ -694,11 +704,12 @@ xfs_bulkstat(
/*
* Set up for the next loop iteration.
*/
- if (ubleft > 0) {
+ if (XFS_BULKSTAT_UBLEFT(ubleft)) {
if (end_of_ag) {
agno++;
agino = 0;
- }
+ } else
+ agino = XFS_INO_TO_AGINO(mp, lastino);
} else
break;
}
@@ -707,6 +718,11 @@ xfs_bulkstat(
*/
kmem_free(irbuf, irbsize);
*ubcountp = ubelem;
+ /*
+ * Found some inodes, return them now and return the error next time.
+ */
+ if (ubelem)
+ rval = 0;
if (agno >= mp->m_sb.sb_agcount) {
/*
* If we ran out of filesystem, mark lastino as off
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bulkstat fixups
2007-11-16 4:34 ` Lachlan McIlroy
@ 2007-11-16 4:42 ` Lachlan McIlroy
2007-11-19 3:02 ` David Chinner
2007-11-21 15:17 ` Christoph Hellwig
2 siblings, 0 replies; 10+ messages in thread
From: Lachlan McIlroy @ 2007-11-16 4:42 UTC (permalink / raw)
To: lachlan; +Cc: David Chinner, xfs-dev, xfs-oss
Forgot to mention - this patch is just fs/xfs/xfs_itable.c. That's the
only file that has been updated since the last patch.
Lachlan McIlroy wrote:
> Updated patch - I added cond_resched() calls into each loop - for loops
> that
> have a 'continue' somewhere in them I added the cond_resched() at the
> start,
> otherwise I put it at the end.
>
> David Chinner wrote:
>> On Mon, Nov 12, 2007 at 01:57:33PM +1100, Lachlan McIlroy wrote:
>>> David Chinner wrote:
>>>> [Lachlan, can you wrap your email text at 72 columns for ease of
>>>> quoting?]
>>>>
>>>> On Fri, Nov 09, 2007 at 04:24:02PM +1100, Lachlan McIlroy wrote:
>>>>> Here's a collection of fixups for bulkstat for all the remaining
>>>>> issues.
>>>>>
>>>>> - sanity check for NULL user buffer in xfs_ioc_bulkstat[_compat]()
>>>> OK.
>>>>
>>>>> - remove the special case for XFS_IOC_FSBULKSTAT with count == 1.
>>>>> This special
>>>>> case causes bulkstat to fail because the special case uses
>>>>> xfs_bulkstat_single()
>>>>> instead of xfs_bulkstat() and the two functions have different
>>>>> semantics.
>>>>> xfs_bulkstat() will return the next inode after the one supplied
>>>>> while skipping
>>>>> internal inodes (ie quota inodes). xfs_bulkstate_single() will
>>>>> only lookup the
>>>>> inode supplied and return an error if it is an internal inode.
>>>> Userspace visile change. What applications do we have that rely on this
>>>> behaviour that will be broken by this change?
>>> Any apps that rely on the existing behaviour are probably broken. If
>>> an app
>>> wants to call xfs_bulkstat_single() it should use
>>> XFS_IOC_FSBULKSTAT_SINGLE.
>>
>> Perhaps, but we can't arbitrarily decide that those apps will now
>> break on
>> a new kernel with this change. At minimum we need to audit all of the
>> code
>> we have that uses bulkstat for such breakage (including DMF!) before
>> we make a
>> change like this.
>
> I've looked through everything we have in xfs-cmds and nothing relies on
> this bug being present. Vlad helped me with the DMF side - DMF does not
> use the XFS_IOC_FSBULKSTAT ioctl, it has it's own interface into the kernel
> which calls xfs_bulkstat() directly so it wont be affected by this change.
>
>>
>>>>> - checks against 'ubleft' (the space left in the user's buffer)
>>>>> should be against
>>>>> 'statstruct_size' which is the supplied minimum object size. The
>>>>> mixture of
>>>>> checks against statstruct_size and 0 was one of the reasons we
>>>>> were skipping
>>>>> inodes.
>>>> Can you wrap these checks in a static inline function so that it is
>>>> obvious
>>>> what the correct way to check is and we don't reintroduce this
>>>> porblem? i.e.
>>>>
>>>> static inline int
>>>> xfs_bulkstat_ubuffer_large_enough(ssize_t space)
>>>> {
>>>> return (space > sizeof(struct blah));
>>>> }
>>>>
>>>> That will also remove a stack variable....
>>> That won't work - statstruct_size is passed into xfs_bulkstat() so we
>>> don't
>>> know what 'blah' is. Maybe a macro would be easier.
>>>
>>> #define XFS_BULKSTAT_UBLEFT (ubleft >= statstruct_size)
>>
>> Yeah, something like that, but I don't like macros with no parameters
>> used
>> like that....
>>
>>>> FWIW - missing from this set of patches - cpu_relax() in the loops.
>>>> In the case
>>>> where no I/O is required to do the scan, we can hold the cpu for a
>>>> long time
>>>> and that will hold off I/O completion, etc for the cpu bulkstat is
>>>> running on.
>>>> Hence after every cluster we scan we should cpu_relax() to allow other
>>>> processes cpu time on that cpu.
>>>>
>>> I don't get how cpu_relax() works. I see that it is called at times
>>> with a
>>> spinlock held so it wont trigger a context switch. Does it give
>>> interrupts a chance to run?
>>
>> Sorry, my mistake - confused cpu_relax() with cond_resched(). take the
>> above
>> paragraph and s/cpu_relax/cond_resched/g
>>
>>> It appears to be used where a minor delay is needed - I don't think
>>> we have any
>>> cases in xfs_bulkstat() where we need to wait for an event that isn't
>>> I/O.
>>
>> The issue is when we're hitting cached buffers and we never end up
>> waiting
>> for I/O - we will then monopolise the cpu we are running on and hold off
>> all other processing. It's antisocial and leads to high latencies for
>> other
>> code.
>>
>> Cheers,
>>
>> Dave.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bulkstat fixups
2007-11-16 4:34 ` Lachlan McIlroy
2007-11-16 4:42 ` Lachlan McIlroy
@ 2007-11-19 3:02 ` David Chinner
2007-11-21 15:17 ` Christoph Hellwig
2 siblings, 0 replies; 10+ messages in thread
From: David Chinner @ 2007-11-19 3:02 UTC (permalink / raw)
To: Lachlan McIlroy; +Cc: David Chinner, xfs-dev, xfs-oss
On Fri, Nov 16, 2007 at 03:34:40PM +1100, Lachlan McIlroy wrote:
> Updated patch - I added cond_resched() calls into each loop - for loops that
> have a 'continue' somewhere in them I added the cond_resched() at the start,
> otherwise I put it at the end.
You probably don't need the call in the innermost loop (the walking across
the inode cluster).
> >>>Userspace visile change. What applications do we have that rely on this
> >>>behaviour that will be broken by this change?
> >>Any apps that rely on the existing behaviour are probably broken. If an
> >>app
> >>wants to call xfs_bulkstat_single() it should use
> >>XFS_IOC_FSBULKSTAT_SINGLE.
> >
> >Perhaps, but we can't arbitrarily decide that those apps will now break on
> >a new kernel with this change. At minimum we need to audit all of the code
> >we have that uses bulkstat for such breakage (including DMF!) before we
> >make a
> >change like this.
>
> I've looked through everything we have in xfs-cmds and nothing relies on
> this bug being present. Vlad helped me with the DMF side - DMF does not
> use the XFS_IOC_FSBULKSTAT ioctl, it has it's own interface into the kernel
> which calls xfs_bulkstat() directly so it wont be affected by this change.
Sounds like it really is a bug as nothing is trying to exploit that
behaviour. Ok, seems fair to fix it.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bulkstat fixups
2007-11-16 4:34 ` Lachlan McIlroy
2007-11-16 4:42 ` Lachlan McIlroy
2007-11-19 3:02 ` David Chinner
@ 2007-11-21 15:17 ` Christoph Hellwig
2007-11-21 21:31 ` David Chinner
2 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2007-11-21 15:17 UTC (permalink / raw)
To: Lachlan McIlroy; +Cc: David Chinner, xfs-dev, xfs-oss
+#define XFS_BULKSTAT_UBLEFT(ubleft) ((ubleft) >= statstruct_size)
+
I don't think this macro is really helpful. An inline would
have been useful if statstruct_size was constant, but this
way it's much better to just write out the comparism the four
times it's used.
+ if (!ubcountp || *ubcountp <= 0) {
+ return EINVAL;
+ }
No need for the braces here.
I also must say I don't like the cond_resched() calls very much. They
look entirely random to me. We really should only need cond_resched
when it's absolutely needed, and it deserves a comment why it's needed
then.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bulkstat fixups
2007-11-21 15:17 ` Christoph Hellwig
@ 2007-11-21 21:31 ` David Chinner
0 siblings, 0 replies; 10+ messages in thread
From: David Chinner @ 2007-11-21 21:31 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Lachlan McIlroy, David Chinner, xfs-dev, xfs-oss
On Wed, Nov 21, 2007 at 03:17:47PM +0000, Christoph Hellwig wrote:
> +#define XFS_BULKSTAT_UBLEFT(ubleft) ((ubleft) >= statstruct_size)
> +
>
> I don't think this macro is really helpful. An inline would
> have been useful if statstruct_size was constant, but this
> way it's much better to just write out the comparism the four
> times it's used.
>
> + if (!ubcountp || *ubcountp <= 0) {
> + return EINVAL;
> + }
>
> No need for the braces here.
>
>
> I also must say I don't like the cond_resched() calls very much. They
> look entirely random to me. We really should only need cond_resched
> when it's absolutely needed, and it deserves a comment why it's needed
> then.
I think I mentioned that we don't need them in the innermost loop.
The reason for adding them is that if the inode clusters are in cache,
bulkstat will not yield the cpu at all and so holds off other
things from operating on that CPU. And when bulkstat has got itself
stuck in a loop, if it's running on the same CPU as I/O completion
events are running on (i.e. disk interrupts delivered to) it basically
hangs the filesystem. If that CPU is taking interrupts for your root
filesystem....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2007-11-21 21:31 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-09 5:24 [PATCH] bulkstat fixups Lachlan McIlroy
2007-11-09 5:35 ` Vlad Apostolov
2007-11-11 21:48 ` David Chinner
2007-11-12 2:57 ` Lachlan McIlroy
2007-11-12 4:11 ` David Chinner
2007-11-16 4:34 ` Lachlan McIlroy
2007-11-16 4:42 ` Lachlan McIlroy
2007-11-19 3:02 ` David Chinner
2007-11-21 15:17 ` Christoph Hellwig
2007-11-21 21:31 ` David Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox