* XFS slow when project quota exceeded (again) @ 2013-02-14 13:14 Jan Kara 2013-02-15 7:16 ` Dave Chinner 0 siblings, 1 reply; 4+ messages in thread From: Jan Kara @ 2013-02-14 13:14 UTC (permalink / raw) To: xfs; +Cc: dchinner Hi, this is a follow up on a discussion started here: http://www.spinics.net/lists/xfs/msg14999.html To just quickly sum up the issue: When project quota gets exceeded XFS ends up flushing inodes using sync_inodes_sb(). I've tested (in 3.8-rc4) that if one writes 200 MB to a directory with 100 MB project quota like: fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0644); for (i = 0; i < 50000; i++) pwrite(fd, buf, 4096, i*4096); it takes about 3 s to finish, which is OK. But when there are lots of inodes cached (I've tried with 10000 inodes cached on the fs), the same test program runs ~140 s. This is because sync_inodes_sb() iterates over all inodes in superblock and waits for IO and this iteration eats CPU cycles. One can argue that sync_inodes_sb() should be optimized but it isn't that easy because it is a data integrity operation and we have to be sure that even IO which has been submitted before sync_inodes_sb() is called is finished by the time it returns. So another POV is that it is an overkill to call data integrity operation in a place which just needs to make sure IO is submitted so that delalloc uncertainty is reduced. The comment before xfs_flush_inodes() says that sync_inodes_sb() is used to throttle multiple callers to the rate at which IO is completing. I wonder - why is that needed? Delalloc blocks are allocated already at IO submission time so it should be enough to wait for IO submission to happen? And writeback_inodes_sb() does that. Changelog of the commit changing xfs_flush_inodes() to use sync_inodes_sb() claims that if writeback_inodes_sb() is used, premature ENOSPC can happen. But I wonder why is that the case when writeback_inodes_sb() waits for IO submission to happen and thus for extent conversion? Isn't sync_inodes_sb() just papering over some other problem? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: XFS slow when project quota exceeded (again) 2013-02-14 13:14 XFS slow when project quota exceeded (again) Jan Kara @ 2013-02-15 7:16 ` Dave Chinner 2013-02-15 10:38 ` Jan Kara 0 siblings, 1 reply; 4+ messages in thread From: Dave Chinner @ 2013-02-15 7:16 UTC (permalink / raw) To: Jan Kara; +Cc: dchinner, xfs On Thu, Feb 14, 2013 at 02:14:52PM +0100, Jan Kara wrote: > Hi, > > this is a follow up on a discussion started here: > http://www.spinics.net/lists/xfs/msg14999.html > > To just quickly sum up the issue: > When project quota gets exceeded XFS ends up flushing inodes using > sync_inodes_sb(). I've tested (in 3.8-rc4) that if one writes 200 MB to a > directory with 100 MB project quota like: > fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0644); > for (i = 0; i < 50000; i++) > pwrite(fd, buf, 4096, i*4096); > it takes about 3 s to finish, which is OK. But when there are lots of > inodes cached (I've tried with 10000 inodes cached on the fs), the same > test program runs ~140 s. So, you're testing the overhead of ~25,000 ENOSPC flushes. I could brush this off and say "stupid application" but I won't.... > This is because sync_inodes_sb() iterates over > all inodes in superblock and waits for IO and this iteration eats CPU > cycles. Yup, exactly what I said here: http://www.spinics.net/lists/xfs/msg15198.html Iterating inodes takes a lot of CPU. I think the difference in the old method and the current one is that we only do one inode cache iteration per write(), not one per get_blocks() call. Hence we've removed the per-page overhead of flushing, and now we just have the inode cache iteration overhead. The fix to that problem is mentioned here: http://www.spinics.net/lists/xfs/msg15186.html Which is to: a) throttle speculative allocation as EDQUOT approaches; and b) efficiently track speculative preallocation for all the inodes in the given project, and write back and trim those inodes on ENOSPC. both of those are still a work in progress. I was hoping that we'd have a) in 3.9, but that doesn't seem likely now the merge window is just about upon us.... Essnetially, the fix for your problem is b). it will turn a whole superblock inode iteration into a "inodes with specualtive preallocation iteration" that is project quota aware. it will have lower per-flush overhead than a superblock scan, but it will still have measurable overhead. > One can argue that sync_inodes_sb() should be optimized but it isn't that > easy because it is a data integrity operation and we have to be sure that > even IO which has been submitted before sync_inodes_sb() is called is > finished by the time it returns. Right, but the IO wait is exactly why I chose it.... > So another POV is that it is an overkill to call data integrity operation > in a place which just needs to make sure IO is submitted so that delalloc > uncertainty is reduced. ... and not because of anything to do with how we handle delalloc metadata reservations. > The comment before xfs_flush_inodes() says that > sync_inodes_sb() is used to throttle multiple callers to the rate at which > IO is completing. I wonder - why is that needed? It's trying to prevent the filesystem for falling into worst case IO patterns at ENOSPC when there are hundreds of threads banging on the filesystem and essentially locking up the system. i.e. we have to throttle the IO submission rate from ENOSPC flushing artificially - we really only need one thread doing the submission work, so we need to throttle all concurrent callers while we are doing that work. sync_inodes_sb() does that. And then we need to prevent individual callers from trying to allocate space too frequently, which we do by waiting for IO that was submitted in the flush. sync_inodes_sb() does that, too. > Delalloc blocks are > allocated already at IO submission time so it should be enough to wait for > IO submission to happen? And writeback_inodes_sb() does that. Changelog of > the commit changing xfs_flush_inodes() to use sync_inodes_sb() claims that > if writeback_inodes_sb() is used, premature ENOSPC can happen. But I wonder > why is that the case when writeback_inodes_sb() waits for IO submission to > happen and thus for extent conversion? Isn't sync_inodes_sb() just papering > over some other problem? It was never intended to create or solve EDQUOT issues related to project quota. What you are seeing is an unfortunate side effect of fixing other, more serious architectural problems. We know it's a problem and we have a fix in the works. Just be patient.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: XFS slow when project quota exceeded (again) 2013-02-15 7:16 ` Dave Chinner @ 2013-02-15 10:38 ` Jan Kara 2013-02-15 13:13 ` Brian Foster 0 siblings, 1 reply; 4+ messages in thread From: Jan Kara @ 2013-02-15 10:38 UTC (permalink / raw) To: Dave Chinner; +Cc: dchinner, Jan Kara, xfs On Fri 15-02-13 18:16:07, Dave Chinner wrote: > On Thu, Feb 14, 2013 at 02:14:52PM +0100, Jan Kara wrote: > > Hi, > > > > this is a follow up on a discussion started here: > > http://www.spinics.net/lists/xfs/msg14999.html > > > > To just quickly sum up the issue: > > When project quota gets exceeded XFS ends up flushing inodes using > > sync_inodes_sb(). I've tested (in 3.8-rc4) that if one writes 200 MB to a > > directory with 100 MB project quota like: > > fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0644); > > for (i = 0; i < 50000; i++) > > pwrite(fd, buf, 4096, i*4096); > > it takes about 3 s to finish, which is OK. But when there are lots of > > inodes cached (I've tried with 10000 inodes cached on the fs), the same > > test program runs ~140 s. > > So, you're testing the overhead of ~25,000 ENOSPC flushes. I could > brush this off and say "stupid application" but I won't.... Yes, stupid... NFS. This is what happens when NFS client writes to a directory over project quota. So as much as I agree the workload is braindamaged we don't have a choice to fix a client and we didn't find a reasonable fix on NFS server side either. So that's why we ended up with XFS changes. > > This is because sync_inodes_sb() iterates over > > all inodes in superblock and waits for IO and this iteration eats CPU > > cycles. > > Yup, exactly what I said here: > > http://www.spinics.net/lists/xfs/msg15198.html > > Iterating inodes takes a lot of CPU. > > I think the difference in the old method and the current one is that > we only do one inode cache iteration per write(), not one per > get_blocks() call. Hence we've removed the per-page overhead of > flushing, and now we just have the inode cache iteration overhead. > > The fix to that problem is mentioned here: > > http://www.spinics.net/lists/xfs/msg15186.html > > Which is to: > > a) throttle speculative allocation as EDQUOT approaches; and > b) efficiently track speculative preallocation > for all the inodes in the given project, and write back > and trim those inodes on ENOSPC. > > both of those are still a work in progress. I was hoping that we'd > have a) in 3.9, but that doesn't seem likely now the merge window is > just about upon us.... Yeah, I know someone is working on a better solution. I was mostly wondering why writeback_inodes_sb() isn't enough - i.e. why do you really need to wait for IO completion. And you explained that below. Thanks! > It's trying to prevent the filesystem for falling into worst case IO > patterns at ENOSPC when there are hundreds of threads banging on the > filesystem and essentially locking up the system. i.e. we have to > throttle the IO submission rate from ENOSPC flushing artificially - > we really only need one thread doing the submission work, so we need > to throttle all concurrent callers while we are doing that work. > sync_inodes_sb() does that. And then we need to prevent individual > callers from trying to allocate space too frequently, which we do by > waiting for IO that was submitted in the flush. sync_inodes_sb() > does that, too. I see so the waiting for IO isn't really a correctness thing but just a way to slow down processes bringing fs out of space so that they don't lock up the system. Thanks for explanation! Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: XFS slow when project quota exceeded (again) 2013-02-15 10:38 ` Jan Kara @ 2013-02-15 13:13 ` Brian Foster 0 siblings, 0 replies; 4+ messages in thread From: Brian Foster @ 2013-02-15 13:13 UTC (permalink / raw) To: Jan Kara; +Cc: xfs, dchinner On 02/15/2013 05:38 AM, Jan Kara wrote: > On Fri 15-02-13 18:16:07, Dave Chinner wrote: >> On Thu, Feb 14, 2013 at 02:14:52PM +0100, Jan Kara wrote: ... >> The fix to that problem is mentioned here: >> >> http://www.spinics.net/lists/xfs/msg15186.html >> >> Which is to: >> >> a) throttle speculative allocation as EDQUOT approaches; and >> b) efficiently track speculative preallocation >> for all the inodes in the given project, and write back >> and trim those inodes on ENOSPC. >> >> both of those are still a work in progress. I was hoping that we'd >> have a) in 3.9, but that doesn't seem likely now the merge window is >> just about upon us.... > Yeah, I know someone is working on a better solution. I was mostly > wondering why writeback_inodes_sb() isn't enough - i.e. why do you really > need to wait for IO completion. And you explained that below. Thanks! > Sorry, I think the work for a) is pretty close [1]. We've had a bug fix or two in this area and the speculative prealloc algorithm update to handle sparse files since v2, so I have to rebase against those changes and continue some sanity testing. I'll try to have something available over the next week or so. The last bits required for b) (i.e., the pquota ENOSPC trigger) is a bit further out, but FWIW I have an RFC posted [2] to get some early design feedback on a solution. Brian [1] - v2 - http://oss.sgi.com/archives/xfs/2013-01/msg00006.html [2] - http://oss.sgi.com/archives/xfs/2012-12/msg00112.html >> It's trying to prevent the filesystem for falling into worst case IO >> patterns at ENOSPC when there are hundreds of threads banging on the >> filesystem and essentially locking up the system. i.e. we have to >> throttle the IO submission rate from ENOSPC flushing artificially - >> we really only need one thread doing the submission work, so we need >> to throttle all concurrent callers while we are doing that work. >> sync_inodes_sb() does that. And then we need to prevent individual >> callers from trying to allocate space too frequently, which we do by >> waiting for IO that was submitted in the flush. sync_inodes_sb() >> does that, too. > I see so the waiting for IO isn't really a correctness thing but just a > way to slow down processes bringing fs out of space so that they don't lock > up the system. Thanks for explanation! > > Honza > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2013-02-15 13:24 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-02-14 13:14 XFS slow when project quota exceeded (again) Jan Kara 2013-02-15 7:16 ` Dave Chinner 2013-02-15 10:38 ` Jan Kara 2013-02-15 13:13 ` Brian Foster
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox