* [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation @ 2013-07-26 23:28 Paul E. McKenney 2013-07-27 0:29 ` Linus Torvalds 2013-07-27 2:57 ` Dave Chinner 0 siblings, 2 replies; 9+ messages in thread From: Paul E. McKenney @ 2013-07-26 23:28 UTC (permalink / raw) To: davej; +Cc: linux-kernel, torvalds, viro, hch, jack, curtw, jaxboe, linux-fsdevel Dave Jones reported RCU stalls, overly long hrtimer interrupts, and amazingly long NMI handlers from a trinity-induced workload involving lots of concurrent sync() calls (https://lkml.org/lkml/2013/7/23/369). There are any number of things that one might do to make sync() behave better under high levels of contention, but it is also the case that multiple concurrent sync() system calls can be satisfied by a single sys_sync() invocation. Given that this situation is reminiscent of rcu_barrier(), this commit applies the rcu_barrier() approach to sys_sync(). This approach uses a global mutex and a sequence counter. The mutex is held across the sync() operation, which eliminates contention between concurrent sync() operations. The counter is incremented at the beginning and end of each sync() operation, so that it is odd while a sync() operation is in progress and even otherwise, just like sequence locks. The code that used to be in sys_sync() is now in do_sync(), and sys_sync() now handles the concurrency. The sys_sync() function first takes a snapshot of the counter, then acquires the mutex, and then takes another snapshot of the counter. If the values of the two snapshots indicate that a full do_sync() executed during the mutex acquisition, the sys_sync() function releases the mutex and returns ("Our work is done!"). Otherwise, sys_sync() increments the counter, invokes do_sync(), and increments the counter again. This approach allows a single call to do_sync() to satisfy an arbitrarily large number of sync() system calls, which should eliminate issues due to large numbers of concurrent invocations of the sync() system call. Changes since v1 (https://lkml.org/lkml/2013/7/24/683): o Add a pair of memory barriers to keep the increments from bleeding into the do_sync() code. (The failure probability is insanely low, but when you have several hundred million devices running Linux, you can expect several hundred instances of one-in-a-million failures.) o Actually CC some people who have experience in this area. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Cc: Curt Wohlgemuth <curtw@google.com> Cc: Jens Axboe <jaxboe@fusionio.com> Cc: linux-fsdevel@vger.kernel.org b/fs/sync.c | 45 ++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 44 insertions(+), 1 deletion(-) diff --git a/fs/sync.c b/fs/sync.c index 905f3f6..6e851db 100644 --- a/fs/sync.c +++ b/fs/sync.c @@ -99,7 +99,7 @@ static void fdatawait_one_bdev(struct block_device *bdev, void *arg) * just write metadata (such as inodes or bitmaps) to block device page cache * and do not sync it on their own in ->sync_fs(). */ -SYSCALL_DEFINE0(sync) +static void do_sync(void) { int nowait = 0, wait = 1; @@ -111,6 +111,49 @@ SYSCALL_DEFINE0(sync) iterate_bdevs(fdatawait_one_bdev, NULL); if (unlikely(laptop_mode)) laptop_sync_completion(); + return; +} + +static DEFINE_MUTEX(sync_mutex); /* One do_sync() at a time. */ +static unsigned long sync_seq; /* Many sync()s from one do_sync(). */ + /* Overflow harmless, extra wait. */ + +/* + * Only allow one task to do sync() at a time, and further allow + * concurrent sync() calls to be satisfied by a single do_sync() + * invocation. + */ +SYSCALL_DEFINE0(sync) +{ + unsigned long snap; + unsigned long snap_done; + + snap = ACCESS_ONCE(sync_seq); + smp_mb(); /* Prevent above from bleeding into critical section. */ + mutex_lock(&sync_mutex); + snap_done = ACCESS_ONCE(sync_seq); + if (ULONG_CMP_GE(snap_done, ((snap + 1) & ~0x1) + 2)) { + /* + * A full do_sync() executed between our two fetches from + * sync_seq, so our work is done! + */ + smp_mb(); /* Order test with caller's subsequent code. */ + mutex_unlock(&sync_mutex); + return 0; + } + + /* Record the start of do_sync(). */ + ACCESS_ONCE(sync_seq)++; + WARN_ON_ONCE((sync_seq & 0x1) != 1); + smp_mb(); /* Keep prior increment out of do_sync(). */ + + do_sync(); + + /* Record the end of do_sync(). */ + smp_mb(); /* Keep subsequent increment out of do_sync(). */ + ACCESS_ONCE(sync_seq)++; + WARN_ON_ONCE((sync_seq & 0x1) != 0); + mutex_unlock(&sync_mutex); return 0; } ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation 2013-07-26 23:28 [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation Paul E. McKenney @ 2013-07-27 0:29 ` Linus Torvalds 2013-07-27 1:23 ` Paul E. McKenney 2013-07-27 2:57 ` Dave Chinner 1 sibling, 1 reply; 9+ messages in thread From: Linus Torvalds @ 2013-07-27 0:29 UTC (permalink / raw) To: Paul McKenney Cc: Dave Jones, Linux Kernel Mailing List, Al Viro, Christoph Hellwig, Jan Kara, curtw, Jens Axboe, linux-fsdevel On Fri, Jul 26, 2013 at 4:28 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > + > + snap = ACCESS_ONCE(sync_seq); > + smp_mb(); /* Prevent above from bleeding into critical section. */ > + mutex_lock(&sync_mutex); > + snap_done = ACCESS_ONCE(sync_seq); > + if (ULONG_CMP_GE(snap_done, ((snap + 1) & ~0x1) + 2)) { Ugh. I dislike this RCU'ism. It's bad code. It doesn't just look ugly and complex, it's also not even clever. It is possible that the compiler can fix up this horrible stuff and turn it into the nice clever stuff, but I dunno. The two things that make me go "Eww": - "((snap + 1) & ~0x1) + 2" just isn't the smart way of doing things. Afaik, "(snap+3)&~1" gives the same answer with a simpler arithmetic. - that ULONG_CMP_GE() macro is disgusting. What's wrong with doing it the sane way, which is how (for example) the time comparison functions do it (see time_before() and friends): Just do it ((long)(a-b) >= 0) which doesn't need large constants. And yeah, a smart compiler will hopefully do one or both of those, but what annoys me about the source code is that it actually isn't even any more readable despite being more complicated and needing more compiler tricks for good code generation. So that one line is (a) totally undocumented, (b) not obvious and (c) not very clever. I'm also not a huge believer in those two WARN_ON_ONCE's you have. The sequence count is *only* updated in this place, it is *only* updated inside a lock, and dammit, if those tests ever trigger, we have bigger problems than that piece of code. Those warnings may make sense in code when you write it the first time (because you're thinking things through), but they do *not* make sense at the point where that code is actually committed to the project. I notice that you have those warnings in the RCU code itself, and I don't really think they make sense there either. Finally, the ACCESS_ONCE() is also only correct in the one place where you do the access speculatively outside the lock. Inside the lock, there is no excuse/reason for them, since the value is stable, and you need the memory barriers anyway, so there's no way the compiler could migrate things regardless. So the other two ACCESS_ONCE calls are actually misleading and wrong, and only likely to make the compiler generate much worse code. In fact, the ACCESS_ONCE() is pretty much *guaranteed* to cause the compiler to unnecessarily generate worse code, since there is absolutely no reason why the compiler couldn't reuse the "snap_done" value it reads when it then does the "sync_seq++". There's no way the value could possible have changed from the "snap_done" value earlier, since we're inside the lock, so why force the compiler to reload it? In short, I think the code does too much. I'm sure it works, but I think it might make people believe that the extra work (like those later ACCESS_ONCE ones) is meaningful, when it isn't. It's just make-believe, afaik. But maybe I'm missing something, and there actually *is* reason for the extra work/complexity? Linus ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation 2013-07-27 0:29 ` Linus Torvalds @ 2013-07-27 1:23 ` Paul E. McKenney 0 siblings, 0 replies; 9+ messages in thread From: Paul E. McKenney @ 2013-07-27 1:23 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Linux Kernel Mailing List, Al Viro, Christoph Hellwig, Jan Kara, curtw, Jens Axboe, linux-fsdevel On Fri, Jul 26, 2013 at 05:29:44PM -0700, Linus Torvalds wrote: > On Fri, Jul 26, 2013 at 4:28 PM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > + > > + snap = ACCESS_ONCE(sync_seq); > > + smp_mb(); /* Prevent above from bleeding into critical section. */ > > + mutex_lock(&sync_mutex); > > + snap_done = ACCESS_ONCE(sync_seq); > > + if (ULONG_CMP_GE(snap_done, ((snap + 1) & ~0x1) + 2)) { > > Ugh. I dislike this RCU'ism. It's bad code. It doesn't just look ugly > and complex, it's also not even clever. > > It is possible that the compiler can fix up this horrible stuff and > turn it into the nice clever stuff, but I dunno. > > The two things that make me go "Eww": > > - "((snap + 1) & ~0x1) + 2" just isn't the smart way of doing things. > Afaik, "(snap+3)&~1" gives the same answer with a simpler arithmetic. You are right, this is a better approach, and I have changed this patch to use it. I will also apply it to the similar code in RCU. > - that ULONG_CMP_GE() macro is disgusting. What's wrong with doing it > the sane way, which is how (for example) the time comparison functions > do it (see time_before() and friends): Just do it > > ((long)(a-b) >= 0) > > which doesn't need large constants. True, and I used to use this approach, but it can result in signed integer overflow, which is undefined in C. (Yes, we use -fno-strict-overflow, but there might come a day when we don't want to.) And ULONG_CMP_GE() generated exactly the same code as ((long)(a-b) >= 0) last I tried it. > And yeah, a smart compiler will hopefully do one or both of those, but > what annoys me about the source code is that it actually isn't even > any more readable despite being more complicated and needing more > compiler tricks for good code generation. > > So that one line is (a) totally undocumented, (b) not obvious and (c) > not very clever. For (a), how about I add the following comment? /* * If the value in snap is odd, we need to wait for the current * do_sync() to complete, then wait for the next one, in other * words, we need the value of snap_done to be three larger than * the value of snap. On the other hand, if the value in snap is * even, we only have to wait for the next request to complete, * in other words, we need the value of snap_done to be only two * greater than the value of snap. The "(snap + 3) & 0x1" computes * this for us. */ Hopefully, this helps with (b). For (c), I now use the expression you suggested above. Does that help, or is more needed. > I'm also not a huge believer in those two WARN_ON_ONCE's you have. The > sequence count is *only* updated in this place, it is *only* updated > inside a lock, and dammit, if those tests ever trigger, we have bigger > problems than that piece of code. Those warnings may make sense in > code when you write it the first time (because you're thinking things > through), but they do *not* make sense at the point where that code is > actually committed to the project. I notice that you have those > warnings in the RCU code itself, and I don't really think they make > sense there either. I agree that the fact that this variable is updated only in this one place in sys_sync() makes these warning less than useful in production. I will therefore remove them once this patch get beyond RFC status. However, the similar warnings in RCU have been very helpful in spotting bugs in the callers of rcu_idle_enter() and friends, so I would very much prefer to keep them. > Finally, the ACCESS_ONCE() is also only correct in the one place where > you do the access speculatively outside the lock. Inside the lock, > there is no excuse/reason for them, since the value is stable, and you > need the memory barriers anyway, so there's no way the compiler could > migrate things regardless. So the other two ACCESS_ONCE calls are > actually misleading and wrong, and only likely to make the compiler > generate much worse code. > > In fact, the ACCESS_ONCE() is pretty much *guaranteed* to cause the > compiler to unnecessarily generate worse code, since there is > absolutely no reason why the compiler couldn't reuse the "snap_done" > value it reads when it then does the "sync_seq++". There's no way the > value could possible have changed from the "snap_done" value earlier, > since we're inside the lock, so why force the compiler to reload it? Good point, I have removed the second ACCESS_ONCE(). > In short, I think the code does too much. I'm sure it works, but I > think it might make people believe that the extra work (like those > later ACCESS_ONCE ones) is meaningful, when it isn't. It's just > make-believe, afaik. > > But maybe I'm missing something, and there actually *is* reason for > the extra work/complexity? Only for the ULONG_CMP_GE(). On the rest, you are quite right, and the updated patch is as follows. But I will believe it works only if it helps Dave Jones's tests. ;-) Thanx, Paul ------------------------------------------------------------------------ Dave Jones reported RCU stalls, overly long hrtimer interrupts, and amazingly long NMI handlers from a trinity-induced workload involving lots of concurrent sync() calls (https://lkml.org/lkml/2013/7/23/369). There are any number of things that one might do to make sync() behave better under high levels of contention, but it is also the case that multiple concurrent sync() system calls can be satisfied by a single sys_sync() invocation. Given that this situation is reminiscent of rcu_barrier(), this commit applies the rcu_barrier() approach to sys_sync(). This approach uses a global mutex and a sequence counter. The mutex is held across the sync() operation, which eliminates contention between concurrent sync() operations. The counter is incremented at the beginning and end of each sync() operation, so that it is odd while a sync() operation is in progress and even otherwise, just like sequence locks. The code that used to be in sys_sync() is now in do_sync(), and sys_sync() now handles the concurrency. The sys_sync() function first takes a snapshot of the counter, then acquires the mutex, and then takes another snapshot of the counter. If the values of the two snapshots indicate that a full do_sync() executed during the mutex acquisition, the sys_sync() function releases the mutex and returns ("Our work is done!"). Otherwise, sys_sync() increments the counter, invokes do_sync(), and increments the counter again. This approach allows a single call to do_sync() to satisfy an arbitrarily large number of sync() system calls, which should eliminate issues due to large numbers of concurrent invocations of the sync() system call. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Cc: Curt Wohlgemuth <curtw@google.com> Cc: Jens Axboe <jaxboe@fusionio.com> Cc: linux-fsdevel@vger.kernel.org [ paulmck: Updated based on Linus Torvalds feedback. ] diff --git a/fs/sync.c b/fs/sync.c index 905f3f6..174ad9b 100644 --- a/fs/sync.c +++ b/fs/sync.c @@ -99,7 +99,7 @@ static void fdatawait_one_bdev(struct block_device *bdev, void *arg) * just write metadata (such as inodes or bitmaps) to block device page cache * and do not sync it on their own in ->sync_fs(). */ -SYSCALL_DEFINE0(sync) +static void do_sync(void) { int nowait = 0, wait = 1; @@ -111,6 +111,60 @@ SYSCALL_DEFINE0(sync) iterate_bdevs(fdatawait_one_bdev, NULL); if (unlikely(laptop_mode)) laptop_sync_completion(); + return; +} + +static DEFINE_MUTEX(sync_mutex); /* One do_sync() at a time. */ +static unsigned long sync_seq; /* Many sync()s from one do_sync(). */ + /* Overflow harmless, extra wait. */ + +/* + * Only allow one task to do sync() at a time, and further allow + * concurrent sync() calls to be satisfied by a single do_sync() + * invocation. + */ +SYSCALL_DEFINE0(sync) +{ + unsigned long snap; + unsigned long snap_done; + + snap = ACCESS_ONCE(sync_seq); + smp_mb(); /* Prevent above from bleeding into critical section. */ + mutex_lock(&sync_mutex); + snap_done = sync_seq; + + /* + * If the value in snap is odd, we need to wait for the current + * do_sync() to complete, then wait for the next one, in other + * words, we need the value of snap_done to be three larger than + * the value of snap. On the other hand, if the value in snap is + * even, we only have to wait for the next request to complete, + * in other words, we need the value of snap_done to be only two + * greater than the value of snap. The "(snap + 3) & 0x1" computes + * this for us (thank you, Linus!). + */ + if (ULONG_CMP_GE(snap_done, (snap + 3) & ~0x1)) { + /* + * A full do_sync() executed between our two fetches from + * sync_seq, so our work is done! + */ + smp_mb(); /* Order test with caller's subsequent code. */ + mutex_unlock(&sync_mutex); + return 0; + } + + /* Record the start of do_sync(). */ + ACCESS_ONCE(sync_seq)++; + WARN_ON_ONCE((sync_seq & 0x1) != 1); + smp_mb(); /* Keep prior increment out of do_sync(). */ + + do_sync(); + + /* Record the end of do_sync(). */ + smp_mb(); /* Keep subsequent increment out of do_sync(). */ + ACCESS_ONCE(sync_seq)++; + WARN_ON_ONCE((sync_seq & 0x1) != 0); + mutex_unlock(&sync_mutex); return 0; } ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation 2013-07-26 23:28 [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation Paul E. McKenney 2013-07-27 0:29 ` Linus Torvalds @ 2013-07-27 2:57 ` Dave Chinner 2013-07-27 4:05 ` Paul E. McKenney 1 sibling, 1 reply; 9+ messages in thread From: Dave Chinner @ 2013-07-27 2:57 UTC (permalink / raw) To: Paul E. McKenney Cc: davej, linux-kernel, torvalds, viro, hch, jack, curtw, jaxboe, linux-fsdevel On Fri, Jul 26, 2013 at 04:28:52PM -0700, Paul E. McKenney wrote: > Dave Jones reported RCU stalls, overly long hrtimer interrupts, and > amazingly long NMI handlers from a trinity-induced workload involving > lots of concurrent sync() calls (https://lkml.org/lkml/2013/7/23/369). > There are any number of things that one might do to make sync() behave > better under high levels of contention, but it is also the case that > multiple concurrent sync() system calls can be satisfied by a single > sys_sync() invocation. > > Given that this situation is reminiscent of rcu_barrier(), this commit > applies the rcu_barrier() approach to sys_sync(). This approach uses > a global mutex and a sequence counter. The mutex is held across the > sync() operation, which eliminates contention between concurrent sync() > operations. > > The counter is incremented at the beginning and end of > each sync() operation, so that it is odd while a sync() operation is in > progress and even otherwise, just like sequence locks. > > The code that used to be in sys_sync() is now in do_sync(), and sys_sync() > now handles the concurrency. The sys_sync() function first takes a > snapshot of the counter, then acquires the mutex, and then takes another > snapshot of the counter. If the values of the two snapshots indicate that > a full do_sync() executed during the mutex acquisition, the sys_sync() > function releases the mutex and returns ("Our work is done!"). Otherwise, > sys_sync() increments the counter, invokes do_sync(), and increments > the counter again. > > This approach allows a single call to do_sync() to satisfy an arbitrarily > large number of sync() system calls, which should eliminate issues due > to large numbers of concurrent invocations of the sync() system call. This is not addressing the problem that is causing issues during sync. Indeed, it only puts a bandaid over the currently observed trigger. Indeed, i suspect that this will significantly slow down concurrent sync operations, as it serialised sync across all superblocks rather than serialising per-superblock like is currently done. Indeed, that per-superblock serialisation is where all the lock contention problems are. And it's not sync alone that causes the contention problems - it has to be combined with other concurrent workloads that add or remove inodes from the inode cache at tha same time. I have patches to address that by removing the source of the lock contention completely, and not just for the sys_sync trigger. Those patches make the problems with concurrent sys_sync operation go away completely for me, not to mention improve performance for 8+ thread metadata workloads on XFS significantly. IOWs, I don't see that concurrent sys_sync operation is a problem at all, and it is actively desirable for systems that have multiple busy filesystems as it allows concurrent dispatch of IO across those multiple filesystems. Serialising all sys_sync work might stop the contention problems, but it will also slow down concurrent sync operations on busy systems as it only allows one thread to dispatch and wait for IO at a time. So, let's not slap a bandaid over a symptom - let's address the cause of the lock contention properly.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation 2013-07-27 2:57 ` Dave Chinner @ 2013-07-27 4:05 ` Paul E. McKenney 2013-07-27 6:21 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Paul E. McKenney @ 2013-07-27 4:05 UTC (permalink / raw) To: Dave Chinner Cc: davej, linux-kernel, torvalds, viro, hch, jack, curtw, jaxboe, linux-fsdevel On Sat, Jul 27, 2013 at 12:57:03PM +1000, Dave Chinner wrote: > On Fri, Jul 26, 2013 at 04:28:52PM -0700, Paul E. McKenney wrote: > > Dave Jones reported RCU stalls, overly long hrtimer interrupts, and > > amazingly long NMI handlers from a trinity-induced workload involving > > lots of concurrent sync() calls (https://lkml.org/lkml/2013/7/23/369). > > There are any number of things that one might do to make sync() behave > > better under high levels of contention, but it is also the case that > > multiple concurrent sync() system calls can be satisfied by a single > > sys_sync() invocation. > > > > Given that this situation is reminiscent of rcu_barrier(), this commit > > applies the rcu_barrier() approach to sys_sync(). This approach uses > > a global mutex and a sequence counter. The mutex is held across the > > sync() operation, which eliminates contention between concurrent sync() > > operations. > > > > The counter is incremented at the beginning and end of > > each sync() operation, so that it is odd while a sync() operation is in > > progress and even otherwise, just like sequence locks. > > > > The code that used to be in sys_sync() is now in do_sync(), and sys_sync() > > now handles the concurrency. The sys_sync() function first takes a > > snapshot of the counter, then acquires the mutex, and then takes another > > snapshot of the counter. If the values of the two snapshots indicate that > > a full do_sync() executed during the mutex acquisition, the sys_sync() > > function releases the mutex and returns ("Our work is done!"). Otherwise, > > sys_sync() increments the counter, invokes do_sync(), and increments > > the counter again. > > > > This approach allows a single call to do_sync() to satisfy an arbitrarily > > large number of sync() system calls, which should eliminate issues due > > to large numbers of concurrent invocations of the sync() system call. > > This is not addressing the problem that is causing issues during > sync. Indeed, it only puts a bandaid over the currently observed > trigger. > > Indeed, i suspect that this will significantly slow down concurrent > sync operations, as it serialised sync across all superblocks rather > than serialising per-superblock like is currently done. Indeed, that > per-superblock serialisation is where all the lock contention > problems are. And it's not sync alone that causes the contention > problems - it has to be combined with other concurrent workloads > that add or remove inodes from the inode cache at tha same time. Seems like something along the lines of wakeup_flusher_threads() currently at the start of sys_sync() would address this. > I have patches to address that by removing the source > of the lock contention completely, and not just for the sys_sync > trigger. Those patches make the problems with concurrent > sys_sync operation go away completely for me, not to mention improve > performance for 8+ thread metadata workloads on XFS significantly. > > IOWs, I don't see that concurrent sys_sync operation is a problem at > all, and it is actively desirable for systems that have multiple > busy filesystems as it allows concurrent dispatch of IO across those > multiple filesystems. Serialising all sys_sync work might stop the > contention problems, but it will also slow down concurrent sync > operations on busy systems as it only allows one thread to dispatch > and wait for IO at a time. > > So, let's not slap a bandaid over a symptom - let's address the > cause of the lock contention properly.... Hmmm... Could you please send your patches over to Dave Jones right now? I am getting quite tired of getting RCU CPU stall warning complaints from him that turn out to be due to highly contended sync() system calls. Thanx, Paul ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation 2013-07-27 4:05 ` Paul E. McKenney @ 2013-07-27 6:21 ` Dave Chinner 2013-07-27 11:26 ` Paul E. McKenney 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2013-07-27 6:21 UTC (permalink / raw) To: Paul E. McKenney Cc: davej, linux-kernel, torvalds, viro, hch, jack, curtw, jaxboe, linux-fsdevel On Fri, Jul 26, 2013 at 09:05:24PM -0700, Paul E. McKenney wrote: > On Sat, Jul 27, 2013 at 12:57:03PM +1000, Dave Chinner wrote: > > On Fri, Jul 26, 2013 at 04:28:52PM -0700, Paul E. McKenney wrote: > > > Dave Jones reported RCU stalls, overly long hrtimer interrupts, and > > > amazingly long NMI handlers from a trinity-induced workload involving > > > lots of concurrent sync() calls (https://lkml.org/lkml/2013/7/23/369). > > > There are any number of things that one might do to make sync() behave > > > better under high levels of contention, but it is also the case that > > > multiple concurrent sync() system calls can be satisfied by a single > > > sys_sync() invocation. > > > > > > Given that this situation is reminiscent of rcu_barrier(), this commit > > > applies the rcu_barrier() approach to sys_sync(). This approach uses > > > a global mutex and a sequence counter. The mutex is held across the > > > sync() operation, which eliminates contention between concurrent sync() > > > operations. > > > > > > The counter is incremented at the beginning and end of > > > each sync() operation, so that it is odd while a sync() operation is in > > > progress and even otherwise, just like sequence locks. > > > > > > The code that used to be in sys_sync() is now in do_sync(), and sys_sync() > > > now handles the concurrency. The sys_sync() function first takes a > > > snapshot of the counter, then acquires the mutex, and then takes another > > > snapshot of the counter. If the values of the two snapshots indicate that > > > a full do_sync() executed during the mutex acquisition, the sys_sync() > > > function releases the mutex and returns ("Our work is done!"). Otherwise, > > > sys_sync() increments the counter, invokes do_sync(), and increments > > > the counter again. > > > > > > This approach allows a single call to do_sync() to satisfy an arbitrarily > > > large number of sync() system calls, which should eliminate issues due > > > to large numbers of concurrent invocations of the sync() system call. > > > > This is not addressing the problem that is causing issues during > > sync. Indeed, it only puts a bandaid over the currently observed > > trigger. > > > > Indeed, i suspect that this will significantly slow down concurrent > > sync operations, as it serialised sync across all superblocks rather > > than serialising per-superblock like is currently done. Indeed, that > > per-superblock serialisation is where all the lock contention > > problems are. And it's not sync alone that causes the contention > > problems - it has to be combined with other concurrent workloads > > that add or remove inodes from the inode cache at tha same time. > > Seems like something along the lines of wakeup_flusher_threads() > currently at the start of sys_sync() would address this. That's the first thing sync() currently does, but it doesn't solve the problem because that only triggers WB_SYNC_NONE writeback. ie. it's not data integrity writeback, so if there's any contention it will skip pages and inodes. So we still need sync to dispatch IO. Further, it requires scheduling of the work before anything will be done, and so if you are CPU loaded then there's a good chance that the flusher thread doesn't get to a CPU until sync actually queues the sync work to it and then waits for completion on it... > > I have patches to address that by removing the source > > of the lock contention completely, and not just for the sys_sync > > trigger. Those patches make the problems with concurrent > > sys_sync operation go away completely for me, not to mention improve > > performance for 8+ thread metadata workloads on XFS significantly. > > > > IOWs, I don't see that concurrent sys_sync operation is a problem at > > all, and it is actively desirable for systems that have multiple > > busy filesystems as it allows concurrent dispatch of IO across those > > multiple filesystems. Serialising all sys_sync work might stop the > > contention problems, but it will also slow down concurrent sync > > operations on busy systems as it only allows one thread to dispatch > > and wait for IO at a time. > > > > So, let's not slap a bandaid over a symptom - let's address the > > cause of the lock contention properly.... > > Hmmm... > > Could you please send your patches over to Dave Jones right now? I am > getting quite tired of getting RCU CPU stall warning complaints from > him that turn out to be due to highly contended sync() system calls. Then ignore them until the code is ready - it'll be 3.12 before the fixes are merged, anyway, because the lock contention fix requires infrastructure that is currently in mmotm that is queued for 3.12 (i.e. the per-node list infrastructure) to fix a whole bunch of other, more critical VFS lock contention problems. Seeing as a new mmotm went out last week, I should have the patches ready for review early next week. FWIW, we (as in XFS filesystem testers) regularly run tests that have hundreds of concurrent sys_sync() calls running at the same time. e.g. xfstests::xfs/297 runs a 1000 fsstress processes while freezing and unfreezing the filesystem, and that usually shows hundreds of threads running sys_sync concurrently after a short amount of runtime. So it's pretty clear that what Dave is seeing is not necessarily representative of what happens when there ar lots of sys_sync() calls run concurrently. BTW, concurrent syncfs() calls are going to have exactly the same problem as concurrent sync() calls, as is any other operation that results in a walk of the per-superblock inodes list. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation 2013-07-27 6:21 ` Dave Chinner @ 2013-07-27 11:26 ` Paul E. McKenney 2013-07-29 7:06 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Paul E. McKenney @ 2013-07-27 11:26 UTC (permalink / raw) To: Dave Chinner Cc: davej, linux-kernel, torvalds, viro, hch, jack, curtw, jaxboe, linux-fsdevel On Sat, Jul 27, 2013 at 04:21:01PM +1000, Dave Chinner wrote: > On Fri, Jul 26, 2013 at 09:05:24PM -0700, Paul E. McKenney wrote: > > On Sat, Jul 27, 2013 at 12:57:03PM +1000, Dave Chinner wrote: > > > On Fri, Jul 26, 2013 at 04:28:52PM -0700, Paul E. McKenney wrote: > > > > Dave Jones reported RCU stalls, overly long hrtimer interrupts, and > > > > amazingly long NMI handlers from a trinity-induced workload involving > > > > lots of concurrent sync() calls (https://lkml.org/lkml/2013/7/23/369). > > > > There are any number of things that one might do to make sync() behave > > > > better under high levels of contention, but it is also the case that > > > > multiple concurrent sync() system calls can be satisfied by a single > > > > sys_sync() invocation. > > > > > > > > Given that this situation is reminiscent of rcu_barrier(), this commit > > > > applies the rcu_barrier() approach to sys_sync(). This approach uses > > > > a global mutex and a sequence counter. The mutex is held across the > > > > sync() operation, which eliminates contention between concurrent sync() > > > > operations. > > > > > > > > The counter is incremented at the beginning and end of > > > > each sync() operation, so that it is odd while a sync() operation is in > > > > progress and even otherwise, just like sequence locks. > > > > > > > > The code that used to be in sys_sync() is now in do_sync(), and sys_sync() > > > > now handles the concurrency. The sys_sync() function first takes a > > > > snapshot of the counter, then acquires the mutex, and then takes another > > > > snapshot of the counter. If the values of the two snapshots indicate that > > > > a full do_sync() executed during the mutex acquisition, the sys_sync() > > > > function releases the mutex and returns ("Our work is done!"). Otherwise, > > > > sys_sync() increments the counter, invokes do_sync(), and increments > > > > the counter again. > > > > > > > > This approach allows a single call to do_sync() to satisfy an arbitrarily > > > > large number of sync() system calls, which should eliminate issues due > > > > to large numbers of concurrent invocations of the sync() system call. > > > > > > This is not addressing the problem that is causing issues during > > > sync. Indeed, it only puts a bandaid over the currently observed > > > trigger. > > > > > > Indeed, i suspect that this will significantly slow down concurrent > > > sync operations, as it serialised sync across all superblocks rather > > > than serialising per-superblock like is currently done. Indeed, that > > > per-superblock serialisation is where all the lock contention > > > problems are. And it's not sync alone that causes the contention > > > problems - it has to be combined with other concurrent workloads > > > that add or remove inodes from the inode cache at tha same time. > > > > Seems like something along the lines of wakeup_flusher_threads() > > currently at the start of sys_sync() would address this. > > That's the first thing sync() currently does, but it doesn't solve > the problem because that only triggers WB_SYNC_NONE writeback. ie. > it's not data integrity writeback, so if there's any contention it > will skip pages and inodes. So we still need sync to dispatch IO. > > Further, it requires scheduling of the work before anything will be > done, and so if you are CPU loaded then there's a good chance that > the flusher thread doesn't get to a CPU until sync actually queues > the sync work to it and then waits for completion on it... I could address these fairly straightforwardly, but I will hold off for a bit to see what you come up with. > > > I have patches to address that by removing the source > > > of the lock contention completely, and not just for the sys_sync > > > trigger. Those patches make the problems with concurrent > > > sys_sync operation go away completely for me, not to mention improve > > > performance for 8+ thread metadata workloads on XFS significantly. > > > > > > IOWs, I don't see that concurrent sys_sync operation is a problem at > > > all, and it is actively desirable for systems that have multiple > > > busy filesystems as it allows concurrent dispatch of IO across those > > > multiple filesystems. Serialising all sys_sync work might stop the > > > contention problems, but it will also slow down concurrent sync > > > operations on busy systems as it only allows one thread to dispatch > > > and wait for IO at a time. > > > > > > So, let's not slap a bandaid over a symptom - let's address the > > > cause of the lock contention properly.... > > > > Hmmm... > > > > Could you please send your patches over to Dave Jones right now? I am > > getting quite tired of getting RCU CPU stall warning complaints from > > him that turn out to be due to highly contended sync() system calls. > > Then ignore them until the code is ready - it'll be 3.12 before the > fixes are merged, anyway, because the lock contention fix requires > infrastructure that is currently in mmotm that is queued for 3.12 > (i.e. the per-node list infrastructure) to fix a whole bunch of > other, more critical VFS lock contention problems. Seeing as a new > mmotm went out last week, I should have the patches ready for review > early next week. > > FWIW, we (as in XFS filesystem testers) regularly run tests that > have hundreds of concurrent sys_sync() calls running at the same > time. e.g. xfstests::xfs/297 runs a 1000 fsstress processes while > freezing and unfreezing the filesystem, and that usually shows > hundreds of threads running sys_sync concurrently after a short > amount of runtime. So it's pretty clear that what Dave is seeing > is not necessarily representative of what happens when there ar lots > of sys_sync() calls run concurrently. So Dave might be finding an additional problem. ;-) > BTW, concurrent syncfs() calls are going to have exactly the same > problem as concurrent sync() calls, as is any other operation that > results in a walk of the per-superblock inodes list. Yep! Your upcoming patch addresses these as well? Thanx, Paul > Cheers, > > Dave. > > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation 2013-07-27 11:26 ` Paul E. McKenney @ 2013-07-29 7:06 ` Dave Chinner 2013-07-30 17:43 ` Paul E. McKenney 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2013-07-29 7:06 UTC (permalink / raw) To: Paul E. McKenney Cc: davej, linux-kernel, torvalds, viro, hch, jack, curtw, jaxboe, linux-fsdevel On Sat, Jul 27, 2013 at 04:26:28AM -0700, Paul E. McKenney wrote: > On Sat, Jul 27, 2013 at 04:21:01PM +1000, Dave Chinner wrote: > > On Fri, Jul 26, 2013 at 09:05:24PM -0700, Paul E. McKenney wrote: > > > Could you please send your patches over to Dave Jones right now? I am > > > getting quite tired of getting RCU CPU stall warning complaints from > > > him that turn out to be due to highly contended sync() system calls. > > > > Then ignore them until the code is ready - it'll be 3.12 before the > > fixes are merged, anyway, because the lock contention fix requires > > infrastructure that is currently in mmotm that is queued for 3.12 > > (i.e. the per-node list infrastructure) to fix a whole bunch of > > other, more critical VFS lock contention problems. Seeing as a new > > mmotm went out last week, I should have the patches ready for review > > early next week. > > > > FWIW, we (as in XFS filesystem testers) regularly run tests that > > have hundreds of concurrent sys_sync() calls running at the same > > time. e.g. xfstests::xfs/297 runs a 1000 fsstress processes while > > freezing and unfreezing the filesystem, and that usually shows > > hundreds of threads running sys_sync concurrently after a short > > amount of runtime. So it's pretty clear that what Dave is seeing > > is not necessarily representative of what happens when there ar lots > > of sys_sync() calls run concurrently. > > So Dave might be finding an additional problem. ;-) Dave will always find problems. If you want something broken, give it to Dave and he'll hand it back in pieces. :) > > BTW, concurrent syncfs() calls are going to have exactly the same > > problem as concurrent sync() calls, as is any other operation that > > results in a walk of the per-superblock inodes list. > > Yep! Your upcoming patch addresses these as well? Yes, it does. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation 2013-07-29 7:06 ` Dave Chinner @ 2013-07-30 17:43 ` Paul E. McKenney 0 siblings, 0 replies; 9+ messages in thread From: Paul E. McKenney @ 2013-07-30 17:43 UTC (permalink / raw) To: Dave Chinner Cc: davej, linux-kernel, torvalds, viro, hch, jack, curtw, jaxboe, linux-fsdevel On Mon, Jul 29, 2013 at 05:06:43PM +1000, Dave Chinner wrote: > On Sat, Jul 27, 2013 at 04:26:28AM -0700, Paul E. McKenney wrote: > > On Sat, Jul 27, 2013 at 04:21:01PM +1000, Dave Chinner wrote: > > > On Fri, Jul 26, 2013 at 09:05:24PM -0700, Paul E. McKenney wrote: > > > > Could you please send your patches over to Dave Jones right now? I am > > > > getting quite tired of getting RCU CPU stall warning complaints from > > > > him that turn out to be due to highly contended sync() system calls. > > > > > > Then ignore them until the code is ready - it'll be 3.12 before the > > > fixes are merged, anyway, because the lock contention fix requires > > > infrastructure that is currently in mmotm that is queued for 3.12 > > > (i.e. the per-node list infrastructure) to fix a whole bunch of > > > other, more critical VFS lock contention problems. Seeing as a new > > > mmotm went out last week, I should have the patches ready for review > > > early next week. > > > > > > FWIW, we (as in XFS filesystem testers) regularly run tests that > > > have hundreds of concurrent sys_sync() calls running at the same > > > time. e.g. xfstests::xfs/297 runs a 1000 fsstress processes while > > > freezing and unfreezing the filesystem, and that usually shows > > > hundreds of threads running sys_sync concurrently after a short > > > amount of runtime. So it's pretty clear that what Dave is seeing > > > is not necessarily representative of what happens when there ar lots > > > of sys_sync() calls run concurrently. > > > > So Dave might be finding an additional problem. ;-) > > Dave will always find problems. If you want something broken, give > it to Dave and he'll hand it back in pieces. :) So rather than Wreck-it Ralph, we have Destroy-it Dave? ;-) And I must hasten to add that Dave's destroy-it services, though sometimes irritating, are almost always quite valuable. > > > BTW, concurrent syncfs() calls are going to have exactly the same > > > problem as concurrent sync() calls, as is any other operation that > > > results in a walk of the per-superblock inodes list. > > > > Yep! Your upcoming patch addresses these as well? > > Yes, it does. Good to hear, looking forward to seeing them? Thanx, Paul ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2013-07-30 17:47 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-07-26 23:28 [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation Paul E. McKenney 2013-07-27 0:29 ` Linus Torvalds 2013-07-27 1:23 ` Paul E. McKenney 2013-07-27 2:57 ` Dave Chinner 2013-07-27 4:05 ` Paul E. McKenney 2013-07-27 6:21 ` Dave Chinner 2013-07-27 11:26 ` Paul E. McKenney 2013-07-29 7:06 ` Dave Chinner 2013-07-30 17:43 ` Paul E. McKenney
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).