* [PATCH 0/2] ext4: Reduce contention on s_orphan_lock @ 2014-04-29 23:32 Jan Kara 2014-04-29 23:32 ` [PATCH 1/2] ext4: Use sbi in ext4_orphan_del() Jan Kara ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Jan Kara @ 2014-04-29 23:32 UTC (permalink / raw) To: T Makphaibulchoke; +Cc: linux-ext4 Hello, so I finally got to looking into your patches for reducing contention on s_orphan_lock. AFAICT these two patches (the first one is just a small cleanup) should have the same speed gain as the patches you wrote and they are simpler. Can you give them a try please? Thanks! Honza ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ext4: Use sbi in ext4_orphan_del() 2014-04-29 23:32 [PATCH 0/2] ext4: Reduce contention on s_orphan_lock Jan Kara @ 2014-04-29 23:32 ` Jan Kara 2014-04-29 23:32 ` [PATCH 2/2] ext4: Reduce contention on s_orphan_lock Jan Kara 2014-05-02 21:56 ` [PATCH 0/2] " Thavatchai Makphaibulchoke 2 siblings, 0 replies; 25+ messages in thread From: Jan Kara @ 2014-04-29 23:32 UTC (permalink / raw) To: T Makphaibulchoke; +Cc: linux-ext4, Jan Kara Use sbi pointer consistently in ext4_orphan_del() instead of opencoding it sometimes. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/namei.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 1cb84f78909e..411957326827 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2622,22 +2622,20 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) { struct list_head *prev; struct ext4_inode_info *ei = EXT4_I(inode); - struct ext4_sb_info *sbi; + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); __u32 ino_next; struct ext4_iloc iloc; int err = 0; - if ((!EXT4_SB(inode->i_sb)->s_journal) && - !(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ORPHAN_FS)) + if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS)) return 0; - mutex_lock(&EXT4_SB(inode->i_sb)->s_orphan_lock); + mutex_lock(&sbi->s_orphan_lock); if (list_empty(&ei->i_orphan)) goto out; ino_next = NEXT_ORPHAN(inode); prev = ei->i_orphan.prev; - sbi = EXT4_SB(inode->i_sb); jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino); @@ -2683,7 +2681,7 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) out_err: ext4_std_error(inode->i_sb, err); out: - mutex_unlock(&EXT4_SB(inode->i_sb)->s_orphan_lock); + mutex_unlock(&sbi->s_orphan_lock); return err; out_brelse: -- 1.8.1.4 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-04-29 23:32 [PATCH 0/2] ext4: Reduce contention on s_orphan_lock Jan Kara 2014-04-29 23:32 ` [PATCH 1/2] ext4: Use sbi in ext4_orphan_del() Jan Kara @ 2014-04-29 23:32 ` Jan Kara 2014-05-02 21:56 ` Thavatchai Makphaibulchoke 2014-05-02 21:56 ` [PATCH 0/2] " Thavatchai Makphaibulchoke 2 siblings, 1 reply; 25+ messages in thread From: Jan Kara @ 2014-04-29 23:32 UTC (permalink / raw) To: T Makphaibulchoke; +Cc: linux-ext4, Jan Kara Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that we avoid taking global s_orphan_lock in some cases and hold it for shorter time in other cases. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/namei.c | 39 +++++++++++++++++++++++++++++---------- 1 file changed, 29 insertions(+), 10 deletions(-) diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 411957326827..4253df8af9ef 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2539,13 +2539,17 @@ static int empty_dir(struct inode *inode) return 1; } -/* ext4_orphan_add() links an unlinked or truncated inode into a list of +/* + * ext4_orphan_add() links an unlinked or truncated inode into a list of * such inodes, starting at the superblock, in case we crash before the * file is closed/deleted, or in case the inode truncate spans multiple * transactions and the last transaction is not recovered after a crash. * * At filesystem recovery time, we walk this list deleting unlinked * inodes and truncating linked inodes in ext4_orphan_cleanup(). + * + * Orphan list manipulation functions must be called under i_mutex unless + * we are just creating the inode or deleting it. */ int ext4_orphan_add(handle_t *handle, struct inode *inode) { @@ -2556,9 +2560,14 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode) if (!EXT4_SB(sb)->s_journal) return 0; - mutex_lock(&EXT4_SB(sb)->s_orphan_lock); - if (!list_empty(&EXT4_I(inode)->i_orphan)) - goto out_unlock; + WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) && + !mutex_is_locked(&inode->i_mutex)); + /* + * Exit early if inode already is on orphan list. This is a big speedup + * since we don't have to contend on the global s_orphan_lock. + */ + if (!list_empty(&EXT4_I(inode)->i_orphan)) + return 0; /* * Orphan handling is only valid for files with data blocks @@ -2577,6 +2586,8 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode) err = ext4_reserve_inode_write(handle, inode, &iloc); if (err) goto out_unlock; + + mutex_lock(&EXT4_SB(sb)->s_orphan_lock); /* * Due to previous errors inode may be already a part of on-disk * orphan list. If so skip on-disk list modification. @@ -2630,10 +2641,22 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS)) return 0; - mutex_lock(&sbi->s_orphan_lock); + WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) && + !mutex_is_locked(&inode->i_mutex)); + /* + * Do this quick and racy check before taking global s_orphan_lock. + */ if (list_empty(&ei->i_orphan)) - goto out; + return 0; + /* Grab inode buffer early before taking global s_orphan_lock */ + err = ext4_reserve_inode_write(handle, inode, &iloc); + if (err) { + list_del_init(&ei->i_orphan); + return err; + } + + mutex_lock(&sbi->s_orphan_lock); ino_next = NEXT_ORPHAN(inode); prev = ei->i_orphan.prev; @@ -2648,10 +2671,6 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) if (!handle) goto out; - err = ext4_reserve_inode_write(handle, inode, &iloc); - if (err) - goto out_err; ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-04-29 23:32 ` [PATCH 2/2] ext4: Reduce contention on s_orphan_lock Jan Kara @ 2014-05-02 21:56 ` Thavatchai Makphaibulchoke 0 siblings, 0 replies; 25+ messages in thread From: Thavatchai Makphaibulchoke @ 2014-05-02 21:56 UTC (permalink / raw) To: Jan Kara, T Makphaibulchoke; +Cc: linux-ext4 On 04/29/2014 05:32 PM, Jan Kara wrote: > Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that > we avoid taking global s_orphan_lock in some cases and hold it for > shorter time in other cases. > > Signed-off-by: Jan Kara <jack@suse.cz> > --- > fs/ext4/namei.c | 39 +++++++++++++++++++++++++++++---------- > 1 file changed, 29 insertions(+), 10 deletions(-) > > diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c > index 411957326827..4253df8af9ef 100644 > --- a/fs/ext4/namei.c > +++ b/fs/ext4/namei.c > @@ -2539,13 +2539,17 @@ static int empty_dir(struct inode *inode) > return 1; > } > > -/* ext4_orphan_add() links an unlinked or truncated inode into a list of > +/* > + * ext4_orphan_add() links an unlinked or truncated inode into a list of > * such inodes, starting at the superblock, in case we crash before the > * file is closed/deleted, or in case the inode truncate spans multiple > * transactions and the last transaction is not recovered after a crash. > * > * At filesystem recovery time, we walk this list deleting unlinked > * inodes and truncating linked inodes in ext4_orphan_cleanup(). > + * > + * Orphan list manipulation functions must be called under i_mutex unless > + * we are just creating the inode or deleting it. > */ > int ext4_orphan_add(handle_t *handle, struct inode *inode) > { > @@ -2556,9 +2560,14 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode) > if (!EXT4_SB(sb)->s_journal) > return 0; > > - mutex_lock(&EXT4_SB(sb)->s_orphan_lock); > - if (!list_empty(&EXT4_I(inode)->i_orphan)) > - goto out_unlock; > + WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) && > + !mutex_is_locked(&inode->i_mutex)); > + /* > + * Exit early if inode already is on orphan list. This is a big speedup > + * since we don't have to contend on the global s_orphan_lock. > + */ > + if (!list_empty(&EXT4_I(inode)->i_orphan)) > + return 0; > > /* > * Orphan handling is only valid for files with data blocks > @@ -2577,6 +2586,8 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode) > err = ext4_reserve_inode_write(handle, inode, &iloc); > if (err) > goto out_unlock; > + > + mutex_lock(&EXT4_SB(sb)->s_orphan_lock); > /* > * Due to previous errors inode may be already a part of on-disk > * orphan list. If so skip on-disk list modification. I believe we need to also call brelse() on the iloc that we reserve earlier, if we are to only update in memory orphan list only. This is also missing in the original code. > @@ -2630,10 +2641,22 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) > if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS)) > return 0; > > - mutex_lock(&sbi->s_orphan_lock); > + WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) && > + !mutex_is_locked(&inode->i_mutex)); > + /* > + * Do this quick and racy check before taking global s_orphan_lock. > + */ > if (list_empty(&ei->i_orphan)) > - goto out; > + return 0; > > + /* Grab inode buffer early before taking global s_orphan_lock */ > + err = ext4_reserve_inode_write(handle, inode, &iloc); > + if (err) { You need to hold s_orphan_lock mutex before modify the orphan list. > + list_del_init(&ei->i_orphan); > + return err; > + } > + > + mutex_lock(&sbi->s_orphan_lock); > ino_next = NEXT_ORPHAN(inode); > prev = ei->i_orphan.prev; > > @@ -2648,10 +2671,6 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) > if (!handle) > goto out; In the error path, with NULL handle, you still call the ext4_resrve_inode_write(), which is not required. This may be one of the reasons your did not achieve a better performance. In the original code and my patch, this path should be quick. BTW, I realized that my patch does not remove the in memory orphan list even if we do not have write access to the inode. > > - err = ext4_reserve_inode_write(handle, inode, &iloc); > - if (err) > - goto out_err; > - > if (prev == &sbi->s_orphan) { > jbd_debug(4, "superblock will point to %u\n", ino_next); > BUFFER_TRACE(sbi->s_sbh, "get_write_access"); > In my patch, there are things I do post updating the orphan list without holding the mutex, which you do not do. Again, this may be the reason your patch does not achieve the same performance. Yes, there some code clean up that should have been done when I revert back to using a single mutex. I doubt it would make much difference if we are to minimize the time we are holding the s_orphan_lock. As mentioned earlier, I still believe the hashed mutex is a good thing to have though with existing code it may not be required. I'll resubmit a new version of my patch with some code clean up and fix for the problem with orphan delete mentioned above. Thanks, Mak. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/2] ext4: Reduce contention on s_orphan_lock 2014-04-29 23:32 [PATCH 0/2] ext4: Reduce contention on s_orphan_lock Jan Kara 2014-04-29 23:32 ` [PATCH 1/2] ext4: Use sbi in ext4_orphan_del() Jan Kara 2014-04-29 23:32 ` [PATCH 2/2] ext4: Reduce contention on s_orphan_lock Jan Kara @ 2014-05-02 21:56 ` Thavatchai Makphaibulchoke 2014-05-06 11:49 ` Jan Kara 2 siblings, 1 reply; 25+ messages in thread From: Thavatchai Makphaibulchoke @ 2014-05-02 21:56 UTC (permalink / raw) To: Jan Kara, T Makphaibulchoke; +Cc: linux-ext4 On 04/29/2014 05:32 PM, Jan Kara wrote: > > Hello, > > so I finally got to looking into your patches for reducing contention > on s_orphan_lock. AFAICT these two patches (the first one is just a > small cleanup) should have the same speed gain as the patches you wrote > and they are simpler. Can you give them a try please? Thanks! > > Honza > I applied your patch and ran aim7 on both the 80 and 120 core count machines. There are aim7 workloads that your patch does show some performance improvement. Unfortunately in general, it does not have the same performance level as my original patch, especially with high user count, 500 or more. As for the hashed mutex used in my patch to serialize orphan operation within a single node, even if I agree with you that with existing code it is not required, I don't believe that you can count on that in the future. I believe that is also your concern, as you also added comment indicating the requirement of the i_mutex in your patch. In terms of maintainability, I do not believe simply relying on warning in a comment is sufficient. On top of that with this new requirement, we are unnecessarily coupling the orphan operations to i_mutex, adding more contention to it. This would likely to cause performance regression, as my experiment in responding to your earlier comment on my patch did show some regression when using i_mutex for serialization of orphan operations within a single node (adding code to lock the if it is not already locked). I still believe having a different mutex for orphan operation serialization is a better and safer alternative. From my experiment so far (I'm still verifying this), it may even help improving the performance by spreading out the contention on the s_orphan_mutex. Please also see my response to patch 2/2 for comments on the actual code. Thanks, Mak. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/2] ext4: Reduce contention on s_orphan_lock 2014-05-02 21:56 ` [PATCH 0/2] " Thavatchai Makphaibulchoke @ 2014-05-06 11:49 ` Jan Kara 2014-05-09 6:24 ` Thavatchai Makphaibulchoke 0 siblings, 1 reply; 25+ messages in thread From: Jan Kara @ 2014-05-06 11:49 UTC (permalink / raw) To: Thavatchai Makphaibulchoke; +Cc: Jan Kara, T Makphaibulchoke, linux-ext4 On Fri 02-05-14 15:56:56, Thavatchai Makphaibulchoke wrote: > On 04/29/2014 05:32 PM, Jan Kara wrote: > > > > Hello, > > > > so I finally got to looking into your patches for reducing contention > > on s_orphan_lock. AFAICT these two patches (the first one is just a > > small cleanup) should have the same speed gain as the patches you wrote > > and they are simpler. Can you give them a try please? Thanks! > > I applied your patch and ran aim7 on both the 80 and 120 core count > machines. There are aim7 workloads that your patch does show some > performance improvement. Unfortunately in general, it does not have the > same performance level as my original patch, especially with high user > count, 500 or more. Thanks for running the benchmarks! > As for the hashed mutex used in my patch to serialize orphan operation > within a single node, even if I agree with you that with existing code it > is not required, I don't believe that you can count on that in the > future. I believe that is also your concern, as you also added comment > indicating the requirement of the i_mutex in your patch. So I believe it is reasonable to require i_mutex to be held when orphan operations are called. Adding / removing inode from orphan list is needed when extending or truncating inode and these operations generally require i_mutex in ext4. I've added the assertion so that we can catch a situation when we either by mistake or intentionally grow a call site which won't hold i_mutex. If such situation happens and we have good reasons why i_mutex shouldn't be used there, then we can think about some dedicated locks (either hashed or per inode ones). > In terms of maintainability, I do not believe simply relying on warning > in a comment is sufficient. On top of that with this new requirement, we > are unnecessarily coupling the orphan operations to i_mutex, adding more > contention to it. This would likely to cause performance regression, as We aren't adding *any* contention. I didn't have to add a single additional locking of i_mutex because in all the necessary cases we are holding i_mutex over the orphan operations anyway (or the inode isn't reachable by other processes). So regarding per-inode locking, we are doing strictly less work with my patch than with your patch. However you are correct in your comments to my patch 2/2 that in your patch you handle operations under the global s_orphan_lock better and that's probably what causes the difference. I need to improve that for sure. > my experiment in responding to your earlier comment on my patch did show > some regression when using i_mutex for serialization of orphan operations > within a single node (adding code to lock the if it is not already > locked). > > I still believe having a different mutex for orphan operation > serialization is a better and safer alternative. Frankly I don't see why they would be better. They are marginally safer by keeping the locking inside the orphan functions but the WARN_ON I have added pretty much mitigates this concern and as I wrote above I actually think using i_mutex not only works but also makes logical sense. > From my experiment so > far (I'm still verifying this), it may even help improving the > performance by spreading out the contention on the s_orphan_mutex. So orphan operations on a single inode are serialized by i_mutex both with your and with my patch. You add additional serialization by hashed mutexes so now orphan operations for independent inodes get serialized as well. It may in theory improve the performance by effectively making the access to global s_orphan_lock less fair but for now I believe that other differences in our patches is what makes a difference. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/2] ext4: Reduce contention on s_orphan_lock 2014-05-06 11:49 ` Jan Kara @ 2014-05-09 6:24 ` Thavatchai Makphaibulchoke 0 siblings, 0 replies; 25+ messages in thread From: Thavatchai Makphaibulchoke @ 2014-05-09 6:24 UTC (permalink / raw) To: Jan Kara; +Cc: T Makphaibulchoke, linux-ext4 On 05/06/2014 05:49 AM, Jan Kara wrote: > On Fri 02-05-14 15:56:56, Thavatchai Makphaibulchoke wrote: > So I believe it is reasonable to require i_mutex to be held when orphan > operations are called. Adding / removing inode from orphan list is needed > when extending or truncating inode and these operations generally require > i_mutex in ext4. I've added the assertion so that we can catch a situation > when we either by mistake or intentionally grow a call site which won't > hold i_mutex. If such situation happens and we have good reasons why > i_mutex shouldn't be used there, then we can think about some dedicated > locks (either hashed or per inode ones). > In earlier version of the patch, it is suggested that per inode mutex is too much of an overhead. In addition, adding a mutex to the ext4_inode struct would increase its size above the cache size. > We aren't adding *any* contention. I didn't have to add a single > additional locking of i_mutex because in all the necessary cases we are > holding i_mutex over the orphan operations anyway (or the inode isn't > reachable by other processes). So regarding per-inode locking, we are doing > strictly less work with my patch than with your patch. However you are > correct in your comments to my patch 2/2 that in your patch you handle > operations under the global s_orphan_lock better and that's probably what > causes the difference. I need to improve that for sure. > Sorry for not being clear. Yes, I agree, with existing code, you are not adding any contention. What I'm trying to say is that with your patch making i_mutex an implicit requirement of an orphan operation, any future ext4 operation requiring an orphan add/delete will also have to acquire i_mutex regardless of whether it needs i_mutex. Though this may not likely to happen, it could potentially add unnecessary contention to i_mutex. Again, this may not likely to happen, but any change in i_mutex usages in the caller could have direct impact on your change. > Frankly I don't see why they would be better. They are marginally safer > by keeping the locking inside the orphan functions but the WARN_ON I have > added pretty much mitigates this concern and as I wrote above I actually > think using i_mutex not only works but also makes logical sense. > Yes, the WARN_ON is good safe gaurd against any i_mutex violation. Personally I would prefer a BUG_ON. But as mentioned above, any change in the caller or some future usage of orphan operations may require a revisit of the WARN_ON. Though my patch is doing more work and is only marginal safer with existing code, it would be safer with regards to any further change. In addition, since i_mutex is not required, there may be a chance (sorry I did not have chance to verify with the source yet)for the caller to drop the i_mutex earlier for possible further improvement. Despite what I've said, I would agree it would make sense to take advantage of the fact either the caller already held i_mutex or it is safe due to no other accesses possible to the inode for the performance reason. Unfotunately (it may be coincidental) from the aim7 results from my patch with the hashed mutex acquire/release removed, not only no performance improvement is detected, there is significant regression in several workloads on two of my higher thread count platforms tested. With all the performance numbers I'm getting, plus the fact that at least it is marginal safer, I'm still believe the hashed mutex is the way to go. Thanks, Mak. > So orphan operations on a single inode are serialized by i_mutex both > with your and with my patch. You add additional serialization by hashed > mutexes so now orphan operations for independent inodes get serialized as > well. It may in theory improve the performance by effectively making the > access to global s_orphan_lock less fair but for now I believe that other > differences in our patches is what makes a difference. > > Honza > ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 0/2 v2] Improve orphan list scaling @ 2014-05-15 20:17 Jan Kara 2014-05-15 20:17 ` [PATCH 2/2] ext4: Reduce contention on s_orphan_lock Jan Kara 0 siblings, 1 reply; 25+ messages in thread From: Jan Kara @ 2014-05-15 20:17 UTC (permalink / raw) To: linux-ext4; +Cc: Ted Tso, Thavatchai Makphaibulchoke Hello, this is my version of patches to improve orphan list scaling by reducing amount of work done under global s_orphan_mutex. We are in disagreement with Thavatchai whose patches are better (see thread http://www.spinics.net/lists/linux-ext4/msg43220.html) so I guess it's up to Ted or other people on this list to decide. When running code stressing orphan list operations [1] with these patches, I see s_orphan_lock to move from number 1 in lock_stat report to unmeasurable. So with the patches there are other much more problematic locks (superblock buffer lock and bh_state lock, j_list_lock, buffer locks for inode buffers when several inodes share a block...). The average times for 10 runs for the test program to run on my 48-way box with ext4 on ramdisk are: Vanilla Patched Procs Avg Stddev Avg Stddev 1 2.769200 0.056194 2.890700 0.061727 2 5.756500 0.313268 4.383500 0.161629 4 11.852500 0.130221 6.542900 0.160039 10 33.590900 0.394888 27.749500 0.615517 20 71.035400 0.320914 76.368700 3.734557 40 236.671100 2.856885 228.236800 2.361391 So we can see the biggest speedup was for 2, 4, and 10 threads. For higher thread counts the contention and cache bouncing prevented any significant speedup (we can even see a barely-out-of-noise performance drop for 20 threads). Changes since v2: * Fixed up various bugs in error handling pointed out by Thavatchai and some others as well * Somewhat reduced critical sections under s_orphan_lock [1] The test program runs given number of processes, each process is truncating a 4k file by 1 byte until it reaches 1 byte size and then the file is extended to 4k again. Honza ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-05-15 20:17 [PATCH 0/2 v2] Improve orphan list scaling Jan Kara @ 2014-05-15 20:17 ` Jan Kara 2014-05-20 3:23 ` Theodore Ts'o 2014-05-20 8:33 ` Thavatchai Makphaibulchoke 0 siblings, 2 replies; 25+ messages in thread From: Jan Kara @ 2014-05-15 20:17 UTC (permalink / raw) To: linux-ext4; +Cc: Ted Tso, Thavatchai Makphaibulchoke, Jan Kara Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that we avoid taking global s_orphan_lock in some cases and hold it for shorter time in other cases. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/namei.c | 109 +++++++++++++++++++++++++++++++++----------------------- 1 file changed, 65 insertions(+), 44 deletions(-) diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 5fcaa85b6dc5..0486fbafb808 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2539,13 +2539,17 @@ static int empty_dir(struct inode *inode) return 1; } -/* ext4_orphan_add() links an unlinked or truncated inode into a list of +/* + * ext4_orphan_add() links an unlinked or truncated inode into a list of * such inodes, starting at the superblock, in case we crash before the * file is closed/deleted, or in case the inode truncate spans multiple * transactions and the last transaction is not recovered after a crash. * * At filesystem recovery time, we walk this list deleting unlinked * inodes and truncating linked inodes in ext4_orphan_cleanup(). + * + * Orphan list manipulation functions must be called under i_mutex unless + * we are just creating the inode or deleting it. */ int ext4_orphan_add(handle_t *handle, struct inode *inode) { @@ -2553,13 +2557,19 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode) struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_iloc iloc; int err = 0, rc; + bool dirty = false; if (!sbi->s_journal) return 0; - mutex_lock(&sbi->s_orphan_lock); + WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) && + !mutex_is_locked(&inode->i_mutex)); + /* + * Exit early if inode already is on orphan list. This is a big speedup + * since we don't have to contend on the global s_orphan_lock. + */ if (!list_empty(&EXT4_I(inode)->i_orphan)) - goto out_unlock; + return 0; /* * Orphan handling is only valid for files with data blocks @@ -2573,44 +2583,47 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode) BUFFER_TRACE(sbi->s_sbh, "get_write_access"); err = ext4_journal_get_write_access(handle, sbi->s_sbh); if (err) - goto out_unlock; + goto out; err = ext4_reserve_inode_write(handle, inode, &iloc); if (err) - goto out_unlock; + goto out; + + mutex_lock(&sbi->s_orphan_lock); /* * Due to previous errors inode may be already a part of on-disk * orphan list. If so skip on-disk list modification. */ - if (NEXT_ORPHAN(inode) && NEXT_ORPHAN(inode) <= - (le32_to_cpu(sbi->s_es->s_inodes_count))) - goto mem_insert; - - /* Insert this inode at the head of the on-disk orphan list... */ - NEXT_ORPHAN(inode) = le32_to_cpu(sbi->s_es->s_last_orphan); - sbi->s_es->s_last_orphan = cpu_to_le32(inode->i_ino); - err = ext4_handle_dirty_super(handle, sb); - rc = ext4_mark_iloc_dirty(handle, inode, &iloc); - if (!err) - err = rc; - - /* Only add to the head of the in-memory list if all the - * previous operations succeeded. If the orphan_add is going to - * fail (possibly taking the journal offline), we can't risk - * leaving the inode on the orphan list: stray orphan-list - * entries can cause panics at unmount time. - * - * This is safe: on error we're going to ignore the orphan list - * anyway on the next recovery. */ -mem_insert: - if (!err) - list_add(&EXT4_I(inode)->i_orphan, &sbi->s_orphan); + if (!NEXT_ORPHAN(inode) || NEXT_ORPHAN(inode) > + (le32_to_cpu(sbi->s_es->s_inodes_count))) { + /* Insert this inode at the head of the on-disk orphan list */ + NEXT_ORPHAN(inode) = le32_to_cpu(sbi->s_es->s_last_orphan); + sbi->s_es->s_last_orphan = cpu_to_le32(inode->i_ino); + dirty = true; + } + list_add(&EXT4_I(inode)->i_orphan, &sbi->s_orphan); + mutex_unlock(&sbi->s_orphan_lock); + if (dirty) { + err = ext4_handle_dirty_super(handle, sb); + rc = ext4_mark_iloc_dirty(handle, inode, &iloc); + if (!err) + err = rc; + if (err) { + /* + * We have to remove inode from in-memory list if + * addition to on disk orphan list failed. Stray orphan + * list entries can cause panics at unmount time. + */ + mutex_lock(&sbi->s_orphan_lock); + list_del(&EXT4_I(inode)->i_orphan); + mutex_unlock(&sbi->s_orphan_lock); + } + } jbd_debug(4, "superblock will point to %lu\n", inode->i_ino); jbd_debug(4, "orphan inode %lu will point to %d\n", inode->i_ino, NEXT_ORPHAN(inode)); -out_unlock: - mutex_unlock(&sbi->s_orphan_lock); +out: ext4_std_error(sb, err); return err; } @@ -2631,13 +2644,18 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS)) return 0; - mutex_lock(&sbi->s_orphan_lock); + WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) && + !mutex_is_locked(&inode->i_mutex)); + /* Do this quick check before taking global s_orphan_lock. */ if (list_empty(&ei->i_orphan)) - goto out; + return 0; - ino_next = NEXT_ORPHAN(inode); - prev = ei->i_orphan.prev; + if (handle) { + /* Grab inode buffer early before taking global s_orphan_lock */ + err = ext4_reserve_inode_write(handle, inode, &iloc); + } + mutex_lock(&sbi->s_orphan_lock); jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino); list_del_init(&ei->i_orphan); @@ -2646,20 +2664,23 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) * transaction handle with which to update the orphan list on * disk, but we still need to remove the inode from the linked * list in memory. */ - if (!handle) - goto out; - - err = ext4_reserve_inode_write(handle, inode, &iloc); - if (err) + if (!handle || err) { + mutex_unlock(&sbi->s_orphan_lock); goto out_err; + } + ino_next = NEXT_ORPHAN(inode); + prev = ei->i_orphan.prev; if (prev == &sbi->s_orphan) { jbd_debug(4, "superblock will point to %u\n", ino_next); BUFFER_TRACE(sbi->s_sbh, "get_write_access"); err = ext4_journal_get_write_access(handle, sbi->s_sbh); - if (err) + if (err) { + mutex_unlock(&sbi->s_orphan_lock); goto out_brelse; + } sbi->s_es->s_last_orphan = cpu_to_le32(ino_next); + mutex_unlock(&sbi->s_orphan_lock); err = ext4_handle_dirty_super(handle, inode->i_sb); } else { struct ext4_iloc iloc2; @@ -2669,20 +2690,20 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) jbd_debug(4, "orphan inode %lu will point to %u\n", i_prev->i_ino, ino_next); err = ext4_reserve_inode_write(handle, i_prev, &iloc2); - if (err) + if (err) { + mutex_unlock(&sbi->s_orphan_lock); goto out_brelse; + } NEXT_ORPHAN(i_prev) = ino_next; + mutex_unlock(&sbi->s_orphan_lock); err = ext4_mark_iloc_dirty(handle, i_prev, &iloc2); } if (err) goto out_brelse; NEXT_ORPHAN(inode) = 0; err = ext4_mark_iloc_dirty(handle, inode, &iloc); ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-05-15 20:17 ` [PATCH 2/2] ext4: Reduce contention on s_orphan_lock Jan Kara @ 2014-05-20 3:23 ` Theodore Ts'o 2014-05-20 8:33 ` Thavatchai Makphaibulchoke 1 sibling, 0 replies; 25+ messages in thread From: Theodore Ts'o @ 2014-05-20 3:23 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4, Thavatchai Makphaibulchoke On Thu, May 15, 2014 at 10:17:06PM +0200, Jan Kara wrote: > Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that > we avoid taking global s_orphan_lock in some cases and hold it for > shorter time in other cases. > > Signed-off-by: Jan Kara <jack@suse.cz> Hi Jan, With this patch applies, xfstests is hanging after ext4/001 runs: BEGIN TEST: Ext4 4k block Tue May 20 03:18:49 UTC 2014 Device: /dev/vdb mk2fs options: -q mount options: -o block_validity FSTYP -- ext4 PLATFORM -- Linux/i686 candygram 3.15.0-rc2-00052-g0f430fc MKFS_OPTIONS -- -q /dev/vdc MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity /dev/vdc /vdc ext4/001 4s ... [03:19:05] [03:19:09] 4s It just stops here, and never runs the next test. If I revert this commit, the test run continus with ext4/002, and then onto the rest of the tests. Could you take a look? Thanks!! - Ted ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-05-15 20:17 ` [PATCH 2/2] ext4: Reduce contention on s_orphan_lock Jan Kara 2014-05-20 3:23 ` Theodore Ts'o @ 2014-05-20 8:33 ` Thavatchai Makphaibulchoke 2014-05-20 9:18 ` Jan Kara 2014-05-20 13:57 ` Theodore Ts'o 1 sibling, 2 replies; 25+ messages in thread From: Thavatchai Makphaibulchoke @ 2014-05-20 8:33 UTC (permalink / raw) To: Jan Kara, linux-ext4; +Cc: Ted Tso Please see my one comment below. BTW, I've run aim7 on your before I notice what I commented below. There are workloads that my patch outperform yours and vice versa. I will have to redo it over again. On 05/15/2014 02:17 PM, Jan Kara wrote: > Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that > we avoid taking global s_orphan_lock in some cases and hold it for > shorter time in other cases. > > Signed-off-by: Jan Kara <jack@suse.cz> > --- > fs/ext4/namei.c | 109 +++++++++++++++++++++++++++++++++----------------------- > 1 file changed, 65 insertions(+), 44 deletions(-) > > diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c > index 5fcaa85b6dc5..0486fbafb808 100644 > --- a/fs/ext4/namei.c > +++ b/fs/ext4/namei.c > @@ -2539,13 +2539,17 @@ static int empty_dir(struct inode *inode) > return 1; > } > > -/* ext4_orphan_add() links an unlinked or truncated inode into a list of > +/* > + * ext4_orphan_add() links an unlinked or truncated inode into a list of > * such inodes, starting at the superblock, in case we crash before the > * file is closed/deleted, or in case the inode truncate spans multiple > * transactions and the last transaction is not recovered after a crash. > * > * At filesystem recovery time, we walk this list deleting unlinked > * inodes and truncating linked inodes in ext4_orphan_cleanup(). > + * > + * Orphan list manipulation functions must be called under i_mutex unless > + * we are just creating the inode or deleting it. > */ > int ext4_orphan_add(handle_t *handle, struct inode *inode) > { > @@ -2553,13 +2557,19 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode) > struct ext4_sb_info *sbi = EXT4_SB(sb); > struct ext4_iloc iloc; > int err = 0, rc; > + bool dirty = false; > > if (!sbi->s_journal) > return 0; > > - mutex_lock(&sbi->s_orphan_lock); > + WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) && > + !mutex_is_locked(&inode->i_mutex)); > + /* > + * Exit early if inode already is on orphan list. This is a big speedup > + * since we don't have to contend on the global s_orphan_lock. > + */ > if (!list_empty(&EXT4_I(inode)->i_orphan)) > - goto out_unlock; > + return 0; > > /* > * Orphan handling is only valid for files with data blocks > @@ -2573,44 +2583,47 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode) > BUFFER_TRACE(sbi->s_sbh, "get_write_access"); > err = ext4_journal_get_write_access(handle, sbi->s_sbh); > if (err) > - goto out_unlock; > + goto out; > > err = ext4_reserve_inode_write(handle, inode, &iloc); > if (err) > - goto out_unlock; > + goto out; > + > + mutex_lock(&sbi->s_orphan_lock); > /* > * Due to previous errors inode may be already a part of on-disk > * orphan list. If so skip on-disk list modification. > */ > - if (NEXT_ORPHAN(inode) && NEXT_ORPHAN(inode) <= > - (le32_to_cpu(sbi->s_es->s_inodes_count))) > - goto mem_insert; > - > - /* Insert this inode at the head of the on-disk orphan list... */ > - NEXT_ORPHAN(inode) = le32_to_cpu(sbi->s_es->s_last_orphan); > - sbi->s_es->s_last_orphan = cpu_to_le32(inode->i_ino); > - err = ext4_handle_dirty_super(handle, sb); > - rc = ext4_mark_iloc_dirty(handle, inode, &iloc); > - if (!err) > - err = rc; > - > - /* Only add to the head of the in-memory list if all the > - * previous operations succeeded. If the orphan_add is going to > - * fail (possibly taking the journal offline), we can't risk > - * leaving the inode on the orphan list: stray orphan-list > - * entries can cause panics at unmount time. > - * > - * This is safe: on error we're going to ignore the orphan list > - * anyway on the next recovery. */ > -mem_insert: > - if (!err) > - list_add(&EXT4_I(inode)->i_orphan, &sbi->s_orphan); > + if (!NEXT_ORPHAN(inode) || NEXT_ORPHAN(inode) > > + (le32_to_cpu(sbi->s_es->s_inodes_count))) { > + /* Insert this inode at the head of the on-disk orphan list */ > + NEXT_ORPHAN(inode) = le32_to_cpu(sbi->s_es->s_last_orphan); > + sbi->s_es->s_last_orphan = cpu_to_le32(inode->i_ino); > + dirty = true; > + } > + list_add(&EXT4_I(inode)->i_orphan, &sbi->s_orphan); > + mutex_unlock(&sbi->s_orphan_lock); > > + if (dirty) { > + err = ext4_handle_dirty_super(handle, sb); > + rc = ext4_mark_iloc_dirty(handle, inode, &iloc); > + if (!err) > + err = rc; > + if (err) { > + /* > + * We have to remove inode from in-memory list if > + * addition to on disk orphan list failed. Stray orphan > + * list entries can cause panics at unmount time. > + */ > + mutex_lock(&sbi->s_orphan_lock); > + list_del(&EXT4_I(inode)->i_orphan); > + mutex_unlock(&sbi->s_orphan_lock); > + } > + } > jbd_debug(4, "superblock will point to %lu\n", inode->i_ino); > jbd_debug(4, "orphan inode %lu will point to %d\n", > inode->i_ino, NEXT_ORPHAN(inode)); > -out_unlock: > - mutex_unlock(&sbi->s_orphan_lock); > +out: > ext4_std_error(sb, err); > return err; > } > @@ -2631,13 +2644,18 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) > if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS)) > return 0; > > - mutex_lock(&sbi->s_orphan_lock); > + WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) && > + !mutex_is_locked(&inode->i_mutex)); > + /* Do this quick check before taking global s_orphan_lock. */ > if (list_empty(&ei->i_orphan)) > - goto out; > + return 0; > > - ino_next = NEXT_ORPHAN(inode); > - prev = ei->i_orphan.prev; > + if (handle) { > + /* Grab inode buffer early before taking global s_orphan_lock */ > + err = ext4_reserve_inode_write(handle, inode, &iloc); > + } > > + mutex_lock(&sbi->s_orphan_lock); > jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino); > Should set prev = ei->i_orphan.prev; here, instead of down below where it has already been removed from the list. Thanks, Mak. > list_del_init(&ei->i_orphan); > @@ -2646,20 +2664,23 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) > * transaction handle with which to update the orphan list on > * disk, but we still need to remove the inode from the linked > * list in memory. */ > - if (!handle) > - goto out; > - > - err = ext4_reserve_inode_write(handle, inode, &iloc); > - if (err) > + if (!handle || err) { > + mutex_unlock(&sbi->s_orphan_lock); > goto out_err; > + } > > + ino_next = NEXT_ORPHAN(inode); > + prev = ei->i_orphan.prev; ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-05-20 8:33 ` Thavatchai Makphaibulchoke @ 2014-05-20 9:18 ` Jan Kara 2014-05-20 13:57 ` Theodore Ts'o 1 sibling, 0 replies; 25+ messages in thread From: Jan Kara @ 2014-05-20 9:18 UTC (permalink / raw) To: Thavatchai Makphaibulchoke; +Cc: Jan Kara, linux-ext4, Ted Tso On Tue 20-05-14 02:33:23, Thavatchai Makphaibulchoke wrote: > Please see my one comment below. > > BTW, I've run aim7 on your before I notice what I commented below. There > are workloads that my patch outperform yours and vice versa. I will have > to redo it over again. > > On 05/15/2014 02:17 PM, Jan Kara wrote: > > @@ -2631,13 +2644,18 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) > > if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS)) > > return 0; > > > > - mutex_lock(&sbi->s_orphan_lock); > > + WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) && > > + !mutex_is_locked(&inode->i_mutex)); > > + /* Do this quick check before taking global s_orphan_lock. */ > > if (list_empty(&ei->i_orphan)) > > - goto out; > > + return 0; > > > > - ino_next = NEXT_ORPHAN(inode); > > - prev = ei->i_orphan.prev; > > + if (handle) { > > + /* Grab inode buffer early before taking global s_orphan_lock */ > > + err = ext4_reserve_inode_write(handle, inode, &iloc); > > + } > > > > + mutex_lock(&sbi->s_orphan_lock); > > jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino); > > > > Should set prev = ei->i_orphan.prev; here, instead of down below where it > has already been removed from the list. Bah, I'm sorry for causing extra work to you. That's a really stupid mistake. Thanks for finding that! Also I've realized we cannot really reliably do ext4_mark_iloc_dirty(handle, i_prev, &iloc2); outside of s_orphan_lock because that inode may be reclaimed the moment after we drop s_orphan_lock. So I had to remove that optimization as well and move the call back under s_orphan_lock. I'm running xfstests now (updated so they include also the test Ted found failing) to verify correctness again and then I'll get my performance numbers with fixed patches. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-05-20 8:33 ` Thavatchai Makphaibulchoke 2014-05-20 9:18 ` Jan Kara @ 2014-05-20 13:57 ` Theodore Ts'o 2014-05-20 17:16 ` Thavatchai Makphaibulchoke 2014-06-02 17:45 ` Thavatchai Makphaibulchoke 1 sibling, 2 replies; 25+ messages in thread From: Theodore Ts'o @ 2014-05-20 13:57 UTC (permalink / raw) To: Thavatchai Makphaibulchoke; +Cc: Jan Kara, linux-ext4 On Tue, May 20, 2014 at 02:33:23AM -0600, Thavatchai Makphaibulchoke wrote: > Please see my one comment below. > > BTW, I've run aim7 on your before I notice what I commented below. There are workloads that my patch outperform yours and vice versa. I will have to redo it over again. Thavatchai, it would be really great if you could do lock_stat runs with both Jan's latest patches as well as yours. We need to understand where the differences are coming from. As I understand things, there are two differences between Jan and your approaches. The first is that Jan is using the implicit locking of i_mutex to avoid needing to keep a hashed array of mutexes to synchronize an individual inode's being added or removed to the orphan list. The second is that you've split the orphan mutex into an on-disk mutex and a in-memory spinlock. Is it possible to split up your patch so we can measure the benefits of each of these two changes? More interestingly, is there a way we can use the your second change in concert with Jan's changes? Regards, - Ted ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-05-20 13:57 ` Theodore Ts'o @ 2014-05-20 17:16 ` Thavatchai Makphaibulchoke 2014-06-02 17:45 ` Thavatchai Makphaibulchoke 1 sibling, 0 replies; 25+ messages in thread From: Thavatchai Makphaibulchoke @ 2014-05-20 17:16 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Jan Kara, linux-ext4 On 05/20/2014 07:57 AM, Theodore Ts'o wrote: > Thavatchai, it would be really great if you could do lock_stat runs > with both Jan's latest patches as well as yours. We need to > understand where the differences are coming from. > > As I understand things, there are two differences between Jan and your > approaches. The first is that Jan is using the implicit locking of > i_mutex to avoid needing to keep a hashed array of mutexes to > synchronize an individual inode's being added or removed to the orphan > list. > > The second is that you've split the orphan mutex into an on-disk mutex > and a in-memory spinlock. > > Is it possible to split up your patch so we can measure the benefits > of each of these two changes? More interestingly, is there a way we > can use the your second change in concert with Jan's changes? > > Regards, > > - Ted > Ted, depending on what Jan does with my latest comment regarding the dirty optimization, her patch and my patch are functionally identical, with the exception that mine uses an explicit hashed lock to serialize add/delete within a single node. Hers uses the fact that either i_mutex is already held or the node is not accessible to other thread. My assumption is that with the hashed mutex, the contentions for the s_orphan_mutex will spread out more, increasing its chance to hit the mutex_lock fast path. Anyway, I will do lock_stat to get more info. Thanks, Mak. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-05-20 13:57 ` Theodore Ts'o 2014-05-20 17:16 ` Thavatchai Makphaibulchoke @ 2014-06-02 17:45 ` Thavatchai Makphaibulchoke 2014-06-03 8:52 ` Jan Kara 1 sibling, 1 reply; 25+ messages in thread From: Thavatchai Makphaibulchoke @ 2014-06-02 17:45 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Jan Kara, linux-ext4 [-- Attachment #1: Type: text/plain, Size: 6713 bytes --] On 05/20/2014 07:57 AM, Theodore Ts'o wrote: > On Tue, May 20, 2014 at 02:33:23AM -0600, Thavatchai Makphaibulchoke wrote: > > Thavatchai, it would be really great if you could do lock_stat runs > with both Jan's latest patches as well as yours. We need to > understand where the differences are coming from. > > As I understand things, there are two differences between Jan and your > approaches. The first is that Jan is using the implicit locking of > i_mutex to avoid needing to keep a hashed array of mutexes to > synchronize an individual inode's being added or removed to the orphan > list. > > The second is that you've split the orphan mutex into an on-disk mutex > and a in-memory spinlock. > > Is it possible to split up your patch so we can measure the benefits > of each of these two changes? More interestingly, is there a way we > can use the your second change in concert with Jan's changes? > > Regards, > > - Ted > Thanks to Jan, as she pointed out one optimization in orphan_addr() that I've missed. After integrated that into my patch, I've rerun the following aim7 workloads; alltests, custom, dbase, disk, fserver, new_fserver, shared and short. Here are the results. On an 8 core (16 thread) machine, both my revised patch (with additional optimization from Jan's oprhan_add()) and version 3 of Jan's patch give about the same results, for most of the workloads, except fserver and new_fserver, which Jan's outperforms about 9% and 16%, respectively. Here are the lock_stat output for disk, Jan's patch, lock_stat version 0.4 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- &sbi->s_orphan_lock: 80189 80246 3.94 464489.22 77219615.47 962.29 503289 809004 0.10 476537.44 3424587.77 4.23 Mine, &sbi->s_orphan_lock: 82215 82259 3.61 640876.19 15098561.09 183.55 541422 794254 0.10 640368.86 4425140.61 5.57 &sbi->s_orphan_op_mutex[n]: 102507 104880 4.21 1335087.21 1392773487.19 13279.69 398328 840120 0.11 1334711.17 397596137.90 473.26 For new_fserver, Jan's patch, &sbi->s_orphan_lock: 1063059 1063369 5.57 1073325.95 59535205188.94 55987.34 4525570 8446052 0.10 75625.72 10700844.58 1.27 Mine, &sbi->s_orphan_lock: 1171433 1172220 3.02 349678.21 553168029.92 471.90 5517262 8446052 0.09 254108.75 16504015.29 1.95 &sbi->s_orphan_op_mutex[n]: 2176760 2202674 3.44 633129.10 55206091750.06 25063.21 3259467 8452918 0.10 349687.82 605683982.34 71.65 On an 80 core (160 thread) machine, mine outpeforms Jan's in alltests, custom, fserver, new_fserver and shared about the same margin it did over the baseline, around 20% For all these workloads, Jan's patch does not seem to show any noticeable improvement over baseline kernel. I'm getting about the same performance with the rest of the workloads. Here are the lock_stat output for alltests, Jan;'s, &sbi->s_orphan_lock: 2762871 2763355 4.46 49043.39 1763499587.40 638.17 5878253 6475844 0.15 20508.98 70827300.79 10.94 Mine, &sbi->s_orphan_lock: 1171433 1172220 3.02 349678.21 553168029.92 471.90 5517262 8446052 0.09 254108.75 16504015.29 1.95 &sbi->s_orphan_op_mutex[n]: 783176 785840 4.95 30358.58 432279688.66 550.09 2899889 6505883 0.16 30254.12 1668330140.08 256.43 For custom, Jan's, &sbi->s_orphan_lock: 5706466 5707069 4.54 44063.38 3312864313.18 580.48 11942088 13175060 0.15 15944.34 142660367.51 10.83 Mine, &sbi->s_orphan_lock: 5518186 5518558 4.84 32040.05 2436898419.22 441.58 12290996 13175234 0.17 23160.65 141234888.88 10.72 &sbi->s_orphan_op_mutex[n]: 1565216 1569333 4.50 32527.02 788215876.94 502.26 5894074 13196979 0.16 71073.57 3128766227.92 237.08 For dbase, Jan's, &sbi->s_orphan_lock: 14453 14489 5.84 39442.57 8678179.21 598.95 119847 153686 0.17 4390.25 1406816.03 9.15 Mine, &sbi->s_orphan_lock: 13847 13868 6.23 31314.03 7982386.22 575.60 120332 153542 0.17 9354.86 1458061.28 9.50 &sbi->s_orphan_op_mutex[n]: 1700 1717 22.00 50566.24 1225749.82 713.89 85062 189435 0.16 31374.44 14476217.56 76.42 In case the line-wrap making it hard to read, I've also attached the results as a text file. The lock_stat seems to show that with my patch the s_orphan_lock performs better across the board. But on a smaller machine, the hashed mutex seems to offset out the performance gain in the s_oprhan_lock and increase the hashed mutex size likely to make it perform better. Jan, if you could send me your orphan stress test, I could run lock_stat for more performance comparison. Ted, please let me know if there is anything else you like me to experiment with. If you'd like I could also resubmit my revised patch for you to take a look. Thanks, Mak. [-- Attachment #2: lock_stat.txt --] [-- Type: text/plain, Size: 4912 bytes --] On an 8 core (16 thread) machine, lock_stat version 0.4 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Jan's disk, &sbi->s_orphan_lock: 80189 80246 3.94 464489.22 77219615.47 962.29 503289 809004 0.10 476537.44 3424587.77 4.23 Mine's disk, &sbi->s_orphan_lock: 82215 82259 3.61 640876.19 15098561.09 183.55 541422 794254 0.10 640368.86 4425140.61 5.57 &sbi->s_orphan_op_mutex[n]: 102507 104880 4.21 1335087.21 1392773487.19 13279.69 398328 840120 0.11 1334711.17 397596137.90 473.26 Jan's new_fserver, &sbi->s_orphan_lock: 1063059 1063369 5.57 1073325.95 59535205188.94 55987.34 4525570 8446052 0.10 75625.72 10700844.58 1.27 Mine's new_fserver, &sbi->s_orphan_lock: 1171433 1172220 3.02 349678.21 553168029.92 471.90 5517262 8446052 0.09 254108.75 16504015.29 1.95 &sbi->s_orphan_op_mutex[n]: 2176760 2202674 3.44 633129.10 55206091750.06 25063.21 3259467 8452918 0.10 349687.82 605683982.34 71.65 On an 80 core (160 thread) machine, ock_stat version 0.4 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Jan's alltests, &sbi->s_orphan_lock: 2762871 2763355 4.46 49043.39 1763499587.40 638.17 5878253 6475844 0.15 20508.98 70827300.79 10.94 Mine's alltests, &sbi->s_orphan_lock: 2690362 2690612 5.08 24977.33 1286260951.01 478.06 6031175 6475694 0.17 12247.20 70182042.72 10.84 &sbi->s_orphan_op_mutex[n]: 783176 785840 4.95 30358.58 432279688.66 550.09 2899889 6505883 0.16 30254.12 1668330140.08 256.43 Jan's custom, &sbi->s_orphan_lock: 5706466 5707069 4.54 44063.38 3312864313.18 580.48 11942088 13175060 0.15 15944.34 142660367.51 10.83 Mine's custom, &sbi->s_orphan_lock: 5518186 5518558 4.84 32040.05 2436898419.22 441.58 12290996 13175234 0.17 23160.65 141234888.88 10.72 &sbi->s_orphan_op_mutex[n]: 1565216 1569333 4.50 32527.02 788215876.94 502.26 5894074 13196979 0.16 71073.57 3128766227.92 237.08 Jan's dbase, &sbi->s_orphan_lock: 14453 14489 5.84 39442.57 8678179.21 598.95 119847 153686 0.17 4390.25 1406816.03 9.15 Mine's dbase, &sbi->s_orphan_lock: 13847 13868 6.23 31314.03 7982386.22 575.60 120332 153542 0.17 9354.86 1458061.28 9.50 &sbi->s_orphan_op_mutex[n]: 1700 1717 22.00 50566.24 1225749.82 713.89 85062 189435 0.16 31374.44 14476217.56 76.42 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-06-02 17:45 ` Thavatchai Makphaibulchoke @ 2014-06-03 8:52 ` Jan Kara 2014-06-16 19:20 ` Thavatchai Makphaibulchoke 0 siblings, 1 reply; 25+ messages in thread From: Jan Kara @ 2014-06-03 8:52 UTC (permalink / raw) To: Thavatchai Makphaibulchoke; +Cc: Theodore Ts'o, Jan Kara, linux-ext4 [-- Attachment #1: Type: text/plain, Size: 7409 bytes --] On Mon 02-06-14 11:45:32, Thavatchai Makphaibulchoke wrote: > On 05/20/2014 07:57 AM, Theodore Ts'o wrote: > > On Tue, May 20, 2014 at 02:33:23AM -0600, Thavatchai Makphaibulchoke wrote: > > > > Thavatchai, it would be really great if you could do lock_stat runs > > with both Jan's latest patches as well as yours. We need to > > understand where the differences are coming from. > > > > As I understand things, there are two differences between Jan and your > > approaches. The first is that Jan is using the implicit locking of > > i_mutex to avoid needing to keep a hashed array of mutexes to > > synchronize an individual inode's being added or removed to the orphan > > list. > > > > The second is that you've split the orphan mutex into an on-disk mutex > > and a in-memory spinlock. > > > > Is it possible to split up your patch so we can measure the benefits > > of each of these two changes? More interestingly, is there a way we > > can use the your second change in concert with Jan's changes? > > > > Regards, > > > > - Ted > > > > Thanks to Jan, as she pointed out one optimization in orphan_addr() that > I've missed. > > After integrated that into my patch, I've rerun the following aim7 > workloads; alltests, custom, dbase, disk, fserver, new_fserver, shared > and short. Here are the results. > > On an 8 core (16 thread) machine, both my revised patch (with additional > optimization from Jan's oprhan_add()) and version 3 of Jan's patch give > about the same results, for most of the workloads, except fserver and > new_fserver, which Jan's outperforms about 9% and 16%, respectively. > > Here are the lock_stat output for disk, > Jan's patch, > lock_stat version 0.4 > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > &sbi->s_orphan_lock: 80189 80246 3.94 464489.22 77219615.47 962.29 503289 809004 0.10 476537.44 3424587.77 4.23 > Mine, > &sbi->s_orphan_lock: 82215 82259 3.61 640876.19 15098561.09 183.55 541422 794254 0.10 640368.86 4425140.61 5.57 > &sbi->s_orphan_op_mutex[n]: 102507 104880 4.21 1335087.21 1392773487.19 13279.69 398328 840120 0.11 1334711.17 397596137.90 473.26 > > For new_fserver, > Jan's patch, > &sbi->s_orphan_lock: 1063059 1063369 5.57 1073325.95 59535205188.94 55987.34 4525570 8446052 0.10 75625.72 10700844.58 1.27 > Mine, > &sbi->s_orphan_lock: 1171433 1172220 3.02 349678.21 553168029.92 471.90 5517262 8446052 0.09 254108.75 16504015.29 1.95 > &sbi->s_orphan_op_mutex[n]: 2176760 2202674 3.44 633129.10 55206091750.06 25063.21 3259467 8452918 0.10 349687.82 605683982.34 71.65 > > > On an 80 core (160 thread) machine, mine outpeforms Jan's in alltests, > custom, fserver, new_fserver and shared about the same margin it did over > the baseline, around 20% For all these workloads, Jan's patch does not > seem to show any noticeable improvement over baseline kernel. I'm > getting about the same performance with the rest of the workloads. > > Here are the lock_stat output for alltests, > Jan;'s, > &sbi->s_orphan_lock: 2762871 2763355 4.46 49043.39 1763499587.40 638.17 5878253 6475844 0.15 20508.98 70827300.79 10.94 > Mine, > &sbi->s_orphan_lock: 1171433 1172220 3.02 349678.21 553168029.92 471.90 5517262 8446052 0.09 254108.75 16504015.29 1.95 > &sbi->s_orphan_op_mutex[n]: 783176 785840 4.95 30358.58 432279688.66 550.09 2899889 6505883 0.16 30254.12 1668330140.08 256.43 > > For custom, > Jan's, > &sbi->s_orphan_lock: 5706466 5707069 4.54 44063.38 3312864313.18 580.48 11942088 13175060 0.15 15944.34 142660367.51 10.83 > Mine, > &sbi->s_orphan_lock: 5518186 5518558 4.84 32040.05 2436898419.22 441.58 12290996 13175234 0.17 23160.65 141234888.88 10.72 > &sbi->s_orphan_op_mutex[n]: 1565216 1569333 4.50 32527.02 788215876.94 502.26 5894074 13196979 0.16 71073.57 3128766227.92 237.08 > > For dbase, > Jan's, > &sbi->s_orphan_lock: 14453 14489 5.84 39442.57 8678179.21 598.95 119847 153686 0.17 4390.25 1406816.03 9.15 > Mine, > &sbi->s_orphan_lock: 13847 13868 6.23 31314.03 7982386.22 575.60 120332 153542 0.17 9354.86 1458061.28 9.50 > &sbi->s_orphan_op_mutex[n]: 1700 1717 22.00 50566.24 1225749.82 713.89 85062 189435 0.16 31374.44 14476217.56 76.42 > > In case the line-wrap making it hard to read, I've also attached the > results as a text file. > > The lock_stat seems to show that with my patch the s_orphan_lock performs > better across the board. But on a smaller machine, the hashed mutex > seems to offset out the performance gain in the s_oprhan_lock and > increase the hashed mutex size likely to make it perform better. I'd interpret the data a bit differently :) With your patch the contention for resource - access to orphan list - is split between s_orphan_lock and s_orphan_op_mutex. For the smaller machine contending directly on s_orphan_lock is a win and we spend less time waiting in total. For the large machine it seems beneficial to contend on the hashed mutex first and only after that on global lock. Likely that reduces amount of cacheline bouncing, or maybe the mutex is more often acquired during the spinning phase which reduces the acquisition latency. > Jan, if you could send me your orphan stress test, I could run lock_stat > for more performance comparison. Sure, it is attached. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR [-- Attachment #2: stress-orphan.c --] [-- Type: text/x-c, Size: 1296 bytes --] #include <stdio.h> #include <fcntl.h> #include <stdlib.h> #include <errno.h> #include <string.h> #include <signal.h> #include <unistd.h> #include <sys/wait.h> #define COUNT 100 #define MAX_PROCS 1024 char wbuf[4096]; void run_test(char *base, int count) { char pbuf[1024]; int fd, i, j; sprintf(pbuf, "%s/file-%d", base, count); fd = open(pbuf, O_CREAT | O_TRUNC | O_WRONLY, 0644); if (fd < 0) { perror("open"); exit(1); } for (i = 0; i < COUNT; i++) { if (pwrite(fd, wbuf, 4096, 0) != 4096) { perror("pwrite"); exit(1); } for (j = 4095; j >= 1; j--) { if (ftruncate(fd, j) < 0) { perror("ftruncate"); exit(1); } } } } int main(int argc, char **argv) { int procs, i, j; pid_t pids[MAX_PROCS]; if (argc != 3) { fprintf(stderr, "Usage: stress-orphan <processes> <dir>\n"); return 1; } procs = strtol(argv[1], NULL, 10); if (procs > MAX_PROCS) { fprintf(stderr, "Too many processes!\n"); return 1; } for (i = 0; i < procs; i++) { pids[i] = fork(); if (pids[i] < 0) { perror("fork"); for (j = 0; j < i; j++) kill(pids[j], SIGKILL); return 1; } if (pids[i] == 0) { run_test(argv[2], i); exit(0); } } printf("Processes started.\n"); for (i = 0; i < procs; i++) waitpid(pids[i], NULL, 0); return 0; } ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-06-03 8:52 ` Jan Kara @ 2014-06-16 19:20 ` Thavatchai Makphaibulchoke 2014-06-17 9:29 ` Jan Kara 0 siblings, 1 reply; 25+ messages in thread From: Thavatchai Makphaibulchoke @ 2014-06-16 19:20 UTC (permalink / raw) To: Jan Kara; +Cc: Theodore Ts'o, linux-ext4 [-- Attachment #1: Type: text/plain, Size: 5351 bytes --] On 06/03/2014 02:52 AM, Jan Kara wrote: > I'd interpret the data a bit differently :) With your patch the > contention for resource - access to orphan list - is split between > s_orphan_lock and s_orphan_op_mutex. For the smaller machine contending > directly on s_orphan_lock is a win and we spend less time waiting in total. > For the large machine it seems beneficial to contend on the hashed mutex > first and only after that on global lock. Likely that reduces amount of > cacheline bouncing, or maybe the mutex is more often acquired during the > spinning phase which reduces the acquisition latency. > > Sure, it is attached. > > Honza > Thanks Jan for the test program. Anyway I did modify the test a little so that we could also actually run multiple incarnations of the test simultaneously, that is to generate the orphan stress operations on multiple files. I have also attached the modified test, just in case. These are the results that I got. All values are real time in seconds, computed over ten runs with journaling disabled. "w/o" stand for without hashed mutexes and "with" for with mutexes, "Proc" number of processes, and "Files" number of files. With only 1 file, On an 8 core (16 thread) platform, Proc | 1 | 20 | 40 | 80 | 160 | 400 | 800 ----------------------------------------------------------------------------------------------------- | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD ----------------------------------------------------------------------------------------------------- w/o |.7921|.0467|7.1342|.0316|12.4026|.3552|19.3930|.6917|22.7519|.7017|35.9374|1.658|66.7374|.4716 ----------------------------------------------------------------------------------------------------- with |.7819|.0362|6.3302|.2119|12.0933|.6589|18.7514|.9698|24.1351|1.659|38.6480|.6809|67.4810|.2749 On a 80 core (160 thread) platform, Proc | 40 | 80 | 100 | 400 | 800 | 1600 ----------------------------------------------------------------------------------------------------- | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD ----------------------------------------------------------------------------------------------------- w/o |44.8532|3.4991|67.8351|1.7534| 73.0616|2.4733|252.5798|1.1949|485.3289|5.7320|952.8874|2.0911 ----------------------------------------------------------------------------------------------------- with |46.1134|3.3228|99.1550|1.4894|109.0272|1.3617|259.6128|2.5247|284.4386|4.6767|266.8664|7.7726 With one file, we would expect without hashed mutexes would perform better than with. The results do show so on 80 core machine with 80 up to 400 processes. Surprisingly there is no differences across all process ranges tested on 8 core. Also on 80 core, with hashed mutexes the time seems to be steadying out at around high two hundred something with 400 or more processes and outperform without significantly with 800 or more processes. With multiple files and only 1 process per file, On an 8 core (16 thread) platform, Proc | 40 | 80 | 150 | 400 | 800 ------------------------------------------------------------------------- | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD -------------------------------------------------------------------------- w/o |3.3578|.0363|6.4533|.1204|12.1925|.2528|31.5862|.6016|63.9913|.3528 ------------------------------------------------------------------------- with |3.2583|.0384|6.3228|.1075|11.8328|.2290|30.5394|.3220|62.7672|.3802 On a 80 core (160 thread) platform, Proc| 40 | 80 | 100 | 200 | 400 | 800 | 1200 | 1600 ------------------------------------------------------------------------------------------------------------------ | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | ------------------------------------------------------------------------------------------------------------------- w/o_|43.6507|2.9979|57.0404|1.8684|068.5557|1.2902|144.745|1.7939|52.7491|1.3585|487.8996|1.3997|715.1978|1.1224|942.5605|2.9629 ------------------------------------------------------------------------------------------------------------------- with|52.8003|2.1949|69.2455|1.2902|106.5026|1.8813|130.2995|7.8020150.3648|3.4153|184.7233|6.0525|270.1533|3.2261|298.5705|3.1318 Again, there is not much difference on 8 core. On 80 core, without hashed mutexes performs better than with hashed mutexes with the number of files between 40 to less than 200. With hashed mutexes outperorm without significantly with 400 or more files. Overall there seems to be no performance difference on 8 core. On 80 core with hashed mutexes, while performs worse than with in the lower ranges of both processes and files, seems to be scaling better with both higher processes and files. Again, the change with hashed mutexes does include the additional optimization in orphan_add() introduced by your patch. Please let me know if you need a copy of the modified patch with hashed mutexes for verification. Thanks, Mak. [-- Attachment #2: stress_orphan.c --] [-- Type: text/x-csrc, Size: 3166 bytes --] #include <stdio.h> #include <fcntl.h> #include <stdlib.h> #include <errno.h> #include <string.h> #include <signal.h> #include <unistd.h> #include <sys/wait.h> #include <sys/shm.h> #include <sys/times.h> #include <stdint.h> #define COUNT 100 #define MAX_PROCS 2048 #define MAX_ITERATIONS 2048 #define SHM_FAILED ((void *) -1) #define NBPG 4096 char wbuf[4096]; int *g_mstart; void run_test(char *base, int count) { char pbuf[1024]; int fd, i, j; sprintf(pbuf, "%s/file-%d", base, count); fd = open(pbuf, O_CREAT | O_TRUNC | O_WRONLY, 0644); if (fd < 0) { perror("open"); exit(1); } for (i = 0; i < COUNT; i++) { if (pwrite(fd, wbuf, 4096, 0) != 4096) { perror("pwrite"); exit(1); } for (j = 4095; j >= 1; j--) { if (ftruncate(fd, j) < 0) { perror("ftruncate"); exit(1); } } } } int run_main(int procs, char *basedir) { int i, j; pid_t pids[MAX_PROCS]; clock_t start_clock; struct tms start_times; clock_t end_clock; struct tms end_times; while (!*g_mstart); start_clock = times(&start_times); for (i = 0; i < procs; i++) { pids[i] = fork(); if (pids[i] < 0) { perror("fork"); for (j = 0; j < i; j++) kill(pids[j], SIGKILL); return 1; } if (pids[i] == 0) { run_test(basedir, i); exit(0); } } #ifdef PRINT_TIMES printf("Processes started for %s.\n", basedir); #endif /* PRINT_TIMES */ for (i = 0; i < procs; i++) waitpid(pids[i], NULL, 0); end_clock = times(&end_times); #ifdef PRINT_TIMES printf("Real Time: %jd User Time %jd System Time %jd.\n", (intmax_t)(end_clock - start_clock), (intmax_t)(end_times.tms_utime - start_times.tms_utime), (intmax_t)(end_times.tms_stime - start_times.tms_stime)); fflush(stdout); #endif /* PRINT_TIMES */ return 0; } int main(int argc, char **argv) { int procs, i, j; pid_t mpids[MAX_ITERATIONS]; int niterations = 10; char basedir[64]; int n, shmid; if (argc < 3) { fprintf(stderr, "Usage: stress-orphan <processes> <dir>" " [<iterations>]\n"); return 1; } if (argc > 3) { niterations = atoi(argv[3]); if (niterations > MAX_ITERATIONS) { fprintf(stderr, "Warning: iterations %d > " " MAX %d.\n", niterations, MAX_ITERATIONS); niterations = MAX_ITERATIONS; } } if ((shmid = shmget(IPC_PRIVATE, NBPG, 0600)) < 0) { fprintf(stderr, "Error in shmget, errno %d.\n", errno); return 1; } if ((g_mstart = (int *)shmat(shmid, 0, 0)) == SHM_FAILED) { fprintf(stderr, "Error in shmat, errno %d.\n", errno); return 1; } *g_mstart = 0; procs = strtol(argv[1], NULL, 10); if (procs > MAX_PROCS) { fprintf(stderr, "Too many processes!\n"); return 1; } strcpy(basedir, argv[2]); n = strlen(basedir); for ( i = 0 ; i < niterations ; i++ ) { sprintf(&basedir[n], "%d", i + 1); mpids[i] = fork(); if (mpids[i] < 0) { perror("fork"); for (j = 0; j < i; j++) kill(mpids[j], SIGKILL); return 1; } if (mpids[i] == 0) { run_main(procs, basedir); exit(0); } } printf("Main started for %d %s %d.\n", procs, basedir, niterations); *g_mstart = 1; for (i = 0; i < niterations ; i++) waitpid(mpids[i], NULL, 0); printf("Main Processes Done.\n"); } ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-06-16 19:20 ` Thavatchai Makphaibulchoke @ 2014-06-17 9:29 ` Jan Kara 2014-06-18 4:38 ` Thavatchai Makphaibulchoke 0 siblings, 1 reply; 25+ messages in thread From: Jan Kara @ 2014-06-17 9:29 UTC (permalink / raw) To: Thavatchai Makphaibulchoke; +Cc: Jan Kara, Theodore Ts'o, linux-ext4 On Mon 16-06-14 13:20:32, Thavatchai Makphaibulchoke wrote: > On 06/03/2014 02:52 AM, Jan Kara wrote: > > I'd interpret the data a bit differently :) With your patch the > > contention for resource - access to orphan list - is split between > > s_orphan_lock and s_orphan_op_mutex. For the smaller machine contending > > directly on s_orphan_lock is a win and we spend less time waiting in total. > > For the large machine it seems beneficial to contend on the hashed mutex > > first and only after that on global lock. Likely that reduces amount of > > cacheline bouncing, or maybe the mutex is more often acquired during the > > spinning phase which reduces the acquisition latency. > > > > Sure, it is attached. > > > > Honza > > > > Thanks Jan for the test program. > > Anyway I did modify the test a little so that we could also actually run > multiple incarnations of the test simultaneously, that is to generate the > orphan stress operations on multiple files. I have also attached the > modified test, just in case. Hum, looking at your test program source I'm not sure what do you mean. Your program first forks 'niteration' times and each process starts using a different directory. Then each of these processes forks 'procs' times and each of these processes will use a different file in the directory belonging to the parent. So what's the difference to just running 'niterations' * 'procs' processes? After some thought I guess the difference is in how time to run on each individual file contributes to the total average - (\sum_{i=1}^{procs} t_i)/procs in the first case you ran, where t_i is the time to run test for file i, and (max^{i=1}^{procs} t_i) in the second case. But what's the point? > These are the results that I got. > > All values are real time in seconds, computed over ten runs with > journaling disabled. "w/o" stand for without hashed mutexes and "with" > for with mutexes, "Proc" number of processes, and "Files" number of > files. What do you exactly mean by 'journaling disabled'? Did you run ext4 in nojournal mode? That wouldn't really make sense because in nojournal mode all orphan list operations are skipped... So what did you really test? > With only 1 file, > > On an 8 core (16 thread) platform, > > > Proc | 1 | 20 | 40 | 80 | 160 | 400 | 800 > ----------------------------------------------------------------------------------------------------- > | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD > ----------------------------------------------------------------------------------------------------- > w/o |.7921|.0467|7.1342|.0316|12.4026|.3552|19.3930|.6917|22.7519|.7017|35.9374|1.658|66.7374|.4716 > ----------------------------------------------------------------------------------------------------- > with |.7819|.0362|6.3302|.2119|12.0933|.6589|18.7514|.9698|24.1351|1.659|38.6480|.6809|67.4810|.2749 > > On a 80 core (160 thread) platform, > > Proc | 40 | 80 | 100 | 400 | 800 | 1600 > ----------------------------------------------------------------------------------------------------- > | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD > ----------------------------------------------------------------------------------------------------- > w/o |44.8532|3.4991|67.8351|1.7534| 73.0616|2.4733|252.5798|1.1949|485.3289|5.7320|952.8874|2.0911 > ----------------------------------------------------------------------------------------------------- > with |46.1134|3.3228|99.1550|1.4894|109.0272|1.3617|259.6128|2.5247|284.4386|4.6767|266.8664|7.7726 > > With one file, we would expect without hashed mutexes would perform > better than with. The results do show so on 80 core machine with 80 up > to 400 processes. Surprisingly there is no differences across all > process ranges tested on 8 core. Also on 80 core, with hashed mutexes > the time seems to be steadying out at around high two hundred something > with 400 or more processes and outperform without significantly with 800 > or more processes. > > With multiple files and only 1 process per file, > > On an 8 core (16 thread) platform, > > > Proc | 40 | 80 | 150 | 400 | 800 > ------------------------------------------------------------------------- > | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD > -------------------------------------------------------------------------- > w/o |3.3578|.0363|6.4533|.1204|12.1925|.2528|31.5862|.6016|63.9913|.3528 > ------------------------------------------------------------------------- > with |3.2583|.0384|6.3228|.1075|11.8328|.2290|30.5394|.3220|62.7672|.3802 > > On a 80 core (160 thread) platform, > > Proc| 40 | 80 | 100 | 200 | 400 | 800 | 1200 | 1600 > ------------------------------------------------------------------------------------------------------------------ > | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | Avg | SD | > ------------------------------------------------------------------------------------------------------------------- > w/o_|43.6507|2.9979|57.0404|1.8684|068.5557|1.2902|144.745|1.7939|52.7491|1.3585|487.8996|1.3997|715.1978|1.1224|942.5605|2.9629 ^^^^ this number is strange > ------------------------------------------------------------------------------------------------------------------- > with|52.8003|2.1949|69.2455|1.2902|106.5026|1.8813|130.2995|7.8020150.3648|3.4153|184.7233|6.0525|270.1533|3.2261|298.5705|3.1318 > > Again, there is not much difference on 8 core. On 80 core, without > hashed mutexes performs better than with hashed mutexes with the number > of files between 40 to less than 200. With hashed mutexes outperorm > without significantly with 400 or more files. > > Overall there seems to be no performance difference on 8 core. On 80 > core with hashed mutexes, while performs worse than with in the lower > ranges of both processes and files, seems to be scaling better with both > higher processes and files. Your numbers are interesting and seem to confirm that with really high contention it is advantageous to contend on smaller locks first (your hashed mutexes) and only after that on the global lock. But I'd like to hear answers to my previous questions before drawing any conclusions... > Again, the change with hashed mutexes does include the additional > optimization in orphan_add() introduced by your patch. Please let me > know if you need a copy of the modified patch with hashed mutexes for > verification. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-06-17 9:29 ` Jan Kara @ 2014-06-18 4:38 ` Thavatchai Makphaibulchoke 2014-06-18 10:37 ` Jan Kara 0 siblings, 1 reply; 25+ messages in thread From: Thavatchai Makphaibulchoke @ 2014-06-18 4:38 UTC (permalink / raw) To: Jan Kara; +Cc: Theodore Ts'o, linux-ext4 On 06/17/2014 03:29 AM, Jan Kara wrote: > Hum, looking at your test program source I'm not sure what do you mean. > Your program first forks 'niteration' times and each process starts using a > different directory. Then each of these processes forks 'procs' times and > each of these processes will use a different file in the directory > belonging to the parent. So what's the difference to just running > 'niterations' * 'procs' processes? After some thought I guess the > difference is in how time to run on each individual file contributes to the > total average - > (\sum_{i=1}^{procs} t_i)/procs > in the first case you ran, where t_i is the time to run test for file i, and > (max^{i=1}^{procs} t_i) > in the second case. But what's the point? > The original test program generates orphan traffic on only a single file, which not only does not seem to be a fair comparison for the WM (with hashed mutexes) algorithm, does not seem to represent realistic load. The modified test's intention is to compare performance between the WM and WO (without hashed mutexes) under heavy orphan traffic on more than a single file. Instead of running multiple copies of the original test in the background (for a short job we may not actually get overlapping traffic on different inodes), the modified test runs and start the excution, as close as possible to each other, of multiple copies of the original test. The results of my test is the average, over ten runs, of the maximum time the original test takes to complete under certain number of processes and files. Of course, with one copy (niteration of 1), we get the equivalence of the original test. > What do you exactly mean by 'journaling disabled'? Did you run ext4 in > nojournal mode? That wouldn't really make sense because in nojournal mode > all orphan list operations are skipped... So what did you really test? > Yes, sorry my fault disabling journaling, should also inhibit the orphan activities. Therefore there should not be any difference in performance between the two. I just discovered that the two different branches I used do not have the same baseline. Let me recreate the two branches and redo my test. I'll get back with you with the results. Sorry for the confusion. Thanks, Mak. > Your numbers are interesting and seem to confirm that with really high > contention it is advantageous to contend on smaller locks first (your > hashed mutexes) and only after that on the global lock. But I'd like to > hear answers to my previous questions before drawing any conclusions... > > > > Honza > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-06-18 4:38 ` Thavatchai Makphaibulchoke @ 2014-06-18 10:37 ` Jan Kara 2014-07-22 4:35 ` Thavatchai Makphaibulchoke 0 siblings, 1 reply; 25+ messages in thread From: Jan Kara @ 2014-06-18 10:37 UTC (permalink / raw) To: Thavatchai Makphaibulchoke; +Cc: Jan Kara, Theodore Ts'o, linux-ext4 On Tue 17-06-14 22:38:32, Thavatchai Makphaibulchoke wrote: > On 06/17/2014 03:29 AM, Jan Kara wrote: > > Hum, looking at your test program source I'm not sure what do you mean. > > Your program first forks 'niteration' times and each process starts using a > > different directory. Then each of these processes forks 'procs' times and > > each of these processes will use a different file in the directory > > belonging to the parent. So what's the difference to just running > > 'niterations' * 'procs' processes? After some thought I guess the > > difference is in how time to run on each individual file contributes to the > > total average - > > (\sum_{i=1}^{procs} t_i)/procs > > in the first case you ran, where t_i is the time to run test for file i, and > > (max^{i=1}^{procs} t_i) > > in the second case. But what's the point? > > > > The original test program generates orphan traffic on only a single file, > which not only does not seem to be a fair comparison for the WM (with > hashed mutexes) algorithm, does not seem to represent realistic load. > The modified test's intention is to compare performance between the WM > and WO (without hashed mutexes) under heavy orphan traffic on more than a > single file. That's not true. The original test program creates one file per process: void run_test(char *base, int count) { ... sprintf(pbuf, "%s/file-%d", base, count); fd = open(pbuf, O_CREAT | O_TRUNC | O_WRONLY, 0644); ... and forks given number of processes. Ah, maybe I see what you are getting at. My original test program doesn't bother with synchronizing start of the processes - I have manually set number of repetitions (COUNT) so that each process runs long enough on my machine that the differences in start time are negligible. Maybe you thought these processes run sequentially (and maybe they really did if you had fast enough setup). So I agree synchronization of start using shared memory as you did it is a good thing and gets more reliable numbers especially for high numbers of processes (which I didn't really test because of my limited HW) just there's no need to add another level of iteration... > Instead of running multiple copies of the original test in the background > (for a short job we may not actually get overlapping traffic on different > inodes), the modified test runs and start the excution, as close as > possible to each other, of multiple copies of the original test. > > The results of my test is the average, over ten runs, of the maximum time > the original test takes to complete under certain number of processes and > files. Of course, with one copy (niteration of 1), we get the > equivalence of the original test. > > > What do you exactly mean by 'journaling disabled'? Did you run ext4 in > > nojournal mode? That wouldn't really make sense because in nojournal mode > > all orphan list operations are skipped... So what did you really test? > > > > Yes, sorry my fault disabling journaling, should also inhibit the orphan > activities. Therefore there should not be any difference in performance > between the two. Exactly, it's strange you saw a difference. > I just discovered that the two different branches I used do not have the > same baseline. Let me recreate the two branches and redo my test. I'll > get back with you with the results. OK, looking forward to the results. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-06-18 10:37 ` Jan Kara @ 2014-07-22 4:35 ` Thavatchai Makphaibulchoke 2014-07-23 8:15 ` Jan Kara 0 siblings, 1 reply; 25+ messages in thread From: Thavatchai Makphaibulchoke @ 2014-07-22 4:35 UTC (permalink / raw) To: Jan Kara; +Cc: Theodore Ts'o, linux-ext4 On 06/18/2014 04:37 AM, Jan Kara wrote: > That's not true. The original test program creates one file per process: > void run_test(char *base, int count) > { > ... > sprintf(pbuf, "%s/file-%d", base, count); > fd = open(pbuf, O_CREAT | O_TRUNC | O_WRONLY, 0644); > ... > and forks given number of processes. Ah, maybe I see what you are getting > at. My original test program doesn't bother with synchronizing start of the > processes - I have manually set number of repetitions (COUNT) so that each > process runs long enough on my machine that the differences in start time > are negligible. Maybe you thought these processes run sequentially (and > maybe they really did if you had fast enough setup). So I agree > synchronization of start using shared memory as you did it is a good thing > and gets more reliable numbers especially for high numbers of processes > (which I didn't really test because of my limited HW) just there's no need > to add another level of iteration... > First sorry for taking a little long to get back. Your original main is now my new run_main() function. The old main and new run_main() function, practically are doing the same thing, fork and immediately all the child processes as indicated by argv[1], running the original run_test() function. By converting it to a run_main function, I'm trying to start each incarnation of the old main as closed to each other as possible, making sure I have some overlapping orphan activities on different files. Actually I should have done the synchronization in the run_test function instead, this way both tests should be closer to each other. > Exactly, it's strange you saw a difference. > >> I just discovered that the two different branches I used do not have the >> same baseline. Let me recreate the two branches and redo my test. I'll >> get back with you with the results. > OK, looking forward to the results. > > Honza > Here are the results with 3.15-rc5 baseline kernel, using your original test. Each result is an average over ten runs, except those longer runs, marked with an '*', which is an average of only 5 runs. On a 2 socket 16 thread machine, ---------------------------------- Num files | 40 | 100 | 400 | 1000 | ---------------------------------------------- |W/O Mutexes|118.593|178.118|595.017|1480.165| ----------------------------------------------- |W Mutexes |129.212|194.728|620.412|1430.908| ---------------------------------------------- On an 8 socket 160 thread machine, -------------------------------------------- Num files | 40 | 100 | 400 | 1000* | 1500* | ------------------------------------------------------- |W/O Mutexes|387.257|644.270|1692.018|4254.766|7362.064| ------------------------------------------------------- |W Mutexes |328.490|557.508|1967.676|4721.060|7013.618| -------------------------------------------------------- >From the above data, looks like without mutexes (WOM) performs better across the board on a smaller machine, except at 1000 files. For a larger machine, I give the edge to with mutexes (WM) for smaller number of files, until around 400 to 100 something files. Looks like with WOM starts performing better again at 1500 files. I've noticed that your test seems to generate high volume of both add and delete orphan operations for inodes that are either already on or off the orphan chain already, favoring the WM. I've also done some data collection to verify that. It turns out about half of both the delete and add orphan operations do indeed fell into that category. I did some further experiment by changing the decrement in the ftruncate() loop in the run_test() function, from 1 to be a variable configurable from the command line, as shown here, for (i = 0; i < COUNT; i++) { if (pwrite(fd, wbuf, 4096, 0) != 4096) { perror("pwrite"); exit(1); } for (j = 4096 - ndec ; j >= 1; j -= ndec) { if (ftruncate(fd, j) < 0) { perror("ftruncate"); exit(1); } } } Here are more results with different decrement values. On a 2 socket 16 thread machine, with decrement of 32, -------------------------------------------------- Num files | 40 | 100 | 400 | 1000 | 1500 | 2000 | 2500 | -------------------------------------------------------------- |W/O Mutexes|4.077|5.928|19.005|46.056|95.767|514.050|621.227| -------------------------------------------------------------- |W Mutexes |4.441|6.354|19.837|44.771|63.068| 85.997|284.084| -------------------------------------------------------------- with decrement of 256, ----------------------------------------------------- Num files | 40 | 100 | 400 | 1000| 1500 | 2000 | 2500 | 3000 | -------------------------------------------------------------- |W/O Mutexes|0.482|0.708|2.486|6.072|11.111|54.994|69.695|81.657| ----------------------------------------------------------------- |W Mutexes |0.540|0.823|2.449|5.650| 8.146|11.179|26.453|34.705| ----------------------------------------------------------------- with decrement of 4095, ------------------------------------------------------------- Num files | 40 | 100 | 400 | 1000| 1500| 2000| 2500| 3000| 4000| 5000| ------------------------------------------------------------------------- |W/O Mutexes|0.063|0.010|0.235|0.479|0.665|1.701|2.620|3.812|4.986|7.047| ------------------------------------------------------------------------- |W Mutexes |0.064|0.011|0.217|0.462|0.682|0.886|1.374|2.091|3.602|4.650| ------------------------------------------------------------------------- On an 8 socket 160 thread machine, with decrement of 32, -------------------------------------------------------- Num files | 40 | 100 | 400 | 1000 | 1500 | 2000 | 2500* | -------------------------------------------------------------------- |W/O Mutexes|15.461|21.501|54.009|131.591|219.795|2375.105|6401.232| -------------------------------------------------------------------- |W Mutexes |13.376|18.557|61.573|146.369|217.394| 375.293|2626.881| -------------------------------------------------------------------- with decrement of 256, ------------------------------------------------- Num files | 40 | 100 | 400 | 1000 | 1500 | 2000 | 2500 | ------------------------------------------------------------- |W/O Mutexes|1.903|2.709|7.212|16.573|26.634|260.666|745.265| ------------------------------------------------------------- |W Mutexes |1.643|2.319|7.429|17.641|26.059| 45.333|302.096| ------------------------------------------------------------- with decrement of 4095, --------------------------------------------- Num files | 40 | 100 | 400 | 1000| 1500| 2000 | 2500 | --------------------------------------------------------- |W/O Mutexes|0.160|0.231|0.746|1.640|2.307|13.586|49.986| --------------------------------------------------------- |W Mutexes |0.134|0.192|0.579|1.306|1.931| 3.332|15.481| --------------------------------------------------------- Looks like the results are similar to that of decrement of 1. With the exception that on smaller machine, with higher decrement values, both the WM and WOM seems to performance closer to each other for small to medium number of files. The WOM does show better scaling for higher number of files on both the small and large core count machines. Here are also aim7 results on an 8 socket 160 threads machine, at 2000 users, | % of Average job per minutes of WM compared to WOM ---------------------------------------------------------------- all_tests | +42.02% ---------------------------------------------------------------- custom | +56.22% ---------------------------------------------------------------- dbase | 0% ---------------------------------------------------------------- disk | 0% ---------------------------------------------------------------- fserver | +51.23% ---------------------------------------------------------------- new_dbase | 0% ---------------------------------------------------------------- new_fserver|+53.83% ---------------------------------------------------------------- shared |+44.37% ---------------------------------------------------------------- Please let me know if you have need any additional info or have any question and also if you would like a copy of the patch for testing. Thanks, Mak. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-07-22 4:35 ` Thavatchai Makphaibulchoke @ 2014-07-23 8:15 ` Jan Kara 0 siblings, 0 replies; 25+ messages in thread From: Jan Kara @ 2014-07-23 8:15 UTC (permalink / raw) To: Thavatchai Makphaibulchoke; +Cc: Jan Kara, Theodore Ts'o, linux-ext4 Hello, On Mon 21-07-14 22:35:24, Thavatchai Makphaibulchoke wrote: > On 06/18/2014 04:37 AM, Jan Kara wrote: > > That's not true. The original test program creates one file per process: > > void run_test(char *base, int count) > > { > > ... > > sprintf(pbuf, "%s/file-%d", base, count); > > fd = open(pbuf, O_CREAT | O_TRUNC | O_WRONLY, 0644); > > ... > > and forks given number of processes. Ah, maybe I see what you are getting > > at. My original test program doesn't bother with synchronizing start of the > > processes - I have manually set number of repetitions (COUNT) so that each > > process runs long enough on my machine that the differences in start time > > are negligible. Maybe you thought these processes run sequentially (and > > maybe they really did if you had fast enough setup). So I agree > > synchronization of start using shared memory as you did it is a good thing > > and gets more reliable numbers especially for high numbers of processes > > (which I didn't really test because of my limited HW) just there's no need > > to add another level of iteration... > > > First sorry for taking a little long to get back. > > Your original main is now my new run_main() function. The old main and > new run_main() function, practically are doing the same thing, fork and > immediately all the child processes as indicated by argv[1], running the > original run_test() function. By converting it to a run_main function, > I'm trying to start each incarnation of the old main as closed to each > other as possible, making sure I have some overlapping orphan activities > on different files. Actually I should have done the synchronization in > the run_test function instead, this way both tests should be closer to > each other. > > Here are the results with 3.15-rc5 baseline kernel, using your original > test. Each result is an average over ten runs, except those longer runs, > marked with an '*', which is an average of only 5 runs. > > On a 2 socket 16 thread machine, > > ---------------------------------- > Num files | 40 | 100 | 400 | 1000 | > ---------------------------------------------- > |W/O Mutexes|118.593|178.118|595.017|1480.165| > ----------------------------------------------- > |W Mutexes |129.212|194.728|620.412|1430.908| > ---------------------------------------------- > > On an 8 socket 160 thread machine, > > -------------------------------------------- > Num files | 40 | 100 | 400 | 1000* | 1500* | > ------------------------------------------------------- > |W/O Mutexes|387.257|644.270|1692.018|4254.766|7362.064| > ------------------------------------------------------- > |W Mutexes |328.490|557.508|1967.676|4721.060|7013.618| > -------------------------------------------------------- > > From the above data, looks like without mutexes (WOM) performs better > across the board on a smaller machine, except at 1000 files. For a > larger machine, I give the edge to with mutexes (WM) for smaller number > of files, until around 400 to 100 something files. Looks like with WOM > starts performing better again at 1500 files. > <snip results for larger truncates as those don't seem to show anything new> > Here are also aim7 results on an 8 socket 160 threads machine, at 2000 > users, > > | % of Average job per minutes of WM compared to WOM > ---------------------------------------------------------------- > all_tests | +42.02% > ---------------------------------------------------------------- > custom | +56.22% > ---------------------------------------------------------------- > dbase | 0% > ---------------------------------------------------------------- > disk | 0% > ---------------------------------------------------------------- > fserver | +51.23% > ---------------------------------------------------------------- > new_dbase | 0% > ---------------------------------------------------------------- > new_fserver|+53.83% > ---------------------------------------------------------------- > shared |+44.37% > ---------------------------------------------------------------- > > Please let me know if you have need any additional info or have any > question and also if you would like a copy of the patch for testing. Thanks for all the measurements! I don't have further questions I think. All your tests seem to point to the fact that when contention on orphan list is high, your hashed mutexes improve performance (especially when we are bouncing cache between sockets) while they have some cost for low contention cases. I'm not sure it is a worthwhile tradeoff but it's upto Ted as a maintainer to decide. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 0/2 v3] Improve orphan list scaling @ 2014-05-20 12:45 Jan Kara 2014-05-20 12:45 ` [PATCH 2/2] ext4: Reduce contention on s_orphan_lock Jan Kara 0 siblings, 1 reply; 25+ messages in thread From: Jan Kara @ 2014-05-20 12:45 UTC (permalink / raw) To: Ted Tso; +Cc: linux-ext4, Thavatchai Makphaibulchoke, Jan Kara Hello, here is another version of my patches to improve orphan list scaling by reducing amount of work done under global s_orphan_mutex. Since previous version I've fixed some bugs (thanks Thavatchai!), retested with updated xfstests to verify the problem Ted has spotted is fixed, and rerun the performance tests because the bugs had a non-trivial impact on the functionality. To stress orphan list operations I run my artifical test program. The test program runs given number of processes, each process is truncating a 4k file by 1 byte until it reaches 1 byte size and then the file is extended to 4k again. The average times for 10 runs for the test program to run on my 48-way box with ext4 on ramdisk are: Vanilla Patched Procs Avg Stddev Avg Stddev 1 2.769200 0.056194 2.750000 (-0.6%) 0.054772 2 5.756500 0.313268 5.669000 (-1.5%) 0.587528 4 11.852500 0.130221 8.311000 (-29.9%) 0.257544 10 33.590900 0.394888 20.162000 (-40%) 0.189832 20 71.035400 0.320914 55.854000 (-21.4%) 0.478815 40 236.671100 2.856885 174.543000 (-26.3%) 0.974547 In the lockstat reports, s_orphan_mutex has been #1 in both cases however the patches significanly reduced the contention. For 10 threads the numbers look like: con-bounces contentions waittime-min waittime-max waittime-total Orig 7089335 7089335 9.07 3504220.69 1473754546.28 Patched 2547868 2547868 9.18 8218.64 547550185.12 waittime-avg acq-bounces acquisitions holdtime-min holdtime-max Orig 207.88 14487647 16381236 0.16 211.62 Patched 214.91 7994533 8191236 0.16 203.81 holdtime-total holdtime-avg Orig 79738146.84 4.87 Patched 30660307.81 3.74 We can see the number of acquisitions dropped to a half (we now check whether inode already is / is not part of the orphan list before acquiring s_orphan_mutex). The average hold time is somewhat smaller as well and given that the patched kernel doesn't have those 50% of short lived aquisitions just for checking whether the inode is part of the orphan list, we can see that the patched kernel really does significanly less work with s_orphan_lock held. Changes since v2: * Fixed bug in ext4_orphan_del() leading to orphan list corruption - thanks to Thavatchai for pointing that out. * Fixed bug in ext4_orphan_del() that could lead to using freed inodes Changes since v1: * Fixed up various bugs in error handling pointed out by Thavatchai and some others as well * Somewhat reduced critical sections under s_orphan_lock Honza ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-05-20 12:45 [PATCH 0/2 v3] Improve orphan list scaling Jan Kara @ 2014-05-20 12:45 ` Jan Kara 2014-05-20 16:45 ` Thavatchai Makphaibulchoke 0 siblings, 1 reply; 25+ messages in thread From: Jan Kara @ 2014-05-20 12:45 UTC (permalink / raw) To: Ted Tso; +Cc: linux-ext4, Thavatchai Makphaibulchoke, Jan Kara Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that we avoid taking global s_orphan_lock in some cases and hold it for shorter time in other cases. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/namei.c | 109 +++++++++++++++++++++++++++++++++----------------------- 1 file changed, 65 insertions(+), 44 deletions(-) diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 5fcaa85b6dc5..d0b1776efd92 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2539,13 +2539,17 @@ static int empty_dir(struct inode *inode) return 1; } -/* ext4_orphan_add() links an unlinked or truncated inode into a list of +/* + * ext4_orphan_add() links an unlinked or truncated inode into a list of * such inodes, starting at the superblock, in case we crash before the * file is closed/deleted, or in case the inode truncate spans multiple * transactions and the last transaction is not recovered after a crash. * * At filesystem recovery time, we walk this list deleting unlinked * inodes and truncating linked inodes in ext4_orphan_cleanup(). + * + * Orphan list manipulation functions must be called under i_mutex unless + * we are just creating the inode or deleting it. */ int ext4_orphan_add(handle_t *handle, struct inode *inode) { @@ -2553,13 +2557,19 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode) struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_iloc iloc; int err = 0, rc; + bool dirty = false; if (!sbi->s_journal) return 0; - mutex_lock(&sbi->s_orphan_lock); + WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) && + !mutex_is_locked(&inode->i_mutex)); + /* + * Exit early if inode already is on orphan list. This is a big speedup + * since we don't have to contend on the global s_orphan_lock. + */ if (!list_empty(&EXT4_I(inode)->i_orphan)) - goto out_unlock; + return 0; /* * Orphan handling is only valid for files with data blocks @@ -2573,44 +2583,47 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode) BUFFER_TRACE(sbi->s_sbh, "get_write_access"); err = ext4_journal_get_write_access(handle, sbi->s_sbh); if (err) - goto out_unlock; + goto out; err = ext4_reserve_inode_write(handle, inode, &iloc); if (err) - goto out_unlock; + goto out; + + mutex_lock(&sbi->s_orphan_lock); /* * Due to previous errors inode may be already a part of on-disk * orphan list. If so skip on-disk list modification. */ - if (NEXT_ORPHAN(inode) && NEXT_ORPHAN(inode) <= - (le32_to_cpu(sbi->s_es->s_inodes_count))) - goto mem_insert; - - /* Insert this inode at the head of the on-disk orphan list... */ - NEXT_ORPHAN(inode) = le32_to_cpu(sbi->s_es->s_last_orphan); - sbi->s_es->s_last_orphan = cpu_to_le32(inode->i_ino); - err = ext4_handle_dirty_super(handle, sb); - rc = ext4_mark_iloc_dirty(handle, inode, &iloc); - if (!err) - err = rc; - - /* Only add to the head of the in-memory list if all the - * previous operations succeeded. If the orphan_add is going to - * fail (possibly taking the journal offline), we can't risk - * leaving the inode on the orphan list: stray orphan-list - * entries can cause panics at unmount time. - * - * This is safe: on error we're going to ignore the orphan list - * anyway on the next recovery. */ -mem_insert: - if (!err) - list_add(&EXT4_I(inode)->i_orphan, &sbi->s_orphan); + if (!NEXT_ORPHAN(inode) || NEXT_ORPHAN(inode) > + (le32_to_cpu(sbi->s_es->s_inodes_count))) { + /* Insert this inode at the head of the on-disk orphan list */ + NEXT_ORPHAN(inode) = le32_to_cpu(sbi->s_es->s_last_orphan); + sbi->s_es->s_last_orphan = cpu_to_le32(inode->i_ino); + dirty = true; + } + list_add(&EXT4_I(inode)->i_orphan, &sbi->s_orphan); + mutex_unlock(&sbi->s_orphan_lock); + if (dirty) { + err = ext4_handle_dirty_super(handle, sb); + rc = ext4_mark_iloc_dirty(handle, inode, &iloc); + if (!err) + err = rc; + if (err) { + /* + * We have to remove inode from in-memory list if + * addition to on disk orphan list failed. Stray orphan + * list entries can cause panics at unmount time. + */ + mutex_lock(&sbi->s_orphan_lock); + list_del(&EXT4_I(inode)->i_orphan); + mutex_unlock(&sbi->s_orphan_lock); + } + } jbd_debug(4, "superblock will point to %lu\n", inode->i_ino); jbd_debug(4, "orphan inode %lu will point to %d\n", inode->i_ino, NEXT_ORPHAN(inode)); -out_unlock: - mutex_unlock(&sbi->s_orphan_lock); +out: ext4_std_error(sb, err); return err; } @@ -2631,35 +2644,43 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS)) return 0; - mutex_lock(&sbi->s_orphan_lock); + WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) && + !mutex_is_locked(&inode->i_mutex)); + /* Do this quick check before taking global s_orphan_lock. */ if (list_empty(&ei->i_orphan)) - goto out; + return 0; - ino_next = NEXT_ORPHAN(inode); - prev = ei->i_orphan.prev; + if (handle) { + /* Grab inode buffer early before taking global s_orphan_lock */ + err = ext4_reserve_inode_write(handle, inode, &iloc); + } + mutex_lock(&sbi->s_orphan_lock); jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino); + prev = ei->i_orphan.prev; list_del_init(&ei->i_orphan); /* If we're on an error path, we may not have a valid * transaction handle with which to update the orphan list on * disk, but we still need to remove the inode from the linked * list in memory. */ - if (!handle) - goto out; - - err = ext4_reserve_inode_write(handle, inode, &iloc); - if (err) + if (!handle || err) { + mutex_unlock(&sbi->s_orphan_lock); goto out_err; + } + ino_next = NEXT_ORPHAN(inode); if (prev == &sbi->s_orphan) { jbd_debug(4, "superblock will point to %u\n", ino_next); BUFFER_TRACE(sbi->s_sbh, "get_write_access"); err = ext4_journal_get_write_access(handle, sbi->s_sbh); - if (err) + if (err) { + mutex_unlock(&sbi->s_orphan_lock); goto out_brelse; + } sbi->s_es->s_last_orphan = cpu_to_le32(ino_next); + mutex_unlock(&sbi->s_orphan_lock); err = ext4_handle_dirty_super(handle, inode->i_sb); } else { struct ext4_iloc iloc2; @@ -2669,20 +2690,20 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) jbd_debug(4, "orphan inode %lu will point to %u\n", i_prev->i_ino, ino_next); err = ext4_reserve_inode_write(handle, i_prev, &iloc2); - if (err) + if (err) { + mutex_unlock(&sbi->s_orphan_lock); goto out_brelse; + } NEXT_ORPHAN(i_prev) = ino_next; err = ext4_mark_iloc_dirty(handle, i_prev, &iloc2); + mutex_unlock(&sbi->s_orphan_lock); } if (err) goto out_brelse; NEXT_ORPHAN(inode) = 0; err = ext4_mark_iloc_dirty(handle, inode, &iloc); ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-05-20 12:45 ` [PATCH 2/2] ext4: Reduce contention on s_orphan_lock Jan Kara @ 2014-05-20 16:45 ` Thavatchai Makphaibulchoke 2014-05-20 21:03 ` Jan Kara 0 siblings, 1 reply; 25+ messages in thread From: Thavatchai Makphaibulchoke @ 2014-05-20 16:45 UTC (permalink / raw) To: Jan Kara, Ted Tso; +Cc: linux-ext4 On 05/20/2014 06:45 AM, Jan Kara wrote: > + if (dirty) { > + err = ext4_handle_dirty_super(handle, sb); > + rc = ext4_mark_iloc_dirty(handle, inode, &iloc); > + if (!err) > + err = rc; > + if (err) { > + /* > + * We have to remove inode from in-memory list if > + * addition to on disk orphan list failed. Stray orphan > + * list entries can cause panics at unmount time. > + */ > + mutex_lock(&sbi->s_orphan_lock); > + list_del(&EXT4_I(inode)->i_orphan); > + mutex_unlock(&sbi->s_orphan_lock); > + } > + } Sorry Jan, I just noticed this. I don't believe you could this optimization either. Since you drop the s_oprhan_lock in between, you essentially have an interval where there is a stray in-memory orphan and could cause a panic as the comment above mentioned. As for comments regarding ext4_mark_iloc() optimization, in your case since you are holding the i_mutex, should not that prevent the inode from being reclaimed? Thanks, Mak. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-05-20 16:45 ` Thavatchai Makphaibulchoke @ 2014-05-20 21:03 ` Jan Kara 2014-05-20 23:27 ` Thavatchai Makphaibulchoke 0 siblings, 1 reply; 25+ messages in thread From: Jan Kara @ 2014-05-20 21:03 UTC (permalink / raw) To: Thavatchai Makphaibulchoke; +Cc: Jan Kara, Ted Tso, linux-ext4 On Tue 20-05-14 10:45:07, Thavatchai Makphaibulchoke wrote: > On 05/20/2014 06:45 AM, Jan Kara wrote: > > + if (dirty) { > > + err = ext4_handle_dirty_super(handle, sb); > > + rc = ext4_mark_iloc_dirty(handle, inode, &iloc); > > + if (!err) > > + err = rc; > > + if (err) { > > + /* > > + * We have to remove inode from in-memory list if > > + * addition to on disk orphan list failed. Stray orphan > > + * list entries can cause panics at unmount time. > > + */ > > + mutex_lock(&sbi->s_orphan_lock); > > + list_del(&EXT4_I(inode)->i_orphan); > > + mutex_unlock(&sbi->s_orphan_lock); > > + } > > + } > > Sorry Jan, I just noticed this. > > I don't believe you could this optimization either. Since you drop the > s_oprhan_lock in between, you essentially have an interval where there is > a stray in-memory orphan and could cause a panic as the comment above > mentioned. No, I think we are fine in this case. I'm not sure what race you are exactly thinking of but unmount cannot certainly happen since we have a reference to active inode in the filesystem. > As for comments regarding ext4_mark_iloc() optimization, in your case > since you are holding the i_mutex, should not that prevent the inode from > being reclaimed? Yes, that prevents the inode from being reclaimed but it doesn't prevent the previous inode in the list from being reclaimed and we need to update also that one... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] ext4: Reduce contention on s_orphan_lock 2014-05-20 21:03 ` Jan Kara @ 2014-05-20 23:27 ` Thavatchai Makphaibulchoke 0 siblings, 0 replies; 25+ messages in thread From: Thavatchai Makphaibulchoke @ 2014-05-20 23:27 UTC (permalink / raw) To: Jan Kara; +Cc: Ted Tso, linux-ext4 On 05/20/2014 03:03 PM, Jan Kara wrote: > No, I think we are fine in this case. I'm not sure what race you are > exactly thinking of but unmount cannot certainly happen since we have a > reference to active inode in the filesystem. > Thanks for the reply Jan. Yes, you are correct. Sorry I forgot about that. Thanks, Mak. ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2014-07-23 8:15 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-04-29 23:32 [PATCH 0/2] ext4: Reduce contention on s_orphan_lock Jan Kara 2014-04-29 23:32 ` [PATCH 1/2] ext4: Use sbi in ext4_orphan_del() Jan Kara 2014-04-29 23:32 ` [PATCH 2/2] ext4: Reduce contention on s_orphan_lock Jan Kara 2014-05-02 21:56 ` Thavatchai Makphaibulchoke 2014-05-02 21:56 ` [PATCH 0/2] " Thavatchai Makphaibulchoke 2014-05-06 11:49 ` Jan Kara 2014-05-09 6:24 ` Thavatchai Makphaibulchoke -- strict thread matches above, loose matches on Subject: below -- 2014-05-15 20:17 [PATCH 0/2 v2] Improve orphan list scaling Jan Kara 2014-05-15 20:17 ` [PATCH 2/2] ext4: Reduce contention on s_orphan_lock Jan Kara 2014-05-20 3:23 ` Theodore Ts'o 2014-05-20 8:33 ` Thavatchai Makphaibulchoke 2014-05-20 9:18 ` Jan Kara 2014-05-20 13:57 ` Theodore Ts'o 2014-05-20 17:16 ` Thavatchai Makphaibulchoke 2014-06-02 17:45 ` Thavatchai Makphaibulchoke 2014-06-03 8:52 ` Jan Kara 2014-06-16 19:20 ` Thavatchai Makphaibulchoke 2014-06-17 9:29 ` Jan Kara 2014-06-18 4:38 ` Thavatchai Makphaibulchoke 2014-06-18 10:37 ` Jan Kara 2014-07-22 4:35 ` Thavatchai Makphaibulchoke 2014-07-23 8:15 ` Jan Kara 2014-05-20 12:45 [PATCH 0/2 v3] Improve orphan list scaling Jan Kara 2014-05-20 12:45 ` [PATCH 2/2] ext4: Reduce contention on s_orphan_lock Jan Kara 2014-05-20 16:45 ` Thavatchai Makphaibulchoke 2014-05-20 21:03 ` Jan Kara 2014-05-20 23:27 ` Thavatchai Makphaibulchoke
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).