All of lore.kernel.org
 help / color / mirror / Atom feed
* [Ocfs2-devel] [PATCH] ocfs2: Do not lock/unlock() inode DLM lock
@ 2015-12-29 18:20 rgoldwyn at suse.de
  2015-12-30  1:47 ` Junxiao Bi
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: rgoldwyn at suse.de @ 2015-12-29 18:20 UTC (permalink / raw)
  To: ocfs2-devel

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

DLM does not cache locks. So, blocking lock and unlock
will only make the performance worse where contention over
the locks is high.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/ocfs2/dlmglue.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 20276e3..f92612e 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -2432,12 +2432,6 @@ bail:
  * done this we have to return AOP_TRUNCATED_PAGE so the aop method
  * that called us can bubble that back up into the VFS who will then
  * immediately retry the aop call.
- *
- * We do a blocking lock and immediate unlock before returning, though, so that
- * the lock has a great chance of being cached on this node by the time the VFS
- * calls back to retry the aop.    This has a potential to livelock as nodes
- * ping locks back and forth, but that's a risk we're willing to take to avoid
- * the lock inversion simply.
  */
 int ocfs2_inode_lock_with_page(struct inode *inode,
 			      struct buffer_head **ret_bh,
@@ -2449,8 +2443,6 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
 	ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
 	if (ret == -EAGAIN) {
 		unlock_page(page);
-		if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
-			ocfs2_inode_unlock(inode, ex);
 		ret = AOP_TRUNCATED_PAGE;
 	}
 
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: Do not lock/unlock() inode DLM lock
  2015-12-29 18:20 [Ocfs2-devel] [PATCH] ocfs2: Do not lock/unlock() inode DLM lock rgoldwyn at suse.de
@ 2015-12-30  1:47 ` Junxiao Bi
  2015-12-30  2:31 ` Gang He
       [not found] ` <5683B274020000F900023B30@suse.com>
  2 siblings, 0 replies; 8+ messages in thread
From: Junxiao Bi @ 2015-12-30  1:47 UTC (permalink / raw)
  To: ocfs2-devel

On 12/30/2015 02:20 AM, rgoldwyn at suse.de wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> DLM does not cache locks. So, blocking lock and unlock
> will only make the performance worse where contention over
> the locks is high.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Looks good.

Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
> ---
>  fs/ocfs2/dlmglue.c | 8 --------
>  1 file changed, 8 deletions(-)
> 
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index 20276e3..f92612e 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -2432,12 +2432,6 @@ bail:
>   * done this we have to return AOP_TRUNCATED_PAGE so the aop method
>   * that called us can bubble that back up into the VFS who will then
>   * immediately retry the aop call.
> - *
> - * We do a blocking lock and immediate unlock before returning, though, so that
> - * the lock has a great chance of being cached on this node by the time the VFS
> - * calls back to retry the aop.    This has a potential to livelock as nodes
> - * ping locks back and forth, but that's a risk we're willing to take to avoid
> - * the lock inversion simply.
>   */
>  int ocfs2_inode_lock_with_page(struct inode *inode,
>  			      struct buffer_head **ret_bh,
> @@ -2449,8 +2443,6 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
>  	ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
>  	if (ret == -EAGAIN) {
>  		unlock_page(page);
> -		if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
> -			ocfs2_inode_unlock(inode, ex);
>  		ret = AOP_TRUNCATED_PAGE;
>  	}
>  
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: Do not lock/unlock() inode DLM lock
  2015-12-29 18:20 [Ocfs2-devel] [PATCH] ocfs2: Do not lock/unlock() inode DLM lock rgoldwyn at suse.de
  2015-12-30  1:47 ` Junxiao Bi
@ 2015-12-30  2:31 ` Gang He
  2016-01-07  5:03   ` Eric Ren
       [not found] ` <5683B274020000F900023B30@suse.com>
  2 siblings, 1 reply; 8+ messages in thread
From: Gang He @ 2015-12-30  2:31 UTC (permalink / raw)
  To: ocfs2-devel

Hello Goldwyn,

When read path can't get the DLM lock immediately (NONBLOCK way), next get the lock with BLOCK way, this behavior will cost some time (several msecs).
It looks make sense to delete that two line code. 
But why there are two line code existed? I just worry about, if we delete two line code, when read path can't get the DLM lock with NONBLOCK way, read path will retry to get this DLM lock repeatedly, this will lead to cost too much CPU (Not waiting in sleep).
I just worry about this possibility, Eric will test this case, and give a feedback.

Thanks
Gang   


>>> 
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> DLM does not cache locks. So, blocking lock and unlock
> will only make the performance worse where contention over
> the locks is high.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/ocfs2/dlmglue.c | 8 --------
>  1 file changed, 8 deletions(-)
> 
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index 20276e3..f92612e 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -2432,12 +2432,6 @@ bail:
>   * done this we have to return AOP_TRUNCATED_PAGE so the aop method
>   * that called us can bubble that back up into the VFS who will then
>   * immediately retry the aop call.
> - *
> - * We do a blocking lock and immediate unlock before returning, though, so 
> that
> - * the lock has a great chance of being cached on this node by the time the 
> VFS
> - * calls back to retry the aop.    This has a potential to livelock as nodes
> - * ping locks back and forth, but that's a risk we're willing to take to 
> avoid
> - * the lock inversion simply.
>   */
>  int ocfs2_inode_lock_with_page(struct inode *inode,
>  			      struct buffer_head **ret_bh,
> @@ -2449,8 +2443,6 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
>  	ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
>  	if (ret == -EAGAIN) {
>  		unlock_page(page);
> -		if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
> -			ocfs2_inode_unlock(inode, ex);
>  		ret = AOP_TRUNCATED_PAGE;
>  	}
>  
> -- 
> 2.6.2

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: Do not lock/unlock() inode DLM lock
       [not found] ` <5683B274020000F900023B30@suse.com>
@ 2015-12-30 18:35   ` Goldwyn Rodrigues
  2016-01-05  6:52     ` Gang He
  0 siblings, 1 reply; 8+ messages in thread
From: Goldwyn Rodrigues @ 2015-12-30 18:35 UTC (permalink / raw)
  To: ocfs2-devel



On 12/29/2015 08:31 PM, Gang He wrote:
> Hello Goldwyn,
>
> When read path can't get the DLM lock immediately (NONBLOCK way), next get the lock with BLOCK way, this behavior will cost some time (several msecs).
> It looks make sense to delete that two line code.
> But why there are two line code existed? I just worry about, if we delete two line code, when read path can't get the DLM lock with NONBLOCK way, read path will retry to get this DLM lock repeatedly, this will lead to cost too much CPU (Not waiting in sleep).
> I just worry about this possibility, Eric will test this case, and give a feedback.
>

If DLM cannot grant a lock, it simply queues it in the order it 
received, irrespective of lock type request.
Imagine a case of continuous contention of a PR with a EX lock (a 
typical ping pong):
If we request a blocking PR lock and an EX request follows. We will be 
waiting and getting the PR lock but the lock immediately granted after 
we unlock PR will be the EX lock. So, the nonblocking PR lock will be 
starved when we actually require it to perform useful work.

I assume this is antebellum code which probably existed when DLM locks 
were disk-based. I may be wrong though.


> Thanks
> Gang
>
>
>>>>
>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>
>> DLM does not cache locks. So, blocking lock and unlock
>> will only make the performance worse where contention over
>> the locks is high.
>>
>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>> ---
>>   fs/ocfs2/dlmglue.c | 8 --------
>>   1 file changed, 8 deletions(-)
>>
>> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
>> index 20276e3..f92612e 100644
>> --- a/fs/ocfs2/dlmglue.c
>> +++ b/fs/ocfs2/dlmglue.c
>> @@ -2432,12 +2432,6 @@ bail:
>>    * done this we have to return AOP_TRUNCATED_PAGE so the aop method
>>    * that called us can bubble that back up into the VFS who will then
>>    * immediately retry the aop call.
>> - *
>> - * We do a blocking lock and immediate unlock before returning, though, so
>> that
>> - * the lock has a great chance of being cached on this node by the time the
>> VFS
>> - * calls back to retry the aop.    This has a potential to livelock as nodes
>> - * ping locks back and forth, but that's a risk we're willing to take to
>> avoid
>> - * the lock inversion simply.
>>    */
>>   int ocfs2_inode_lock_with_page(struct inode *inode,
>>   			      struct buffer_head **ret_bh,
>> @@ -2449,8 +2443,6 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
>>   	ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
>>   	if (ret == -EAGAIN) {
>>   		unlock_page(page);
>> -		if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
>> -			ocfs2_inode_unlock(inode, ex);
>>   		ret = AOP_TRUNCATED_PAGE;
>>   	}
>>
>> --
>> 2.6.2

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: Do not lock/unlock() inode DLM lock
  2015-12-30 18:35   ` Goldwyn Rodrigues
@ 2016-01-05  6:52     ` Gang He
  0 siblings, 0 replies; 8+ messages in thread
From: Gang He @ 2016-01-05  6:52 UTC (permalink / raw)
  To: ocfs2-devel

Hello Goldwyn,


>>> 

> 
> On 12/29/2015 08:31 PM, Gang He wrote:
>> Hello Goldwyn,
>>
>> When read path can't get the DLM lock immediately (NONBLOCK way), next get 
> the lock with BLOCK way, this behavior will cost some time (several msecs).
>> It looks make sense to delete that two line code.
>> But why there are two line code existed? I just worry about, if we delete 
> two line code, when read path can't get the DLM lock with NONBLOCK way, read 
> path will retry to get this DLM lock repeatedly, this will lead to cost too 
> much CPU (Not waiting in sleep).
>> I just worry about this possibility, Eric will test this case, and give a 
> feedback.
>>
> 
> If DLM cannot grant a lock, it simply queues it in the order it 
> received, irrespective of lock type request.
If can't get the DLM inode lock  (ocfs2_inode_lock_with_page) in read path (ocfs2_readpage), 
it will return to VFS layer (do_generic_file_read), the do_generic_file_read function will re-call OCFS2 readpage()  function to read this page again,
actually, the readpage() function still not get the lock until write path unlocks the DLM inode lock, this will bring many time retries for consuming too much CPU.   
Of course, deleting these two line code will increase read path get the lock opportunities, but looks to bring CPU busy loop problem also. 

Thanks
Gang 

> Imagine a case of continuous contention of a PR with a EX lock (a 
> typical ping pong):
> If we request a blocking PR lock and an EX request follows. We will be 
> waiting and getting the PR lock but the lock immediately granted after 
> we unlock PR will be the EX lock. So, the nonblocking PR lock will be 
> starved when we actually require it to perform useful work.
> 
> I assume this is antebellum code which probably existed when DLM locks 
> were disk-based. I may be wrong though.
> 
> 
>> Thanks
>> Gang
>>
>>
>>>>>
>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>>
>>> DLM does not cache locks. So, blocking lock and unlock
>>> will only make the performance worse where contention over
>>> the locks is high.
>>>
>>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>> ---
>>>   fs/ocfs2/dlmglue.c | 8 --------
>>>   1 file changed, 8 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
>>> index 20276e3..f92612e 100644
>>> --- a/fs/ocfs2/dlmglue.c
>>> +++ b/fs/ocfs2/dlmglue.c
>>> @@ -2432,12 +2432,6 @@ bail:
>>>    * done this we have to return AOP_TRUNCATED_PAGE so the aop method
>>>    * that called us can bubble that back up into the VFS who will then
>>>    * immediately retry the aop call.
>>> - *
>>> - * We do a blocking lock and immediate unlock before returning, though, so
>>> that
>>> - * the lock has a great chance of being cached on this node by the time the
>>> VFS
>>> - * calls back to retry the aop.    This has a potential to livelock as 
> nodes
>>> - * ping locks back and forth, but that's a risk we're willing to take to
>>> avoid
>>> - * the lock inversion simply.
>>>    */
>>>   int ocfs2_inode_lock_with_page(struct inode *inode,
>>>   			      struct buffer_head **ret_bh,
>>> @@ -2449,8 +2443,6 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
>>>   	ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
>>>   	if (ret == -EAGAIN) {
>>>   		unlock_page(page);
>>> -		if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
>>> -			ocfs2_inode_unlock(inode, ex);
>>>   		ret = AOP_TRUNCATED_PAGE;
>>>   	}
>>>
>>> --
>>> 2.6.2
> 
> -- 
> Goldwyn

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: Do not lock/unlock() inode DLM lock
  2015-12-30  2:31 ` Gang He
@ 2016-01-07  5:03   ` Eric Ren
  2016-01-08  1:24     ` Goldwyn Rodrigues
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Ren @ 2016-01-07  5:03 UTC (permalink / raw)
  To: ocfs2-devel

Hi,

On Tue, Dec 29, 2015 at 07:31:16PM -0700, He Gang wrote: 
> Hello Goldwyn,
> 
> When read path can't get the DLM lock immediately (NONBLOCK way), next get the lock with BLOCK way, this behavior will cost some time (several msecs).
> It looks make sense to delete that two line code. 
> But why there are two line code existed? I just worry about, if we delete two line code, when read path can't get the DLM lock with NONBLOCK way, read path will retry to get this DLM lock repeatedly, this will lead to cost too much CPU (Not waiting in sleep).
> I just worry about this possibility, Eric will test this case, and give a feedback.
Sorry for the late reply. After applying this patch, the performance is improved very much,
but "soft lockup" occurred several times(log message is placed at the end). The two lines:

       if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
           ocfs2_inode_unlock(inode, ex);

is similar with the lines of this commit:

commit c7e25e6e0b0486492c5faaf6312b37413642c48e
Author: Jan Kara <jack@suse.cz>
Date:   Thu Jun 23 22:51:47 2011 +0200

    ocfs2: Avoid livelock in ocfs2_readpage()
    
    When someone writes to an inode, readers accessing the same inode via
    ocfs2_readpage() just busyloop trying to get ip_alloc_sem because
    do_generic_file_read() looks up the page again and retries ->readpage()
    when previous attempt failed with AOP_TRUNCATED_PAGE. When there are enough
    readers, they can occupy all CPUs and in non-preempt kernel the system is
    deadlocked because writer holding ip_alloc_sem is never run to release the
    semaphore. Fix the problem by making reader block on ip_alloc_sem to break
    the busy loop.
    
    Signed-off-by: Jan Kara <jack@suse.cz>
    Signed-off-by: Joel Becker <jlbec@evilplan.org>
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 4c1ec8f..ba3ca1e 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -290,7 +290,15 @@ static int ocfs2_readpage(struct file *file, struct page *page)
        }
 
        if (down_read_trylock(&oi->ip_alloc_sem) == 0) {
+               /*
+                * Unlock the page and cycle ip_alloc_sem so that we don't
+                * busyloop waiting for ip_alloc_sem to unlock
+                */
                ret = AOP_TRUNCATED_PAGE;
+               unlock_page(page);
+               unlock = 0;
+               down_read(&oi->ip_alloc_sem);
+               up_read(&oi->ip_alloc_sem);
                goto out_inode_unlock;
        }

So, this patch removing the two lines of code may result in "busy wait" potentially that
reading thread repeatedly checks to see if cluster lock is available. Please correct me if
I'm wrong. 

This is the "soft lookup" log:
Dec 30 17:30:40 n2 kernel: [  248.084012] BUG: soft lockup - CPU#3 stuck for 23s! [iomaker:4061]
Dec 30 17:30:40 n2 kernel: [  248.084015] Modules linked in: ocfs2(FN) jbd2 ocfs2_nodemanager(F) quota_tree ocfs2_stack_user(FN) ocfs2_stackglue(FN) dlm(F) sctp libcrc32c configfs edd sg sd_mod crc_t10dif crc32c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet mperf fuse loop dm_mod ipv6 ipv6_lib 8139too pcspkr acpiphp rtc_cmos pci_hotplug floppy 8139cp button i2c_piix4 mii virtio_balloon ext3 jbd mbcache ttm drm_kms_helper drm i2c_core sysimgblt sysfillrect syscopyarea uhci_hcd ehci_hcd processor thermal_sys hwmon usbcore usb_common intel_agp intel_gtt scsi_dh_alua scsi_dh_emc scsi_dh_hp_sw scsi_dh_rdac scsi_dh virtio_pci ata_generic ata_piix libata scsi_mod virtio_blk virtio virtio_ring
Dec 30 17:30:40 n2 kernel: [  248.084049] Supported: No, Unsupported modules are loaded
Dec 30 17:30:40 n2 kernel: [  248.084050] CPU 3 
Dec 30 17:30:40 n2 kernel: [  248.084051] Modules linked in: ocfs2(FN) jbd2 ocfs2_nodemanager(F) quota_tree ocfs2_stack_user(FN) ocfs2_stackglue(FN) dlm(F) sctp libcrc32c configfs edd sg sd_mod crc_t10dif crc32c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet mperf fuse loop dm_mod ipv6 ipv6_lib 8139too pcspkr acpiphp rtc_cmos pci_hotplug floppy 8139cp button i2c_piix4 mii virtio_balloon ext3 jbd mbcache ttm drm_kms_helper drm i2c_core sysimgblt sysfillrect syscopyarea uhci_hcd ehci_hcd processor thermal_sys hwmon usbcore usb_common intel_agp intel_gtt scsi_dh_alua scsi_dh_emc scsi_dh_hp_sw scsi_dh_rdac scsi_dh virtio_pci ata_generic ata_piix libata scsi_mod virtio_blk virtio virtio_ring
Dec 30 17:30:40 n2 kernel: [  248.084074] Supported: No, Unsupported modules are loaded
Dec 30 17:30:40 n2 kernel: [  248.084075] 
Dec 30 17:30:40 n2 kernel: [  248.084077] Pid: 4061, comm: iomaker Tainted: GF          N  3.0.76-0.11-default #1 Bochs Bochs
Dec 30 17:30:40 n2 kernel: [  248.084080] RIP: 0010:[<ffffffff8145c7a9>]  [<ffffffff8145c7a9>] _raw_spin_lock+0x9/0x20
Dec 30 17:30:40 n2 kernel: [  248.084087] RSP: 0018:ffff8800db3fbb10  EFLAGS: 00000206
Dec 30 17:30:40 n2 kernel: [  248.084088] RAX: 0000000071967196 RBX: ffff880114c1c000 RCX: 0000000000000004
Dec 30 17:30:40 n2 kernel: [  248.084090] RDX: 0000000000007195 RSI: 0000000000000000 RDI: ffff880114c1c0d0
Dec 30 17:30:40 n2 kernel: [  248.084091] RBP: ffff880114c1c0d0 R08: 0000000000000000 R09: 0000000000000003
Dec 30 17:30:40 n2 kernel: [  248.084092] R10: 0000000000000100 R11: 0000000000014819 R12: ffffffff81464f6e
Dec 30 17:30:40 n2 kernel: [  248.084094] R13: ffff880037b34908 R14: ffff880037b342c0 R15: ffff880037b34908
Dec 30 17:30:40 n2 kernel: [  248.084096] FS:  00007fdbe0f7a720(0000) GS:ffff88011fd80000(0000) knlGS:0000000000000000
Dec 30 17:30:40 n2 kernel: [  248.084097] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec 30 17:30:40 n2 kernel: [  248.084099] CR2: 00007ff1df54b000 CR3: 00000000db2da000 CR4: 00000000000006e0
Dec 30 17:30:40 n2 kernel: [  248.084103] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 30 17:30:40 n2 kernel: [  248.084106] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec 30 17:30:40 n2 kernel: [  248.084108] Process iomaker (pid: 4061, threadinfo ffff8800db3fa000, task ffff880037b342c0)
Dec 30 17:30:40 n2 kernel: [  248.084109] Stack:
Dec 30 17:30:40 n2 kernel: [  248.084112]  ffffffffa04ab9b5 0000000000000000 00000039c2f58103 0000000000000000
Dec 30 17:30:40 n2 kernel: [  248.084115]  0000000000000000 ffff880114c1c0d0 ffff880114c1c0d0 ffff880114c1c0d0
Dec 30 17:30:40 n2 kernel: [  248.084117]  0000000000000000 ffff880114c1c000 ffff8800dba0ec38 0000000000000004
Dec 30 17:30:40 n2 kernel: [  248.084120] Call Trace:
Dec 30 17:30:40 n2 kernel: [  248.084145]  [<ffffffffa04ab9b5>] ocfs2_wait_for_recovery+0x25/0xc0 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084182]  [<ffffffffa049aed8>] ocfs2_inode_lock_full_nested+0x298/0x510 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084204]  [<ffffffffa049b161>] ocfs2_inode_lock_with_page+0x11/0x40 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084225]  [<ffffffffa048170f>] ocfs2_readpage+0x8f/0x210 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084234]  [<ffffffff810f938e>] do_generic_file_read+0x13e/0x490
Dec 30 17:30:40 n2 kernel: [  248.084238]  [<ffffffff810f9d3c>] generic_file_aio_read+0xfc/0x260
Dec 30 17:30:40 n2 kernel: [  248.084250]  [<ffffffffa04a168b>] ocfs2_file_aio_read+0x14b/0x390 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084265]  [<ffffffff811584b8>] do_sync_read+0xc8/0x110
Dec 30 17:30:40 n2 kernel: [  248.084268]  [<ffffffff81158c67>] vfs_read+0xc7/0x130
Dec 30 17:30:40 n2 kernel: [  248.084271]  [<ffffffff81158dd3>] sys_read+0x53/0xa0
Dec 30 17:30:40 n2 kernel: [  248.084275]  [<ffffffff81464592>] system_call_fastpath+0x16/0x1b
Dec 30 17:30:40 n2 kernel: [  248.084280]  [<00007fdbe03532a0>] 0x7fdbe035329f
Dec 30 17:30:40 n2 kernel: [  248.084281] Code: 00 75 04 f0 0f b1 17 0f 94 c2 0f b6 c2 85 c0 ba 01 00 00 00 75 02 31 d2 89 d0 c3 0f 1f 80 00 00 00 00 b8 00 00 01 00 f0 0f c1 07 <0f> b7 d0 c1 e8 10 39 c2 74 07 f3 90 0f b7 17 eb f5 c3 0f 1f 44 
Dec 30 17:30:40 n2 kernel: [  248.084307] Call Trace:
Dec 30 17:30:40 n2 kernel: [  248.084319]  [<ffffffffa04ab9b5>] ocfs2_wait_for_recovery+0x25/0xc0 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084344]  [<ffffffffa049aed8>] ocfs2_inode_lock_full_nested+0x298/0x510 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084366]  [<ffffffffa049b161>] ocfs2_inode_lock_with_page+0x11/0x40 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084386]  [<ffffffffa048170f>] ocfs2_readpage+0x8f/0x210 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084394]  [<ffffffff810f938e>] do_generic_file_read+0x13e/0x490
Dec 30 17:30:40 n2 kernel: [  248.084397]  [<ffffffff810f9d3c>] generic_file_aio_read+0xfc/0x260
Dec 30 17:30:40 n2 kernel: [  248.084409]  [<ffffffffa04a168b>] ocfs2_file_aio_read+0x14b/0x390 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084423]  [<ffffffff811584b8>] do_sync_read+0xc8/0x110
Dec 30 17:30:40 n2 kernel: [  248.084426]  [<ffffffff81158c67>] vfs_read+0xc7/0x130
Dec 30 17:30:40 n2 kernel: [  248.084429]  [<ffffffff81158dd3>] sys_read+0x53/0xa0
Dec 30 17:30:40 n2 kernel: [  248.084431]  [<ffffffff81464592>] system_call_fastpath+0x16/0x1b
Dec 30 17:30:40 n2 kernel: [  248.084435]  [<00007fdbe03532a0>] 0x7fdbe035329f

Thanks,
Eric

> 
> Thanks
> Gang   
> 
> 
> >>> 
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > DLM does not cache locks. So, blocking lock and unlock
> > will only make the performance worse where contention over
> > the locks is high.
> > 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> >  fs/ocfs2/dlmglue.c | 8 --------
> >  1 file changed, 8 deletions(-)
> > 
> > diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> > index 20276e3..f92612e 100644
> > --- a/fs/ocfs2/dlmglue.c
> > +++ b/fs/ocfs2/dlmglue.c
> > @@ -2432,12 +2432,6 @@ bail:
> >   * done this we have to return AOP_TRUNCATED_PAGE so the aop method
> >   * that called us can bubble that back up into the VFS who will then
> >   * immediately retry the aop call.
> > - *
> > - * We do a blocking lock and immediate unlock before returning, though, so 
> > that
> > - * the lock has a great chance of being cached on this node by the time the 
> > VFS
> > - * calls back to retry the aop.    This has a potential to livelock as nodes
> > - * ping locks back and forth, but that's a risk we're willing to take to 
> > avoid
> > - * the lock inversion simply.
> >   */
> >  int ocfs2_inode_lock_with_page(struct inode *inode,
> >  			      struct buffer_head **ret_bh,
> > @@ -2449,8 +2443,6 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
> >  	ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
> >  	if (ret == -EAGAIN) {
> >  		unlock_page(page);
> > -		if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
> > -			ocfs2_inode_unlock(inode, ex);
> >  		ret = AOP_TRUNCATED_PAGE;
> >  	}
> >  
> > -- 
> > 2.6.2
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: Do not lock/unlock() inode DLM lock
  2016-01-07  5:03   ` Eric Ren
@ 2016-01-08  1:24     ` Goldwyn Rodrigues
  2016-01-08  2:16       ` Zhen Ren
  0 siblings, 1 reply; 8+ messages in thread
From: Goldwyn Rodrigues @ 2016-01-08  1:24 UTC (permalink / raw)
  To: ocfs2-devel

Hi Eric,

On 01/07/2016 10:33 AM, Eric Ren wrote:
> Hi,
>
> On Tue, Dec 29, 2015 at 07:31:16PM -0700, He Gang wrote:
>> Hello Goldwyn,
>>
>> When read path can't get the DLM lock immediately (NONBLOCK way), next get the lock with BLOCK way, this behavior will cost some time (several msecs).
>> It looks make sense to delete that two line code.
>> But why there are two line code existed? I just worry about, if we delete two line code, when read path can't get the DLM lock with NONBLOCK way, read path will retry to get this DLM lock repeatedly, this will lead to cost too much CPU (Not waiting in sleep).
>> I just worry about this possibility, Eric will test this case, and give a feedback.
> Sorry for the late reply. After applying this patch, the performance is improved very much,
> but "soft lockup" occurred several times(log message is placed at the end). The two lines:

Did you perform this test on the upstream kernel? I performed mine on 
upstream and I did not receive any softlockups. AFAICS, you performed it 
on 3.0 based kernels (perhaps SLE11SP3) which does not have the patch 
you have mentioned.


>
>         if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
>             ocfs2_inode_unlock(inode, ex);
>
> is similar with the lines of this commit:
>
> commit c7e25e6e0b0486492c5faaf6312b37413642c48e
> Author: Jan Kara <jack@suse.cz>
> Date:   Thu Jun 23 22:51:47 2011 +0200
>
>      ocfs2: Avoid livelock in ocfs2_readpage()
>
>      When someone writes to an inode, readers accessing the same inode via
>      ocfs2_readpage() just busyloop trying to get ip_alloc_sem because
>      do_generic_file_read() looks up the page again and retries ->readpage()
>      when previous attempt failed with AOP_TRUNCATED_PAGE. When there are enough
>      readers, they can occupy all CPUs and in non-preempt kernel the system is
>      deadlocked because writer holding ip_alloc_sem is never run to release the
>      semaphore. Fix the problem by making reader block on ip_alloc_sem to break
>      the busy loop.
>
>      Signed-off-by: Jan Kara <jack@suse.cz>
>      Signed-off-by: Joel Becker <jlbec@evilplan.org>
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index 4c1ec8f..ba3ca1e 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -290,7 +290,15 @@ static int ocfs2_readpage(struct file *file, struct page *page)
>          }
>
>          if (down_read_trylock(&oi->ip_alloc_sem) == 0) {
> +               /*
> +                * Unlock the page and cycle ip_alloc_sem so that we don't
> +                * busyloop waiting for ip_alloc_sem to unlock
> +                */
>                  ret = AOP_TRUNCATED_PAGE;
> +               unlock_page(page);
> +               unlock = 0;
> +               down_read(&oi->ip_alloc_sem);
> +               up_read(&oi->ip_alloc_sem);
>                  goto out_inode_unlock;
>          }
>
> So, this patch removing the two lines of code may result in "busy wait" potentially that
> reading thread repeatedly checks to see if cluster lock is available. Please correct me if
> I'm wrong.

Correct. In older kernels, not upstream. Right?

>
> This is the "soft lookup" log:
> Dec 30 17:30:40 n2 kernel: [  248.084012] BUG: soft lockup - CPU#3 stuck for 23s! [iomaker:4061]
> Dec 30 17:30:40 n2 kernel: [  248.084015] Modules linked in: ocfs2(FN) jbd2 ocfs2_nodemanager(F) quota_tree ocfs2_stack_user(FN) ocfs2_stackglue(FN) dlm(F) sctp libcrc32c configfs edd sg sd_mod crc_t10dif crc32c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet mperf fuse loop dm_mod ipv6 ipv6_lib 8139too pcspkr acpiphp rtc_cmos pci_hotplug floppy 8139cp button i2c_piix4 mii virtio_balloon ext3 jbd mbcache ttm drm_kms_helper drm i2c_core sysimgblt sysfillrect syscopyarea uhci_hcd ehci_hcd processor thermal_sys hwmon usbcore usb_common intel_agp intel_gtt scsi_dh_alua scsi_dh_emc scsi_dh_hp_sw scsi_dh_rdac scsi_dh virtio_pci ata_generic ata_piix libata scsi_mod virtio_blk virtio virtio_ring
> Dec 30 17:30:40 n2 kernel: [  248.084049] Supported: No, Unsupported modules are loaded
> Dec 30 17:30:40 n2 kernel: [  248.084050] CPU 3
> Dec 30 17:30:40 n2 kernel: [  248.084051] Modules linked in: ocfs2(FN) jbd2 ocfs2_nodemanager(F) quota_tree ocfs2_stack_user(FN) ocfs2_stackglue(FN) dlm(F) sctp libcrc32c configfs edd sg sd_mod crc_t10dif crc32c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet mperf fuse loop dm_mod ipv6 ipv6_lib 8139too pcspkr acpiphp rtc_cmos pci_hotplug floppy 8139cp button i2c_piix4 mii virtio_balloon ext3 jbd mbcache ttm drm_kms_helper drm i2c_core sysimgblt sysfillrect syscopyarea uhci_hcd ehci_hcd processor thermal_sys hwmon usbcore usb_common intel_agp intel_gtt scsi_dh_alua scsi_dh_emc scsi_dh_hp_sw scsi_dh_rdac scsi_dh virtio_pci ata_generic ata_piix libata scsi_mod virtio_blk virtio virtio_ring
> Dec 30 17:30:40 n2 kernel: [  248.084074] Supported: No, Unsupported modules are loaded
> Dec 30 17:30:40 n2 kernel: [  248.084075]
> Dec 30 17:30:40 n2 kernel: [  248.084077] Pid: 4061, comm: iomaker Tainted: GF          N  3.0.76-0.11-default #1 Bochs Bochs
> Dec 30 17:30:40 n2 kernel: [  248.084080] RIP: 0010:[<ffffffff8145c7a9>]  [<ffffffff8145c7a9>] _raw_spin_lock+0x9/0x20
> Dec 30 17:30:40 n2 kernel: [  248.084087] RSP: 0018:ffff8800db3fbb10  EFLAGS: 00000206
> Dec 30 17:30:40 n2 kernel: [  248.084088] RAX: 0000000071967196 RBX: ffff880114c1c000 RCX: 0000000000000004
> Dec 30 17:30:40 n2 kernel: [  248.084090] RDX: 0000000000007195 RSI: 0000000000000000 RDI: ffff880114c1c0d0
> Dec 30 17:30:40 n2 kernel: [  248.084091] RBP: ffff880114c1c0d0 R08: 0000000000000000 R09: 0000000000000003
> Dec 30 17:30:40 n2 kernel: [  248.084092] R10: 0000000000000100 R11: 0000000000014819 R12: ffffffff81464f6e
> Dec 30 17:30:40 n2 kernel: [  248.084094] R13: ffff880037b34908 R14: ffff880037b342c0 R15: ffff880037b34908
> Dec 30 17:30:40 n2 kernel: [  248.084096] FS:  00007fdbe0f7a720(0000) GS:ffff88011fd80000(0000) knlGS:0000000000000000
> Dec 30 17:30:40 n2 kernel: [  248.084097] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Dec 30 17:30:40 n2 kernel: [  248.084099] CR2: 00007ff1df54b000 CR3: 00000000db2da000 CR4: 00000000000006e0
> Dec 30 17:30:40 n2 kernel: [  248.084103] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Dec 30 17:30:40 n2 kernel: [  248.084106] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Dec 30 17:30:40 n2 kernel: [  248.084108] Process iomaker (pid: 4061, threadinfo ffff8800db3fa000, task ffff880037b342c0)
> Dec 30 17:30:40 n2 kernel: [  248.084109] Stack:
> Dec 30 17:30:40 n2 kernel: [  248.084112]  ffffffffa04ab9b5 0000000000000000 00000039c2f58103 0000000000000000
> Dec 30 17:30:40 n2 kernel: [  248.084115]  0000000000000000 ffff880114c1c0d0 ffff880114c1c0d0 ffff880114c1c0d0
> Dec 30 17:30:40 n2 kernel: [  248.084117]  0000000000000000 ffff880114c1c000 ffff8800dba0ec38 0000000000000004
> Dec 30 17:30:40 n2 kernel: [  248.084120] Call Trace:
> Dec 30 17:30:40 n2 kernel: [  248.084145]  [<ffffffffa04ab9b5>] ocfs2_wait_for_recovery+0x25/0xc0 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084182]  [<ffffffffa049aed8>] ocfs2_inode_lock_full_nested+0x298/0x510 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084204]  [<ffffffffa049b161>] ocfs2_inode_lock_with_page+0x11/0x40 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084225]  [<ffffffffa048170f>] ocfs2_readpage+0x8f/0x210 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084234]  [<ffffffff810f938e>] do_generic_file_read+0x13e/0x490
> Dec 30 17:30:40 n2 kernel: [  248.084238]  [<ffffffff810f9d3c>] generic_file_aio_read+0xfc/0x260
> Dec 30 17:30:40 n2 kernel: [  248.084250]  [<ffffffffa04a168b>] ocfs2_file_aio_read+0x14b/0x390 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084265]  [<ffffffff811584b8>] do_sync_read+0xc8/0x110
> Dec 30 17:30:40 n2 kernel: [  248.084268]  [<ffffffff81158c67>] vfs_read+0xc7/0x130
> Dec 30 17:30:40 n2 kernel: [  248.084271]  [<ffffffff81158dd3>] sys_read+0x53/0xa0
> Dec 30 17:30:40 n2 kernel: [  248.084275]  [<ffffffff81464592>] system_call_fastpath+0x16/0x1b
> Dec 30 17:30:40 n2 kernel: [  248.084280]  [<00007fdbe03532a0>] 0x7fdbe035329f
> Dec 30 17:30:40 n2 kernel: [  248.084281] Code: 00 75 04 f0 0f b1 17 0f 94 c2 0f b6 c2 85 c0 ba 01 00 00 00 75 02 31 d2 89 d0 c3 0f 1f 80 00 00 00 00 b8 00 00 01 00 f0 0f c1 07 <0f> b7 d0 c1 e8 10 39 c2 74 07 f3 90 0f b7 17 eb f5 c3 0f 1f 44
> Dec 30 17:30:40 n2 kernel: [  248.084307] Call Trace:
> Dec 30 17:30:40 n2 kernel: [  248.084319]  [<ffffffffa04ab9b5>] ocfs2_wait_for_recovery+0x25/0xc0 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084344]  [<ffffffffa049aed8>] ocfs2_inode_lock_full_nested+0x298/0x510 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084366]  [<ffffffffa049b161>] ocfs2_inode_lock_with_page+0x11/0x40 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084386]  [<ffffffffa048170f>] ocfs2_readpage+0x8f/0x210 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084394]  [<ffffffff810f938e>] do_generic_file_read+0x13e/0x490
> Dec 30 17:30:40 n2 kernel: [  248.084397]  [<ffffffff810f9d3c>] generic_file_aio_read+0xfc/0x260
> Dec 30 17:30:40 n2 kernel: [  248.084409]  [<ffffffffa04a168b>] ocfs2_file_aio_read+0x14b/0x390 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084423]  [<ffffffff811584b8>] do_sync_read+0xc8/0x110
> Dec 30 17:30:40 n2 kernel: [  248.084426]  [<ffffffff81158c67>] vfs_read+0xc7/0x130
> Dec 30 17:30:40 n2 kernel: [  248.084429]  [<ffffffff81158dd3>] sys_read+0x53/0xa0
> Dec 30 17:30:40 n2 kernel: [  248.084431]  [<ffffffff81464592>] system_call_fastpath+0x16/0x1b
> Dec 30 17:30:40 n2 kernel: [  248.084435]  [<00007fdbe03532a0>] 0x7fdbe035329f
>
> Thanks,
> Eric
>
>>
>> Thanks
>> Gang
>>
>>
>>>>>
>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>>
>>> DLM does not cache locks. So, blocking lock and unlock
>>> will only make the performance worse where contention over
>>> the locks is high.
>>>
>>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>> ---
>>>   fs/ocfs2/dlmglue.c | 8 --------
>>>   1 file changed, 8 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
>>> index 20276e3..f92612e 100644
>>> --- a/fs/ocfs2/dlmglue.c
>>> +++ b/fs/ocfs2/dlmglue.c
>>> @@ -2432,12 +2432,6 @@ bail:
>>>    * done this we have to return AOP_TRUNCATED_PAGE so the aop method
>>>    * that called us can bubble that back up into the VFS who will then
>>>    * immediately retry the aop call.
>>> - *
>>> - * We do a blocking lock and immediate unlock before returning, though, so
>>> that
>>> - * the lock has a great chance of being cached on this node by the time the
>>> VFS
>>> - * calls back to retry the aop.    This has a potential to livelock as nodes
>>> - * ping locks back and forth, but that's a risk we're willing to take to
>>> avoid
>>> - * the lock inversion simply.
>>>    */
>>>   int ocfs2_inode_lock_with_page(struct inode *inode,
>>>   			      struct buffer_head **ret_bh,
>>> @@ -2449,8 +2443,6 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
>>>   	ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
>>>   	if (ret == -EAGAIN) {
>>>   		unlock_page(page);
>>> -		if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
>>> -			ocfs2_inode_unlock(inode, ex);
>>>   		ret = AOP_TRUNCATED_PAGE;
>>>   	}
>>>
>>> --
>>> 2.6.2
>>
>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: Do not lock/unlock() inode DLM lock
  2016-01-08  1:24     ` Goldwyn Rodrigues
@ 2016-01-08  2:16       ` Zhen Ren
  0 siblings, 0 replies; 8+ messages in thread
From: Zhen Ren @ 2016-01-08  2:16 UTC (permalink / raw)
  To: ocfs2-devel

Hi!

 >>>
> Hi Eric, 
>  
> On 01/07/2016 10:33 AM, Eric Ren wrote: 
> > Hi, 
> > 
> > On Tue, Dec 29, 2015 at 07:31:16PM -0700, He Gang wrote: 
> >> Hello Goldwyn, 
> >> 
> >> When read path can't get the DLM lock immediately (NONBLOCK way), next get  
> the lock with BLOCK way, this behavior will cost some time (several msecs). 
> >> It looks make sense to delete that two line code. 
> >> But why there are two line code existed? I just worry about, if we delete  
> two line code, when read path can't get the DLM lock with NONBLOCK way, read  
> path will retry to get this DLM lock repeatedly, this will lead to cost too  
> much CPU (Not waiting in sleep). 
> >> I just worry about this possibility, Eric will test this case, and give a  
> feedback. 
> > Sorry for the late reply. After applying this patch, the performance is  
> improved very much, 
> > but "soft lockup" occurred several times(log message is placed at the end).  
> The two lines: 
>  
> Did you perform this test on the upstream kernel? I performed mine on  
> upstream and I did not receive any softlockups. AFAICS, you performed it  
> on 3.0 based kernels (perhaps SLE11SP3) which does not have the patch  
> you have mentioned. 

I did not test on upstream kernel yet. But I did on sles12 sp1 which's kernel version is 3.12.49, so far
also have not seen "soft lookup" on it. I checked  SLE11SP3 kernel and it really have this patch - "ocfs2: 
Avoid livelock in ocfs2_readpage() ".

So, why newer kernel is without the issue of soft lookup is strange for me.

>  
>  
> > 
> >         if (ocfs2_inode_lock(inode, ret_bh, ex) == 0) 
> >             ocfs2_inode_unlock(inode, ex); 
> > 
> > is similar with the lines of this commit: 
> > 
> > commit c7e25e6e0b0486492c5faaf6312b37413642c48e 
> > Author: Jan Kara <jack@suse.cz> 
> > Date:   Thu Jun 23 22:51:47 2011 +0200 
> > 
> >      ocfs2: Avoid livelock in ocfs2_readpage() 
> > 
> >      When someone writes to an inode, readers accessing the same inode via 
> >      ocfs2_readpage() just busyloop trying to get ip_alloc_sem because 
> >      do_generic_file_read() looks up the page again and retries ->readpage() 
> >      when previous attempt failed with AOP_TRUNCATED_PAGE. When there are  
> enough 
> >      readers, they can occupy all CPUs and in non-preempt kernel the system  
> is 
> >      deadlocked because writer holding ip_alloc_sem is never run to release  
> the 
> >      semaphore. Fix the problem by making reader block on ip_alloc_sem to  
> break 
> >      the busy loop. 
> > 
> >      Signed-off-by: Jan Kara <jack@suse.cz> 
> >      Signed-off-by: Joel Becker <jlbec@evilplan.org> 
> > diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c 
> > index 4c1ec8f..ba3ca1e 100644 
> > --- a/fs/ocfs2/aops.c 
> > +++ b/fs/ocfs2/aops.c 
> > @@ -290,7 +290,15 @@ static int ocfs2_readpage(struct file *file, struct  
> page *page) 
> >          } 
> > 
> >          if (down_read_trylock(&oi->ip_alloc_sem) == 0) { 
> > +               /* 
> > +                * Unlock the page and cycle ip_alloc_sem so that we don't 
> > +                * busyloop waiting for ip_alloc_sem to unlock 
> > +                */ 
> >                  ret = AOP_TRUNCATED_PAGE; 
> > +               unlock_page(page); 
> > +               unlock = 0; 
> > +               down_read(&oi->ip_alloc_sem); 
> > +               up_read(&oi->ip_alloc_sem); 
> >                  goto out_inode_unlock; 
> >          } 
> > 
> > So, this patch removing the two lines of code may result in "busy wait"  
> potentially that 
> > reading thread repeatedly checks to see if cluster lock is available.  
> Please correct me if 
> > I'm wrong. 
>  
> Correct. In older kernels, not upstream. Right? 

Right, but SLE11SP1 kernel has already included the patch above.

Thanks,
Eric

>  
> > 
> > This is the "soft lookup" log: 
> > Dec 30 17:30:40 n2 kernel: [  248.084012] BUG: soft lockup - CPU#3 stuck for  
> 23s! [iomaker:4061] 
> > Dec 30 17:30:40 n2 kernel: [  248.084015] Modules linked in: ocfs2(FN) jbd2  
> ocfs2_nodemanager(F) quota_tree ocfs2_stack_user(FN) ocfs2_stackglue(FN)  
> dlm(F) sctp libcrc32c configfs edd sg sd_mod crc_t10dif crc32c iscsi_tcp  
> libiscsi_tcp libiscsi scsi_transport_iscsi af_packet mperf fuse loop dm_mod  
> ipv6 ipv6_lib 8139too pcspkr acpiphp rtc_cmos pci_hotplug floppy 8139cp  
> button i2c_piix4 mii virtio_balloon ext3 jbd mbcache ttm drm_kms_helper drm  
> i2c_core sysimgblt sysfillrect syscopyarea uhci_hcd ehci_hcd processor  
> thermal_sys hwmon usbcore usb_common intel_agp intel_gtt scsi_dh_alua  
> scsi_dh_emc scsi_dh_hp_sw scsi_dh_rdac scsi_dh virtio_pci ata_generic  
> ata_piix libata scsi_mod virtio_blk virtio virtio_ring 
> > Dec 30 17:30:40 n2 kernel: [  248.084049] Supported: No, Unsupported  
> modules are loaded 
> > Dec 30 17:30:40 n2 kernel: [  248.084050] CPU 3 
> > Dec 30 17:30:40 n2 kernel: [  248.084051] Modules linked in: ocfs2(FN) jbd2  
> ocfs2_nodemanager(F) quota_tree ocfs2_stack_user(FN) ocfs2_stackglue(FN)  
> dlm(F) sctp libcrc32c configfs edd sg sd_mod crc_t10dif crc32c iscsi_tcp  
> libiscsi_tcp libiscsi scsi_transport_iscsi af_packet mperf fuse loop dm_mod  
> ipv6 ipv6_lib 8139too pcspkr acpiphp rtc_cmos pci_hotplug floppy 8139cp  
> button i2c_piix4 mii virtio_balloon ext3 jbd mbcache ttm drm_kms_helper drm  
> i2c_core sysimgblt sysfillrect syscopyarea uhci_hcd ehci_hcd processor  
> thermal_sys hwmon usbcore usb_common intel_agp intel_gtt scsi_dh_alua  
> scsi_dh_emc scsi_dh_hp_sw scsi_dh_rdac scsi_dh virtio_pci ata_generic  
> ata_piix libata scsi_mod virtio_blk virtio virtio_ring 
> > Dec 30 17:30:40 n2 kernel: [  248.084074] Supported: No, Unsupported  
> modules are loaded 
> > Dec 30 17:30:40 n2 kernel: [  248.084075] 
> > Dec 30 17:30:40 n2 kernel: [  248.084077] Pid: 4061, comm: iomaker Tainted:  
> GF          N  3.0.76-0.11-default #1 Bochs Bochs 
> > Dec 30 17:30:40 n2 kernel: [  248.084080] RIP: 0010:[<ffffffff8145c7a9>]   
> [<ffffffff8145c7a9>] _raw_spin_lock+0x9/0x20 
> > Dec 30 17:30:40 n2 kernel: [  248.084087] RSP: 0018:ffff8800db3fbb10   
> EFLAGS: 00000206 
> > Dec 30 17:30:40 n2 kernel: [  248.084088] RAX: 0000000071967196 RBX:  
> ffff880114c1c000 RCX: 0000000000000004 
> > Dec 30 17:30:40 n2 kernel: [  248.084090] RDX: 0000000000007195 RSI:  
> 0000000000000000 RDI: ffff880114c1c0d0 
> > Dec 30 17:30:40 n2 kernel: [  248.084091] RBP: ffff880114c1c0d0 R08:  
> 0000000000000000 R09: 0000000000000003 
> > Dec 30 17:30:40 n2 kernel: [  248.084092] R10: 0000000000000100 R11:  
> 0000000000014819 R12: ffffffff81464f6e 
> > Dec 30 17:30:40 n2 kernel: [  248.084094] R13: ffff880037b34908 R14:  
> ffff880037b342c0 R15: ffff880037b34908 
> > Dec 30 17:30:40 n2 kernel: [  248.084096] FS:  00007fdbe0f7a720(0000)  
> GS:ffff88011fd80000(0000) knlGS:0000000000000000 
> > Dec 30 17:30:40 n2 kernel: [  248.084097] CS:  0010 DS: 0000 ES: 0000 CR0:  
> 000000008005003b 
> > Dec 30 17:30:40 n2 kernel: [  248.084099] CR2: 00007ff1df54b000 CR3:  
> 00000000db2da000 CR4: 00000000000006e0 
> > Dec 30 17:30:40 n2 kernel: [  248.084103] DR0: 0000000000000000 DR1:  
> 0000000000000000 DR2: 0000000000000000 
> > Dec 30 17:30:40 n2 kernel: [  248.084106] DR3: 0000000000000000 DR6:  
> 00000000ffff0ff0 DR7: 0000000000000400 
> > Dec 30 17:30:40 n2 kernel: [  248.084108] Process iomaker (pid: 4061,  
> threadinfo ffff8800db3fa000, task ffff880037b342c0) 
> > Dec 30 17:30:40 n2 kernel: [  248.084109] Stack: 
> > Dec 30 17:30:40 n2 kernel: [  248.084112]  ffffffffa04ab9b5  
> 0000000000000000 00000039c2f58103 0000000000000000 
> > Dec 30 17:30:40 n2 kernel: [  248.084115]  0000000000000000  
> ffff880114c1c0d0 ffff880114c1c0d0 ffff880114c1c0d0 
> > Dec 30 17:30:40 n2 kernel: [  248.084117]  0000000000000000  
> ffff880114c1c000 ffff8800dba0ec38 0000000000000004 
> > Dec 30 17:30:40 n2 kernel: [  248.084120] Call Trace: 
> > Dec 30 17:30:40 n2 kernel: [  248.084145]  [<ffffffffa04ab9b5>]  
> ocfs2_wait_for_recovery+0x25/0xc0 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084182]  [<ffffffffa049aed8>]  
> ocfs2_inode_lock_full_nested+0x298/0x510 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084204]  [<ffffffffa049b161>]  
> ocfs2_inode_lock_with_page+0x11/0x40 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084225]  [<ffffffffa048170f>]  
> ocfs2_readpage+0x8f/0x210 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084234]  [<ffffffff810f938e>]  
> do_generic_file_read+0x13e/0x490 
> > Dec 30 17:30:40 n2 kernel: [  248.084238]  [<ffffffff810f9d3c>]  
> generic_file_aio_read+0xfc/0x260 
> > Dec 30 17:30:40 n2 kernel: [  248.084250]  [<ffffffffa04a168b>]  
> ocfs2_file_aio_read+0x14b/0x390 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084265]  [<ffffffff811584b8>]  
> do_sync_read+0xc8/0x110 
> > Dec 30 17:30:40 n2 kernel: [  248.084268]  [<ffffffff81158c67>]  
> vfs_read+0xc7/0x130 
> > Dec 30 17:30:40 n2 kernel: [  248.084271]  [<ffffffff81158dd3>]  
> sys_read+0x53/0xa0 
> > Dec 30 17:30:40 n2 kernel: [  248.084275]  [<ffffffff81464592>]  
> system_call_fastpath+0x16/0x1b 
> > Dec 30 17:30:40 n2 kernel: [  248.084280]  [<00007fdbe03532a0>]  
> 0x7fdbe035329f 
> > Dec 30 17:30:40 n2 kernel: [  248.084281] Code: 00 75 04 f0 0f b1 17 0f 94  
> c2 0f b6 c2 85 c0 ba 01 00 00 00 75 02 31 d2 89 d0 c3 0f 1f 80 00 00 00 00 b8  
> 00 00 01 00 f0 0f c1 07 <0f> b7 d0 c1 e8 10 39 c2 74 07 f3 90 0f b7 17 eb f5 c3  
> 0f 1f 44 
> > Dec 30 17:30:40 n2 kernel: [  248.084307] Call Trace: 
> > Dec 30 17:30:40 n2 kernel: [  248.084319]  [<ffffffffa04ab9b5>]  
> ocfs2_wait_for_recovery+0x25/0xc0 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084344]  [<ffffffffa049aed8>]  
> ocfs2_inode_lock_full_nested+0x298/0x510 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084366]  [<ffffffffa049b161>]  
> ocfs2_inode_lock_with_page+0x11/0x40 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084386]  [<ffffffffa048170f>]  
> ocfs2_readpage+0x8f/0x210 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084394]  [<ffffffff810f938e>]  
> do_generic_file_read+0x13e/0x490 
> > Dec 30 17:30:40 n2 kernel: [  248.084397]  [<ffffffff810f9d3c>]  
> generic_file_aio_read+0xfc/0x260 
> > Dec 30 17:30:40 n2 kernel: [  248.084409]  [<ffffffffa04a168b>]  
> ocfs2_file_aio_read+0x14b/0x390 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084423]  [<ffffffff811584b8>]  
> do_sync_read+0xc8/0x110 
> > Dec 30 17:30:40 n2 kernel: [  248.084426]  [<ffffffff81158c67>]  
> vfs_read+0xc7/0x130 
> > Dec 30 17:30:40 n2 kernel: [  248.084429]  [<ffffffff81158dd3>]  
> sys_read+0x53/0xa0 
> > Dec 30 17:30:40 n2 kernel: [  248.084431]  [<ffffffff81464592>]  
> system_call_fastpath+0x16/0x1b 
> > Dec 30 17:30:40 n2 kernel: [  248.084435]  [<00007fdbe03532a0>]  
> 0x7fdbe035329f 
> > 
> > Thanks, 
> > Eric 
> > 
> >> 
> >> Thanks 
> >> Gang 
> >> 
> >> 
> >>>>> 
> >>> From: Goldwyn Rodrigues <rgoldwyn@suse.com> 
> >>> 
> >>> DLM does not cache locks. So, blocking lock and unlock 
> >>> will only make the performance worse where contention over 
> >>> the locks is high. 
> >>> 
> >>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> 
> >>> --- 
> >>>   fs/ocfs2/dlmglue.c | 8 -------- 
> >>>   1 file changed, 8 deletions(-) 
> >>> 
> >>> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c 
> >>> index 20276e3..f92612e 100644 
> >>> --- a/fs/ocfs2/dlmglue.c 
> >>> +++ b/fs/ocfs2/dlmglue.c 
> >>> @@ -2432,12 +2432,6 @@ bail: 
> >>>    * done this we have to return AOP_TRUNCATED_PAGE so the aop method 
> >>>    * that called us can bubble that back up into the VFS who will then 
> >>>    * immediately retry the aop call. 
> >>> - * 
> >>> - * We do a blocking lock and immediate unlock before returning, though, so 
> >>> that 
> >>> - * the lock has a great chance of being cached on this node by the time the 
> >>> VFS 
> >>> - * calls back to retry the aop.    This has a potential to livelock as  
> nodes 
> >>> - * ping locks back and forth, but that's a risk we're willing to take to 
> >>> avoid 
> >>> - * the lock inversion simply. 
> >>>    */ 
> >>>   int ocfs2_inode_lock_with_page(struct inode *inode, 
> >>>   			      struct buffer_head **ret_bh, 
> >>> @@ -2449,8 +2443,6 @@ int ocfs2_inode_lock_with_page(struct inode *inode, 
> >>>   	ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK); 
> >>>   	if (ret == -EAGAIN) { 
> >>>   		unlock_page(page); 
> >>> -		if (ocfs2_inode_lock(inode, ret_bh, ex) == 0) 
> >>> -			ocfs2_inode_unlock(inode, ex); 
> >>>   		ret = AOP_TRUNCATED_PAGE; 
> >>>   	} 
> >>> 
> >>> -- 
> >>> 2.6.2 
> >> 
> >> 
> >> _______________________________________________ 
> >> Ocfs2-devel mailing list 
> >> Ocfs2-devel at oss.oracle.com 
> >> https://oss.oracle.com/mailman/listinfo/ocfs2-devel 
> >> 
 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-01-08  2:16 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-29 18:20 [Ocfs2-devel] [PATCH] ocfs2: Do not lock/unlock() inode DLM lock rgoldwyn at suse.de
2015-12-30  1:47 ` Junxiao Bi
2015-12-30  2:31 ` Gang He
2016-01-07  5:03   ` Eric Ren
2016-01-08  1:24     ` Goldwyn Rodrigues
2016-01-08  2:16       ` Zhen Ren
     [not found] ` <5683B274020000F900023B30@suse.com>
2015-12-30 18:35   ` Goldwyn Rodrigues
2016-01-05  6:52     ` Gang He

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.