From mboxrd@z Thu Jan  1 00:00:00 1970
From: alex chen <alex.chen@huawei.com>
Date: Thu, 2 Nov 2017 18:50:20 +0800
Subject: [Ocfs2-devel] [PATCH] ocfs2/cluster: unlock the o2hb_live_lock
	before the o2nm_depend_item()
In-Reply-To: <63ADC13FD55D6546B7DECE290D39E373CED73DB8@H3CMLB14-EX.srv.huawei-3com.com>
References: <59F86F83.7010501@huawei.com>
	<2581fe81-0b89-0385-a8e2-b82ca66ec114@gmail.com>
	<59F97160.9000506@huawei.com>
	<63ADC13FD55D6546B7DECE290D39E373CED73664@H3CMLB14-EX.srv.huawei-3com.com>
	<59F9981A.6090503@huawei.com>
	<63ADC13FD55D6546B7DECE290D39E373CED73783@H3CMLB14-EX.srv.huawei-3com.com>
	<59FA945C.1070002@huawei.com>
	<63ADC13FD55D6546B7DECE290D39E373CED73DB8@H3CMLB14-EX.srv.huawei-3com.com>
Message-ID: <59FAF86C.2020208@huawei.com>
List-Id: <ocfs2-devel.oss.oracle.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
To: ocfs2-devel@oss.oracle.com

Hi Changwei,

Thanks for you reply.

On 2017/11/2 13:58, Changwei Ge wrote:
> On 2017/11/2 11:45, alex chen wrote:
>> Hi Changwei,
>>
>> On 2017/11/1 17:59, Changwei Ge wrote:
>>> Hi Alex,
>>>
>>> On 2017/11/1 17:52, alex chen wrote:
>>>> Hi Changwei,
>>>>
>>>> Thanks for you reply.
>>>>
>>>> On 2017/11/1 16:28, Changwei Ge wrote:
>>>>> Hi Alex,
>>>>>
>>>>> On 2017/11/1 15:05, alex chen wrote:
>>>>>> Hi Joseph and Changwei,
>>>>>>
>>>>>> It's our basic principle that the function in which may sleep can't be called
>>>>>> within spinlock hold.
>>>>>
>>>>> I suppose this principle is a suggestion not a restriction.
>>>>>
>>>>
>>>> It is a very very bad idea to sleep while holding a spin lock.
>>>> If a process grabs a spinlock and goes to sleep before releasing it.
>>>> Then this process yields the control of the CPU to a second process.
>>>> Unfortunately the second process also need to grab the spinlock and it will
>>>> busy wait. On an uniprocessor machine the second process will lock the CPU not
>>>> allowing the first process to wake up and release the spinlock so the second
>>>> process can't continue, it is basically a deadlock.
>>>
>>> I agree. This soft lockup truly exists. This point is OK for me.
>>>
>>>>
>>>>>>
>>>>>> On 2017/11/1 9:03, Joseph Qi wrote:
>>>>>>> Hi Alex,
>>>>>>>
>>>>>>> On 17/10/31 20:41, alex chen wrote:
>>>>>>>> In the following situation, the down_write() will be called under
>>>>>>>> the spin_lock(), which may lead a soft lockup:
>>>>>>>> o2hb_region_inc_user
>>>>>>>>     spin_lock(&o2hb_live_lock)
>>>>>>>>      o2hb_region_pin
>>>>>>>>       o2nm_depend_item
>>>>>>>>        configfs_depend_item
>>>>>>>>         inode_lock
>>>>>>>>          down_write
>>>>>>>>          -->here may sleep and reschedule
>>>>>>>>
>>>>>>>> So we should unlock the o2hb_live_lock before the o2nm_depend_item(), and
>>>>>>>> get item reference in advance to prevent the region to be released.
>>>>>>>>
>>>>>>>> Signed-off-by: Alex Chen <alex.chen@huawei.com>
>>>>>>>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>>>>>>> Reviewed-by: Jun Piao <piaojun@huawei.com>
>>>>>>>> ---
>>>>>>>>     fs/ocfs2/cluster/heartbeat.c | 8 ++++++++
>>>>>>>>     1 file changed, 8 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
>>>>>>>> index d020604..f1142a9 100644
>>>>>>>> --- a/fs/ocfs2/cluster/heartbeat.c
>>>>>>>> +++ b/fs/ocfs2/cluster/heartbeat.c
>>>>>>>> @@ -2399,6 +2399,9 @@ static int o2hb_region_pin(const char *region_uuid)
>>>>>>>>     		if (reg->hr_item_pinned || reg->hr_item_dropped)
>>>>>>>>     			goto skip_pin;
>>>>>>>>
>>>>>>>> +		config_item_get(&reg->hr_item);
>>>>>>>> +		spin_unlock(&o2hb_live_lock);
>>>>>>>> +
>>>>>>> If unlock here, the iteration of o2hb_all_regions is no longer safe.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Joseph
>>>>>>>
>>>>>>
>>>>>> In local heartbeat mode, here we already found the region and will break the loop after
>>>>>> depending item, we get the item reference before spin_unlock(), that means the region will
>>>>>> never be released by the o2hb_region_release() until we put the item reference after
>>>>>> spin_lock(&o2hb_live_lock), so we can safely iterate over the list.
>>>>>> In global heartbeat mode, it doesn't matter that some regions may be deleted after
>>>>>> spin_unlock(), because we just pin all the active regions.
>>>>>
>>>>> But we are still iterating elements on global list  *o2hb_all_regions*,
>>>>> right? How can we guarantee that no more elements will be attached or
>>>>> detached while the lock is unlocked.
>>>>>
>>>>
>>>> In global heartbeat mode, we pin all regions if there is the first dependent user and
>>>> unpin all regions if the number of dependent users falls to zero.
>>>> So there is not exist another region will be attached or detached after spin_unlock().
>>>
>>> Well, as for local heartbeat mode, I think, we still need to protect
>>> that global list.
>>>
>>> Thanks,
>>> Changwei
>>>
>> For the local heartbeat mode, as I said before, here we already found the region and will
>> break the loop after depending item, so we just need to protect the region during depending
>> item. The global list doesn't need to protect anymore.
> 
> Hi Alex,
> Um, I think this is too tricky.
>

The o2hb_live_lock mainly protect the "o2hb_all_regions" not the region and we will not iterate
the list o2hb_all_regions any more after we find the region. So we can unlock here.
If you have any other opinions, please let me know and I really appreciate you if you can express
your opinions as detailed as possible.

>>
>> I find a problem in my patch. The reg->hr_item_pinned should be protect by the o2hb_live_lock,
>> so we should call spin_lock(&o2hb_live_lock) immediately after o2nm_depend_item();
>> I will modified the patch and resend the patch
> Well, if you do so to revise your patch, how can you keep the 
> consistency between setting ::hr_item_pinned and pining ::hr_item.
> 
> What will ::hr_item_pinned stand for against your revision?
> I think this fix is fragile.
> 

Firstly, the ::hr_item_pinned is set in o2hb_region_pin() when mounting filesystem and cleared
in o2hb_region_unpin() when umounting filesystem. mount and umount the filesystem can only be called
one by one. So the ::hr_item_pinned will not be set and cleared concurrently.
Secondly, Although the o2hb_region_pin() is multiple calls, but are called serial in the function
dlm_register_domain_handlers(). So the ::hr_item_pinned will not be set concurrently.
So we don't need to worry about the consistency of the ::hr_item_pinned.

> Besides that the issue reported by Fungguang seems to be a crash issue 
> rather than a lockup one.
> 
> Thanks,
> Changwei
> 
Um, I think the CONFIG_DEBUG_ATOMIC_SLEEP had been defined in the kernel tested by Fengguang,
so the warning "BUG: sleeping function..." could be printed. But normally, the
CONFIG_DEBUG_ATOMIC_SLEEP would not be defined in the released kernel. so it is a soft lockup.
You can read the function ___might_sleep() for more details.

Thanks,
Alex
>>
>> Thank,
>> Alex
>>>>
>>>> Thank,
>>>> Alex
>>>>>>
>>>>>> Thanks,
>>>>>> Alex
>>>>>>
>>>>>>>>     		/* Ignore ENOENT only for local hb (userdlm domain) */
>>>>>>>>     		ret = o2nm_depend_item(&reg->hr_item);
>>>>>>>>     		if (!ret) {
>>>>>>>> @@ -2410,9 +2413,14 @@ static int o2hb_region_pin(const char *region_uuid)
>>>>>>>>     			else {
>>>>>>>>     				mlog(ML_ERROR, "Pin region %s fails with %d\n",
>>>>>>>>     				     uuid, ret);
>>>>>>>> +				config_item_put(&reg->hr_item);
>>>>>>>> +				spin_lock(&o2hb_live_lock);
>>>>>>>>     				break;
>>>>>>>>     			}
>>>>>>>>     		}
>>>>>>>> +
>>>>>>>> +		config_item_put(&reg->hr_item);
>>>>>>>> +		spin_lock(&o2hb_live_lock);
>>>>>>>>     skip_pin:
>>>>>>>>     		if (found)
>>>>>>>>     			break;
>>>>>>>> -- 1.9.5.msysgit.1
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> .
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>>>>
>>>
>>>
>>> .
>>>
>>
>>
> 
> 
> .
>