From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E3C0C47DB3 for ; Tue, 30 Jan 2024 02:14:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B4DE86B00A2; Mon, 29 Jan 2024 21:14:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AE14E6B00A5; Mon, 29 Jan 2024 21:14:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 977586B00A6; Mon, 29 Jan 2024 21:14:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 816D46B00A2 for ; Mon, 29 Jan 2024 21:14:44 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 5812B1C13E0 for ; Tue, 30 Jan 2024 02:14:44 +0000 (UTC) X-FDA: 81734358888.15.CF73D07 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by imf05.hostedemail.com (Postfix) with ESMTP id 4088B10000F for ; Tue, 30 Jan 2024 02:14:40 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=none; spf=pass (imf05.hostedemail.com: domain of linmiaohe@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=linmiaohe@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706580882; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=le1086Pp1n5x4WgdGYEV0eergSIvUIbZaoIstLWVB1I=; b=x8QlnSOtbz2TGTL33UbZr89MmtDyra4i4olyiJVShHkhimsy5222Eb0RZEeddjjaGVmGtN cI6TWwCxKGQc01yVvmcq5CyTnU6E6oFKo8Rb2Ug9JNovc32QfqVN1ovksWo5yff4oDY88S 0Ra9oiQiC/lnYi45wsR8EPI0NIGANKc= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=none; spf=pass (imf05.hostedemail.com: domain of linmiaohe@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=linmiaohe@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706580882; a=rsa-sha256; cv=none; b=poaEIHhLHLbPUK3y3H9CbEMkwOKrFesUc/bTzFJQmkcrxkMkfmsFtOnnlb15RTnp5T/OvH QYkTCrROmCPxCavycP4eTOiA6jv83+ZrT3zqriI0qxDGrbr/PAQdAQvKzZZCdSMbnWunjn BktdPr7lcz7gKEXnMLgGW+0rIFnS2dE= Received: from mail.maildlp.com (unknown [172.19.162.254]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4TP7yV5pFhzHv1R; Tue, 30 Jan 2024 10:14:10 +0800 (CST) Received: from canpemm500002.china.huawei.com (unknown [7.192.104.244]) by mail.maildlp.com (Postfix) with ESMTPS id 76B8018001A; Tue, 30 Jan 2024 10:14:36 +0800 (CST) Received: from [10.173.135.154] (10.173.135.154) by canpemm500002.china.huawei.com (7.192.104.244) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Tue, 30 Jan 2024 10:14:36 +0800 Subject: Re: hugetlbfs: WARNING: bad unlock balance detected during MADV_REMOVE To: "Liam R. Howlett" References: <42788ABD-99AE-4AEF-B543-C0FABAFA0464@linux.dev> <4780b0e3-42e1-9099-d010-5a1793b6cbd3@huawei.com> <531195fb-b642-2bc1-3a07-4944ee5d8664@huawei.com> <20240129161735.6gmjsswx62o4pbja@revolver> From: Miaohe Lin CC: Muchun Song , Thorvald Natvig , Linux-MM Message-ID: <76f33f3b-f61f-efe7-f63f-1b2e0efaf71d@huawei.com> Date: Tue, 30 Jan 2024 10:14:35 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 MIME-Version: 1.0 In-Reply-To: <20240129161735.6gmjsswx62o4pbja@revolver> Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.173.135.154] X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To canpemm500002.china.huawei.com (7.192.104.244) X-Rspamd-Queue-Id: 4088B10000F X-Rspam-User: X-Stat-Signature: 7xmu16iwt5hgrg61y58m41iqucjpzj6b X-Rspamd-Server: rspam01 X-HE-Tag: 1706580880-729526 X-HE-Meta: U2FsdGVkX1+UyBIdENzUWyfIZeIEiosrIiXY4I8SDebTMZOI8t1H5OrGWh/xWAv9M0rj41Xn94bagFkatOX1dBZ+RA8jeYba5LMtpge5qhsxi+gb5M6XzjR3w8w+5IDLbds6KVo3ODNPzjMQzBAJI8Bl9/HWm9w75Ea3ob3Ti55iXAYdmakY5W1R0wJeGNXyaP3rWoJgJhkUK2RkDB/bFo939cObvamygPb6CVs2z/hWlkyboJnKEYoVdtu+n04V2mUdcZNVGz/6gaxusBKSTZELki0JYmISMt4N1kb8InUCkKB+uY+JBfMZp6mH3eUxNMJpqD0BULhIEIjuYTCHD+tfKhnAB1RJ+IiNyvxuCrDhDBGfyRfovpWxveebMI98yfDqjcDnQwwG88uAXbzWfyLUaJbT5XESKqAukN8vHO4NBhvEG5+PlS0bQRhG9f/QpYn/A7VdhaHapZi18tXb9ZixSE/mUQ8lsaYObDIpTs7xYlM2gb1oZgiGOdyXVOKAU+sb2uPlw3CTOg4lJSfEQyh5IAcYN652/GqeCkIKGd/j6RvXtkFQVfbcDMPyS39FRLpIQ2nPvdErxd6praeiPhil+QK2rxohRBWlpsW+qnP9d8KAPx5a2l+kcr83I+S+UNGDn/eG7sIZFmkTAyoCFlN0n2On5cWYuY/Pgtc1/2VeRjHUq+x017+LkeBqXBIThUR4oUgMEFLWaTgAff00VhsLLxtVIBD5Y9pS++mEDsVNFNLCiyPiiW4vIKDxpUbLaLk+9E6+P7yRBi33igXYszjDGiKKmv2t56qyl3hlTTBJ7xWT5yug3hMEfzCa/BCtEVdrXjwbxKPOKY0/LXO3Fh5/ZSErCPSW9Ye7Byw4RdtGZId6C520UCCOqdy90rbm54PyLERMgDurOR57GxlE271KIgmGxytEH/ZSNi3nLPCbsJi/wu/LoUHzBgmNM//ZRsBNoYiAp3S1VsBfY+W aLSI7Duq RhHNqfalt942QvDQtyJiDuaihf+vRI2iUvGc/oeScawsOjBtkGFlq11Bg/UNTN8t7QMR3/unvkzXOHy1XqDAIhObuQtMn87i6IsjG+aedCSsdsVXXQjXz2suK5rVX9/8K5pnGShj59/2NAsYNIyVIjk6LIMHqOfmJRwwiskw5Al1fTOLcsqOOq+OuAB6dMQ+wdTOLm+qjwfq93umvxlY1jOGMTGvgaHfHSzqMmWAM5qwJmJdEeFOZwHg96Eu/zenr2eJk+up1vVzsNHyKrxqJRoqfus4FLTCJWr8xZPf19yYAR0p1C16ef8fxqg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/1/30 0:17, Liam R. Howlett wrote: > * Miaohe Lin [240129 07:56]: >> On 2024/1/27 18:13, Miaohe Lin wrote: >>> On 2024/1/26 15:50, Muchun Song wrote: >>>> >>>> >>>>> On Jan 26, 2024, at 04:28, Thorvald Natvig wrote: >>>>> >>>>> We've found what appears to be a lock issue that results in a blocked >>>>> process somewhere in hugetlbfs for shared maps; seemingly from an >>>>> interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list. >>>>> >>>>> Based on some added pr_warn, we believe the following is happening: >>>>> When hugetlb_vmdelete_list is entered from the child process, >>>>> vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does >>>>> not lock, since neither __vma_shareable_lock nor __vma_private_lock >>>>> are true. >>>>> >>>>> While hugetlb_vmdelete_list is executing, the parent process does >>>>> fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a >>>>> lock for the same vma. >>>>> >>>>> Thus, when the hugetlb_vmdelete_list in the child reaches the end of >>>>> the function, vma->vm_private_data is now populated, and hence >>>>> hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does >>>>> not hold. >>>> >>>> Thanks for your report. ->vm_private_data was introduced since the >>>> series [1]. So I suspect it was caused by this. But I haven't reviewed >>>> that at that time (actually, it is a little complex in pmd sharing >>>> case). I saw Miaohe had reviewed many of those. >>>> >>>> CC Miaohe, maybe he has some ideas on this. >>>> >>>> [1] https://lore.kernel.org/all/20220914221810.95771-7-mike.kravetz@oracle.com/T/#m2141e4bc30401a8ce490b1965b9bad74e7f791ff >>>> >>>> Thanks. >>>> >>>>> >>>>> dmesg: >>>>> WARNING: bad unlock balance detected! >>>>> 6.8.0-rc1+ #24 Not tainted >>>>> ------------------------------------- >>>>> lock/2613 is trying to release lock (&vma_lock->rw_sema) at: >>>>> [] hugetlb_vma_unlock_write+0x48/0x60 >>>>> but there are no more locks to release! >>> >>> Thanks for your report. It seems there's a race: >>> >>> CPU 1 CPU 2 >>> fork hugetlbfs_fallocate >>> dup_mmap hugetlbfs_punch_hole >>> i_mmap_lock_write(mapping); >>> vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree. >>> i_mmap_unlock_write(mapping); >>> hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem! i_mmap_lock_write(mapping); >>> hugetlb_vmdelete_list >>> vma_interval_tree_foreach >>> hugetlb_vma_trylock_write -- Vma_lock is cleared. >>> tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem! >>> hugetlb_vma_unlock_write -- Vma_lock is assigned!!! >>> i_mmap_unlock_write(mapping); >>> >>> hugetlb_dup_vma_private and hugetlb_vm_op_open are called outside i_mmap_rwsem lock. So there will be another bugs behind it. >>> But I'm not really sure. I will take a more closed look at next week. >> >> >> This can be fixed by deferring vma_interval_tree_insert_after() until vma is fully initialized. >> But I'm not sure whether there're side effects with this patch. >> >> linux-UJMmTI:/home/linmiaohe/mm # git diff >> diff --git a/kernel/fork.c b/kernel/fork.c >> index 47ff3b35352e..2ef2711452e0 100644 >> --- a/kernel/fork.c >> +++ b/kernel/fork.c >> @@ -712,21 +712,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >> } else if (anon_vma_fork(tmp, mpnt)) >> goto fail_nomem_anon_vma_fork; >> vm_flags_clear(tmp, VM_LOCKED_MASK); >> - file = tmp->vm_file; >> - if (file) { >> - struct address_space *mapping = file->f_mapping; >> - >> - get_file(file); >> - i_mmap_lock_write(mapping); >> - if (vma_is_shared_maywrite(tmp)) >> - mapping_allow_writable(mapping); >> - flush_dcache_mmap_lock(mapping); >> - /* insert tmp into the share list, just after mpnt */ >> - vma_interval_tree_insert_after(tmp, mpnt, >> - &mapping->i_mmap); >> - flush_dcache_mmap_unlock(mapping); >> - i_mmap_unlock_write(mapping); >> - } >> >> /* >> * Copy/update hugetlb private vma information. >> @@ -747,6 +732,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >> if (tmp->vm_ops && tmp->vm_ops->open) >> tmp->vm_ops->open(tmp); >> >> + file = tmp->vm_file; >> + if (file) { >> + struct address_space *mapping = file->f_mapping; >> + >> + get_file(file); >> + i_mmap_lock_write(mapping); >> + if (vma_is_shared_maywrite(tmp)) >> + mapping_allow_writable(mapping); >> + flush_dcache_mmap_lock(mapping); >> + /* insert tmp into the share list, just after mpnt. */ >> + vma_interval_tree_insert_after(tmp, mpnt, >> + &mapping->i_mmap); >> + flush_dcache_mmap_unlock(mapping); >> + i_mmap_unlock_write(mapping); >> + } >> + >> if (retval) { >> mpnt = vma_next(&vmi); >> goto loop_out; >> >> > > How is this possible? I thought, as specified in mm/rmap.c, that the > hugetlbfs path would be holding the mmap lock (which is also held in the > fork path)? The fork path holds the mmap lock from parent A and other childs(except first child B) while hugetlbfs path holds the mmap lock from first child B. So the mmap lock won't help here because it comes from different mm. Or am I miss something? Thanks. > > That is, the mmap_lock must be held before the i_mmap_lock_write() > > Am I missing something? Do we need an update to mm/rmap.c? > > Thanks, > Liam > > > . >