From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 05452C27C76 for ; Wed, 25 Jan 2023 16:02:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 92D7B6B0072; Wed, 25 Jan 2023 11:02:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8DDA86B0073; Wed, 25 Jan 2023 11:02:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 77E6F6B0075; Wed, 25 Jan 2023 11:02:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 677616B0072 for ; Wed, 25 Jan 2023 11:02:43 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 305A540DC4 for ; Wed, 25 Jan 2023 16:02:43 +0000 (UTC) X-FDA: 80393789406.23.48AE911 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf15.hostedemail.com (Postfix) with ESMTP id ADF48A0011 for ; Wed, 25 Jan 2023 16:02:39 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=CXFQbmnk; spf=pass (imf15.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1674662559; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WQizG6H3a3lKyN9FIAX0hFTgtZ8aE66XLHahPlUh5f8=; b=ks9A1d5kAu0Gr+E1i8jDyJBYV9vIWGWu/YrIahGag/H6HT/YT7kAw3JcJ2Pe84ebLvHISf jO3nZ+fwivI+EWJEi6eiP8CyN1gAGB7A1U8AVC95lbu2AiUphwD8EJfexc6wG8veK3VWzQ 0FTLDFFv5Nwbw9LR41Y8T74VsGaFdtI= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=CXFQbmnk; spf=pass (imf15.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1674662559; a=rsa-sha256; cv=none; b=P5zzghU9e1cJDlTYueV+KCMAZrm2B4nnKdHoG5HLB4IA/rQT7rxpSQ0LMPz6G4p6KW13XV 5J9qPgzcdN64cJid7ltnkZdROcSnHSHa1e0PHjM2psckpbJbRKf3GjupdK/W4UxhvwBwXX pBb5vT8Y20BSqiecHIca/g4WPNUkMlk= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1674662558; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=WQizG6H3a3lKyN9FIAX0hFTgtZ8aE66XLHahPlUh5f8=; b=CXFQbmnklSh8Ppmd3ELKQ8iReh1lTUiA6DLiomMuww1tEC1mxCbxrP3sWEggvvQIcw48k7 MVRfuU0HIPFoKNth61qrBMzALkKojKhuriIAfFadRTpzXurT+whsWxybYHeFQ5EbPd1BPi pm6OzLgE1y6NkgFLCka3acS57ozXW+I= Received: from mail-oa1-f70.google.com (mail-oa1-f70.google.com [209.85.160.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-433-Ksl703quP6GSg6oML1uWJw-1; Wed, 25 Jan 2023 11:02:37 -0500 X-MC-Unique: Ksl703quP6GSg6oML1uWJw-1 Received: by mail-oa1-f70.google.com with SMTP id 586e51a60fabf-15ba18af8d6so7807422fac.23 for ; Wed, 25 Jan 2023 08:02:37 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=WQizG6H3a3lKyN9FIAX0hFTgtZ8aE66XLHahPlUh5f8=; b=YmK4jgnRjbu7vbhJd43Tw4gWNqCU+D1xIrmwN9Jojr/+bg6fp5mC4+wrQVqvzZEpOw 9vBBEczj4lbbfY6MNjOvY3svUdd2jd0/kAV8tIcG9wGo7L4QfQSuV+ci8Pp1fXFAUY2C g+jFQEISISZyvCEGxBSH99fZW2l4hnXxckxybJcJY9cp05F3UXC0JMDXi9t7pNBWkkLC Kszss6TFv4l20/HRhNek9AZzBwdpOBg09gw/uruMnbrNxxBjlKmGoEIiNFeIRjrDfbtS oTaZ0IXAi4pnvIFuU8PNF+eVQbOqFwv47IdhsclY/nWPRnRH0gYNnsGQsiEGtngBgjNZ EUPQ== X-Gm-Message-State: AFqh2krH5ozur1Fu2zbkmHZ6OHIMVHqVuPNr75dAo4Ixf4kBdAsotB2L zZ+kpRgW+fC1XCT63s4TL10+ai9CJxlNFmfbnwDs3tKsz+YpjIDyrO3hMJA37vMNJVkLlLwxyzA 5NOvnNoayenE= X-Received: by 2002:a05:6870:cd0c:b0:15e:f26f:7140 with SMTP id qk12-20020a056870cd0c00b0015ef26f7140mr19067637oab.58.1674662556819; Wed, 25 Jan 2023 08:02:36 -0800 (PST) X-Google-Smtp-Source: AMrXdXvyOQi5DDFD8ArZAj/h1QhJuMlQW49YbNGWpYozTGHGh0KxV7NdPdOh5bUrsceX34J5gMkEUA== X-Received: by 2002:a05:6870:cd0c:b0:15e:f26f:7140 with SMTP id qk12-20020a056870cd0c00b0015ef26f7140mr19067603oab.58.1674662556433; Wed, 25 Jan 2023 08:02:36 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id s192-20020a37a9c9000000b007091068cff1sm3740764qke.28.2023.01.25.08.02.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Jan 2023 08:02:35 -0800 (PST) Date: Wed, 25 Jan 2023 11:02:34 -0500 From: Peter Xu To: Yang Shi Cc: Mike Kravetz , linux-mm@kvack.org, Naoya Horiguchi , David Rientjes , Michal Hocko , Matthew Wilcox , David Hildenbrand , James Houghton , Muchun Song Subject: Re: A mapcount riddle Message-ID: References: MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: ADF48A0011 X-Stat-Signature: fxhsoiyy6ju1ynst35w5rw41u7bgowmh X-Rspam-User: X-HE-Tag: 1674662559-573488 X-HE-Meta: U2FsdGVkX1/OLExTZ3g4kEoprq5x2oh0tiFQ4lzdS/VnQIU+aRglYG5TUQmHogqWx3UGJntV6ZYuFDS/lgp/NAPr40IOpeskMLa5xUffPf1AYKXXe/qTvsM/9i12A3OHpT6rX/Dvtqk9KtXpfqLr6ED4YdQm2jTdo2kMdi936/XyXZ5X5hs/RzO+mZPKL2rTIJ6wkm/815EzNIZ+6Es/tacXd7fRG5itJpwvBhdCEj/1+jjzKnxFyBPhbqH+IPhFf8mkfEG7Li8UVd/NGu72nn78EVvME+3aa5VSLet0dXp5QvbIyJuclc2t7MoUG+B/aOCMyTAuTfQfRUnUuB0jZs6INeSKzWW4xOm2ENB9mzObxvENHR63Bbwoq2mSKO3Z+JvjXXzUsgz40hM0+wKOlY9cTfetRVncieCX8BLIU9Mtyyg9zzAOhPMgo8N9XTUkxjQIFRJEjHMtx5VQjwdos4zZ3yXI1w7/Uqb5Mr8t6UkWowa/2bf1wDRzZ7zQEXajK9NnfqPa/uZbO99e2QYOd0YHeSY7l1+SM+BdW/Lsz/rynavTXVgLDJ9kUj4fOuYtrvi9ea4/DdpPM/lgeQGxQBY+fc/nhPNbpeUE52FzKfQO2OX04f9wesJcz03wKcgDl4/BF6XsG5Tb7KkCca72fd95iIYFmavMqUzZww+4Mi8IXNCWxVhGEJDYu2T/VdqgDRDE9ykdQkj8gLyRilFPr4QSGz4Zv9BmBe4K3wAiw00E+2DelULy7bDk6RD3fGxiHHzpMusqkqjPtKDoDwd50zQrbPt6tEx4sgzrGFjQxQb/LYiUrbjRv4zUi3lPIaZzj63+2LWINN2mlIJOJywkvdogMA8nSSvLsgCrfnULlw51UYxn6/nxsHo6Ia65S03Izkzc3Ri1w+PHoRciftUkHeZsrNl20UxVk/jPOPDLCuqbfvY41D8nxuyahgWdPFDLlTa01d7czanc7p9ogAz mlpyCzLN MLPlxphKr0gEsc+aEFFbq1x3pMiyRkrW1Un7PDQGoTIsDaBHKw51ATcIteS8ReFGzFJtGdPYT4gtRqcfBfWXzMp7HrdSFgePZtuh8UWAa5xxEtAjEdEkVXmGAH+4Zh2sYYeoC8PJdupYysJoehPSWe8vTbSoK1gtnum2r35OsipfLtIU/nIH4FPPJbY+6d5ogpPl3A0QMjKXQjJM4p6dV92l8eLNMgJNzDGoeaCOLQOf/GmmMNAryi06zydSbBN/LrjcIcDBc5y20FuwaNwkn/EXTk/U5/OD2WSThvsKqAm9V6xYlUiyK8XIoEt3w3kwU7DCTrJtq9gVzrr8i+Jk7gVdDX1cLGbQ4bQ28Ewg+kR/id/7rSegIl6euH3qWWxSMOpmMrLO4vJJCiUthNdZ1TFup/w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Jan 24, 2023 at 03:29:53PM -0800, Yang Shi wrote: > On Tue, Jan 24, 2023 at 3:00 PM Peter Xu wrote: > > > > On Tue, Jan 24, 2023 at 12:56:24PM -0800, Mike Kravetz wrote: > > > Q How can a page be mapped into multiple processes and have a > > > mapcount of 1? > > > > > > A It is a hugetlb page referenced by a shared PMD. > > > > > > I was looking to expose some basic information about PMD sharing via > > > /proc/smaps. After adding the code, I started a couple processes > > > sharing a large hugetlb mapping that would result in the use of > > > shared PMDs. When I looked at the output of /proc/smaps, I saw > > > my new metric counting the number of shared PMDs. However, what > > > stood out was that the entire mapping was listed as Private_Hugetlb. > > > WTH??? It certainly was shared! The routine smaps_hugetlb_range > > > decides between Private_Hugetlb and Shared_Hugetlb with this code: > > > > > > if (page) { > > > int mapcount = page_mapcount(page); > > > > > > if (mapcount >= 2) > > > mss->shared_hugetlb += huge_page_size(hstate_vma(vma)); > > > else > > > mss->private_hugetlb += huge_page_size(hstate_vma(vma)); > > > } > > > > This is definitely unfortunate.. > > > > > > > > After spending some time looking for issues in the page_mapcount code, > > > I came to the realization that the mapcount of hugetlb pages only > > > referenced by a shared PMD would be 1 no matter how many processes had > > > mapped the page. When a page is first faulted, the mapcount is set to 1. > > > When faulted in other processes, the shared PMD is added to the page > > > table of the other processes. No increase of mapcount will occur. > > > > > > At first thought this seems bad. However, I believe this has been the > > > behavior since hugetlb PMD sharing was introduced in 2006 and I am > > > unaware of any reported issues. I did a audit of code looking at > > > mapcount. In addition to the above issue with smaps, there appears > > > to be an issue with 'migrate_pages' where shared pages could be migrated > > > without appropriate privilege. > > > > > > /* With MPOL_MF_MOVE, we migrate only unshared hugepage. */ > > > if (flags & (MPOL_MF_MOVE_ALL) || > > > (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) { > > > if (isolate_hugetlb(page, qp->pagelist) && > > > (flags & MPOL_MF_STRICT)) > > > /* > > > * Failed to isolate page but allow migrating pages > > > * which have been queued. > > > */ > > > ret = 1; > > > } > > > > > > I will prepare fixes for both of these. However, I wanted to ask if > > > anyone has ideas about other potential issues with this? > > > > This reminded me whether things should be checked already before this > > happens. E.g. when trying to share pmd, whether it makes sense to check > > vma mempolicy before doing so? > > > > Then the question is if pmd sharing only happens with the vma that shares > > the same memory policy, whether above mapcount==1 check would be acceptable > > even if it's shared by multiple processes. > > I don't think so. One process might change its policy, for example, > bind to another node, then result in migration for the hugepage due to > the incorrect mapcount. The above example code pasted by Mike actually > comes from mbind if I remember correctly. Yes, or any page migrations. Above was a purely wild idea that we share pmd based on vma attributes matching first (shared, mapping alignments, etc.). It can also take mempolicy into account so that when migrating one page on the shared pmd, one can make a decision for all with mapcount==1 because that single mapcount may stand for all the sharers of the page as long as they share the same mempolicy. If above idea applies, we'll also need to unshare during mbind() when the mempolicy of vma changes for hugetlb in this path, because right after the mempolicy changed the vma attribute changed, so pmd sharing doesn't hold. But please also ignore this whole thing - I don't think that'll resolve the generic problem of mapcount issue on pmd sharing no matter what. It's just something come up to mind when I read it. > > I'm wondering whether we could use refcount instead of mapcount to > determine if hugetlb page is shared or not, assuming refcounting for > hugetlb page behaves similar to base page (inc when mapped by a new > process or pinned). If it is pinned (for example, GUP) we can't > migrate it either. I think refcount has the same issue because it's not accounted either when pmd page is shared. -- Peter Xu