From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8432CC4320A for ; Thu, 26 Aug 2021 21:42:27 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 17FDB60F14 for ; Thu, 26 Aug 2021 21:42:26 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 17FDB60F14 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 8E14D8D0002; Thu, 26 Aug 2021 17:42:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 86A208D0001; Thu, 26 Aug 2021 17:42:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 70B4D8D0002; Thu, 26 Aug 2021 17:42:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0236.hostedemail.com [216.40.44.236]) by kanga.kvack.org (Postfix) with ESMTP id 4FCF68D0001 for ; Thu, 26 Aug 2021 17:42:26 -0400 (EDT) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id EB7D618020B3C for ; Thu, 26 Aug 2021 21:42:25 +0000 (UTC) X-FDA: 78518555850.29.078C890 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf13.hostedemail.com (Postfix) with ESMTP id 7ECE0101BC99 for ; Thu, 26 Aug 2021 21:42:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1630014145; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=m2FmJSXlEr2qCecPowyaOlZhRz8GpI4FUiMuAU7OKCg=; b=NUiTSeFLafmtEYoPlRDeBwAhV/mZu+Ieg7zF48AsvSjZiFZLE9z9wWHjLKyNvh2UHNA1RG s+OV6I16+Qweh3kFGfVnbnGlrfm4Mo0kxsXpr6IkhOAhXafFqycnM2f3tQN6G3/vAesu/e KKEutuaziGSYMYaI+jqbpyoUypaO7og= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-441--40U566jNs6DO0JdU2WQKA-1; Thu, 26 Aug 2021 17:42:23 -0400 X-MC-Unique: -40U566jNs6DO0JdU2WQKA-1 Received: by mail-wm1-f70.google.com with SMTP id o20-20020a05600c379400b002e755735eedso1193136wmr.0 for ; Thu, 26 Aug 2021 14:42:23 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=m2FmJSXlEr2qCecPowyaOlZhRz8GpI4FUiMuAU7OKCg=; b=PcjatoyKvXjFw+gKdl+8ZKpH6ChW+6TOX+I9VQv0bW5CccdUzU+tWv82w1rSNMFl57 DzWLYP88bbE+90B1FkqYU+TWYmtLVxWwp9RtjO0rxGYQ+sIugtndD5LAHDovKIo9F3Qd Y4sjIeJU87msyC+a+wHYgAUGoaFb82ObyITwZmFnjo5dKfb7efp3iXU3KCBVduVJ1bU5 pW8TQzC0ENhIy8/XhR/JkGHqF5SbMG8txkhU8uvq22mVjaGcJdAl+ovXRTxUfdVoFepz K55vp16AHMlopAs/5L1BdpqOpr8oBwwMWrVebPNqggNwsBW0svAyntrSdyJ2xu2VZrlZ tq7A== X-Gm-Message-State: AOAM530ZfiVrq/DPLxNbKZ5MeyHkpuSgP/dVlvzkj+7B2023lRnKdtjx 3HPz+dr8Jk/QSpV7SMly29I/U2zH1TT+nhr9Q63q4j7Tv3qPU4bStWDC1wCbJfGXxM6V5ahvyhC CNl6OO8J9qAQ= X-Received: by 2002:adf:9783:: with SMTP id s3mr6606212wrb.349.1630014142729; Thu, 26 Aug 2021 14:42:22 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxCkhqn+UwHShsb4JKtbEqOs0TLDQOFLz7yJC8vYxiDcumhU3BYy/FLLwouM+xIIp2eUTNUgA== X-Received: by 2002:adf:9783:: with SMTP id s3mr6606161wrb.349.1630014142475; Thu, 26 Aug 2021 14:42:22 -0700 (PDT) Received: from [192.168.3.132] (p4ff23dec.dip0.t-ipconnect.de. [79.242.61.236]) by smtp.gmail.com with ESMTPSA id i21sm4261721wrb.62.2021.08.26.14.42.20 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 26 Aug 2021 14:42:21 -0700 (PDT) To: SeongJae Park Cc: akpm@linux-foundation.org, markubo@amazon.com, SeongJae Park , Jonathan.Cameron@Huawei.com, acme@kernel.org, alexander.shishkin@linux.intel.com, amit@kernel.org, benh@kernel.crashing.org, brendanhiggins@google.com, corbet@lwn.net, dwmw@amazon.com, elver@google.com, fan.du@intel.com, foersleo@amazon.de, greg@kroah.com, gthelen@google.com, guoju.fgj@alibaba-inc.com, jgowans@amazon.com, joe@perches.com, mgorman@suse.de, mheyne@amazon.de, minchan@kernel.org, mingo@redhat.com, namhyung@kernel.org, peterz@infradead.org, riel@surriel.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, shakeelb@google.com, shuah@kernel.org, sieberf@amazon.com, snu@zelle79.org, vbabka@suse.cz, vdavydov.dev@gmail.com, zgf574564920@gmail.com, linux-damon@amazon.com, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org References: <20210826172920.4877-1-sjpark@amazon.de> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v34 05/13] mm/damon: Implement primitives for the virtual memory address spaces Message-ID: <3b094493-9c1e-6024-bfd5-7eca66399b7e@redhat.com> Date: Thu, 26 Aug 2021 23:42:19 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <20210826172920.4877-1-sjpark@amazon.de> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=NUiTSeFL; spf=none (imf13.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 7ECE0101BC99 X-Stat-Signature: t369tzu5t9z8r8xwnauwxbd851tkhi11 X-HE-Tag: 1630014145-985361 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 26.08.21 19:29, SeongJae Park wrote: > From: SeongJae Park >=20 > Hello David, >=20 >=20 > On Thu, 26 Aug 2021 16:09:23 +0200 David Hildenbrand = wrote: >=20 >>> +static void damon_va_mkold(struct mm_struct *mm, unsigned long addr) >>> +{ >>> + pte_t *pte =3D NULL; >>> + pmd_t *pmd =3D NULL; >>> + spinlock_t *ptl; >>> + >> >> I just stumbled over this, sorry for the dumb questions: >=20 > Appreciate for the great questions! >=20 >> >> >> a) What do we know about that region we are messing with? >> >> AFAIU, just like follow_pte() and follow_pfn(), follow_invalidate_pte(= ) >> should only be called on VM_IO and raw VM_PFNMAP mappings in general >> (see the doc of follow_pte()). Do you even know that it's within a >> single VMA and that there are no concurrent modifications? >=20 > We have no idea about the region at this moment. However, if we succes= sfully > get the pte or pmd under the protection of the page table lock, we ensu= re the > page for the pte or pmd is a online LRU-page with damon_get_page(), bef= ore > updating the pte or pmd's PAGE_ACCESSED bit. We release the page table= lock > only after the update. >=20 > And concurrent VMA change doesn't matter here because we read and write= only > the page table. If the address is not mapped or not backed by LRU page= s, we > simply treat it as not accessed. reading/writing page tables is the real problem. >=20 >> >> b) Which locks are we holding? >> >> I hope we're holding the mmap lock in read mode at least. Or how are y= ou >> making sure there are no concurrent modifications to page tables / VMA >> layout ... ? >> >>> + if (follow_invalidate_pte(mm, addr, NULL, &pte, &pmd, &ptl)) >=20 > All the operations are protected by the page table lock of the pte or p= md, so > no concurrent page table modification would happen. As previously ment= ioned, > because we read and update only page table, we don't care about VMAs an= d > therefore we don't need to hold mmap lock here. See below, that's unfortunately not sufficient. >=20 > Outside of this function, DAMON reads the VMAs to know which address ra= nges are > not mapped, and avoid inefficiently checking access to the area with th= e > information. Nevertheless, it happens only occasionally (once per 60 s= econds > by default), and it holds the mmap read lock in the case. >=20 > Nonetheless, I agree the usage of follow_invalidate_pte() here could ma= ke > readers very confusing. It would be better to implement and use DAMON'= s own > page table walk logic. Of course, I might missing something important.= If you > think so, please don't hesitate at yelling to me. I'm certainly not going to yell :) But unfortunately I'll have to tell=20 you that what you are doing is in my understanding fundamentally broken. See, page tables might get removed any time a) By munmap() code even while holding the mmap semaphore in read (!) b) By khugepaged holding the mmap lock in write mode The rules are (ignoring the rmap side of things) a) You can walk page tables inside a known VMA with the mmap semaphore=20 held in read mode. If you drop the mmap sem, you have to re-validate the=20 VMA! Anything could have changed in the meantime. This is essentially=20 what mm/pagewalk.c does. b) You can walk page tables ignoring VMAs with the mmap semaphore held=20 in write mode. c) You can walk page tables lockless if the architecture supports it and=20 you have interrupts disabled the hole time. But you are not allowed to=20 write. With what you're doing, you might end up reading random garbage as page=20 table pointers, or writing random garbage to pages that are no longer=20 used as page tables. Take a look at mm/gup.c:lockless_pages_from_mm() to see how difficult it=20 is to walk page tables lockless. And it only works because page table=20 freeing code synchronizes either via IPI or fake-rcu before actually=20 freeing a page table. follow_invalidate_pte() is, in general, the wrong thing to use. It's=20 specialized to VM_IO and VM_PFNMAP. take a look at the difference in=20 complexity between follow_invalidate_pte() and mm/pagewalk.c! I'm really sorry, but as far as I can tell, this is locking-wise broken=20 and follow_invalidate_pte() is the wrong interface to use here. Someone can most certainly correct me if I'm wrong, or if I'm missing=20 something regarding your implementation, but if you take a look around,=20 you won't find any code walking page tables without at least holding the=20 mmap sem in read mode -- for a good reason. --=20 Thanks, David / dhildenb