From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A7B67C433EF for ; Fri, 26 Nov 2021 09:04:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 132156B0072; Fri, 26 Nov 2021 04:04:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0E29C6B0078; Fri, 26 Nov 2021 04:04:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EEBEE6B007B; Fri, 26 Nov 2021 04:04:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0120.hostedemail.com [216.40.44.120]) by kanga.kvack.org (Postfix) with ESMTP id DF29A6B0072 for ; Fri, 26 Nov 2021 04:04:20 -0500 (EST) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id A6B5E8248D7C for ; Fri, 26 Nov 2021 09:04:10 +0000 (UTC) X-FDA: 78850494660.23.9D29C33 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf13.hostedemail.com (Postfix) with ESMTP id 5C1EA104629D for ; Fri, 26 Nov 2021 09:04:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1637917448; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=pBeC7fNu77y+gSatqqLKddyjvmoD+l8MjcB3oFZhODk=; b=Fsi0kiYxZH2Kp/89cw8W3pcTZSHaOlT/56AuJeKV6ruzEkFABfM3GeUXsnlvObF+HgST7q 5k/juX0s1t6PLIyCwFoZ2gZwznQ9i2oZdlfLBOEWJ/E6E3cKyJfn3XnhSIrf+0dTdUMTyg S3AfiuKc0rrPsq8aMGMpDhooTgZsZ0w= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-61-nQhiwZm8Pbue_eaccgHpfA-1; Fri, 26 Nov 2021 04:04:05 -0500 X-MC-Unique: nQhiwZm8Pbue_eaccgHpfA-1 Received: by mail-wm1-f71.google.com with SMTP id l6-20020a05600c4f0600b0033321934a39so5000848wmq.9 for ; Fri, 26 Nov 2021 01:04:04 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent :content-language:to:cc:references:from:organization:subject :in-reply-to:content-transfer-encoding; bh=pBeC7fNu77y+gSatqqLKddyjvmoD+l8MjcB3oFZhODk=; b=ycL0INI5QPZGNciVvYwSkTjnMUIVjnQh0E8jhBgZ/3LT9SxuzzdIpPllCekL4dvcKT L17npJkSGYxBIzxGIDqq/NC+HFG033bgYj25zr7NEzTrscbprGBVB9I5K6at7UU9pMfC Qg10bNSW9762IRK6vtQrdY4qEgSICNxtoZoB0FfTcyau79ADP9q/OMcqdY8HpaUq6n6E zqqTlICxoHyIg9StQTrJHKzVXSUJzhl9ACgiFb0UCsfChRFSTrhYTx+RIqwvYJTAnK9l VdrdKN6y6vYdm7awCrCCan1ODYm6ieBWYS5Keq517iMlTCRg+4rBYrvWxnphihoUuotV +CSQ== X-Gm-Message-State: AOAM5323w6p30IcwGU+rp9r2GH6icc0A28Ybfc2Iq8+nY4XLdHXDJfbg hdLQNrYns3/Lvpi8iHJNrnyrxS58QqZCT2qYeCI/R5U2ZJLrqC6qIMc8ex9Gfw/6UymPw0364Wb 5rWfoJoCdp/U= X-Received: by 2002:adf:9e46:: with SMTP id v6mr12815249wre.579.1637917443867; Fri, 26 Nov 2021 01:04:03 -0800 (PST) X-Google-Smtp-Source: ABdhPJy1dlanYH8GDldEysHXV09dUKcN/c6nLT9uALeauQ+jH6UiOSUrI6mi92jp+Lstidv4gX4iwQ== X-Received: by 2002:adf:9e46:: with SMTP id v6mr12815225wre.579.1637917443652; Fri, 26 Nov 2021 01:04:03 -0800 (PST) Received: from [192.168.3.132] (p5b0c69e1.dip0.t-ipconnect.de. [91.12.105.225]) by smtp.gmail.com with ESMTPSA id b13sm1306456wrh.32.2021.11.26.01.04.02 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 26 Nov 2021 01:04:03 -0800 (PST) Message-ID: Date: Fri, 26 Nov 2021 10:04:01 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 To: Peter Xu Cc: Shakeel Butt , "Kirill A . Shutemov" , Yang Shi , Zi Yan , Matthew Wilcox , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, "George G. Davis" References: <20211120201230.920082-1-shakeelb@google.com> <25b36a5c-5bbd-5423-0c67-05cd6c1432a7@redhat.com> <57d649c8-fe13-17cd-8819-2cd93500a79c@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH] mm: split thp synchronously on MADV_DONTNEED In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Stat-Signature: 9y9wossm8w44478z311o5bsxn9o6gt47 X-Rspamd-Queue-Id: 5C1EA104629D X-Rspamd-Server: rspam07 Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Fsi0kiYx; spf=none (imf13.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-HE-Tag: 1637917445-514462 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 26.11.21 03:52, Peter Xu wrote: > On Thu, Nov 25, 2021 at 11:32:08AM +0100, David Hildenbrand wrote: >> On 25.11.21 11:24, Peter Xu wrote: >>> On Mon, Nov 22, 2021 at 10:40:54AM -0800, Shakeel Butt wrote: >>>>> Do we have a performance evaluation how much overhead is added e.g., for >>>>> a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that >>>>> covers the whole THP? >>>> >>>> I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on >>>> x86 for both settings you suggested. I don't see any statistically >>>> significant difference with and without the patch. Let me know if you >>>> want me to try something else. >>> >>> I'm a bit surprised that sync split thp didn't bring any extra overhead. >>> >>> "unmap whole thp" is understandable from that pov, because afaict that won't >>> even trigger any thp split anyway even delayed, if this is the simplest case >>> that only this process mapped this thp, and it mapped once. >>> >>> For "unmap 4k upon thp" IIUC that's the worst case and zapping 4k should be >>> fast; while what I don't understand since thp split requires all hand-made work >>> for copying thp flags into small pages and so on, so I thought there should at >>> least be some overhead measured. Shakeel, could there be something overlooked >>> in the test, or maybe it's me that overlooked? >>> >>> I had the same concern as what Kirill/Matthew raised in the other thread - I'm >>> worried proactively splitting simply because any 4k page is zapped might >>> quickly free up 2m thps in the system and I'm not sure whether it'll exaggerate >>> the defragmentation of the system memory in general. I'm also not sure whether >>> that's ideal for some very common workload that frequently uses DONTNEED to >>> proactively drop some pages. >> >> The pageblock corresponding to the THP is movable. So (unless we start >> spilling unmovable allocations into movable pageblocks) we'd only place >> movable allocations in there. Compaction will be able to migrate to >> re-create a free THP. >> >> In contrast I think, compaction will happily skip over the THP and >> ignore it, because it has no clue that the THP could be repurposed by >> split+migrate (at least I am not aware of code that does it). >> >> Unless I am missing something, with the above in mind it could make >> sense to split as soon as possible, even before we're under memory >> pressure -- for example, for proactive compaction. >> >> [proactive compaction could try splitting first as well I think] > > But we can't rely on proactive compaction for rapid operations, because it's > still adding overhead to the overall system by split+merge, right? Yes, but there is also direct compaction that can be triggered without the shrinker getting involved. I think we can summarize as "there might not be a right or wrong when to split". An application that MADV_DONTNEEDs/munmap sub-THP memory told us that it doesn't want to consume memory, yet it looks like it's still consuming that memory. I do wonder how THP on the deferred split queue behave in respect to page migration -- memory offlining, alloc_contig_range(). I saw reports that there are some cases where THP can be problematic when stress-testing THP: https://lkml.kernel.org/r/20210903162102.GA10039@mam-gdavis-dt But not sure if that's related to deferred splitting. Most probably not. > > +compaction_proactiveness > +======================== > + ... > +Note that compaction has a non-trivial system-wide impact as pages > +belonging to different processes are moved around, which could also lead > +to latency spikes in unsuspecting applications. The kernel employs > +various heuristics to avoid wasting CPU cycles if it detects that > +proactive compaction is not being effective. > > Delaying split makes sense to me because after all the kernel is not aware of > the userspace's preference, so the best thing is to do nothing until necessary. > > Proactively split thps in dontneed/unmap added an assumption that the userspace > wants to break the pages by default. It's 100% true for Shakeel's use case, > but I'm not sure whether it may always be true. That's why I thought maybe a > new interface is more proper, so we at least won't break anyone by accident. Well, we already broke the PMD into PTEs. So the performance gain at least for that user is really gone until we "fix that" again via khugepaged -- which might just be configured to not "fix" if there are empty PTEs. It for sure is interesting if you have a COW huge page and only one party zaps/unmaps some part. -- Thanks, David / dhildenb