From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DBBD9C433F5 for ; Tue, 24 May 2022 20:02:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 573448D0003; Tue, 24 May 2022 16:02:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5224D8D0002; Tue, 24 May 2022 16:02:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 40ED78D0003; Tue, 24 May 2022 16:02:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 2FDE38D0002 for ; Tue, 24 May 2022 16:02:44 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 01AF23455A for ; Tue, 24 May 2022 20:02:43 +0000 (UTC) X-FDA: 79501709448.27.263C1C4 Received: from mail-pf1-f180.google.com (mail-pf1-f180.google.com [209.85.210.180]) by imf31.hostedemail.com (Postfix) with ESMTP id 27F1C2002F for ; Tue, 24 May 2022 20:02:08 +0000 (UTC) Received: by mail-pf1-f180.google.com with SMTP id c14so17386973pfn.2 for ; Tue, 24 May 2022 13:02:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=u/I6ys9+zugmfyBe7QrIGxzhWLhTr7pL6s9VfnrPA0c=; b=kxgKksdtEo63k9rmZSWhkbtzFp1XNfuBEPT4vlLowrV5b+srwXdkNiXAgeOhuOcvj9 pO7i5xEloG8UGyjw5wp4aE1YEJiOeWCgp97j4f5zDitbl92RLO3/nJWx1pNQ3ivHlS27 Tj2ueM4S2PppeSDk8m3d4mrkNJbuGM//yfCtG/Cltc6rwwj8jLSMsIwJoA5KgXs81qLq lqJE0WsE3ieqRZeaTX8aMJ26R7OtjNafJ2nWJIdyOaJytc0MYfuRCXIML3MSJfNFWaHa GfWkJs/B9Oi9onB9jOTaMdEu6nNcUHTLWtgynK5suq0J3HcJ8P2Bexah39kajPWOOnyB 1TIA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=u/I6ys9+zugmfyBe7QrIGxzhWLhTr7pL6s9VfnrPA0c=; b=hJNtOSHOX1W7qv62MTpaYH0PV75Uyq8YwEn29ku0FaajNX6WAgCk6JCgVStdV3+wyS Hd7ELklP91/SnRy8aUTZBD/ksHGEPM6/CSxi0YmFVUFbe53Xb7ImGmfSjKb5mDL9p5T/ 5aC7JkF0sfp1SUbXSqPbLxw+zSeXhEQIaEycvw4tFoIVqZeIEV178SX/Ga2gSVTo5X7c v5vI1KHwqN0IA4+oqsx8zJYamF8z3omUcSENAiLkaE/AL65InyVnpl/EI8y/S6u4hcGd /1AgHw2LJ7LRhiASmwISGiooYSxRChyfIAwgXhzjr6X5zPLfMilFGoMYJF/Me+ovOilg 0fKQ== X-Gm-Message-State: AOAM5321zgwtKL1VUFXBOlxz0+TB97an5ge/kyG/FatrjNlkfhtFu+zd 16Nx9zd1pLmMgi/M1PU53BxsLLcvxiapGkKQPbk= X-Google-Smtp-Source: ABdhPJzu349sBrFRceJSwy5NB9tLJilFhtgdfEWV/5XT6EmYjbdZbLQbLWi3a8N006HUcJBqsOvh3TGWNxUjc0IzUJI= X-Received: by 2002:a63:fc04:0:b0:3f6:4b2b:9d36 with SMTP id j4-20020a63fc04000000b003f64b2b9d36mr23092647pgi.206.1653422562323; Tue, 24 May 2022 13:02:42 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Yang Shi Date: Tue, 24 May 2022 13:02:29 -0700 Message-ID: Subject: Re: [RFC] mm: MADV_COLLAPSE semantics To: "Zach O'Keefe" Cc: Alex Shi , David Hildenbrand , David Rientjes , Matthew Wilcox , Michal Hocko , Peter Xu , Song Liu , Linux MM , Rongwei Wang , Andrea Arcangeli , Axel Rasmussen , Hugh Dickins , "Kirill A. Shutemov" , Minchan Kim , SeongJae Park , Pasha Tatashin Content-Type: text/plain; charset="UTF-8" X-Stat-Signature: tbs1hdaxm98t5twgcfjwx996txk5owae X-Rspam-User: Authentication-Results: imf31.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=kxgKksdt; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf31.hostedemail.com: domain of shy828301@gmail.com designates 209.85.210.180 as permitted sender) smtp.mailfrom=shy828301@gmail.com X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 27F1C2002F X-HE-Tag: 1653422528-941622 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, May 23, 2022 at 5:19 PM Zach O'Keefe wrote: > > Hey All, > > I'm sending this out before the v6 of "mm: userspace hugepage > collapse" for the purposes of aligning on and finalizing the semantics > of the proposed MADV_COLLAPSE madvise(2) mode. > > Background: > > So far, thanks to everyone's input, we've aligned on: > - MADV_COLLAPSE specifies its own hugepage allocation semantics (it > allows direct reclaim/compaction). > - MADV_COLLAPSE ignores khugepaged heuristics > (/sys/kernel/mm/transparent_hugepage/khugepaged/max_pte_* and > young/referenced page requirements). > > In terms of THP _eligibility_, in v5 it was proposed that > MADV_COLLAPSE follow existing THP eligibility semantics > (/sys/kernel/mm/transparent_hugepage/enabled + the VMA flags of the > VMA being collapsed)[1]. > > However, Rongwei Wang kindly pointed out that the useability of > process_madvise(MADV_COLLAPSE) on a system in "madvise" THP mode was > limited. I agreed to include process_madvise(2) support for > MADV_[NO]HUGEPAGE in v6, but following a discussion with David H., I > think that was a mistake. Namely, as David kindly pointed out, there > exist programs that don't > work with THP and have good reason to disable it. The example > provided was postcopy life migration in QEMU, which explicitly > disables THP right before faulting in any pages. > > Idea: MADV_COLLAPSE should respect VM_NOHUGEPAGE and "never" THP mode, > but otherwise would attempt to collapse. I do agree to respect VM_NOHUGEPAGE and "never". Collapsing for not-madvised VMAs for "madvise" mode sounds ok to me, but I'm not so sure. > > Why? If someone(*), somewhere told us not to use THPs, then don't > override that decision. Otherwise, this is an explicit, safe(**) > request made on behalf of ourselves, or by a CAP_SYS_ADMIN process, > and shouldn't be blocked by interfaces meant to guide the > "transparent" part of THPs. > > Other options considered: > > I considered variations of setting VM_HUGEPAGE only if calling on > behalf of self or if VM_NOHUGEPAGE is not set. However, I didn't like > this because there isn't a way to undo the operation: If we supported > process_madvise(MADV_NOHUGEPAGE), we would have to let the application > unclear VM_NOHUGEPAGE because outside processes can't/shouldn't. It > would have to require some *new* madvise mode like MADV_CLEARHUGEPAGE > (that would fail if calling on behalf of another process and > VM_NOHUGEPAGE set) to clear VM_[NO]HUGEPAGE. > > A possible downside to the proposed approach is that, if in "madvise" > THP mode and collapsing a VMA not marked VM_HUGEPAGE, it's now the > caller's responsibility to monitor and recollapse this memory back > into THPs. However, in practice this likely means an explicit > MADV_DONTNEED (please let me know if there are other important cases > here), and presumably it's the caller's job to do the monitoring anyway. Page reclaim could also cause the THP split. And it may happen at any time. I'm not sure how the users or callers could monitor it. > > Thanks again for taking the time to read / provide input here. I think > this is the last point to clear up before releasing a v6 that should > hopefully have all the functionality we need. > > Best, > Zach > > --- > > (*) If we could verify that "never" THP mode was used _only_ for > debugging, then I'd actually opt to ignore "never" in MADV_COLLAPSE. > It's the last dependency MADV_COLLAPSE has on sysfs THP interface and > would provide a convenient way to test/debug MADV_COLLAPSE with > khugepaged / at-fault disabled. > (**) I suppose there could exist applications that see THP "madvise" > mode, never call MADV_HUGEPAGE, and so assume THPs will never be > found. > > [1] https://lore.kernel.org/linux-mm/20220504214437.2850685-1-zokeefe@google.com/ > [2] https://lore.kernel.org/linux-mm/502a3ced-f3c6-7117-3b24-d80d204d66ee@linux.alibaba.com/