From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CF60AC433F5 for ; Wed, 13 Apr 2022 17:52:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237479AbiDMRyf (ORCPT ); Wed, 13 Apr 2022 13:54:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42620 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233269AbiDMRye (ORCPT ); Wed, 13 Apr 2022 13:54:34 -0400 Received: from mail-qt1-x82f.google.com (mail-qt1-x82f.google.com [IPv6:2607:f8b0:4864:20::82f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5CEB16D4E1 for ; Wed, 13 Apr 2022 10:52:11 -0700 (PDT) Received: by mail-qt1-x82f.google.com with SMTP id hf18so1440622qtb.0 for ; Wed, 13 Apr 2022 10:52:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=AOcfAyqGgzTsDLYWZNFvb/pt3FEBh03V8FeUjVfINLs=; b=Bossa6GEY7VuCphnLPZuOzJUtcvFY0Xkwv27PVSBfCC/FdX4k2U/R/8tc3e6/TDG9b w/9uAgh8u82TWvzqs9/oqezRfVKYTJcg88jPsDyVpjn5i6qVesFEQ/DApakc869u+5lk GwUeq/iRHuUUSbU5VPjF+x+Nxyo3LcL64cerGvihnoxpTS7x163MByH0tjIdn6VLh7j0 ORezJT/JFq5so7GsIh61VtI+O3VpLisBAYp/sU9dzGHFPjvZ0UPJeFpJ9W27eu7ET5GJ cvhFhPAZ1ZOsHSur//jxKwQ6gDdtJtdM7gcJ7fyJKGc6kOw1FZQe9pd2KDaJak2ceHpG PJvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=AOcfAyqGgzTsDLYWZNFvb/pt3FEBh03V8FeUjVfINLs=; b=qT8FST4ShwafAkKQw0xmR7ledsmcd7jijacIuMo9MTByyoYQyOEDoNuCC9hmOURnN5 GLWyuFJBJFb7BV/d+uZQZqyeG8GqZ7kfFQd2Hiwm7T3z4CM9owcnL6e/bud62acZq6Vl OQadU/f5qdlRyqyzpxGqdy68GNYBZ29MSpk6itf9CigbugKhaQerZD65ViTawmaYD1GH 21Bzz/TBjckKD+wNWNE51CUy3Tyqmgx5q4ADTkHm9OAM42PzIefDpu4voK4rKoZTnyzD negnBGE2Aw3txO2cjf00Ssr1GKdjFoGAvYn1SqpRUAEZFFaVnh+JYcoayeKwnBp3p6bT xaZA== X-Gm-Message-State: AOAM532OkmYcM0/pRlae/7kyRXVbxAKSG0ugeJbLzbsF9n6ELI/QiPtb 3p3osx6Xg7iRltvkF/jLU/V18w== X-Google-Smtp-Source: ABdhPJyFh5b43Bp+tYjO2fJUecpGRbFTQokPtotFsAvgNZxKoFX6OH3MdddD7y2rpYU6hDl31dtUUQ== X-Received: by 2002:a05:622a:1392:b0:2e1:e7b9:3ce4 with SMTP id o18-20020a05622a139200b002e1e7b93ce4mr7976945qtk.153.1649872330523; Wed, 13 Apr 2022 10:52:10 -0700 (PDT) Received: from ziepe.ca (hlfxns017vw-142-162-113-129.dhcp-dynamic.fibreop.ns.bellaliant.net. [142.162.113.129]) by smtp.gmail.com with ESMTPSA id w10-20020a05620a424a00b00680c0c0312dsm23050212qko.30.2022.04.13.10.52.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 13 Apr 2022 10:52:09 -0700 (PDT) Received: from jgg by mlx with local (Exim 4.94) (envelope-from ) id 1nehA0-001kWp-5D; Wed, 13 Apr 2022 14:52:08 -0300 Date: Wed, 13 Apr 2022 14:52:08 -0300 From: Jason Gunthorpe To: David Hildenbrand Cc: Sean Christopherson , Andy Lutomirski , Chao Peng , kvm list , Linux Kernel Mailing List , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Linux API , qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , the arch/x86 maintainers , "H. Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A. Shutemov" , "Nakajima, Jun" , Dave Hansen , Andi Kleen Subject: Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK Message-ID: <20220413175208.GI64706@ziepe.ca> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> <20220310140911.50924-5-chao.p.peng@linux.intel.com> <02e18c90-196e-409e-b2ac-822aceea8891@www.fastmail.com> <7ab689e7-e04d-5693-f899-d2d785b09892@redhat.com> <20220412143636.GG64706@ziepe.ca> <1686fd2d-d9c3-ec12-32df-8c4c5ae26b08@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1686fd2d-d9c3-ec12-32df-8c4c5ae26b08@redhat.com> Precedence: bulk List-ID: X-Mailing-List: linux-api@vger.kernel.org On Wed, Apr 13, 2022 at 06:24:56PM +0200, David Hildenbrand wrote: > On 12.04.22 16:36, Jason Gunthorpe wrote: > > On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote: > > > >> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he > >> past already with secretmem, it's not 100% that good of a fit (unmovable > >> is worth than mlocked). But it gets the job done for now at least. > > > > No, it doesn't. There are too many different interpretations how > > MELOCK is supposed to work > > > > eg VFIO accounts per-process so hostile users can just fork to go past > > it. > > > > RDMA is per-process but uses a different counter, so you can double up > > > > iouring is per-user and users a 3rd counter, so it can triple up on > > the above two > > Thanks for that summary, very helpful. I kicked off a big discussion when I suggested to change vfio to use the same as io_uring We may still end up trying it, but the major concern is that libvirt sets the RLIMIT_MEMLOCK and if we touch anything here - including fixing RDMA, or anything really, it becomes a uAPI break for libvirt.. > >> So I'm open for alternative to limit the amount of unmovable memory we > >> might allocate for user space, and then we could convert seretmem as well. > > > > I think it has to be cgroup based considering where we are now :\ > > Most probably. I think the important lessons we learned are that > > * mlocked != unmovable. > * RLIMIT_MEMLOCK should most probably never have been abused for > unmovable memory (especially, long-term pinning) The trouble is I'm not sure how anything can correctly/meaningfully set a limit. Consider qemu where we might have 3 different things all pinning the same page (rdma, iouring, vfio) - should the cgroup give 3x the limit? What use is that really? IMHO there are only two meaningful scenarios - either you are unpriv and limited to a very small number for your user/cgroup - or you are priv and you can do whatever you want. The idea we can fine tune this to exactly the right amount for a workload does not seem realistic and ends up exporting internal kernel decisions into a uAPI.. Jason