From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C4806C2B9F7 for ; Fri, 28 May 2021 16:51:58 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 45E566139A for ; Fri, 28 May 2021 16:51:58 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 45E566139A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id CB1886B0036; Fri, 28 May 2021 12:51:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C61AC6B006E; Fri, 28 May 2021 12:51:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AB42D6B0070; Fri, 28 May 2021 12:51:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0221.hostedemail.com [216.40.44.221]) by kanga.kvack.org (Postfix) with ESMTP id 706AC6B0036 for ; Fri, 28 May 2021 12:51:57 -0400 (EDT) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 75C89A8CB for ; Fri, 28 May 2021 16:51:55 +0000 (UTC) X-FDA: 78191231790.23.7F50B9A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf15.hostedemail.com (Postfix) with ESMTP id C9CD3A00014F for ; Fri, 28 May 2021 16:51:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1622220714; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=hCpZDLzGfuvpzK9WauKlTe9JpEyhJUtHOj4EKePAcT8=; b=fZeAHGfagx2aWKDdolk2ZncZwcHwQ+fxnObAZQI+H8cLxN1/JlVfngKe4iXyHpJTYbhO/C BioPdT++PgGeP9l8KSzkVwCKkgUWI6hy3Tp6POwJMuCHun7kwAocplMe5HIj+R3zxQ16ti b8OVBa5wlf/lqTJ2yPVPL0UNyE4mPQw= Received: from mail-ed1-f71.google.com (mail-ed1-f71.google.com [209.85.208.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-559-Oe4s-y3UOESFVUYj9tbGNA-1; Fri, 28 May 2021 12:51:52 -0400 X-MC-Unique: Oe4s-y3UOESFVUYj9tbGNA-1 Received: by mail-ed1-f71.google.com with SMTP id n6-20020a0564020606b029038cdc241890so2397160edv.20 for ; Fri, 28 May 2021 09:51:52 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=hCpZDLzGfuvpzK9WauKlTe9JpEyhJUtHOj4EKePAcT8=; b=OzFUxr3Wp44SSe9GfxFmQU5jO/Q4FQFOZNe4iivoWm0B0LLbsOIQRxBHg10hqZr7c2 us+RG+hyWBmXpqcs1ubW36Wzbfmb658hO4ztesGYxOhdEcQdeusClfNQGMNLy6pHN5Sl gEOxg6A5NPYcELBOK8RDbjS+aIVGtBPlcl6DbQsYelYlgP1UDxAIODdORMgpdoRFz1qK JEHHTp5R8nioIG+jxazwR8q4PwjhL0+szVg9I2MSNRxirZLjNceEoy0iScaHPY4wGIJi rE2SPi4UoGFPjNorljqPvOz85326QUCwVtow/YGOIBOZC59hevxgG2HrDojcYXwwJGrN 0QDA== X-Gm-Message-State: AOAM530efjUzd8DlnNyxFU50nN0hDUrxjg7PIT810p9lVd0JF3aCbzj5 OPo3FaIMIHb2AJ8Rykuo6BjebPUb0+yzGmHgNleqxdw5idAxZFKuEruCxmYKC8qLgeMqS1Q5QlB uZTEoQItwAeU= X-Received: by 2002:a05:6402:22f1:: with SMTP id dn17mr10807145edb.286.1622220711678; Fri, 28 May 2021 09:51:51 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyq5SQF7umO2y05aiV0Je3yHNv8BRSDCCoe3Aa0Z344MrU4znNVekOImJM82sObFVcWUvBgKw== X-Received: by 2002:a05:6402:22f1:: with SMTP id dn17mr10807115edb.286.1622220711360; Fri, 28 May 2021 09:51:51 -0700 (PDT) Received: from [192.168.3.132] (p5b0c6870.dip0.t-ipconnect.de. [91.12.104.112]) by smtp.gmail.com with ESMTPSA id bu1sm2598573ejb.116.2021.05.28.09.51.50 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 28 May 2021 09:51:50 -0700 (PDT) To: Dan Williams Cc: Bjorn Helgaas , Greg KH , Arnd Bergmann , Ingo Molnar , Kees Cook , Matthew Wilcox , Russell King , Andrew Morton , Linux Kernel Mailing List , Linux MM , Linux PCI , Daniel Vetter , =?UTF-8?Q?Krzysztof_Wilczy=c5=84ski?= , Jason Gunthorpe , Christoph Hellwig References: <159009507306.847224.8502634072429766747.stgit@dwillia2-desk3.amr.corp.intel.com> <20210527205845.GA1421476@bjorn-Precision-5520> <7eaf4c10-292e-18d7-e8ce-3a6b72122381@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v4] /dev/mem: Revoke mappings when a driver claims the region Message-ID: Date: Fri, 28 May 2021 18:51:50 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=fZeAHGfa; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf15.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: C9CD3A00014F X-Stat-Signature: h7c5oxguiehz93rwhfbatsafud6dchjw X-HE-Tag: 1622220709-969619 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 28.05.21 18:42, Dan Williams wrote: > On Fri, May 28, 2021 at 1:58 AM David Hildenbrand wr= ote: >> >> On 27.05.21 23:30, Dan Williams wrote: >>> On Thu, May 27, 2021 at 1:58 PM Bjorn Helgaas wr= ote: >>>> >>>> [+cc Daniel, Krzysztof, Jason, Christoph, linux-pci] >>>> >>>> On Thu, May 21, 2020 at 02:06:17PM -0700, Dan Williams wrote: >>>>> Close the hole of holding a mapping over kernel driver takeover eve= nt of >>>>> a given address range. >>>>> >>>>> Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges") >>>>> introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the >>>>> kernel against scenarios where a /dev/mem user tramples memory that= a >>>>> kernel driver owns. However, this protection only prevents *new* re= ad(), >>>>> write() and mmap() requests. Established mappings prior to the driv= er >>>>> calling request_mem_region() are left alone. >>>>> >>>>> Especially with persistent memory, and the core kernel metadata tha= t is >>>>> stored there, there are plentiful scenarios for a /dev/mem user to >>>>> violate the expectations of the driver and cause amplified damage. >>>>> >>>>> Teach request_mem_region() to find and shoot down active /dev/mem >>>>> mappings that it believes it has successfully claimed for the exclu= sive >>>>> use of the driver. Effectively a driver call to request_mem_region(= ) >>>>> becomes a hole-punch on the /dev/mem device. >>>> >>>> This idea of hole-punching /dev/mem has since been extended to PCI >>>> BARs via [1]. >>>> >>>> Correct me if I'm wrong: I think this means that if a user process h= as >>>> mmapped a PCI BAR via sysfs, and a kernel driver subsequently reques= ts >>>> that region via pci_request_region() or similar, we punch holes in t= he >>>> the user process mmap. The driver might be happy, but my guess is t= he >>>> user starts seeing segmentation violations for no obvious reason and >>>> is not happy. >>>> >>>> Apart from the user process issue, the implementation of [1] is >>>> problematic for PCI because the mmappable sysfs attributes now depen= d >>>> on iomem_init_inode(), an fs_initcall, which means they can't be >>>> static attributes, which ultimately leads to races in creating them. >>> >>> See the comments in iomem_get_mapping(), and revoke_iomem(): >>> >>> /* >>> * Check that the initialization has completed. Losing the = race >>> * is ok because it means drivers are claiming resources be= fore >>> * the fs_initcall level of init and prevent iomem_get_mapp= ing users >>> * from establishing mappings. >>> */ >>> >>> ...the observation being that it is ok for the revocation inode to >>> come on later in the boot process because userspace won't be able to >>> use the fs yet. So any missed calls to revoke_iomem() would fall back >>> to userspace just seeing the resource busy in the first instance. I.e= . >>> through the normal devmem_is_allowed() exclusion. >>> >>>> >>>> So I'm raising the question of whether this hole-punch is the right >>>> strategy. >>>> >>>> - Prior to revoke_iomem(), __request_region() was very >>>> self-contained and really only depended on the resource tree. = Now >>>> it depends on a lot of higher-level MM machinery to shoot down >>>> mappings of other tasks. This adds quite a bit of complexity = and >>>> some new ordering constraints. >>>> >>>> - Punching holes in the address space of an existing process see= ms >>>> unfriendly. Maybe the driver's __request_region() should fail >>>> instead, since the driver should be prepared to handle failure >>>> there anyway. >>> >>> It's prepared to handle failure, but in this case it is dealing with = a >>> root user of 2 minds. >>> >>>> >>>> - [2] suggests that the hole punch protects drivers from /dev/me= m >>>> writers, especially with persistent memory. I'm not really >>>> convinced. The hole punch does nothing to prevent a user proc= ess >>>> from mmapping and corrupting something before the driver loads= . >>> >>> The motivation for this was a case that was swapping between /dev/mem >>> access and /dev/pmem0 access and they forgot to stop using /dev/mem >>> when they switched to /dev/pmem0. If root wants to use /dev/mem it ca= n >>> use it, if root wants to stop the driver from loading it can set >>> mopdrobe policy or manually unbind, and if root asks the kernel to >>> load the driver while it is actively using /dev/mem something has to >>> give. Given root has other options to stop a driver the decision to >>> revoke userspace access when root messes up and causes a collision >>> seems prudent to me. >>> >> >> Is there a real use case for mapping pmem via /dev/mem or could we jus= t >> prohibit the access to these areas completely? >=20 > The kernel offers conflicting access to iomem resources and a > long-standing mechanism to enforce mutual exclusion > (CONFIG_IO_STRICT_DEVMEM) between those interfaces. That mechanism was > found to be incomplete for the case where a /dev/mem mapping is > maintained after a kernel driver is attached, and incomplete for other > mechanisms to map iomem like pci-sysfs. This was found with PMEM, but > the issue is larger and applies to userspace drivers / debug in > general. >=20 >> What's the use case for "swapping between /dev/mem access and /dev/pme= m0 >> access" ? >=20 > "Who knows". I mean, I know in this case it was a platform validation > test using /dev/mem for "reasons", but I am not sure that is relevant > to the wider concern. If CONFIG_IO_STRICT_DEVMEM=3Dn exclusion is > enforced when drivers pass the IORESOURCE_EXCLUSIVE flag, if > CONFIG_IO_STRICT_DEVMEM=3Dy exclusion is enforced whenever the kernel > marks a resource IORESOURCE_BUSY, and if kernel lockdown is enabled > the driver state is moot as LOCKDOWN_DEV_MEM and LOCKDOWN_PCI_ACCESS > policy is in effect. >=20 I was thinking about a mechanism to permanently disallow /dev/mem access=20 to specific memory regions (BUSY or not) in any /dev/mem mode. In my=20 case, it would apply to the whole virtio-mem provided memory region.=20 Once the driver is loaded, it would disallow access to the whole region. I thought about doing it via the kernel resource tree, extending the=20 EXCLUSIVE flag to !BUSY SYSRAM regions. But a simplistic list managed in=20 /dev/mem code would also be possible. That's why I wondered if we could just disallow access to these physical=20 PMEM memory regions right from the start similarly, such that we don't=20 have to really care about revoking in case of PMEM anymore. --=20 Thanks, David / dhildenb