From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1C0BCC71148 for ; Fri, 13 Jun 2025 19:15:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9A50C6B007B; Fri, 13 Jun 2025 15:15:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 956126B0089; Fri, 13 Jun 2025 15:15:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8456C6B008A; Fri, 13 Jun 2025 15:15:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 61D3C6B007B for ; Fri, 13 Jun 2025 15:15:29 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id D5E3612104D for ; Fri, 13 Jun 2025 19:15:28 +0000 (UTC) X-FDA: 83551331136.22.C1F52AA Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf12.hostedemail.com (Postfix) with ESMTP id 5F1134000E for ; Fri, 13 Jun 2025 19:15:26 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="jDB/Mqrf"; spf=pass (imf12.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749842126; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VJfPFlcDvR8fJBRiXniq1f1GzMtYJTiekQUbf2DHnUc=; b=I8smzaGLecr7NEjgOnFHHA0Y0C1Os4fhA7JpRnk5UfS1/mMcSPaVMwwo6CsZR9E8Uy3oIA N7fCJM4jgeMPdtQZJyFqnxwzOLAQ8gjVRNX6qkEDItAxAuYd0d4mhiDkrQGrKms4JxDVXn ybPg3fsaq+Hu2NJDhXLma/0m6KCmVfA= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="jDB/Mqrf"; spf=pass (imf12.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749842126; a=rsa-sha256; cv=none; b=PbGVBx/OD0TqFAxNRUeIWnuh/jjtDi1bbQe0sGuBxPRbAhq8ZUh/0AifhnHsj4if85fvOd hmQeZ/d9uKH5cqPlZBKcPAx4EU82ySDeJ2E4a7Dj7dC10pEKPpzAjsLJ1hUiVnW0eRBnLF T3ApWSmqYnk9PrFI/lo9ButCgnFLeoY= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1749842125; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=VJfPFlcDvR8fJBRiXniq1f1GzMtYJTiekQUbf2DHnUc=; b=jDB/Mqrfw2O0iLp1WFKdfkrN7PNDFj7KoHKzOVcW03DGFJXbdVh2EgcoJSaJUOk49jRTUu jFze//puEyj/fIHCWkkLoWbv6nNzqKZTR9sOcFGLGqMQbXSmFPp49htCaRkbobTL3KKtpw iuOWp8rhPrkXWzS3iw9y07Qr0UuvUKA= Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com [209.85.222.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-269-EM4UmdLIMfOd2uuFW28coQ-1; Fri, 13 Jun 2025 15:15:24 -0400 X-MC-Unique: EM4UmdLIMfOd2uuFW28coQ-1 X-Mimecast-MFC-AGG-ID: EM4UmdLIMfOd2uuFW28coQ_1749842124 Received: by mail-qk1-f197.google.com with SMTP id af79cd13be357-7caee988153so419162685a.1 for ; Fri, 13 Jun 2025 12:15:24 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749842124; x=1750446924; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=VJfPFlcDvR8fJBRiXniq1f1GzMtYJTiekQUbf2DHnUc=; b=fgKj1gG8qNtLhn1INynM/OPv/0OgM/Co7Q/NkKvg6qs89hxn3koSFwqGS8eAVE6LTu KScBCbR2a2QHu/AW9aFGdDzDZ3Ioksejt41e69scQPrJgmP9ieOuXh9/UJMSFT57lmRt OAJW5RiTJrA6BWFcRVKBbvM0n2buswCfnadg1uoyRDD/2vnGQmMrHrb4uR+EmrWmwxCk n5Yvg2/bYDFamB04IJUU1ALtvTNJDp2WSuh37os9YzfDI4b+2IhKYjs6TO7+psvxeIW6 iHJYHppUfV+kzFj7i4VHJ0iThR0JYlwyFnrA01BJnIrFxYG4zqBv9ChM25mvyudgXy2a ksug== X-Forwarded-Encrypted: i=1; AJvYcCUOteXiRewVcH17IukX6pnUTRuKmZ1B5kZNjVvA0OYCTVMpX67gobR9LSsLnZG3w5qNNe/FvT+C4Q==@kvack.org X-Gm-Message-State: AOJu0Ywa2FDINLS/zlI6Tu+3MZoUBoZcByYErncjGVexnyPQiGfUo/ye bKcK7qCnhFb4fgnytuvHAFclOIRjthT0jPoHn5lQAx/bhEwFfAvqy9NU7cRgC0zxRpo+DLTtbuw wKjDPkwus9+bZ/dUqeVSmIbztAizHbT3wVoKmVJ4fdajMN3sVKQCU X-Gm-Gg: ASbGncvjQxchgbzR6JpmgAYOmJIEdlqUrZr1769tZRwOkyBWSz/n2OKmwWY26F6rpy8 aDIC1EwxVg8hy9/YmO8l1mHa6gEfifPEQn52WYogEPIG82pKGMEIn2s8CcQY4HbnW7KVGgCJtnt Xit3/rifslc8z24f8sNx0TS1JvSa4T8kRmemvepxvzQGB1buXHr2YcNM9G0N5xOPMn47kvB1Zle F2/RtAUSwv1MUWLGLur09xaonzpYEMq+UrBah9p9xJ0/8kwRHvUKwYLrJ2HCV0YAksb2VMu0pmS ToTuSI3CqgOkBw== X-Received: by 2002:a05:620a:4690:b0:7cd:3f01:7c83 with SMTP id af79cd13be357-7d3c6ced959mr86689785a.39.1749842123786; Fri, 13 Jun 2025 12:15:23 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHHk+tLwiHsLDG8K2pZBbbo9MT0cMiJjdHUPCCmygz28zS6TQo7hCodBcEzueO0HEuES3dSRA== X-Received: by 2002:a05:620a:4690:b0:7cd:3f01:7c83 with SMTP id af79cd13be357-7d3c6ced959mr86685785a.39.1749842123343; Fri, 13 Jun 2025 12:15:23 -0700 (PDT) Received: from x1.local ([85.131.185.92]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7d3b8eac910sm208179585a.72.2025.06.13.12.15.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 13 Jun 2025 12:15:22 -0700 (PDT) Date: Fri, 13 Jun 2025 15:15:19 -0400 From: Peter Xu To: Jason Gunthorpe Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org, Andrew Morton , Alex Williamson , Zi Yan , Alex Mastro , David Hildenbrand , Nico Pache Subject: Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings Message-ID: References: <20250613134111.469884-1-peterx@redhat.com> <20250613134111.469884-6-peterx@redhat.com> <20250613142903.GL1174925@nvidia.com> <20250613160956.GN1174925@nvidia.com> MIME-Version: 1.0 In-Reply-To: <20250613160956.GN1174925@nvidia.com> X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: iPpNLXqKgTvjgzP25MBrVbuTLEDCIPLJFHlH33z58cw_1749842124 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 5F1134000E X-Stat-Signature: u5e669o59bpf534khorgon6myzrfwes3 X-Rspam-User: X-HE-Tag: 1749842126-647379 X-HE-Meta: U2FsdGVkX18J0ZVyztbuK4cek89WlnriDYOJfcSwiKzAJZKJ0YG8G/m5sT61soa9/NVIw5f8X93fMLjT4L0SIx8tAv7EXzpkZ0ybbU3cvIDpY+Bl+u3kuZpWhc/PbGnelY4vi9R0F59Vaea3fXRx6GL11SKZH42uAmQN6YJnSgDBewA2A43JrUdDu1Ip5rbkU0c1fC8O+QJb55afreZoioTGsP2LSY/WaTyNQGHAmLqGxMJ7x77aUUd7hitaXBHZC8E99YhOs5oGwhfCBPTs7dHVWr+MjSsprpdpBf5KO7TAYjsc3BpxEtwfADK85s4yopDoc0Wojei7G05Emkw4+cBIR9G8uBpNmfHiIQ9xzLiWzXcS6Vr4cjGjuHItZmU6JzIIBgqe4jLauL3B96OJb/Su8wQbHE0BlmdMcA8OVw4Z4ugM9ZbwLDQtnCA/lP0S2af6DbIE4cc9kDCIQdogPsEqi+lxB6BLhabiKeMlyoJ6kWggEkxwvUa4bvI4ROYJIPKuueDdTUFwXCRJv7uPIlBPZUaJpPVvRuwM24Bsx420I9N958c5zh8sOuEjrBhVb7r6IFO/orYuhWh2plRfuwLuq9UU5fJNmWQutl22440kfVraGlP4+n1BA9WabmFzKO2UF1N+atcKRMQ6OCbrl0XaKbxfjg8nW55FMyy4LO+cFESrFs/j2R+0FGL/qnLvkkL7hBYaoPSFUGfvGgb87fdnzlFxnt2qje3uPwLlOmlP58v6IozQX/Rw9X1mU1LnUKclE1YqvFegCiiML9l69ZJa//bxeudZB8vQuxZ2iU/R9um47k034fz4rcZKR8p2za+EX40o1lVn+iOYwVYspNjjxsydH9m3TQNjtmWrxIkPrRuJaDd1bUFiibfMj+OGT2e4bnwQADfKbvOf1NGyiIyw4eBKYbTO2H6WWMy0dmXpL/g8NjwPy4kyFLDhBrntkogWZb+1a108gdY8weE JPsh+I/L jM91B2QR1d/EVqgHfVZ+NlsNuld0/pyvb7bo2N0K3hlvjlcfcnaQCYCAdCbyglLRUCfp2X7/S76zjarXa9ENqd6PvAqNM7KeZNMq3s+XtiRfdQ9av7vnUubq3dYwWN+1zLSDLO5sxr3doBY0vzw1EXTvrtTfOIjCP71c7jB7xQa54cO5UYwfB/fjRmC1PYGKcOleJaEGrDaAq5pMkYYA9EmyLQLR7qaasAvDE4MvRjLgOchSJ70L5VLMPoh1CiLtLIGIXpOn+dj8pBxRNC2U+Jvr+pyEz9lW6VIiKjpztqGDA1gM0L4SRWUqL+tWuzDdN/VaHi6h0JwZQiKOt2e/qsuNUmMw35CaYKttNPuUvpcbTmth7dbn/K3KO1GAHjV+pl+MJyuZHKdd+n3gTNUA1yNVr9/ZOvkvnLmlkOqCbh6N8trY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jun 13, 2025 at 01:09:56PM -0300, Jason Gunthorpe wrote: > On Fri, Jun 13, 2025 at 11:26:40AM -0400, Peter Xu wrote: > > On Fri, Jun 13, 2025 at 11:29:03AM -0300, Jason Gunthorpe wrote: > > > On Fri, Jun 13, 2025 at 09:41:11AM -0400, Peter Xu wrote: > > > > > > > + /* Choose the alignment */ > > > > + if (IS_ENABLED(CONFIG_ARCH_SUPPORTS_PUD_PFNMAP) && phys_len >= PUD_SIZE) { > > > > + ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr, > > > > + flags, PUD_SIZE, 0); > > > > + if (ret) > > > > + return ret; > > > > + } > > > > + > > > > + if (phys_len >= PMD_SIZE) { > > > > + ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr, > > > > + flags, PMD_SIZE, 0); > > > > + if (ret) > > > > + return ret; > > > > + } > > > > > > Hurm, we have contiguous pages now, so PMD_SIZE is not so great, eg on > > > 4k ARM with we can have a 16*2M=32MB contiguity, and 16k ARM uses > > > contiguity to get a 32*16k=1GB option. > > > > > > Forcing to only align to the PMD or PUD seems suboptimal.. > > > > Right, however the cont-pte / cont-pmd are still not supported in huge > > pfnmaps in general? It'll definitely be nice if someone could look at that > > from ARM perspective, then provide support of both in one shot. > > Maybe leave behind a comment about this. I've been poking around if > somone would do the ARM PFNMAP support but can't report any commitment. I didn't know what's the best part to take a note for the whole pfnmap effort, but I added a note into the commit message on this patch: Note 2: Currently continuous pgtable entries (for example, cont-pte) is not yet supported for huge pfnmaps in general. It also is not considered in this patch so far. Separate work will be needed to enable continuous pgtable entries on archs that support it. > > > > > +fallback: > > > > + return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags); > > > > > > Why not put this into mm_get_unmapped_area_vmflags() and get rid of > > > thp_get_unmapped_area_vmflags() too? > > > > > > Is there any reason the caller should have to do a retry? > > > > We would still need thp_get_unmapped_area_vmflags() because that encodes > > PMD_SIZE for THPs; we need the flexibility of providing any size alignment > > as a generic helper. > > There is only one caller for thp_get_unmapped_area_vmflags(), just > open code PMD_SIZE there and thin this whole thing out. It reads > better like that anyhow: > > } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && !file > && !addr /* no hint */ > && IS_ALIGNED(len, PMD_SIZE)) { > /* Ensures that larger anonymous mappings are THP aligned. */ > addr = mm_get_unmapped_area_aligned(file, 0, len, pgoff, > flags, vm_flags, PMD_SIZE); > > > That was ok, however that loses some flexibility when the caller wants to > > try with different alignments, exactly like above: currently, it was trying > > to do a first attempt of PUD mapping then fallback to PMD if that fails. > > Oh, that's a good point, I didn't notice that subtle bit. > > But then maybe that is showing the API is just wrong and the core code > should be trying to find the best alignment not the caller. Like we > can have those PUD/PMD size ifdefs inside the mm instead of in VFIO? > > VFIO would just pass the BAR size, implying the best alignment, and > the core implementation will try to get the largest VMA alignment that > snaps to an arch supported page contiguity, testing each of the arches > page size possibilities in turn. > > That sounds like a much better API than pushing this into drivers?? Yes it would be nice if the core mm can evolve to make supporting such easier. Though the question is how to pass information over to core mm. For example, currently a vfio device file represents the whole device, and it's also VFIO that defines what the MMIO region offsets means. So core mm has no simple idea which BAR VFIO is mapping if it only receives a mmap() request. So even if we assume the core mm provides some vma flag showing that, it won't be per-vma, but need to be case by case of the mmap() request at least relevant to pgoff and len being mapped. And it's definitely the case that for one device its BAR sizes are different, hence it asks for different alignments when mmap() even if on the same device fd. It's similar to many other use cases of get_unmapped_area() users. For example, see v4l2_m2m_get_unmapped_area() which has similar treatment on at least knowing which part of the file was being mapped: if (offset < DST_QUEUE_OFF_BASE) { vq = v4l2_m2m_get_src_vq(fh->m2m_ctx); } else { vq = v4l2_m2m_get_dst_vq(fh->m2m_ctx); pgoff -= (DST_QUEUE_OFF_BASE >> PAGE_SHIFT); } Such flexibility might still be needed for now until we know how to provide the abstraction. Meanwhile, there can be other constraints to existing get_unmapped_area() users that a decision might be done with any parameter passed into it besides the pgoff.. so even if we provide the whole pgoff info, it might not be enough. -- Peter Xu