From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp-fw-80007.amazon.com (smtp-fw-80007.amazon.com [99.78.197.218]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E61A9145B10; Wed, 20 Nov 2024 15:58:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=99.78.197.218 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732118317; cv=none; b=HGVO8ji/ukGl0ljtSJaMPkQj2u8eYYZ0UNYGjMefbu0q4cf86iQmNL0uyDYb0a/BLmy/N9TGjbrx8fPHIBAfwNxWXket8KQkeCS3w71pIfIZaz/eX2uUBTs3XhlkOk2TZDMz6/TSp51s5kVBkXiNfVHq/1uXsKc9XalSZ+eeus0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732118317; c=relaxed/simple; bh=QUCP/F2MW152qu6dweDXIAy8zsQV0D5nzigkKZ7Shn8=; h=Message-ID:Date:MIME-Version:Subject:To:CC:References:From: In-Reply-To:Content-Type; b=q/hg0/O7S4EAyVt6OTEsTKuWX9C7WVOGB4l7SqgK68na+fHMYNM/UVcS8VWhWC0sunBAPGAh4uJcSCva0w1H27mUIqqDjtUZCIRoIKOsfwengnxusewqMf+eIAqEwRz3lWXFzlaSGRly6fiTa0E5xlNsBsa2eFCq+PPl0SSexVg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.co.uk; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=nKBXymSs; arc=none smtp.client-ip=99.78.197.218 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="nKBXymSs" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1732118316; x=1763654316; h=message-id:date:mime-version:reply-to:subject:to:cc: references:from:in-reply-to:content-transfer-encoding; bh=pSZsUFc2pO+H1LRn5VA2nkGQrkmW6bpxo1bRSlx8hok=; b=nKBXymSslzz7JdNVM3DEMOasxtjimeKlQDXacfEYu70zWB9myuJf8lsq l0qy6j5cEJPj3RU2H5oyAHrulQbRuEo2MTF5OxRLHqCR7FJSQf5SJA6ot +YOue2BYfwe3IftXPA6+hyz5b5av9o0Ib7Dozkbxor4S9Js4GZOBC5Rj3 A=; X-IronPort-AV: E=Sophos;i="6.12,170,1728950400"; d="scan'208";a="354185361" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.25.36.210]) by smtp-border-fw-80007.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2024 15:58:34 +0000 Received: from EX19MTAEUB002.ant.amazon.com [10.0.10.100:53225] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.10.207:2525] with esmtp (Farcaster) id 5ec6092c-67cf-4bc6-a64f-e617283997e7; Wed, 20 Nov 2024 15:58:32 +0000 (UTC) X-Farcaster-Flow-ID: 5ec6092c-67cf-4bc6-a64f-e617283997e7 Received: from EX19D022EUC002.ant.amazon.com (10.252.51.137) by EX19MTAEUB002.ant.amazon.com (10.252.51.59) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Wed, 20 Nov 2024 15:58:31 +0000 Received: from [192.168.4.239] (10.106.82.23) by EX19D022EUC002.ant.amazon.com (10.252.51.137) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Wed, 20 Nov 2024 15:58:30 +0000 Message-ID: <55b6b3ec-eaa8-494b-9bc7-741fe0c3bc63@amazon.com> Date: Wed, 20 Nov 2024 15:58:29 +0000 Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Reply-To: Subject: Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd To: David Hildenbrand , , , , , CC: , , , , , , , , , "Sean Christopherson" , References: <20241024095429.54052-1-kalyazin@amazon.com> <08aeaf6e-dc89-413a-86a6-b9772c9b2faf@amazon.com> <01b0a528-bec0-41d7-80f6-8afe213bd56b@redhat.com> Content-Language: en-US From: Nikita Kalyazin Autocrypt: addr=kalyazin@amazon.com; keydata= xjMEY+ZIvRYJKwYBBAHaRw8BAQdA9FwYskD/5BFmiiTgktstviS9svHeszG2JfIkUqjxf+/N JU5pa2l0YSBLYWx5YXppbiA8a2FseWF6aW5AYW1hem9uLmNvbT7CjwQTFggANxYhBGhhGDEy BjLQwD9FsK+SyiCpmmTzBQJj5ki9BQkDwmcAAhsDBAsJCAcFFQgJCgsFFgIDAQAACgkQr5LK IKmaZPOR1wD/UTcn4GbLC39QIwJuWXW0DeLoikxFBYkbhYyZ5CbtrtAA/2/rnR/zKZmyXqJ6 ULlSE8eWA3ywAIOH8jIETF2fCaUCzjgEY+ZIvRIKKwYBBAGXVQEFAQEHQCqd7/nb2tb36vZt ubg1iBLCSDctMlKHsQTp7wCnEc4RAwEIB8J+BBgWCAAmFiEEaGEYMTIGMtDAP0Wwr5LKIKma ZPMFAmPmSL0FCQPCZwACGwwACgkQr5LKIKmaZPNCxAEAxwnrmyqSC63nf6hoCFCfJYQapghC abLV0+PWemntlwEA/RYx8qCWD6zOEn4eYhQAucEwtg6h1PBbeGK94khVMooF In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-ClientProxiedBy: EX19D014EUA003.ant.amazon.com (10.252.50.119) To EX19D022EUC002.ant.amazon.com (10.252.51.137) On 20/11/2024 15:13, David Hildenbrand wrote: > Hi! Hi! :) >> Results: >> - MAP_PRIVATE: 968 ms >> - MAP_SHARED: 1646 ms > > At least here it is expected to some degree: as soon as the page cache > is involved map/unmap gets slower, because we are effectively > maintaining two datastructures (page tables + page cache) instead of > only a single one (page cache) > > Can you make sure that THP/large folios don't interfere in your > experiments (e.g., madvise(MADV_NOHUGEPAGE))? I was using transparent_hugepage=never command line argument in my testing. $ cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] Is that sufficient to exclude the THP/large folio factor? >> While this logic is intuitive, its performance effect is more >> significant that I would expect. > > Yes. How much of the performance difference would remain if you hack out > the atomic op just to play with it? I suspect there will still be some > difference. I have tried that, but could not see any noticeable difference in the overall results. It looks like a big portion of the bottleneck has moved from shmem_get_folio_gfp/folio_mark_uptodate to finish_fault/__pte_offset_map_lock somehow. I have no good explanation for why: Orig: - 69.62% do_fault + 44.61% __do_fault + 20.26% filemap_map_pages + 3.48% finish_fault Hacked: - 67.39% do_fault + 32.45% __do_fault + 21.87% filemap_map_pages + 11.97% finish_fault Orig: - 3.48% finish_fault - 1.28% set_pte_range 0.96% folio_add_file_rmap_ptes - 0.91% __pte_offset_map_lock 0.54% _raw_spin_lock Hacked: - 11.97% finish_fault - 8.59% __pte_offset_map_lock - 6.27% _raw_spin_lock preempt_count_add 1.00% __pte_offset_map - 1.28% set_pte_range - folio_add_file_rmap_ptes __mod_node_page_state > Note that we might improve allocation times with guest_memfd when > allocating larger folios. I suppose it may not always be an option depending on requirements to consistency of the allocation latency. Eg if a large folio isn't available at the time, the performance would degrade to the base case (please correct me if I'm missing something). > Heh, now I spot that your comment was as reply to a series. Yeah, sorry if it wasn't obvious. > If your ioctl is supposed to to more than "allocating memory" like > MAP_POPULATE/MADV_POPULATE+* ... then POPULATE is a suboptimal choice. > Because for allocating memory, we would want to use fallocate() instead. > I assume you want to "allocate+copy"? Yes, the ultimate use case is "allocate+copy". > I'll note that, as we're moving into the direction of moving > guest_memfd.c into mm/guestmem.c, we'll likely want to avoid "KVM_*" > ioctls, and think about something generic. Good point, thanks. Are we at the stage where some concrete API has been proposed yet? I might have missed that. > Any clue how your new ioctl will interact with the WIP to have shared > memory as part of guest_memfd? For example, could it be reasonable to > "populate" the shared memory first (via VMA) and then convert that > "allocated+filled" memory to private? No, I can't immediately see why it shouldn't work. My main concern would probably still be about the latency of the population stage as I can't see why it would improve compared to what we have now, because my feeling is this is linked with the sharedness property of guest_memfd. > Cheers, > > David / dhildenb