From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A2A95EDE9AB for ; Tue, 10 Sep 2024 19:33:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3DED18D00AE; Tue, 10 Sep 2024 15:33:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 38EAD8D0002; Tue, 10 Sep 2024 15:33:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 22FFB8D00AE; Tue, 10 Sep 2024 15:33:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 051B18D0002 for ; Tue, 10 Sep 2024 15:33:51 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 8247EC0A79 for ; Tue, 10 Sep 2024 19:33:51 +0000 (UTC) X-FDA: 82549828662.22.F731811 Received: from mail-oo1-f42.google.com (mail-oo1-f42.google.com [209.85.161.42]) by imf15.hostedemail.com (Postfix) with ESMTP id 74FEBA0003 for ; Tue, 10 Sep 2024 19:33:49 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=dt75Sw9d; spf=pass (imf15.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.161.42 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725996802; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=G+oRgNpMzHA98io8/EcSn/q44X/HKW6tfY5sutn6YRo=; b=qVK+17POcTzYJI+kUHifbNealSUsos/4pOVYfwV2ZEiTaKAebixT2ETQ/Ck6S/YdKXktKU Gss1RvBzJWdqwtrdoWZ7ny1yYIzQ+WFBtCo1CsNn2fIP3KqSoRyVa+gRFT5V8prC3rJlOx ZFYwtXt6OSPMq3ns938uZPdMG57/80c= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=dt75Sw9d; spf=pass (imf15.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.161.42 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725996802; a=rsa-sha256; cv=none; b=hgH0IiT09tltUvPxEZ3tHP1l6Wt7AneMYf3ErVUMLkjQd1BRMrWXWFO98Gyx4RH/mAqf3L 8Et0ukbMogmRD4iVFIFiDzGQhhtsUPzZxugkDif730gUQ35K9sfUEwaLZT7r8N1MadiuPv wJYiWdbJQ42UjCWpks/FMJLsoiiyngM= Received: by mail-oo1-f42.google.com with SMTP id 006d021491bc7-5dca990cf58so3443292eaf.1 for ; Tue, 10 Sep 2024 12:33:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1725996828; x=1726601628; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=G+oRgNpMzHA98io8/EcSn/q44X/HKW6tfY5sutn6YRo=; b=dt75Sw9dym8+6unyo/WRbg/hxYKkFrSDiPa5sjAH3xGCGStF6hokuFOIklUrivDoRd 6z2QUx7bJHtluUvUuWLi0HEGy3FuVJiQaLuZw6wrCpJnLIMHYeiSTeTfh7RsykN70LZT i4RCxVqxsgQJuhgJRGNAav3MyKeldAW/u1rmQASFlJFlPkd68sqO7ODt5kv+IEPlMbKl 83WyRAUZX9CH+acM3MQxsV2qrgUBIKCDi9MBNNvqixJ5qIOzk3fEKcElQA50vybsTIq+ yyyaf3JFyAbSj3Ua1dIEQIblxX9gywtiYaAQUWMwk905/I6QEO4yZRsLG5Y48sWZdIRU luTQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725996828; x=1726601628; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=G+oRgNpMzHA98io8/EcSn/q44X/HKW6tfY5sutn6YRo=; b=bEEE2lSiyeeEDzSCqBUhHxvkJAWIowCkKu1mVe8UF5spVE1wEXyAqCIag3g4bIeg65 6Dz6h//tmqLkNKpSfVyQ8ZWuXfJLwf4W+Lf7Gt9uTvh7iYjkPBXu34idksozadddA/o9 DUsbMrWtiLhgBWdmWcGJ5nQcoxdLyKJIMj9h5Ois/QwQKtAeXGBKEy4Rgf6kj0QLNzNC HdUkifhffP9GGeOQEsfx9L+qgBRBu3MVuwOCRr4Sndh03kGkcOEEnFV/l2y+BgEuPyAE YNMFXgB4jbcefYW7k9pUfgfi4oXryBcsPmuJNlGB5Hm1CC8MbrG9YJXbyd7S//mh4M11 bXgg== X-Gm-Message-State: AOJu0YyHO16Nit9+IXgOACPKHoinQu4Iu40ww71IQ2+/QNrciFK9qNVi mnAHc9NIKWg2deOfUHf7nUMuonve5earayCz/HxBu9UrKQI9seHUJOrkDCG8uB8= X-Google-Smtp-Source: AGHT+IFvZIWPFkjng9W4LlqG6TVcgI9+b667DfNEa0cu1tErIcJV/R3qUeeMFQSIpJu5LqBMVaHahA== X-Received: by 2002:a05:6358:e486:b0:1b8:32bf:3da1 with SMTP id e5c5f4694b2df-1b8385b2022mr630563755d.5.1725996827802; Tue, 10 Sep 2024 12:33:47 -0700 (PDT) Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6c5343471b8sm32904086d6.66.2024.09.10.12.33.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 10 Sep 2024 12:33:47 -0700 (PDT) Date: Tue, 10 Sep 2024 15:33:42 -0400 From: Johannes Weiner To: Jens Axboe Cc: Linux-MM , Yu Zhao , Andrew Morton , Muchun Song Subject: Re: Hugepage program taking forever to exit Message-ID: <20240910193342.GA108220@cmpxchg.org> References: <02ffa542-ce49-4755-9d2b-29841f9973e0@kernel.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <02ffa542-ce49-4755-9d2b-29841f9973e0@kernel.dk> X-Rspam-User: X-Stat-Signature: 95juf3jrbfja94wirfbya65ky5k8wfqz X-Rspamd-Queue-Id: 74FEBA0003 X-Rspamd-Server: rspam11 X-HE-Tag: 1725996829-513333 X-HE-Meta: U2FsdGVkX1/KCqTifIME3XVMPrE6G6pODo5WfpWC5JTcPeTvh93W6t6iw0q58/185CDpvGJzHNutUmWb3jnumY/Av6IXxTWeN72zCXlH6tAaBbf+X8/bYKOY2s55yThwwRGDa6bXKqgppf/4J/DThEnz/7W7O8XMPyPsJzZc1CH9yg+YgV3+BuexdZHqhHiMoNQBAXNnGdw5XvuPmyrO4HB1te9MyDcrA3sTtkmx4tMmqHu4B0wm2S+7JjDHq2axlg/lE2/0q7QUt9itjMnZSHqhFHPuXn9nKCE8mDazD3WyfER1gcbkjY+rm04ZFy30yuwWDwQSmNIwYfF3kOPoNQ9F83g5oXC702BnjsRBvOVjQv5VAS5p5UyLPEJ9p/XrccU6I3fPOO1okQqvMWTMNiO7C1T7srFB/Z/cbf3HMzKsw0F7IYstnN9haDGS8okacfirBt3hm1Iz//1ktIkQsufdI/P6LhcOp9aHQmNtyO002cBHOJM2LGfQFSNMJ8Ml+9fEhbqvN8o2efS5Gp5MwmwWRJWLMCjl5vnii/YvMXJR6mNyMFyWwhSkpAm/uz7eQugIoR6suaapl5q8fWtAQZf939HcA/7aPn9IGz2cUyU5my/k5rN8GmR1+R+mCKt7MFYlMlMNIlRs0dEE0BQ7pdTywATbmx6cYbSUpMLok3DpX1VQhWv7gOpmpBwCX/TRPWt18QEbe+inD5Fh+PT41cCLfAtxjXIcWmVbMUkOXQ9PVqy+O0j4/0EmPU0iEGu3eq5330lV1xxo+PgjGzCu+/giyRt+x5VK0MCmokyRLqPu6Kyv9y2wWxl0Z/ASsLEvT07Jywr02dZGzMeTWcXLi6m7ZHBx64YWZOjBs8olv2UezcMPTQ58nmt6og8H7vgcbdEfbYmsLMrgYP5ITAQcCpuegMeVflfCQmsc7UDplxmUM+mw/qZwqip1wxOnJzJ6VtmxQhHcboEkus3+iSy U7E54yOa TppSuQmuiWjrdZE7xJHXcvu6hbMnFInbS70o3x7QFh+e3qLmDeK2yY5faFm7hdd0X38N5NV2JdiErSB90Y0KemiQXiZixRBZ79jBL9N+qbaAnKJ0nqziWttjMfph9+jnEXFW6nZty2z80F19co/PCVCOghdWkKTAMYrGwiZ/BknnIi5A06ZnGOyml5jHKkpVvTcox36QyuVjxwZ95zaCCtOZPuU1gRsPtdYpXr+YkkxmeOJC+MoGtBi3xR3QsCXbh/JwttAhrafWoaP5UfVC7RhhjigzmkNayH1K6RlQd31K+2n0eAhzZ7Sbbifdwe7HvNa6hXnbesGqjY78gQrmuLWTgqmjaaHV0oS3yoXyLoIgZv2awTjSRChE6+HMLQLYurx4MMHLqSjd5qxTMQuIRTQ6/nNUUYbby1d5i/8NrTO2QUOQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Sep 10, 2024 at 12:21:42PM -0600, Jens Axboe wrote: > Hi, > > Investigating another issue, I wrote the following simple program that allocates > and faults in 500 1GB huge pages, and then registers them with io_uring. Each > step is timed: > > Got 500 huge pages (each 1024MB) in 0 msec > Faulted in 500 huge pages in 38632 msec > Registered 500 pages in 867 msec > > and as expected, faulting in the pages takes (by far) the longest. From > the above, you'd also expect the total runtime to be around ~39 seconds. > But it is not... In fact it takes 82 seconds in total for this program > to have exited. Looking at why, I see: > > [<0>] __wait_rcu_gp+0x12b/0x160 > [<0>] synchronize_rcu_normal.part.0+0x2a/0x30 > [<0>] hugetlb_vmemmap_restore_folios+0x22/0xe0 > [<0>] update_and_free_pages_bulk+0x4c/0x220 > [<0>] return_unused_surplus_pages+0x80/0xa0 > [<0>] hugetlb_acct_memory.part.0+0x2dd/0x3b0 > [<0>] hugetlb_vm_op_close+0x160/0x180 > [<0>] remove_vma+0x20/0x60 > [<0>] exit_mmap+0x199/0x340 > [<0>] mmput+0x49/0x110 > [<0>] do_exit+0x261/0x9b0 > [<0>] do_group_exit+0x2c/0x80 > [<0>] __x64_sys_exit_group+0x14/0x20 > [<0>] x64_sys_call+0x714/0x720 > [<0>] do_syscall_64+0x5b/0x160 > [<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53 Yeah, this looks wrong to me: void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list) { struct folio *folio; LIST_HEAD(vmemmap_pages); list_for_each_entry(folio, folio_list, lru) { int ret = hugetlb_vmemmap_split_folio(h, folio); /* * Spliting the PMD requires allocating a page, thus lets fail * early once we encounter the first OOM. No point in retrying * as it can be dynamically done on remap with the memory * we get back from the vmemmap deduplication. */ if (ret == -ENOMEM) break; } flush_tlb_all(); /* avoid writes from page_ref_add_unless() while folding vmemmap */ synchronize_rcu(); list_for_each_entry(folio, folio_list, lru) { int ret; ret = __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, VMEMMAP_REMAP_NO_TLB_FLUSH); /* * Pages to be freed may have been accumulated. If we * encounter an ENOMEM, free what we have and try again. * This can occur in the case that both spliting fails * halfway and head page allocation also failed. In this * case __hugetlb_vmemmap_optimize_folio() would free memory * allowing more vmemmap remaps to occur. */ if (ret == -ENOMEM && !list_empty(&vmemmap_pages)) { flush_tlb_all(); free_vmemmap_page_list(&vmemmap_pages); INIT_LIST_HEAD(&vmemmap_pages); __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, VMEMMAP_REMAP_NO_TLB_FLUSH); } } flush_tlb_all(); free_vmemmap_page_list(&vmemmap_pages); } If you don't have HVO enabled, then hugetlb_vmemmap_split_folio() does nothing. And __hugetlb_vmemmap_optimize_folio() also does nothing, leaving &vmemmap_pages empty and free_vmemmap_page_list() a nop. So what's left is: it flushes the TLB twice and waits for RCU. What for exactly? The same is true for hugetlb_vmemmap_optimize_folio() and the corresponding split function, which waits for RCU on every page being allocated and freed, even if the vmemmap is left alone. Surely all those RCU waits and tlb flushes should be guarded by whether the HVO is actually enabled, no?