From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 514B5C6FD1C for ; Wed, 22 Mar 2023 14:26:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:From:References:Cc:To: Subject:MIME-Version:Date:Message-ID:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=WtBKpfDBHwN8wVKz+Pl0kjSdJgRGmEaDWUJ9HLt8bRo=; b=ada5b6s+RxG0gA OZguq+CnsCBZQs0OfnxyMR2vn0oi+xQWWrrjFI3Z8YHgnFs0prKLNDvFlcQ0HqfmnbIH0h/FT0qlx ferosD/X+1zG2lusGYqnRBRjWnJug0Nsn9yfoY54cCIqny0nrh4CCK/w1FhKv6XvepZJHM4HdWe5K tm3ldTT+CvoV0zlFNS14aFLxEIEqvBMwEz8Z4wc8iR2GP/Bzz7kuPreR5kKpigzKn+e2tNbJ5tD2h 0sXlxOTIe7yrRU2eZoP4EyfEmEeSe1zE5WZSOfFyoxTAIIi3GPMIlIEZcB8NLq0oEqtwtXWcftp0r t0/vmw6318VlqFmfvBWQ==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1pezPC-00GHeH-0L; Wed, 22 Mar 2023 14:25:34 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1pezP8-00GHdQ-2X for linux-arm-kernel@lists.infradead.org; Wed, 22 Mar 2023 14:25:32 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 8B40D4B3; Wed, 22 Mar 2023 07:26:13 -0700 (PDT) Received: from [10.1.37.147] (C02CF1NRLVDN.cambridge.arm.com [10.1.37.147]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id A05A13F71E; Wed, 22 Mar 2023 07:25:28 -0700 (PDT) Message-ID: Date: Wed, 22 Mar 2023 14:25:27 +0000 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Subject: Re: [RFC PATCH 0/6] variable-order, large folios for anonymous memory Content-Language: en-US To: "Yin, Fengwei" , Andrew Morton , "Matthew Wilcox (Oracle)" , Yu Zhao Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org References: <20230317105802.2634004-1-ryan.roberts@arm.com> <7b63b922-9c19-b1ed-a6f6-72e1b25f38f6@intel.com> From: Ryan Roberts In-Reply-To: <7b63b922-9c19-b1ed-a6f6-72e1b25f38f6@intel.com> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230322_072530_937737_81123222 X-CRM114-Status: GOOD ( 32.71 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On 22/03/2023 13:36, Yin, Fengwei wrote: > > > On 3/22/2023 8:03 PM, Ryan Roberts wrote: >> Hi Matthew, >> >> On 17/03/2023 10:57, Ryan Roberts wrote: >>> Hi All, >>> >>> [...] >>> >>> Bug(s) >>> ====== >>> >>> When I run this code without the last (workaround) patch, with DEBUG_VM et al, >>> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are >>> relating to invalid kernel addresses (which usually look like either NULL + >>> small offset or mostly zeros with a few mid-order bits set + a small offset) or >>> lockdep complaining about a bad unlock balance. Call stacks are often in >>> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can >>> email example oopses out separately if anyone wants to review them). My hunch is >>> that struct pages adjacent to the folio are being corrupted, but don't have hard >>> evidence. >>> >>> When adding the workaround patch, which prevents madvise_free_pte_range() from >>> attempting to split a large folio, I never see any issues. Although I'm not >>> putting the system under memory pressure so guess I might see the same types of >>> problem crop up under swap, etc. >>> >>> I've reviewed most of the code within split_folio() and can't find any smoking >>> gun, but I wonder if there are implicit assumptions about the large folio being >>> PMD sized that I'm obviously breaking now? >>> >>> The code in madvise_free_pte_range(): >>> >>> if (folio_test_large(folio)) { >>> if (folio_mapcount(folio) != 1) >>> goto out; >>> folio_get(folio); >>> if (!folio_trylock(folio)) { >>> folio_put(folio); >>> goto out; >>> } >>> pte_unmap_unlock(orig_pte, ptl); >>> if (split_folio(folio)) { >>> folio_unlock(folio); >>> folio_put(folio); >>> orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl); >>> goto out; >>> } >>> ... >>> } >> >> I've noticed that its folio_split() with a folio order of 1 that causes my >> problems. And I also see that the page cache code always explicitly never >> allocates order-1 folios: >> >> void page_cache_ra_order(struct readahead_control *ractl, >> struct file_ra_state *ra, unsigned int new_order) >> { >> ... >> >> while (index <= limit) { >> unsigned int order = new_order; >> >> /* Align with smaller pages if needed */ >> if (index & ((1UL << order) - 1)) { >> order = __ffs(index); >> if (order == 1) >> order = 0; >> } >> /* Don't allocate pages past EOF */ >> while (index + (1UL << order) - 1 > limit) { >> if (--order == 1) >> order = 0; >> } >> err = ra_alloc_folio(ractl, index, mark, order, gfp); >> if (err) >> break; >> index += 1UL << order; >> } >> >> ... >> } >> >> Matthew, what is the reason for this? I suspect its guarding against the same >> problem I'm seeing. >> >> If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause >> any oops/panic/etc. I'd just like to understand the root cause. > Checked the struct folio definition. The _deferred_list is in third page struct. > My understanding is to support folio split, the folio order must >= 2. Thanks. Yep, looks like we have found the root cause - thanks for your help! I've updated calc_anonymous_folio_order() to only use non-0 order if THP is available and in that case, never allocate order-1. I think that both fixes the problem and manages the dependency we have on THP: static void calc_anonymous_folio_order(struct vm_fault *vmf, int *order_out, unsigned long *addr_out) { /* * The aim here is to determine what size of folio we should allocate * for this fault. Factors include: * - Folio must be naturally aligned within VA space * - Folio must not breach boundaries of vma * - Folio must be fully contained inside one pmd entry * - Folio must not overlap any non-none ptes * - Order must not be higher than *order_out upon entry * * Additionally, we do not allow order-1 since this breaks assumptions * elsewhere in the mm; THP pages must be at least order-2 (since they * store state up to the 3rd struct page subpage), and these pages must * be THP in order to correctly use pre-existing THP infrastructure such * as folio_split(). * * As a consequence of relying on the THP infrastructure, if the system * does not support THP, we always fallback to order-0. * * Note that the caller may or may not choose to lock the pte. If * unlocked, the calculation should be considered an estimate that will * need to be validated under the lock. */ struct vm_area_struct *vma = vmf->vma; int nr; int order; unsigned long addr; pte_t *pte; pte_t *first_set = NULL; int ret; if (has_transparent_hugepage()) { order = min(*order_out, PMD_SHIFT - PAGE_SHIFT); for (; order > 1; order--) { nr = 1 << order; addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); /* Check vma bounds. */ if (addr < vma->vm_start || addr + nr * PAGE_SIZE > vma->vm_end) continue; /* Ptes covered by order already known to be none. */ if (pte + nr <= first_set) break; /* Already found set pte in range covered by order. */ if (pte <= first_set) continue; /* Need to check if all the ptes are none. */ ret = check_all_ptes_none(pte, nr); if (ret == nr) break; first_set = pte + ret; } if (order == 1) order = 0; } else { order = 0; } *order_out = order; *addr_out = order > 0 ? addr : vmf->address; } > > > Regards > Yin, Fengwei > >> >> Thanks, >> Ryan >> >> >> >>> >>> Will normally skip my large folios because they have a mapcount > 1, due to >>> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it >>> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim >>> or CoW in this case? >>> >>> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my >>> large anon folio is not using the folio lock in the same way as a THP would and >>> we are therefore not getting the expected serialization? >>> >>> I'd really appreciate any suggestions for how to pregress here! >>> >> _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel