From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7318EC54E49 for ; Fri, 8 Mar 2024 01:26:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BDB926B030D; Thu, 7 Mar 2024 20:26:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B8B9E6B030E; Thu, 7 Mar 2024 20:26:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A543E6B030F; Thu, 7 Mar 2024 20:26:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 8FB956B030D for ; Thu, 7 Mar 2024 20:26:11 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 52465C03FB for ; Fri, 8 Mar 2024 01:26:11 +0000 (UTC) X-FDA: 81872130942.21.34DE04A Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) by imf30.hostedemail.com (Postfix) with ESMTP id B5C548000E for ; Fri, 8 Mar 2024 01:26:09 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=dxT9k5Vn; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf30.hostedemail.com: domain of 3MGnqZQoKCOEbRVUbDKPHGJRRJOH.FRPOLQXa-PPNYDFN.RUJ@flex--yosryahmed.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3MGnqZQoKCOEbRVUbDKPHGJRRJOH.FRPOLQXa-PPNYDFN.RUJ@flex--yosryahmed.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709861169; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=eXseg/OXOOaXO0QfrFbk0NsY7v7P+AZODbEPqN/H3o8=; b=fBWLts9eq52zKj2V2dbK/q1aUEsGARgP3b9hQ5chdGW+odmkj5XJUPkkN91eX/Jmn2euko oSucKUtJTBRFKaxCkyZFWPoB6CIZS4j4VmH2C9FL3/mIL6kP9UerYmfvD58TUjRRNv2aaL zQXd+Y3gXP4f0APjBw0UibdKG0qKJ3s= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=dxT9k5Vn; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf30.hostedemail.com: domain of 3MGnqZQoKCOEbRVUbDKPHGJRRJOH.FRPOLQXa-PPNYDFN.RUJ@flex--yosryahmed.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3MGnqZQoKCOEbRVUbDKPHGJRRJOH.FRPOLQXa-PPNYDFN.RUJ@flex--yosryahmed.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709861169; a=rsa-sha256; cv=none; b=iFfIT1TZxB9vO8ZHjO8Cw4lzOYzq09dFhw21bP+PxhRVJwcKae6IlhuOd1unfD8q86fIHt aR4MRGtd3/xvPiwhEMQRkBDUjtdhzNhMVheMCxA4pmcof3k2Vs+ZV/10rS+u6VlWuQWu6t 1nDLiZzjua/YhUdZGOPksZFbJ6vPNEA= Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-dc6b26783b4so743200276.0 for ; Thu, 07 Mar 2024 17:26:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1709861169; x=1710465969; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=eXseg/OXOOaXO0QfrFbk0NsY7v7P+AZODbEPqN/H3o8=; b=dxT9k5VnIb1aXrvXJZKlUkjWGYa0acSt9ZoSk/rF5OSAJZm/F9U1AJK/7iRQh2/SDI AzxLFkvVNghR8Br4DnNVTs3eW2kf7GgbC1NWLr0lbCZbeZHxmBPdSPnahRtUVYZpz7e1 rzYdAfELGQeF1LJpFq1c/ePgxe+UoKdgUQpb6531GPZGiK/c0XiQtgycTTdH1VG3bb9B lRPsEgUZsCYHUxpzaTWA8cy8G/zPLdvOvX3bnBJh1zqY0LfAVeMnkqIUvcdZ1CDeq/ep qYAlKpJtghdR8MAS/z8wGTvb++1S067u3kFdxWDLsO0HJultkyIuv+6TQiaU5SPQMyjF wETQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709861169; x=1710465969; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=eXseg/OXOOaXO0QfrFbk0NsY7v7P+AZODbEPqN/H3o8=; b=T9BhRUF2bPQyWIizPkNiin+zVG1sKbxgl5/yzKCRwCu9nCqcq0V0Byhj7oVt3Yo0kg 76OdkRQacRBK43K/duYpiq8jE8UMlNVje/DvfMD4mW21cCsYGlC3uUnWmCiEGIBfMMZR w8pUCJ5WpFbGi2jQWqbYcct2jKFYlJfdp+PmjSwjHQOHV2ta8BAiEdX+JZamkCen8q8J gHfiKl5VGnr6m6OHyeCe7BtoYOHKScC3lrX6GTP/yqqGEIqiH+lmnscpBEWycSSOdpjF 5GKb5tWus08+qfC+lFauUawE6JfSRs2vpMd6knvZRwuDRTD33mz2toev1NDjrrd1TRa2 kzqQ== X-Forwarded-Encrypted: i=1; AJvYcCVui7R2kgdtt3nsUVyXAoKIh/2v7NDmRFAIoMf9ePhFY6F7VH0a8SHM0JTf56ZcAu8WXwQT0dGAs2z8s1IFRxIO/vc= X-Gm-Message-State: AOJu0Yz7nxKYhcTrAoTRtJB+sydxPJIZiE4k6hMqjrQRWf5MHGNLmxB7 ugfumMZyK1/PSSqiz80oz+ghghGAi/NeuEosUwV/zrmH0oh//+hPBJFP393igHGe2L3vYAPIFUE xilhqFuZ8HkAgsNsjgA== X-Google-Smtp-Source: AGHT+IFwilqH/mUypD/jBtdQECzEjD7iyvfhQwzvfhI6NuQeEvEpas/utpcVa7OIlx3iC/kKszoRugEj08/+/mjC X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:29b4]) (user=yosryahmed job=sendgmr) by 2002:a05:6902:18d3:b0:dc7:82ba:ba6e with SMTP id ck19-20020a05690218d300b00dc782baba6emr666616ybb.7.1709861168825; Thu, 07 Mar 2024 17:26:08 -0800 (PST) Date: Fri, 8 Mar 2024 01:26:06 +0000 In-Reply-To: Mime-Version: 1.0 References: <20240307133916.3782068-1-yosryahmed@google.com> <20240307133916.3782068-3-yosryahmed@google.com> <83b019e2-1b84-491a-b0b9-beb02e45d80c@intel.com> Message-ID: Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching From: Yosry Ahmed To: Dave Hansen Cc: Andrew Morton , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Peter Zijlstra , Andy Lutomirski , "Kirill A. Shutemov" , x86@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="us-ascii" X-Rspam-User: X-Stat-Signature: zt6e7og5gtbynwg1ciro6s99t1t9s391 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: B5C548000E X-HE-Tag: 1709861169-208242 X-HE-Meta: U2FsdGVkX1+uNjYEWSddXCmlU19Rf80hy58vnUQMOPzeKFidbogELYnNMG3l6rREdhd/QgayldpOqNwosVNkBd7vvjsH6pZq1zZk2mPAe2okrdkoaT5VObppEINjwSXDT/y5J6lddA2+r/XSIKqHZeHdOt8qHHdKIJAdAwI+IfBHIbkFjBTHuo/HAVA/M5gGYloVS5+MNWo+8V7UP+VKqmCqwt48K8qSHFIUvslDsmE4GojBj7VkIaVntGUMLc64cppRr7d1KJDpQg6ZsNZPGMht078qnxz/Jj/ZxfWJChPKog5zt7ulGbLE44QpDD34i8Or1oaeOwb9zDYpXlNfzWfVy51vgdxgJFWKOP2quuFUNh++IYEGZ1ACc8Pas8qSk2aBLeTolKwhlOqGtKcD4f5a1Zdx2Yb5K3E13MZQK6c2G6f0uVX5fMWOKiuvaFxXxUVaOYHOy2LVqxr9WMT+/MpdxJH+ovf/wEsnA05WXvj+Ko31yiRY9b20d+CXJMRLX8+Rv9f8ye9S+NYzaguRTrmO0AkbNlqmqBtbasMk4jRcIRZWuimCKCL63PZ30IGb98Biz2iSaJkFZiZt1Ksx+hTQxsjyn6pzG1FNgmeUuItk2EU0kifCsQhHKXggOJ9RTLWVPEMJRYfry6ZwscK6HDwmcHJ+QV6685YEDtke2PPOj2AmKVZQIfk7GVHt2vQ3FuZRscjvEZkFCxY5Zkb/ExyKChzeaWTldchkBGv1bR4TbknPPoEgomKVXHm5f17spY02oGT326U87woKSOPh/p0c2MczabjqHyYCricabQVRHY5sJcx2on9nKIVEdAFrf8q7HvS/RZmoe5w6mMr9tXI9BWGhmBmNbQxrRwYy08zbz2+V4fjcHzw0cuHPR8oqQvsyOI7NmGMgW5zapEf81UwcFYEdvSE+gxOuGCCIX0G13GkTXtH/tzt/donTrfD7e+ZGHpbnqwm6RSY0WWZ gS/mhThg HlBBFfW5A2Ba8njZAkAxpsb+BX+rTzNwmaGgb/OKh/eYXbxojIW28QRkeZ9Xoqh3Rysnqit0l/0+TC74DZ61oQU/Krurerw3RZQW/uUvPu8zfFrLMJdHv5Ouaei2FW9T/bD2wPdnGJMkNkyd5odMgJ8SzQ4ZE/7p+XOn/yyynr+K0ttbLJXq5i2uuDkNXNqu1ofBNUMwtpLUGKeN0xNtbyB3X8TZOCjckLNKAmh4cGoBWkv14kKwe3L/IVbtmfsdHvd9Jc9wEbZHKA/O6NlVKLmR0lW1OxFHIxrSPs312jyF4Wd3hNI/2dU2eOciBBsnUadIikMHlXuagwftCx9WFAeVmTWC3l+NKnYIn0HFMHjhGjrXgUBzRdsdRQiI4iCCNC/wl X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Mar 07, 2024 at 10:29:29PM +0000, Yosry Ahmed wrote: > On Thu, Mar 07, 2024 at 01:39:53PM -0800, Dave Hansen wrote: > > On 3/7/24 13:04, Yosry Ahmed wrote: > > > I thought about doing inc_mm_tlb_gen() when LAM is updated, but it felt > > > hacky and more importantly doesn't make it clear in switch_mm_irqs_off() > > > that we correctly handle LAM updates. We can certainly add a comment, > > > but I think an explicit check for CPU LAM vs. mm LAM is much clearer. > > > > > > WDYT? > > > > The mm generations are literally there so that if the mm changes that > > all the CPUs know they need an update. Changing LAM enabling is 100% > > consistent with telling other CPUs that they need an update. > > > > I'd be curious of Andy feels differently though. > > The mm generations are TLB-specific and all the code using them implies > as such (e.g. look at the checks in switch_mm_irqs_off() when prev == > next). We can go around and update comments and/or function names to > make them more generic, but this seems excessive. If we don't, the code > becomes less clear imo. > > I agree that the use case here is essentially the same (let other > CPUs know they need to write CR3), but I still think that since the LAM > case is just a simple one-time enablement, an explicit check in > switch_mm_irqs_off() would be clearer. > > Just my 2c, let me know what you prefer :) > > > > > >> Considering how fun this code path is, a little effort at an actual > > >> reproduction would be really appreciated. > > > > > > I tried reproducing it but gave up quickly. We need a certain sequence > > > of events to happen: > > > > > > CPU 1 CPU 2 > > > kthread_use_mm() > > > /* user thread enables LAM */ > > > context_switch() > > > context_switch() /* to user thread */ > > > > First, it would be fine to either create a new kthread for reproduction > > purposes or to hack an existing one. For instance, have have the LAM > > prctl() take an extra ref on the mm and stick it in a global variable: > > > > mmgrab(current->mm); > > global_mm = current->mm; > > > > Then in the kthread, grab the mm and use it: > > > > while (!global_mm); > > kthread_use_mm(global_mm); > > ... check for the race > > mmdrop(global_mm); > > > > You can also hackily wait for thread to move with a stupid spin loop: > > > > while (smp_processor_id() != 1); > > > > and then actually move it with sched_setaffinity() from userspace. That > > can make it easier to get that series of events to happen in lockstep. > > I will take a stab at doing something similar and let you know, thanks. I came up with a kernel patch that I *think* may reproduce the problem with enough iterations. Userspace only needs to enable LAM, so I think the selftest can be enough to trigger it. However, there is no hardware with LAM at my disposal, and IIUC I cannot use QEMU without KVM to run a kernel with LAM. I was planning to do more testing before sending a non-RFC version, but apparently I cannot do any testing beyond building at this point (including reproducing) :/ Let me know how you want to proceed. I can send a non-RFC v1 based on the feedback I got on the RFC, but it will only be build tested. For the record, here is the diff that I *think* may reproduce the bug: diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index 33b268747bb7b..c37a8c26a3c21 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -750,8 +750,25 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr) #define LAM_U57_BITS 6 +static int kthread_fn(void *_mm) +{ + struct mm_struct *mm = _mm; + + /* + * Wait for LAM to be enabled then schedule. Hopefully we will context + * switch directly into the task that enabled LAM due to CPU pinning. + */ + kthread_use_mm(mm); + while (!test_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags)); + schedule(); + return 0; +} + static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits) { + struct task_struct *kthread_task; + int kthread_cpu; + if (!cpu_feature_enabled(X86_FEATURE_LAM)) return -ENODEV; @@ -782,10 +799,22 @@ static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits) return -EINVAL; } + /* Pin the task to the current CPU */ + set_cpus_allowed_ptr(current, cpumask_of(smp_processor_id())); + + /* Run a kthread on another CPU and wait for it to start */ + kthread_cpu = cpumask_next_wrap(smp_processor_id(), cpu_online_mask, 0, false), + kthread_task = kthread_run_on_cpu(kthread_fn, mm, kthread_cpu, "lam_repro_kthread"); + while (!task_is_running(kthread_task)); + write_cr3(__read_cr3() | mm->context.lam_cr3_mask); set_tlbstate_lam_mode(mm); set_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags); + /* Move the task to the kthread CPU */ + set_cpus_allowed_ptr(current, cpumask_of(kthread_cpu)); + mmap_write_unlock(mm); return 0; diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 51f9f56941058..3afb53f1a1901 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -593,7 +593,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next, next_tlb_gen = atomic64_read(&next->context.tlb_gen); if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) == next_tlb_gen) - return; + BUG_ON(new_lam != tlbstate_lam_cr3_mask()); /* * TLB contents went out of date while we were in lazy