From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1BC2D14F135 for ; Sat, 4 Jan 2025 05:46:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1735969607; cv=none; b=P0L4+bDAgVB1fZQlzxTK8Kbb3WViIqgu3LvuiLmOix0lqv6BSfUx+RCYj/0QGB6mUGOr4/Flm9qBvQFJL+syviI0TgFnuqQehN8F8x3X+Cc9AkAbOFTPGKYSl/CQoeHNRAPNrDSFovgNu899QQP0NN8ZFKSyJYwTPyC5wb7pMP8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1735969607; c=relaxed/simple; bh=SF9WzRNLruzU1xTmQzHxghTpdUQ5bHQ2LkmlkC0aGZc=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=GoQOZReUgnJ3szpXGihL0NaS01lGr97bV3GjNsv/A/mamrfSRpos9l4VLn5RKd2I4l4DgiZciFjuetukGuD2W0AfkHyva/yj8xubj2S6NuhihBhB4DPRKgjXToend9OYoRFWFj4+xEh4pUQlOQBUMBX/EvBQiX5pMIinjzFhtIw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=CyJ6ocJT; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="CyJ6ocJT" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1735969603; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=4o7IaJaCIr0+tvYM3Fg2MveCS4+Grjyfo5WnxwdvOyA=; b=CyJ6ocJT2tuxXMNs+yzZQMs5+kTZ7UN2pT/fxaH/qZn/UXsfu2XfVojqVgCr/ALX4OGom7 p66wx/oHYfEwCyJbKTgIqfUI2hglvVi1O7EkVoVZ1WXj0XGYQZCaVfW8dSOQo2eNtS8eFG BrVA4fCcBwXRSJ3PKoNRcZFFgXmERO0= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-447-A3VwNQptNxOTOo6oXVc3Vw-1; Sat, 04 Jan 2025 00:46:40 -0500 X-MC-Unique: A3VwNQptNxOTOo6oXVc3Vw-1 X-Mimecast-MFC-AGG-ID: A3VwNQptNxOTOo6oXVc3Vw Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id BEE1119560B2; Sat, 4 Jan 2025 05:46:37 +0000 (UTC) Received: from localhost (unknown [10.72.112.163]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 2B54E19560A2; Sat, 4 Jan 2025 05:46:34 +0000 (UTC) Date: Sat, 4 Jan 2025 13:46:27 +0800 From: Baoquan He To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org Subject: Re: [PATCH v3 07/13] mm, swap: hold a reference during scan and cleanup flag usage Message-ID: References: <20241230174621.61185-1-ryncsn@gmail.com> <20241230174621.61185-8-ryncsn@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20241230174621.61185-8-ryncsn@gmail.com> X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 On 12/31/24 at 01:46am, Kairui Song wrote: > From: Kairui Song > > The flag SWP_SCANNING was used as an indicator of whether a device > is being scanned for allocation, and prevents swapoff. Combined with > SWP_WRITEOK, they work as a set of barriers for a clean swapoff: > > 1. Swapoff clears SWP_WRITEOK, allocation requests will see > ~SWP_WRITEOK and abort as it's serialized by si->lock. > 2. Swapoff unuses all allocated entries. > 3. Swapoff waits for SWP_SCANNING flag to be cleared, so ongoing > allocations will stop, preventing UAF. > 4. Now swapoff can free everything safely. > > This will make the allocation path have a hard dependency on > si->lock. Allocation always have to acquire si->lock first for > setting SWP_SCANNING and checking SWP_WRITEOK. > > This commit removes this flag, and just uses the existing per-CPU > refcount instead to prevent UAF in step 3, which serves well for > such usage without dependency on si->lock, and scales very well too. > Just hold a reference during the whole scan and allocation process. > Swapoff will kill and wait for the counter. > > And for preventing any allocation from happening after step 1 so the > unuse in step 2 can ensure all slots are free, swapoff will acquire > the ci->lock of each cluster one by one to ensure all allocations > see ~SWP_WRITEOK and abort. Changing to use si->users is great, while wondering why we need acquire = each ci->lock now. After setup 1, we have cleared SWP_WRITEOK, and take the si off swap_avail_heads list. No matter what, we just need wait for p->comm's completion and continue, why bothering to loop for the ci->lock acquiring? > > This way these dependences on si->lock are gone. And worth noting we > can't kill the refcount as the first step for swapoff as the unuse > process have to acquire the refcount. > > Signed-off-by: Kairui Song > --- > include/linux/swap.h | 1 - > mm/swapfile.c | 90 ++++++++++++++++++++++++++++---------------- > 2 files changed, 57 insertions(+), 34 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index e1eeea6307cd..02120f1005d5 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -219,7 +219,6 @@ enum { > SWP_STABLE_WRITES = (1 << 11), /* no overwrite PG_writeback pages */ > SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */ > /* add others here before... */ > - SWP_SCANNING = (1 << 14), /* refcount in scan_swap_map */ > }; > > #define SWAP_CLUSTER_MAX 32UL > diff --git a/mm/swapfile.c b/mm/swapfile.c > index e6e58cfb5178..99fd0b0d84a2 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -658,6 +658,8 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster > { > unsigned int nr_pages = 1 << order; > > + lockdep_assert_held(&ci->lock); > + > if (!(si->flags & SWP_WRITEOK)) > return false; > > @@ -1059,8 +1061,6 @@ static int cluster_alloc_swap(struct swap_info_struct *si, > { > int n_ret = 0; > > - si->flags += SWP_SCANNING; > - > while (n_ret < nr) { > unsigned long offset = cluster_alloc_swap_entry(si, order, usage); > > @@ -1069,8 +1069,6 @@ static int cluster_alloc_swap(struct swap_info_struct *si, > slots[n_ret++] = swp_entry(si->type, offset); > } > > - si->flags -= SWP_SCANNING; > - > return n_ret; > } > > @@ -1112,6 +1110,22 @@ static int scan_swap_map_slots(struct swap_info_struct *si, > return cluster_alloc_swap(si, usage, nr, slots, order); > } > > +static bool get_swap_device_info(struct swap_info_struct *si) > +{ > + if (!percpu_ref_tryget_live(&si->users)) > + return false; > + /* > + * Guarantee the si->users are checked before accessing other > + * fields of swap_info_struct, and si->flags (SWP_WRITEOK) is > + * up to dated. > + * > + * Paired with the spin_unlock() after setup_swap_info() in > + * enable_swap_info(), and smp_wmb() in swapoff. > + */ > + smp_rmb(); > + return true; > +} > + > int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) > { > int order = swap_entry_order(entry_order); > @@ -1139,13 +1153,16 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) > /* requeue si to after same-priority siblings */ > plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]); > spin_unlock(&swap_avail_lock); > - spin_lock(&si->lock); > - n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, > - n_goal, swp_entries, order); > - spin_unlock(&si->lock); > - if (n_ret || size > 1) > - goto check_out; > - cond_resched(); > + if (get_swap_device_info(si)) { > + spin_lock(&si->lock); > + n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, > + n_goal, swp_entries, order); > + spin_unlock(&si->lock); > + put_swap_device(si); > + if (n_ret || size > 1) > + goto check_out; > + cond_resched(); > + } > > spin_lock(&swap_avail_lock); > /* > @@ -1296,16 +1313,8 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) > si = swp_swap_info(entry); > if (!si) > goto bad_nofile; > - if (!percpu_ref_tryget_live(&si->users)) > + if (!get_swap_device_info(si)) > goto out; > - /* > - * Guarantee the si->users are checked before accessing other > - * fields of swap_info_struct. > - * > - * Paired with the spin_unlock() after setup_swap_info() in > - * enable_swap_info(). > - */ > - smp_rmb(); > offset = swp_offset(entry); > if (offset >= si->max) > goto put_out; > @@ -1785,10 +1794,13 @@ swp_entry_t get_swap_page_of_type(int type) > goto fail; > > /* This is called for allocating swap entry, not cache */ > - spin_lock(&si->lock); > - if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0)) > - atomic_long_dec(&nr_swap_pages); > - spin_unlock(&si->lock); > + if (get_swap_device_info(si)) { > + spin_lock(&si->lock); > + if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0)) > + atomic_long_dec(&nr_swap_pages); > + spin_unlock(&si->lock); > + put_swap_device(si); > + } > fail: > return entry; > } > @@ -2562,6 +2574,25 @@ bool has_usable_swap(void) > return ret; > } > > +/* > + * Called after clearing SWP_WRITEOK, ensures cluster_alloc_range > + * see the updated flags, so there will be no more allocations. > + */ > +static void wait_for_allocation(struct swap_info_struct *si) > +{ > + unsigned long offset; > + unsigned long end = ALIGN(si->max, SWAPFILE_CLUSTER); > + struct swap_cluster_info *ci; > + > + BUG_ON(si->flags & SWP_WRITEOK); > + > + for (offset = 0; offset < end; offset += SWAPFILE_CLUSTER) { > + ci = lock_cluster(si, offset); > + unlock_cluster(ci); > + offset += SWAPFILE_CLUSTER; > + } > +} > + > SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) > { > struct swap_info_struct *p = NULL; > @@ -2632,6 +2663,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) > spin_unlock(&p->lock); > spin_unlock(&swap_lock); > > + wait_for_allocation(p); > + > disable_swap_slots_cache_lock(); > > set_current_oom_origin(); > @@ -2674,15 +2707,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) > spin_lock(&p->lock); > drain_mmlist(); > > - /* wait for anyone still in scan_swap_map_slots */ > - while (p->flags >= SWP_SCANNING) { > - spin_unlock(&p->lock); > - spin_unlock(&swap_lock); > - schedule_timeout_uninterruptible(1); > - spin_lock(&swap_lock); > - spin_lock(&p->lock); > - } > - > swap_file = p->swap_file; > p->swap_file = NULL; > p->max = 0; > -- > 2.47.1 > >