Re: [PATCH v3 09/13] mm, swap: reduce contention on device lock

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Baoquan He <bhe@redhat.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	Chris Li <chrisl@kernel.org>, Barry Song <v-songbaohua@oppo.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	Hugh Dickins <hughd@google.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	Nhat Pham <nphamcs@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Kalesh Singh <kaleshsingh@google.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 09/13] mm, swap: reduce contention on device lock
Date: Fri, 10 Jan 2025 19:23:47 +0800	[thread overview]
Message-ID: <Z4EDQx2TwLokw1hJ@MiWiFi-R3L-srv> (raw)
In-Reply-To: <CAMgjq7Ad3v+oyBH8598-0emcqnLNK9izS-azgn89J5q=a=6N1w@mail.gmail.com>

On 01/09/25 at 10:15am, Kairui Song wrote:
> On Wed, Jan 8, 2025 at 7:10 PM Baoquan He <bhe@redhat.com> wrote:
> >
> 
> Thanks for the very detailed review!
> 
> > On 12/31/24 at 01:46am, Kairui Song wrote:
> > ......snip.....
> > > ---
> > >  include/linux/swap.h |   3 +-
> > >  mm/swapfile.c        | 435 ++++++++++++++++++++++++-------------------
> > >  2 files changed, 246 insertions(+), 192 deletions(-)
> > >
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index 339d7f0192ff..c4ff31cb6bde 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -291,6 +291,7 @@ enum swap_cluster_flags {
> > >   * throughput.
> > >   */
> > >  struct percpu_cluster {
> > > +     local_lock_t lock; /* Protect the percpu_cluster above */
> > >       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> > >  };
> > >
> > > @@ -313,7 +314,7 @@ struct swap_info_struct {
> > >                                       /* list of cluster that contains at least one free slot */
> > >       struct list_head frag_clusters[SWAP_NR_ORDERS];
> > >                                       /* list of cluster that are fragmented or contented */
> > > -     unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
> > > +     atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
> > >       unsigned int pages;             /* total of usable pages of swap */
> > >       atomic_long_t inuse_pages;      /* number of those currently in use */
> > >       struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 7795a3d27273..dadd4fead689 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
...snip...
> > > @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > >
> > >  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > >  {
> > > -     lockdep_assert_held(&si->lock);
> > >       lockdep_assert_held(&ci->lock);
> > >       cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
> > >       ci->order = 0;
> > >  }
> > >
> > > +/*
> > > + * Isolate and lock the first cluster that is not contented on a list,
> > > + * clean its flag before taken off-list. Cluster flag must be in sync
> > > + * with list status, so cluster updaters can always know the cluster
> > > + * list status without touching si lock.
> > > + *
> > > + * Note it's possible that all clusters on a list are contented so
> > > + * this returns NULL for an non-empty list.
> > > + */
> > > +static struct swap_cluster_info *cluster_isolate_lock(
> > > +             struct swap_info_struct *si, struct list_head *list)
> > > +{
> > > +     struct swap_cluster_info *ci, *ret = NULL;
> > > +
> > > +     spin_lock(&si->lock);
> > > +
> > > +     if (unlikely(!(si->flags & SWP_WRITEOK)))
> > > +             goto out;
> > > +
> > > +     list_for_each_entry(ci, list, list) {
> > > +             if (!spin_trylock(&ci->lock))
> > > +                     continue;
> > > +
> > > +             /* We may only isolate and clear flags of following lists */
> > > +             VM_BUG_ON(!ci->flags);
> > > +             VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
> > > +                       ci->flags != CLUSTER_FLAG_FULL);
> > > +
> > > +             list_del(&ci->list);
> > > +             ci->flags = CLUSTER_FLAG_NONE;
> > > +             ret = ci;
> > > +             break;
> > > +     }
> > > +out:
> > > +     spin_unlock(&si->lock);
> > > +
> > > +     return ret;
> > > +}
> > > +
> > >  /*
> > >   * Doing discard actually. After a cluster discard is finished, the cluster
> > > - * will be added to free cluster list. caller should hold si->lock.
> > > -*/
> > > -static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > > + * will be added to free cluster list. Discard cluster is a bit special as
> > > + * they don't participate in allocation or reclaim, so clusters marked as
> > > + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
> > > + */
> > > +static bool swap_do_scheduled_discard(struct swap_info_struct *si)
> > >  {
> > >       struct swap_cluster_info *ci;
> > > +     bool ret = false;
> > >       unsigned int idx;
> > >
> > > +     spin_lock(&si->lock);
> > >       while (!list_empty(&si->discard_clusters)) {
> > >               ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> > > +             /*
> > > +              * Delete the cluster from list but don't clear its flags until
> > > +              * discard is done, so isolation and relocation will skip it.
> > > +              */
> > >               list_del(&ci->list);
> >
> > I don't understand above comment. ci has been taken off list. While
> > allocation need isolate from a usable list. Even though we clear
> > ci->flags now, how come isolation and relocation will touch it. I may
> > miss anything here.
> 
> There are many cases, one possible and common situation is that the
> percpu cluster (si->percpu_cluster of another CPU) is still pointing
> to it.
> 
> Also, this commit removed protection of si lock on allocation, and
> allocation path may also drop ci lock to call reclaim, which means one
> cluster could be used or freed by anyone before allocator reacquire
> the ci lock again. In that case, the allocator could see a discard
> cluster.
> 
> So we don't clear the discard flag, in case anyone misuse it.
> 
> I can add more inline comments on this, this is already some related
> comments above the function relocate_cluster, could add some more
> referencing that.

Thanks for your great explanation. I understand that si->percpu_cluster
could point to a discarded ci, and a ci could be got from non-full,
frag lists but later become discarded if that ci is freed on other cpu
during cluster_reclaim_range() invocation. I haven't got how isolation
could see a discarded ci in cluster_isolate_lock(). Could you help give
an exmaple on how that happen?

Surely, I understand keeping the discarded flag is very necessary so
that checking like cluster_is_usable() will return expected value.

And by the way, I haven't got when the ' if (!ci->count)' case could
happen in relocate_cluster() since we have filtered away discarded ci
with the 'if (cluster_is_discard(ci))' checking. I asked in another
thread, could you help explain it?

static void relocate_cluster(struct swap_info_struct *si,
                             struct swap_cluster_info *ci)
{               
        lockdep_assert_held(&ci->lock); 
                
        /* Discard cluster must remain off-list or on discard list */
        if (cluster_is_discard(ci))
                return;
                
        if (!ci->count) {
                free_cluster(si, ci);
...
}
> 
> >
> > > -             /* Must clear flag when taking a cluster off-list */
> > > -             ci->flags = CLUSTER_FLAG_NONE;
> > >               idx = cluster_index(si, ci);
> > >               spin_unlock(&si->lock);
> > > -
> > >               discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
> > >                               SWAPFILE_CLUSTER);
> > >
> > > -             spin_lock(&si->lock);
> > >               spin_lock(&ci->lock);
> > > -             __free_cluster(si, ci);
> > > +             /*
> > > +              * Discard is done, clear its flags as it's now off-list,
> > > +              * then return the cluster to allocation list.
> > > +              */
> > > +             ci->flags = CLUSTER_FLAG_NONE;
> > >               memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> > >                               0, SWAPFILE_CLUSTER);
> > > +             __free_cluster(si, ci);
> > >               spin_unlock(&ci->lock);
> > > +             ret = true;
> > > +             spin_lock(&si->lock);
> > >       }
> > > +     spin_unlock(&si->lock);
> > > +     return ret;
> > >  }
> > >
> > >  static void swap_discard_work(struct work_struct *work)
> > ......snip....
> > > @@ -791,29 +873,34 @@ static void swap_reclaim_work(struct work_struct *work)
> > >  static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
> > >                                             unsigned char usage)
> > >  {
> > > -     struct percpu_cluster *cluster;
> > >       struct swap_cluster_info *ci;
> > >       unsigned int offset, found = 0;
> > >
> > > -new_cluster:
> > > -     lockdep_assert_held(&si->lock);
> > > -     cluster = this_cpu_ptr(si->percpu_cluster);
> > > -     offset = cluster->next[order];
> > > +     /* Fast path using per CPU cluster */
> > > +     local_lock(&si->percpu_cluster->lock);
> > > +     offset = __this_cpu_read(si->percpu_cluster->next[order]);
> > >       if (offset) {
> > > -             offset = alloc_swap_scan_cluster(si, offset, &found, order, usage);
> > > +             ci = lock_cluster(si, offset);
> > > +             /* Cluster could have been used by another order */
> > > +             if (cluster_is_usable(ci, order)) {
> > > +                     if (cluster_is_free(ci))
> > > +                             offset = cluster_offset(si, ci);
> > > +                     offset = alloc_swap_scan_cluster(si, offset, &found,
> > > +                                                      order, usage);
> > > +             } else {
> > > +                     unlock_cluster(ci);
> > > +             }
> > >               if (found)
> > >                       goto done;
> > >       }
> > >
> > > -     if (!list_empty(&si->free_clusters)) {
> > > -             ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> > > -             offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage);
> > > -             /*
> > > -              * Either we didn't touch the cluster due to swapoff,
> > > -              * or the allocation must success.
> > > -              */
> > > -             VM_BUG_ON((si->flags & SWP_WRITEOK) && !found);
> > > -             goto done;
> > > +new_cluster:
> > > +     ci = cluster_isolate_lock(si, &si->free_clusters);
> > > +     if (ci) {
> > > +             offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> > > +                                              &found, order, usage);
> > > +             if (found)
> > > +                     goto done;
> > >       }
> > >
> > >       /* Try reclaim from full clusters if free clusters list is drained */
> > > @@ -821,49 +908,45 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> > >               swap_reclaim_full_clusters(si, false);
> > >
> > >       if (order < PMD_ORDER) {
> > > -             unsigned int frags = 0;
> > > +             unsigned int frags = 0, frags_existing;
> > >
> > > -             while (!list_empty(&si->nonfull_clusters[order])) {
> > > -                     ci = list_first_entry(&si->nonfull_clusters[order],
> > > -                                           struct swap_cluster_info, list);
> > > -                     cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG);
> > > +             while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[order]))) {
> > >                       offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> > >                                                        &found, order, usage);
> > > -                     frags++;
> > > +                     /*
> > > +                      * With `fragmenting` set to true, it will surely take
> >                                  ~~~~~~~~~~~
> >                          wondering what 'fragmenting' means here.
> 
> This comment is a bit out of context indeed, it actually trying to say
> the alloc_swap_scan_cluster call above should move the cluster to
> tail. I'll update the comment.
> 
> 
> 
> >
> > > +                      * the cluster off nonfull list
> > > +                      */
> > >                       if (found)
> > >                               goto done;
> > > +                     frags++;
> > >               }
> > >
> > > -             /*
> > > -              * Nonfull clusters are moved to frag tail if we reached
> > > -              * here, count them too, don't over scan the frag list.
> > > -              */
> > > -             while (frags < si->frag_cluster_nr[order]) {
> > > -                     ci = list_first_entry(&si->frag_clusters[order],
> > > -                                           struct swap_cluster_info, list);
> > > +             frags_existing = atomic_long_read(&si->frag_cluster_nr[order]);
> > > +             while (frags < frags_existing &&
> > > +                    (ci = cluster_isolate_lock(si, &si->frag_clusters[order]))) {
> > > +                     atomic_long_dec(&si->frag_cluster_nr[order]);
> > >                       /*
> > > -                      * Rotate the frag list to iterate, they were all failing
> > > -                      * high order allocation or moved here due to per-CPU usage,
> > > -                      * this help keeping usable cluster ahead.
> > > +                      * Rotate the frag list to iterate, they were all
> > > +                      * failing high order allocation or moved here due to
> > > +                      * per-CPU usage, but they could contain newly released
> > > +                      * reclaimable (eg. lazy-freed swap cache) slots.
> > >                        */
> > > -                     list_move_tail(&ci->list, &si->frag_clusters[order]);
> > >                       offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> > >                                                        &found, order, usage);
> > > -                     frags++;
> > >                       if (found)
> > >                               goto done;
> > > +                     frags++;
> > >               }
> > >       }
> > >
> > > -     if (!list_empty(&si->discard_clusters)) {
> > > -             /*
> > > -              * we don't have free cluster but have some clusters in
> > > -              * discarding, do discard now and reclaim them, then
> > > -              * reread cluster_next_cpu since we dropped si->lock
> > > -              */
> > > -             swap_do_scheduled_discard(si);
> > > +     /*
> > > +      * We don't have free cluster but have some clusters in
> > > +      * discarding, do discard now and reclaim them, then
> > > +      * reread cluster_next_cpu since we dropped si->lock
> > > +      */
> > > +     if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> > >               goto new_cluster;
> > > -     }
> > >
> > >       if (order)
> > >               goto done;
> > .....
> >
> >
>

next prev parent reply	other threads:[~2025-01-10 11:24 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-30 17:46 [PATCH v3 00/13] mm, swap: rework of swap allocator locks Kairui Song
2024-12-30 17:46 ` [PATCH v3 01/13] mm, swap: minor clean up for swap entry allocation Kairui Song
2025-01-09  4:04   ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 02/13] mm, swap: fold swap_info_get_cont in the only caller Kairui Song
2025-01-09  4:05   ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 03/13] mm, swap: remove old allocation path for HDD Kairui Song
2025-01-09  4:06   ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 04/13] mm, swap: use cluster lock " Kairui Song
2025-01-09  4:07   ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 05/13] mm, swap: clean up device availability check Kairui Song
2025-01-09  4:08   ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 06/13] mm, swap: clean up plist removal and adding Kairui Song
2025-01-02  8:59   ` Baoquan He
2025-01-03  8:07     ` Kairui Song
2024-12-30 17:46 ` [PATCH v3 07/13] mm, swap: hold a reference during scan and cleanup flag usage Kairui Song
2025-01-04  5:46   ` Baoquan He
2025-01-13  5:34     ` Kairui Song
2025-01-20  2:39       ` Baoquan He
2025-01-27  9:19         ` Kairui Song
2025-02-05  9:18           ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 08/13] mm, swap: use an enum to define all cluster flags and wrap flags changes Kairui Song
2025-01-06  8:43   ` Baoquan He
2025-01-13  5:49     ` Kairui Song
2024-12-30 17:46 ` [PATCH v3 09/13] mm, swap: reduce contention on device lock Kairui Song
2025-01-06 10:12   ` Baoquan He
2025-01-08 11:09   ` Baoquan He
2025-01-09  2:15     ` Kairui Song
2025-01-10 11:23       ` Baoquan He [this message]
2025-01-13  6:33         ` Kairui Song
2025-01-13  8:07           ` Kairui Song
2024-12-30 17:46 ` [PATCH v3 10/13] mm, swap: simplify percpu cluster updating Kairui Song
2025-01-09  2:07   ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 11/13] mm, swap: introduce a helper for retrieving cluster from offset Kairui Song
2024-12-30 17:46 ` [PATCH v3 12/13] mm, swap: use a global swap cluster for non-rotation devices Kairui Song
2024-12-30 17:46 ` [PATCH v3 13/13] mm, swap_slots: remove slot cache for freeing path Kairui Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z4EDQx2TwLokw1hJ@MiWiFi-R3L-srv \
    --to=bhe@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=chrisl@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kaleshsingh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=ryncsn@gmail.com \
    --cc=v-songbaohua@oppo.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.