Re: [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Huang\, Ying" <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>,  <linux-kernel@vger.kernel.org>,
	 Michal Hocko <mhocko@suse.com>,
	 Minchan Kim <minchan@kernel.org>,
	 Tim Chen <tim.c.chen@linux.intel.com>,
	 Hugh Dickins <hughd@google.com>
Subject: Re: [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache
Date: Mon, 18 May 2020 13:52:03 +0800	[thread overview]
Message-ID: <874ksdlri4.fsf@yhuang-dev.intel.com> (raw)
In-Reply-To: <20200515151920.34f147b7b0baa3017cd7a2bb@linux-foundation.org> (Andrew Morton's message of "Fri, 15 May 2020 15:19:20 -0700")

Hi, Andrew,

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 14 May 2020 15:04:24 +0800 Huang Ying <ying.huang@intel.com> wrote:
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 35be7a7271f4..9f1343b066c1 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -746,7 +746,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	 */
>>  
>>  	si->flags += SWP_SCANNING;
>> -	scan_base = offset = si->cluster_next;
>> +	/*
>> +	 * Use percpu scan base for SSD to reduce lock contention on
>> +	 * cluster and swap cache.  For HDD, sequential access is more
>> +	 * important.
>> +	 */
>> +	if (si->flags & SWP_SOLIDSTATE)
>> +		scan_base = this_cpu_read(*si->cluster_next_cpu);
>> +	else
>> +		scan_base = si->cluster_next;
>> +	offset = scan_base;
>
> Do we need to make SSD differ from spinning here?  Do bad things happen
> if !SWP_SOLIDSTATE devices use the per-cpu cache?

I think the swapout throughput may be affected.  Because HDD seek is
necessary to swapout for multiple CPUs, if per-cpu cluster_next is used.
But I just realized that per-cpu swap slots cache will cause seek too.
If we really care about the performance to use HDD as swap, maybe we
should disable per-cpu swap slots cache for HDD too?

>>  	/* SSD algorithm */
>>  	if (si->cluster_info) {
>> @@ -835,7 +844,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	unlock_cluster(ci);
>>  
>>  	swap_range_alloc(si, offset, 1);
>> -	si->cluster_next = offset + 1;
>> +	if (si->flags & SWP_SOLIDSTATE)
>> +		this_cpu_write(*si->cluster_next_cpu, offset + 1);
>> +	else
>> +		si->cluster_next = offset + 1;
>>  	slots[n_ret++] = swp_entry(si->type, offset);
>>  
>>  	/* got enough slots or reach max slots? */
>> @@ -2828,6 +2840,11 @@ static struct swap_info_struct *alloc_swap_info(void)
>>  	p = kvzalloc(struct_size(p, avail_lists, nr_node_ids), GFP_KERNEL);
>>  	if (!p)
>>  		return ERR_PTR(-ENOMEM);
>> +	p->cluster_next_cpu = alloc_percpu(unsigned int);
>> +	if (!p->cluster_next_cpu) {
>> +		kvfree(p);
>> +		return ERR_PTR(-ENOMEM);
>> +	}
>>  
>>  	spin_lock(&swap_lock);
>>  	for (type = 0; type < nr_swapfiles; type++) {
>> @@ -2962,6 +2979,8 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
>>  
>>  	p->lowest_bit  = 1;
>>  	p->cluster_next = 1;
>> +	for_each_possible_cpu(i)
>> +		per_cpu(*p->cluster_next_cpu, i) = 1;
>>  	p->cluster_nr = 0;
>>  
>>  	maxpages = max_swapfile_size();
>> @@ -3204,6 +3223,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>  		 * SSD
>>  		 */
>>  		p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
>
> We shouldn't need to do this now?

Yes.  Thanks for pointing this out.  Will delete this in the future
version.

>> +		for_each_possible_cpu(cpu) {
>> +			per_cpu(*p->cluster_next_cpu, cpu) =
>> +				1 + prandom_u32_max(p->highest_bit);
>> +		}
>
> Would there be any benefit in spreading these out evenly?  Intervals of
> (p->highest_bit/num_possible_cpus())?  That would reduce collisions,
> but not for very long I guess.

These may be spread more evenly with
(p->highest_bit/num_possible_cpus()).  I just worry about the possible
situation that num_possible_cpus() >> num_online_cpus().  Where current
method works better?

> Speaking of which, I wonder if there are failure modes in which all the
> CPUs end up getting into sync.
>
> And is it the case that if two or more CPUs have the same (or similar)
> per_cpu(*p->cluster_next_cpu, cpu), they'll each end up pointlessly
> scanning slots which another CPU has just scanned, thus rather
> defeating the purpose of having the cluster_next cache?
>
> IOW, should there be some additional collision avoidance scheme to
> prevent a CPU from pointing its cluster_ext into a 64MB trunk which
> another CPU is already using?

Yes.  That sounds reasonable.  How about something as below,

When per-cpu cluster_next is assigned, if the new value is in a
different 64MB (or larger) trunk of the old value, we will assign a
random value between p->lowest_bit and p->highest_bit to per-cpu
cluster_next.

This can reduce the possibility of collision to be almost 0 if there's
enough free swap slots.  And this is easy to be implemented, especially
considering the following situation,

  (p->highest_bit - p->lowest_bit) / 64MB < num_online_cpus()

> And should it really be a per-cpu thing?  That's rather arbitrary. 
> Perhaps we would get better swap locality by making swap_cluster_next a
> per-process (per-mm_struct) thing?

I think per-cpu is enough.  Because this is a scalability issue, as long
as we work on different 64MB trunks on different CPUs, the scalability
will be good.  I don't find there's any value to use differnt 64MB
trunks on a single CPU.

Best Regards,
Huang, Ying

next prev parent reply	other threads:[~2020-05-18  5:52 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-14  7:04 [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache Huang Ying
2020-05-15 22:19 ` Andrew Morton
2020-05-18  5:52   ` Huang, Ying [this message]
2020-05-15 23:51 ` Daniel Jordan
2020-05-18  6:37   ` Huang, Ying
2020-05-19  1:39     ` Daniel Jordan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874ksdlri4.fsf@yhuang-dev.intel.com \
    --to=ying.huang@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=tim.c.chen@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.