All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Huang\, Ying" <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>,  <linux-kernel@vger.kernel.org>,
	 Daniel Jordan <daniel.m.jordan@oracle.com>,
	 Michal Hocko <mhocko@suse.com>,
	 Minchan Kim <minchan@kernel.org>,
	 Tim Chen <tim.c.chen@linux.intel.com>,
	 Hugh Dickins <hughd@google.com>
Subject: Re: [PATCH -V2] swap: Reduce lock contention on swap cache from swap slots allocation
Date: Thu, 21 May 2020 11:24:40 +0800	[thread overview]
Message-ID: <87o8qihsw7.fsf@yhuang-dev.intel.com> (raw)
In-Reply-To: <20200520195102.2343f746e88a2bec5c29ef5b@linux-foundation.org> (Andrew Morton's message of "Wed, 20 May 2020 19:51:02 -0700")

Andrew Morton <akpm@linux-foundation.org> writes:

> On Wed, 20 May 2020 11:15:02 +0800 Huang Ying <ying.huang@intel.com> wrote:
>
>> In some swap scalability test, it is found that there are heavy lock
>> contention on swap cache even if we have split one swap cache radix
>> tree per swap device to one swap cache radix tree every 64 MB trunk in
>> commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>> 
>> The reason is as follow.  After the swap device becomes fragmented so
>> that there's no free swap cluster, the swap device will be scanned
>> linearly to find the free swap slots.  swap_info_struct->cluster_next
>> is the next scanning base that is shared by all CPUs.  So nearby free
>> swap slots will be allocated for different CPUs.  The probability for
>> multiple CPUs to operate on the same 64 MB trunk is high.  This causes
>> the lock contention on the swap cache.
>> 
>> To solve the issue, in this patch, for SSD swap device, a percpu
>> version next scanning base (cluster_next_cpu) is added.  Every CPU
>> will use its own per-cpu next scanning base.  And after finishing
>> scanning a 64MB trunk, the per-cpu scanning base will be changed to
>> the beginning of another randomly selected 64MB trunk.  In this way,
>> the probability for multiple CPUs to operate on the same 64 MB trunk
>> is reduced greatly.  Thus the lock contention is reduced too.  For
>> HDD, because sequential access is more important for IO performance,
>> the original shared next scanning base is used.
>> 
>> To test the patch, we have run 16-process pmbench memory benchmark on
>> a 2-socket server machine with 48 cores.  One ram disk is configured
>
> What does "ram disk" mean here?  Which drivers(s) are in use and backed
> by what sort of memory?

We use the following kernel command line

memmap=48G!6G memmap=48G!68G

to create 2 DRAM based /dev/pmem disks (48GB each).  Then we use these
ram disks as swap devices.

>> as the swap device per socket.  The pmbench working-set size is much
>> larger than the available memory so that swapping is triggered.  The
>> memory read/write ratio is 80/20 and the accessing pattern is random.
>> In the original implementation, the lock contention on the swap cache
>> is heavy.  The perf profiling data of the lock contention code path is
>> as following,
>> 
>> _raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list:      7.91
>> _raw_spin_lock_irqsave.__remove_mapping.shrink_page_list:               7.11
>> _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51
>> _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap:     1.66
>> _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:      1.29
>> _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:         1.03
>> _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:        0.93
>> 
>> After applying this patch, it becomes,
>> 
>> _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58
>> _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:      2.3
>> _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap:     2.26
>> _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:        1.8
>> _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:         1.19
>> 
>> The lock contention on the swap cache is almost eliminated.
>> 
>> And the pmbench score increases 18.5%.  The swapin throughput
>> increases 18.7% from 2.96 GB/s to 3.51 GB/s.  While the swapout
>> throughput increases 18.5% from 2.99 GB/s to 3.54 GB/s.
>
> If this was backed by plain old RAM, can we assume that the performance
> improvement on SSD swap is still good?

We need really fast disk to show the benefit.  I have tried this on 2
Intel P3600 NVMe disks.  The performance improvement is only about 1%.
The improvement should be better on the faster disks, such as Intel
Optane disk.  I will try to find some to test.

> Does the ram disk actually set SWP_SOLIDSTATE?

Yes.  "blk_queue_flag_set(QUEUE_FLAG_NONROT, q)" is called in
drivers/nvdimm/pmem.c.

Best Regards,
Huang, Ying


  reply	other threads:[~2020-05-21  3:24 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-20  3:15 [PATCH -V2] swap: Reduce lock contention on swap cache from swap slots allocation Huang Ying
2020-05-21  2:51 ` Andrew Morton
2020-05-21  3:24   ` Huang, Ying [this message]
2020-06-24  3:31     ` Huang, Ying
2020-05-21 13:35 ` Daniel Jordan
2020-05-22  5:58   ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87o8qihsw7.fsf@yhuang-dev.intel.com \
    --to=ying.huang@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=daniel.m.jordan@oracle.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=tim.c.chen@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.