From: "Huang\, Ying" <ying.huang@intel.com>
To: Minchan Kim <minchan@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>,
Tim Chen <tim.c.chen@linux.intel.com>,
Andrew Morton <akpm@linux-foundation.org>,
dave.hansen@intel.com, ak@linux.intel.com, aaron.lu@intel.com,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Hugh Dickins <hughd@google.com>, Shaohua Li <shli@kernel.org>,
Rik van Riel <riel@redhat.com>,
Andrea Arcangeli <aarcange@redhat.com>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
Vladimir Davydov <vdavydov.dev@gmail.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@kernel.org>,
Hillf Danton <hillf.zj@alibaba-inc.com>,
Christian Borntraeger <borntraeger@de.ibm.com>,
Jonathan Corbet <corbet@lwn.net>,
jack@suse.cz
Subject: Re: [PATCH v4 0/9] mm/swap: Regular page swap optimizations
Date: Thu, 05 Jan 2017 14:44:10 +0800 [thread overview]
Message-ID: <87bmvmhupx.fsf@yhuang-dev.intel.com> (raw)
In-Reply-To: <20170105063200.GE24371@bbox> (Minchan Kim's message of "Thu, 5 Jan 2017 15:32:00 +0900")
Minchan Kim <minchan@kernel.org> writes:
> Hi,
>
> On Thu, Jan 05, 2017 at 09:33:55AM +0800, Huang, Ying wrote:
>> Hi, Minchan,
>>
>> Minchan Kim <minchan@kernel.org> writes:
>> [snip]
>> >
>> > The patchset has used several techniqueus to reduce lock contention, for example,
>> > batching alloc/free, fine-grained lock and cluster distribution to avoid cache
>> > false-sharing. Each items has different complexity and benefits so could you
>> > show the number for each step of pathchset? It would be better to include the
>> > nubmer in each description. It helps how the patch is important when we consider
>> > complexitiy of the patch.
>>
>> Here is the test data.
>
> Thanks!
>
>>
>> We test the vm-scalability swap-w-seq test case with 32 processes on a
>> Xeon E5 v3 system. The swap device used is a RAM simulated PMEM
>> (persistent memory) device. To test the sequential swapping out, the
>> test case created 32 processes, which sequentially allocate and write to
>> the anonymous pages until the RAM and part of the swap device is used
>> up.
>>
>> The patchset is rebased on v4.9-rc8. So the baseline performance is as
>> follow,
>>
>> "vmstat.swap.so": 1428002,
>
> What does it mean? vmstat.pswpout?
This is the average of swap.so column of /usr/bin/vmstat output. We run
/usr/bin/vmstat with,
/usr/bin/vmstat -n 1
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list": 13.94,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg":
>> 13.75,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.swap_info_get.swapcache_free.__remove_mapping.shrink_page_list": 7.05,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.swap_info_get.page_swapcount.try_to_free_swap.swap_writepage": 7.03,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.__swap_duplicate.swap_duplicate.try_to_unmap_one.rmap_walk_anon": 7.02,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_page.add_to_swap.shrink_page_list.shrink_inactive_list": 6.83,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.page_check_address_transhuge.page_referenced_one.rmap_walk_anon.rmap_walk": 0.81,
>
> Numbers mean overhead percentage reported by perf?
Yes.
>> >> Patch 1 is a clean up patch.
>> >
>> > Could it be separated patch?
>> >
>> >> Patch 2 creates a lock per cluster, this gives us a more fine graind lock
>> >> that can be used for accessing swap_map, and not lock the whole
>> >> swap device
>>
>> After patch 2, the result is as follow,
>>
>> "vmstat.swap.so": 1481704,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list": 27.53,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg":
>> 27.01,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages.drain_local_pages": 1.03,
>>
>> The swap out throughput is at the same level, but the lock contention on
>> swap_info_struct->lock is eliminated.
>>
>> >> Patch 3 splits the swap cache radix tree into 64MB chunks, reducing
>> >> the rate that we have to contende for the radix tree.
>> >
>>
>> After patch 3,
>>
>> "vmstat.swap.so": 2050097,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_page.add_to_swap.shrink_page_list.shrink_inactive_list": 43.27,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_mm_fault": 4.84,
>>
>> The swap out throughput is improved about ~43% compared with baseline.
>> The lock contention on swap cache radix tree lock is eliminated.
>> swap_info_struct->lock in get_swap_page() becomes the most heavy
>> contended lock.
>
> The numbers are great! Please include those into each patchset.
> And I ask one more thing I said earlier about patch 2.
>
> ""
> I hope you make three steps to review easier. You can create some functions like
> swap_map_lock and cluster_lock which are wrapper functions just hold swap_lock.
> It doesn't change anything performance pov but it clearly shows what kinds of lock
> we should use in specific context.
>
> Then, you can introduce more fine-graind lock in next patch and apply it into
> those wrapper functions.
>
> And last patch, you can adjust cluster distribution to avoid false-sharing.
> And the description should include how it's bad in testing so it's worth.
> ""
>
> It makes review more easier, I believe.
Sorry, personally, I don't like this way to organize the patchset. So
unless more people have this requirement, I still want to keep the
current way.
Best Regards,
Huang, Ying
>>
>> >
>> >> Patch 4 eliminates unnecessary page allocation for read ahead.
>> >
>> > Could it be separated patch?
>> >
>> >> Patch 5-9 create a per cpu cache of the swap slots, so we don't have
>> >> to contend on the swap device to get a swap slot or to release
>> >> a swap slot. And we allocate and release the swap slots
>> >> in batches for better efficiency.
>>
>> After patch 9,
>>
>> "vmstat.swap.so": 4170746,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.swapcache_free_entries.free_swap_slot.free_swap_and_cache.unmap_page_range": 13.91,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_mm_fault": 8.56,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_slowpath.__alloc_pages_nodemask.alloc_pages_vma":
>> 2.56,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_pages.get_swap_page.add_to_swap.shrink_page_list": 2.47,
>>
>> The swap out throughput is improved about 192% compared with the
>> baseline. There are still some lock contention for
>> swap_info_struct->lock, but the pressure begins to shift to buddy system
>> now.
>>
>> Best Regards,
>> Huang, Ying
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: "Huang\, Ying" <ying.huang@intel.com>
To: Minchan Kim <minchan@kernel.org>
Cc: "Huang\, Ying" <ying.huang@intel.com>,
Tim Chen <tim.c.chen@linux.intel.com>,
Andrew Morton <akpm@linux-foundation.org>,
<dave.hansen@intel.com>, <ak@linux.intel.com>,
<aaron.lu@intel.com>, <linux-mm@kvack.org>,
<linux-kernel@vger.kernel.org>, Hugh Dickins <hughd@google.com>,
Shaohua Li <shli@kernel.org>, Rik van Riel <riel@redhat.com>,
Andrea Arcangeli <aarcange@redhat.com>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
Vladimir Davydov <vdavydov.dev@gmail.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@kernel.org>,
Hillf Danton <hillf.zj@alibaba-inc.com>,
"Christian Borntraeger" <borntraeger@de.ibm.com>,
Jonathan Corbet <corbet@lwn.net>, <jack@suse.cz>
Subject: Re: [PATCH v4 0/9] mm/swap: Regular page swap optimizations
Date: Thu, 05 Jan 2017 14:44:10 +0800 [thread overview]
Message-ID: <87bmvmhupx.fsf@yhuang-dev.intel.com> (raw)
In-Reply-To: <20170105063200.GE24371@bbox> (Minchan Kim's message of "Thu, 5 Jan 2017 15:32:00 +0900")
Minchan Kim <minchan@kernel.org> writes:
> Hi,
>
> On Thu, Jan 05, 2017 at 09:33:55AM +0800, Huang, Ying wrote:
>> Hi, Minchan,
>>
>> Minchan Kim <minchan@kernel.org> writes:
>> [snip]
>> >
>> > The patchset has used several techniqueus to reduce lock contention, for example,
>> > batching alloc/free, fine-grained lock and cluster distribution to avoid cache
>> > false-sharing. Each items has different complexity and benefits so could you
>> > show the number for each step of pathchset? It would be better to include the
>> > nubmer in each description. It helps how the patch is important when we consider
>> > complexitiy of the patch.
>>
>> Here is the test data.
>
> Thanks!
>
>>
>> We test the vm-scalability swap-w-seq test case with 32 processes on a
>> Xeon E5 v3 system. The swap device used is a RAM simulated PMEM
>> (persistent memory) device. To test the sequential swapping out, the
>> test case created 32 processes, which sequentially allocate and write to
>> the anonymous pages until the RAM and part of the swap device is used
>> up.
>>
>> The patchset is rebased on v4.9-rc8. So the baseline performance is as
>> follow,
>>
>> "vmstat.swap.so": 1428002,
>
> What does it mean? vmstat.pswpout?
This is the average of swap.so column of /usr/bin/vmstat output. We run
/usr/bin/vmstat with,
/usr/bin/vmstat -n 1
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list": 13.94,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg":
>> 13.75,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.swap_info_get.swapcache_free.__remove_mapping.shrink_page_list": 7.05,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.swap_info_get.page_swapcount.try_to_free_swap.swap_writepage": 7.03,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.__swap_duplicate.swap_duplicate.try_to_unmap_one.rmap_walk_anon": 7.02,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_page.add_to_swap.shrink_page_list.shrink_inactive_list": 6.83,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.page_check_address_transhuge.page_referenced_one.rmap_walk_anon.rmap_walk": 0.81,
>
> Numbers mean overhead percentage reported by perf?
Yes.
>> >> Patch 1 is a clean up patch.
>> >
>> > Could it be separated patch?
>> >
>> >> Patch 2 creates a lock per cluster, this gives us a more fine graind lock
>> >> that can be used for accessing swap_map, and not lock the whole
>> >> swap device
>>
>> After patch 2, the result is as follow,
>>
>> "vmstat.swap.so": 1481704,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list": 27.53,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg":
>> 27.01,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages.drain_local_pages": 1.03,
>>
>> The swap out throughput is at the same level, but the lock contention on
>> swap_info_struct->lock is eliminated.
>>
>> >> Patch 3 splits the swap cache radix tree into 64MB chunks, reducing
>> >> the rate that we have to contende for the radix tree.
>> >
>>
>> After patch 3,
>>
>> "vmstat.swap.so": 2050097,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_page.add_to_swap.shrink_page_list.shrink_inactive_list": 43.27,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_mm_fault": 4.84,
>>
>> The swap out throughput is improved about ~43% compared with baseline.
>> The lock contention on swap cache radix tree lock is eliminated.
>> swap_info_struct->lock in get_swap_page() becomes the most heavy
>> contended lock.
>
> The numbers are great! Please include those into each patchset.
> And I ask one more thing I said earlier about patch 2.
>
> ""
> I hope you make three steps to review easier. You can create some functions like
> swap_map_lock and cluster_lock which are wrapper functions just hold swap_lock.
> It doesn't change anything performance pov but it clearly shows what kinds of lock
> we should use in specific context.
>
> Then, you can introduce more fine-graind lock in next patch and apply it into
> those wrapper functions.
>
> And last patch, you can adjust cluster distribution to avoid false-sharing.
> And the description should include how it's bad in testing so it's worth.
> ""
>
> It makes review more easier, I believe.
Sorry, personally, I don't like this way to organize the patchset. So
unless more people have this requirement, I still want to keep the
current way.
Best Regards,
Huang, Ying
>>
>> >
>> >> Patch 4 eliminates unnecessary page allocation for read ahead.
>> >
>> > Could it be separated patch?
>> >
>> >> Patch 5-9 create a per cpu cache of the swap slots, so we don't have
>> >> to contend on the swap device to get a swap slot or to release
>> >> a swap slot. And we allocate and release the swap slots
>> >> in batches for better efficiency.
>>
>> After patch 9,
>>
>> "vmstat.swap.so": 4170746,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.swapcache_free_entries.free_swap_slot.free_swap_and_cache.unmap_page_range": 13.91,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_mm_fault": 8.56,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_slowpath.__alloc_pages_nodemask.alloc_pages_vma":
>> 2.56,
>> "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_pages.get_swap_page.add_to_swap.shrink_page_list": 2.47,
>>
>> The swap out throughput is improved about 192% compared with the
>> baseline. There are still some lock contention for
>> swap_info_struct->lock, but the pressure begins to shift to buddy system
>> now.
>>
>> Best Regards,
>> Huang, Ying
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-01-05 6:44 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-12-09 21:09 [PATCH v4 0/9] mm/swap: Regular page swap optimizations Tim Chen
2016-12-09 21:09 ` Tim Chen
2016-12-09 21:09 ` [PATCH v4 1/9] mm/swap: Fix kernel message in swap_info_get() Tim Chen
2016-12-09 21:09 ` Tim Chen
2016-12-09 21:09 ` [PATCH v4 2/9] mm/swap: Add cluster lock Tim Chen
2016-12-09 21:09 ` Tim Chen
2016-12-09 21:09 ` [PATCH v4 3/9] mm/swap: Split swap cache into 64MB trunks Tim Chen
2016-12-09 21:09 ` Tim Chen
2016-12-09 21:09 ` [PATCH v4 4/9] mm/swap: skip read ahead for unreferenced swap slots Tim Chen
2016-12-09 21:09 ` Tim Chen
2016-12-09 21:09 ` [PATCH v4 5/9] mm/swap: Allocate swap slots in batches Tim Chen
2016-12-09 21:09 ` Tim Chen
2016-12-09 21:09 ` [PATCH v4 6/9] mm/swap: Free swap slots in batch Tim Chen
2016-12-09 21:09 ` Tim Chen
2016-12-09 21:09 ` [PATCH v4 7/9] mm/swap: Add cache for swap slots allocation Tim Chen
2016-12-09 21:09 ` Tim Chen
2016-12-09 21:09 ` [PATCH v4 8/9] mm/swap: Enable swap slots cache usage Tim Chen
2016-12-09 21:09 ` Tim Chen
2016-12-09 21:09 ` [PATCH v4 9/9] mm/swap: Skip readahead only when swap slot cache is enabled Tim Chen
2016-12-09 21:09 ` Tim Chen
2016-12-27 7:45 ` [PATCH v4 0/9] mm/swap: Regular page swap optimizations Minchan Kim
2016-12-27 7:45 ` Minchan Kim
2016-12-28 1:54 ` Huang, Ying
2016-12-28 1:54 ` Huang, Ying
2016-12-28 2:37 ` Minchan Kim
2016-12-28 2:37 ` Minchan Kim
2016-12-28 3:15 ` Huang, Ying
2016-12-28 3:15 ` Huang, Ying
2016-12-28 3:31 ` Huang, Ying
2016-12-28 3:31 ` Huang, Ying
2016-12-28 3:53 ` Minchan Kim
2016-12-28 3:53 ` Minchan Kim
2016-12-28 4:56 ` Huang, Ying
2016-12-28 4:56 ` Huang, Ying
2017-01-02 15:48 ` Jan Kara
2017-01-02 15:48 ` Jan Kara
2017-01-03 4:34 ` Minchan Kim
2017-01-03 4:34 ` Minchan Kim
2017-01-03 5:43 ` Huang, Ying
2017-01-03 5:43 ` Huang, Ying
2017-01-05 6:15 ` Minchan Kim
2017-01-05 6:15 ` Minchan Kim
2017-01-03 17:47 ` Tim Chen
2017-01-03 17:47 ` Tim Chen
2017-01-05 1:33 ` Huang, Ying
2017-01-05 1:33 ` Huang, Ying
2017-01-05 6:32 ` Minchan Kim
2017-01-05 6:32 ` Minchan Kim
2017-01-05 6:44 ` Huang, Ying [this message]
2017-01-05 6:44 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87bmvmhupx.fsf@yhuang-dev.intel.com \
--to=ying.huang@intel.com \
--cc=aarcange@redhat.com \
--cc=aaron.lu@intel.com \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=borntraeger@de.ibm.com \
--cc=corbet@lwn.net \
--cc=dave.hansen@intel.com \
--cc=hannes@cmpxchg.org \
--cc=hillf.zj@alibaba-inc.com \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=minchan@kernel.org \
--cc=riel@redhat.com \
--cc=shli@kernel.org \
--cc=tim.c.chen@linux.intel.com \
--cc=vdavydov.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.