* [PATCH/RFC] mm/swapfile: reduce kswapd overhead by not filling up disks
@ 2015-12-11 15:09 Christian Borntraeger
2015-12-21 15:58 ` Vlastimil Babka
0 siblings, 1 reply; 2+ messages in thread
From: Christian Borntraeger @ 2015-12-11 15:09 UTC (permalink / raw)
To: linux-mm; +Cc: linux-kernel, Christian Borntraeger
if a user has more than one swap disk with different priorities, the
swap code will fill up the hight prio disk until the last block is
used.
The swap code will continue to scan the first disk also when its
already filling the 2nd or 3rd disk.
We have seen kswapd running at 100% CPU, with the majority of hits
in the scanning code of scan_swap_map, even for non-rotational disks
when this happens.
For example with 3 disks
disk1 99.9%
disk2 10%
disk3 0%
it will scan the bitmap of disk1 (and as the disk is full the
cluster optimization does not trigger) for every page that will
likely go to disk2 anyway.
By doing a first scan that only uses up to 98%, we force the swap
code to use the 2nd disk slightly earlier, but it reduces kswapd
cpu usage significantly. The 2nd scan will then allow to fill
the remaining 2%, again starting with the highest prio disk.
The code does not affect cases with all the same swap priorities,
unless all disks are about 98% full.
There is one issue with mythis approach: If there is a mix between
same and different priorities, the code will loop too often due
to the requeue, so and idea for a better fix is welcome.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
---
mm/swapfile.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5887731..d3817cf 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -640,6 +640,7 @@ swp_entry_t get_swap_page(void)
{
struct swap_info_struct *si, *next;
pgoff_t offset;
+ bool first = true;
if (atomic_long_read(&nr_swap_pages) <= 0)
goto noswap;
@@ -653,6 +654,12 @@ start_over:
plist_requeue(&si->avail_list, &swap_avail_head);
spin_unlock(&swap_avail_lock);
spin_lock(&si->lock);
+ /* at 98% usage lets try the other swaps */
+ if (first && si->inuse_pages / 98 * 100 > si->pages) {
+ spin_lock(&swap_avail_lock);
+ spin_unlock(&si->lock);
+ goto nextsi;
+ }
if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
spin_lock(&swap_avail_lock);
if (plist_node_empty(&si->avail_list)) {
@@ -692,6 +699,10 @@ nextsi:
if (plist_node_empty(&next->avail_list))
goto start_over;
}
+ if (first) {
+ first = false;
+ goto start_over;
+ }
spin_unlock(&swap_avail_lock);
--
2.3.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 2+ messages in thread* Re: [PATCH/RFC] mm/swapfile: reduce kswapd overhead by not filling up disks
2015-12-11 15:09 [PATCH/RFC] mm/swapfile: reduce kswapd overhead by not filling up disks Christian Borntraeger
@ 2015-12-21 15:58 ` Vlastimil Babka
0 siblings, 0 replies; 2+ messages in thread
From: Vlastimil Babka @ 2015-12-21 15:58 UTC (permalink / raw)
To: Christian Borntraeger, linux-mm; +Cc: linux-kernel
On 12/11/2015 04:09 PM, Christian Borntraeger wrote:
> if a user has more than one swap disk with different priorities, the
> swap code will fill up the hight prio disk until the last block is
> used.
> The swap code will continue to scan the first disk also when its
> already filling the 2nd or 3rd disk.
> We have seen kswapd running at 100% CPU, with the majority of hits
> in the scanning code of scan_swap_map, even for non-rotational disks
> when this happens.
> For example with 3 disks
> disk1 99.9%
> disk2 10%
> disk3 0%
> it will scan the bitmap of disk1 (and as the disk is full the
> cluster optimization does not trigger) for every page that will
> likely go to disk2 anyway.
>
> By doing a first scan that only uses up to 98%, we force the swap
> code to use the 2nd disk slightly earlier, but it reduces kswapd
> cpu usage significantly. The 2nd scan will then allow to fill
> the remaining 2%, again starting with the highest prio disk.
>
> The code does not affect cases with all the same swap priorities,
> unless all disks are about 98% full.
> There is one issue with mythis approach: If there is a mix between
> same and different priorities, the code will loop too often due
> to the requeue, so and idea for a better fix is welcome.
>
> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
IMHO you should resend with CCing the relevant people directly (e.g. via
./scripts/get_maintainers.pl) or this might simply get lost in
high-volume mailing lists.
Note that I'm not familiar with this code. But my first thought would be
to put a cache with batch-refill/free before the bitmap. During the
"first" round only consider si's with enough free to satisfy the whole
batch-refill.
> ---
> mm/swapfile.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 5887731..d3817cf 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -640,6 +640,7 @@ swp_entry_t get_swap_page(void)
> {
> struct swap_info_struct *si, *next;
> pgoff_t offset;
> + bool first = true;
>
> if (atomic_long_read(&nr_swap_pages) <= 0)
> goto noswap;
> @@ -653,6 +654,12 @@ start_over:
> plist_requeue(&si->avail_list, &swap_avail_head);
> spin_unlock(&swap_avail_lock);
> spin_lock(&si->lock);
> + /* at 98% usage lets try the other swaps */
> + if (first && si->inuse_pages / 98 * 100 > si->pages) {
> + spin_lock(&swap_avail_lock);
> + spin_unlock(&si->lock);
> + goto nextsi;
> + }
> if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
> spin_lock(&swap_avail_lock);
> if (plist_node_empty(&si->avail_list)) {
> @@ -692,6 +699,10 @@ nextsi:
> if (plist_node_empty(&next->avail_list))
> goto start_over;
> }
> + if (first) {
> + first = false;
> + goto start_over;
> + }
>
> spin_unlock(&swap_avail_lock);
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2015-12-21 15:58 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-11 15:09 [PATCH/RFC] mm/swapfile: reduce kswapd overhead by not filling up disks Christian Borntraeger
2015-12-21 15:58 ` Vlastimil Babka
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).