From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f52.google.com (mail-wm0-f52.google.com [74.125.82.52]) by kanga.kvack.org (Postfix) with ESMTP id 777E46B0253 for ; Fri, 11 Dec 2015 10:09:16 -0500 (EST) Received: by mail-wm0-f52.google.com with SMTP id c17so15119315wmd.1 for ; Fri, 11 Dec 2015 07:09:16 -0800 (PST) Received: from e06smtp11.uk.ibm.com (e06smtp11.uk.ibm.com. [195.75.94.107]) by mx.google.com with ESMTPS id a6si5608681wmh.59.2015.12.11.07.09.14 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 11 Dec 2015 07:09:15 -0800 (PST) Received: from localhost by e06smtp11.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 11 Dec 2015 15:09:14 -0000 Received: from b06cxnps4075.portsmouth.uk.ibm.com (d06relay12.portsmouth.uk.ibm.com [9.149.109.197]) by d06dlp02.portsmouth.uk.ibm.com (Postfix) with ESMTP id 4EB01219005E for ; Fri, 11 Dec 2015 15:09:05 +0000 (GMT) Received: from d06av07.portsmouth.uk.ibm.com (d06av07.portsmouth.uk.ibm.com [9.149.37.248]) by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id tBBF9CK642729498 for ; Fri, 11 Dec 2015 15:09:13 GMT Received: from d06av07.portsmouth.uk.ibm.com (localhost [127.0.0.1]) by d06av07.portsmouth.uk.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id tBBF9Cnw010041 for ; Fri, 11 Dec 2015 08:09:12 -0700 From: Christian Borntraeger Subject: [PATCH/RFC] mm/swapfile: reduce kswapd overhead by not filling up disks Date: Fri, 11 Dec 2015 16:09:34 +0100 Message-Id: <1449846574-35511-1-git-send-email-borntraeger@de.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Christian Borntraeger if a user has more than one swap disk with different priorities, the swap code will fill up the hight prio disk until the last block is used. The swap code will continue to scan the first disk also when its already filling the 2nd or 3rd disk. We have seen kswapd running at 100% CPU, with the majority of hits in the scanning code of scan_swap_map, even for non-rotational disks when this happens. For example with 3 disks disk1 99.9% disk2 10% disk3 0% it will scan the bitmap of disk1 (and as the disk is full the cluster optimization does not trigger) for every page that will likely go to disk2 anyway. By doing a first scan that only uses up to 98%, we force the swap code to use the 2nd disk slightly earlier, but it reduces kswapd cpu usage significantly. The 2nd scan will then allow to fill the remaining 2%, again starting with the highest prio disk. The code does not affect cases with all the same swap priorities, unless all disks are about 98% full. There is one issue with mythis approach: If there is a mix between same and different priorities, the code will loop too often due to the requeue, so and idea for a better fix is welcome. Signed-off-by: Christian Borntraeger --- mm/swapfile.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/mm/swapfile.c b/mm/swapfile.c index 5887731..d3817cf 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -640,6 +640,7 @@ swp_entry_t get_swap_page(void) { struct swap_info_struct *si, *next; pgoff_t offset; + bool first = true; if (atomic_long_read(&nr_swap_pages) <= 0) goto noswap; @@ -653,6 +654,12 @@ start_over: plist_requeue(&si->avail_list, &swap_avail_head); spin_unlock(&swap_avail_lock); spin_lock(&si->lock); + /* at 98% usage lets try the other swaps */ + if (first && si->inuse_pages / 98 * 100 > si->pages) { + spin_lock(&swap_avail_lock); + spin_unlock(&si->lock); + goto nextsi; + } if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) { spin_lock(&swap_avail_lock); if (plist_node_empty(&si->avail_list)) { @@ -692,6 +699,10 @@ nextsi: if (plist_node_empty(&next->avail_list)) goto start_over; } + if (first) { + first = false; + goto start_over; + } spin_unlock(&swap_avail_lock); -- 2.3.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org