From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx163.postini.com [74.125.245.163]) by kanga.kvack.org (Postfix) with SMTP id DE3646B0004 for ; Tue, 22 Jan 2013 01:53:53 -0500 (EST) Received: by mail-pa0-f45.google.com with SMTP id bg2so3875054pad.32 for ; Mon, 21 Jan 2013 22:53:53 -0800 (PST) Date: Tue, 22 Jan 2013 14:53:41 +0800 From: Shaohua Li Subject: [LSF/MM TOPIC]swap improvements for fast SSD Message-ID: <20130122065341.GA1850@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org List-ID: To: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org Hi, Because of high density, low power and low price, flash storage (SSD) is a good candidate to partially replace DRAM. A quick answer for this is using SSD as swap. But Linux swap is designed for slow hard disk storage. There are a lot of challenges to efficiently use SSD for swap: 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This overhead is very high even in a normal 2-socket machine. 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap, which makes swap IO pattern is interleave. Block layer isn't always efficient to do request merge. Such IO pattern also makes swap prefetch hard. 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is very inefficient, especially if swap storage is fast. 5. SSD related optimization, mainly discard support 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages aren't always in LRU list adjacently, so page reclaim will not swap such pages in adjacent storage sectors. This makes swap prefetch hard. 7. Alternative page reclaim policy to bias reclaiming anonymous page. Currently reclaim anonymous page is considering harder than reclaim file pages, so we bias reclaiming file pages. If there are high speed swap storage, we are considering doing swap more aggressively. 8. Huge page swap. Huge page swap can solve a lot of problems above, but both THP and hugetlbfs don't support swap. I had some progresses in these areas recently: http://marc.info/?l=linux-mm&m=134665691021172&w=2 http://marc.info/?l=linux-mm&m=135336039115191&w=2 http://marc.info/?l=linux-mm&m=135882182225444&w=2 http://marc.info/?l=linux-mm&m=135754636926984&w=2 http://marc.info/?l=linux-mm&m=135754634526979&w=2 But a lot of problems remain. I'd like to discuss the issues at the meeting. Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx134.postini.com [74.125.245.134]) by kanga.kvack.org (Postfix) with SMTP id E47D06B0005 for ; Wed, 23 Jan 2013 02:58:10 -0500 (EST) Date: Wed, 23 Jan 2013 16:58:08 +0900 From: Minchan Kim Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD Message-ID: <20130123075808.GH2723@blaptop> References: <20130122065341.GA1850@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130122065341.GA1850@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: Shaohua Li Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Rik van Riel On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote: > Hi, > > Because of high density, low power and low price, flash storage (SSD) is a good > candidate to partially replace DRAM. A quick answer for this is using SSD as > swap. But Linux swap is designed for slow hard disk storage. There are a lot of > challenges to efficiently use SSD for swap: Many of below item could be applied in in-memory swap like zram, zcache. > > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This > overhead is very high even in a normal 2-socket machine. > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap, > which makes swap IO pattern is interleave. Block layer isn't always efficient > to do request merge. Such IO pattern also makes swap prefetch hard. Agreed. > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is > very inefficient, especially if swap storage is fast. Agreed. > 5. SSD related optimization, mainly discard support > 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages > aren't always in LRU list adjacently, so page reclaim will not swap such pages > in adjacent storage sectors. This makes swap prefetch hard. One of problem is LRU churning and I wanted to try to fix it. http://marc.info/?l=linux-mm&m=130978831028952&w=4 > 7. Alternative page reclaim policy to bias reclaiming anonymous page. > Currently reclaim anonymous page is considering harder than reclaim file pages, > so we bias reclaiming file pages. If there are high speed swap storage, we are > considering doing swap more aggressively. Yeb. We need it. I tried it with extending vm_swappiness to 200. From: Minchan Kim Date: Mon, 3 Dec 2012 16:21:00 +0900 Subject: [PATCH] mm: increase swappiness to 200 We have thought swap out cost is very high but it's not true if we use fast device like swap-over-zram. Nonetheless, we can swap out 1:1 ratio of anon and page cache at most. It's not enough to use swap device fully so we encounter OOM kill while there are many free space in zram swap device. It's never what we want. This patch makes swap out aggressively. Cc: Luigi Semenzato Signed-off-by: Minchan Kim --- kernel/sysctl.c | 3 ++- mm/vmscan.c | 6 ++++-- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 693e0ed..f1dbd9d 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -130,6 +130,7 @@ static int __maybe_unused two = 2; static int __maybe_unused three = 3; static unsigned long one_ul = 1; static int one_hundred = 100; +extern int max_swappiness; #ifdef CONFIG_PRINTK static int ten_thousand = 10000; #endif @@ -1157,7 +1158,7 @@ static struct ctl_table vm_table[] = { .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = &zero, - .extra2 = &one_hundred, + .extra2 = &max_swappiness, }, #ifdef CONFIG_HUGETLB_PAGE { diff --git a/mm/vmscan.c b/mm/vmscan.c index 53dcde9..64f3c21 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -53,6 +53,8 @@ #define CREATE_TRACE_POINTS #include +int max_swappiness = 200; + struct scan_control { /* Incremented by the number of inactive pages that were scanned */ unsigned long nr_scanned; @@ -1626,6 +1628,7 @@ static int vmscan_swappiness(struct scan_control *sc) return mem_cgroup_swappiness(sc->target_mem_cgroup); } + /* * Determine how aggressively the anon and file LRU lists should be * scanned. The relative value of each set of LRU lists is determined @@ -1701,11 +1704,10 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, } /* - * With swappiness at 100, anonymous and file have the same priority. * This scanning priority is essentially the inverse of IO cost. */ anon_prio = vmscan_swappiness(sc); - file_prio = 200 - anon_prio; + file_prio = max_swappiness - anon_prio; /* * OK, so we have swap space and a fair amount of page cache -- 1.7.9.5 > 8. Huge page swap. Huge page swap can solve a lot of problems above, but both > THP and hugetlbfs don't support swap. Another items are indirection layers. Please read Rik's mail below. Indirection layers could give many flexibility to backends and helpful for defragmentation. One of idea I am considering is that makes hierarchy swap devides, NOT priority-based. I mean currently swap devices are used up by prioirty order. It's not good fit if we use fast swap and slow swap at the same time. I'd like to consume fast swap device (ex, in-memory swap) firstly, then I want to migrate some of swap pages from fast swap to slow swap to make room for fast swap. It could solve below concern. In addition, buffering via in-memory swap could make big chunk which is aligned to slow device's block size so migration speed from fast swap to slow swap could be enhanced so wear out problem would go away, too. Quote from last KS2012 - http://lwn.net/Articles/516538/ "Andrea Arcangeli was also concerned that the first pages to be evicted from memory are, by definition of the LRU page order, the ones that are least likely to be used in the future. These are the pages that should be going to secondary storage and more frequently used pages should be going to zcache. As it stands, zcache may fill up with no-longer-used pages and then the system continues to move used pages from and to the disk." From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx113.postini.com [74.125.245.113]) by kanga.kvack.org (Postfix) with SMTP id 893BF6B0008 for ; Wed, 23 Jan 2013 11:57:52 -0500 (EST) Received: from /spool/local by e7.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 23 Jan 2013 11:57:51 -0500 Received: from d01relay03.pok.ibm.com (d01relay03.pok.ibm.com [9.56.227.235]) by d01dlp03.pok.ibm.com (Postfix) with ESMTP id D7675C9003C for ; Wed, 23 Jan 2013 11:57:01 -0500 (EST) Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay03.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r0NGv1PS143264 for ; Wed, 23 Jan 2013 11:57:01 -0500 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r0NGv1Vx031880 for ; Wed, 23 Jan 2013 14:57:01 -0200 Message-ID: <51001658.7000507@linux.vnet.ibm.com> Date: Wed, 23 Jan 2013 10:56:56 -0600 From: Seth Jennings MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD References: <20130122065341.GA1850@kernel.org> In-Reply-To: <20130122065341.GA1850@kernel.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Shaohua Li Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Minchan Kim On 01/22/2013 12:53 AM, Shaohua Li wrote: > Hi, > > Because of high density, low power and low price, flash storage (SSD) is a good > candidate to partially replace DRAM. A quick answer for this is using SSD as > swap. But Linux swap is designed for slow hard disk storage. There are a lot of > challenges to efficiently use SSD for swap: > > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This > overhead is very high even in a normal 2-socket machine. > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap, > which makes swap IO pattern is interleave. Block layer isn't always efficient > to do request merge. Such IO pattern also makes swap prefetch hard. > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is > very inefficient, especially if swap storage is fast. > 5. SSD related optimization, mainly discard support > 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages > aren't always in LRU list adjacently, so page reclaim will not swap such pages > in adjacent storage sectors. This makes swap prefetch hard. > 7. Alternative page reclaim policy to bias reclaiming anonymous page. > Currently reclaim anonymous page is considering harder than reclaim file pages, > so we bias reclaiming file pages. If there are high speed swap storage, we are > considering doing swap more aggressively. > 8. Huge page swap. Huge page swap can solve a lot of problems above, but both > THP and hugetlbfs don't support swap. I too have also observed these issues in my work with zswap, especially the lock contentions mentioned in 1 and the prefetch situation in 3 and 6 that contains heuristics for rotational media. I'd be very interested in discussing these issues and potential solutions. Thanks to Minchan for the discussion about the front last year's summits. Seth > > I had some progresses in these areas recently: > http://marc.info/?l=linux-mm&m=134665691021172&w=2 > http://marc.info/?l=linux-mm&m=135336039115191&w=2 > http://marc.info/?l=linux-mm&m=135882182225444&w=2 > http://marc.info/?l=linux-mm&m=135754636926984&w=2 > http://marc.info/?l=linux-mm&m=135754634526979&w=2 > But a lot of problems remain. I'd like to discuss the issues at the meeting. > > Thanks, > Shaohua > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx196.postini.com [74.125.245.196]) by kanga.kvack.org (Postfix) with SMTP id 4CA9F6B0008 for ; Wed, 23 Jan 2013 14:04:33 -0500 (EST) Received: from /spool/local by e9.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 23 Jan 2013 14:04:31 -0500 Received: from d01relay01.pok.ibm.com (d01relay01.pok.ibm.com [9.56.227.233]) by d01dlp01.pok.ibm.com (Postfix) with ESMTP id 7867B38C803F for ; Wed, 23 Jan 2013 14:04:29 -0500 (EST) Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by d01relay01.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r0NJ4Sbf322244 for ; Wed, 23 Jan 2013 14:04:29 -0500 Received: from d01av04.pok.ibm.com (loopback [127.0.0.1]) by d01av04.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r0NJ4RjU020710 for ; Wed, 23 Jan 2013 14:04:28 -0500 Message-ID: <51003439.2070505@linux.vnet.ibm.com> Date: Wed, 23 Jan 2013 13:04:25 -0600 From: Seth Jennings MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD References: <20130122065341.GA1850@kernel.org> <20130123075808.GH2723@blaptop> In-Reply-To: <20130123075808.GH2723@blaptop> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Shaohua Li , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Rik van Riel On 01/23/2013 01:58 AM, Minchan Kim wrote: > Currently, the page table entries that have swapped out pages > associated with them contain a swap entry, pointing directly > at the swap device and swap slot containing the data. Meanwhile, > the swap count lives in a separate array. > > The redesign we are considering moving the swap entry to the > page cache radix tree for the swapper_space and having the pte > contain only the offset into the swapper_space. The swap count > info can also fit inside the swapper_space page cache radix > tree (at least on 64 bits - on 32 bits we may need to get > creative or accept a smaller max amount of swap space). Correct me if I'm wrong, but this recent patchset creating a swapper_space per type would mess this up right? The offset alone would no longer be sufficient to access the proper swapper_space. Why not just continue to store the entire swap entry (type and offset) in the pte? Where you planning to use the type space in the pte for something else? Seth -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx171.postini.com [74.125.245.171]) by kanga.kvack.org (Postfix) with SMTP id C22346B0008 for ; Wed, 23 Jan 2013 18:05:45 -0500 (EST) MIME-Version: 1.0 Message-ID: Date: Wed, 23 Jan 2013 15:05:22 -0800 (PST) From: Dan Magenheimer Subject: RE: [LSF/MM TOPIC]swap improvements for fast SSD References: <766b9855-adf5-47ce-9484-971f88ff0e54@default> In-Reply-To: <766b9855-adf5-47ce-9484-971f88ff0e54@default> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: shli@fusionio.com Cc: linux-mm@kvack.org I would be very interested in this topic. > Because of high density, low power and low price, flash storage (SSD) is = a good > candidate to partially replace DRAM. A quick answer for this is using SSD= as > swap. But Linux swap is designed for slow hard disk storage. There are a = lot of > challenges to efficiently use SSD for swap: >=20 > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush.= This > overhead is very high even in a normal 2-socket machine. > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swa= p, > which makes swap IO pattern is interleave. Block layer isn't always effic= ient > to do request merge. Such IO pattern also makes swap prefetch hard. Shaohua -- Have you considered the possibility of subverting the block layer entirely and accessing the SSD like slow RAM rather than a fast I/O device? E.g. something like NVME and as in this paper? http://static.usenix.org/events/fast12/tech/full_papers/Yang.pdf=20 If you think this could be an option, it could make a very interesting backend to frontswap (something like ramster). Dan > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which = is > very inefficient, especially if swap storage is fast. > 5. SSD related optimization, mainly discard support > 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed = pages > aren't always in LRU list adjacently, so page reclaim will not swap such = pages > in adjacent storage sectors. This makes swap prefetch hard. > 7. Alternative page reclaim policy to bias reclaiming anonymous page. > Currently reclaim anonymous page is considering harder than reclaim file = pages, > so we bias reclaiming file pages. If there are high speed swap storage, w= e are > considering doing swap more aggressively. > 8. Huge page swap. Huge page swap can solve a lot of problems above, but = both > THP and hugetlbfs don't support swap. >=20 > I had some progresses in these areas recently: > http://marc.info/?l=3Dlinux-mm&m=3D134665691021172&w=3D2 > http://marc.info/?l=3Dlinux-mm&m=3D135336039115191&w=3D2 > http://marc.info/?l=3Dlinux-mm&m=3D135882182225444&w=3D2 > http://marc.info/?l=3Dlinux-mm&m=3D135754636926984&w=3D2 > http://marc.info/?l=3Dlinux-mm&m=3D135754634526979&w=3D2 > But a lot of problems remain. I'd like to discuss the issues at the meeti= ng. >=20 > Thanks, > Shaohua >=20 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx125.postini.com [74.125.245.125]) by kanga.kvack.org (Postfix) with SMTP id 880076B0005 for ; Wed, 23 Jan 2013 20:41:02 -0500 (EST) Date: Thu, 24 Jan 2013 10:40:59 +0900 From: Minchan Kim Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD Message-ID: <20130124014059.GA22654@blaptop> References: <20130122065341.GA1850@kernel.org> <20130123075808.GH2723@blaptop> <51003439.2070505@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51003439.2070505@linux.vnet.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: Seth Jennings Cc: Shaohua Li , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Rik van Riel Hi Seth, On Wed, Jan 23, 2013 at 01:04:25PM -0600, Seth Jennings wrote: > On 01/23/2013 01:58 AM, Minchan Kim wrote: > > Currently, the page table entries that have swapped out pages > > associated with them contain a swap entry, pointing directly > > at the swap device and swap slot containing the data. Meanwhile, > > the swap count lives in a separate array. > > > > The redesign we are considering moving the swap entry to the > > page cache radix tree for the swapper_space and having the pte > > contain only the offset into the swapper_space. The swap count > > info can also fit inside the swapper_space page cache radix > > tree (at least on 64 bits - on 32 bits we may need to get > > creative or accept a smaller max amount of swap space). > > Correct me if I'm wrong, but this recent patchset creating a > swapper_space per type would mess this up right? The offset alone > would no longer be sufficient to access the proper swapper_space. If I understand Rik's idea correctly, it doesn't mess up. Because we already have used (swp_type, swp_offset) as offset of swapper_space so although he mentioned "pte contains only the offset into the swapper_space", it doesn't mean we will store only swp_offset in pte but store offset of swapper_space in pte. old : do_swap_page swp_entry_t entry = pte_to_swp_entry(pte); if (!lookup_swap_cache(entry)) swapin_readahead(entry) New : do_swap_page pgoff_t offset = pte_to_swp_offset(pte) if (!lookup_swap_cache(offset)) { swp_entry_t entry = offset_to_swp_entry(offset); swapin_readahead(entry); } IOW, entry of old and offset of new would be same vaule. > > Why not just continue to store the entire swap entry (type and offset) > in the pte? Where you planning to use the type space in the pte for > something else? No plan if I didn't miss something. :) > > Seth > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx170.postini.com [74.125.245.170]) by kanga.kvack.org (Postfix) with SMTP id 22CC16B0005 for ; Wed, 23 Jan 2013 21:03:03 -0500 (EST) Received: by mail-da0-f53.google.com with SMTP id x6so3976615dac.26 for ; Wed, 23 Jan 2013 18:03:02 -0800 (PST) Date: Thu, 24 Jan 2013 10:02:50 +0800 From: Shaohua Li Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD Message-ID: <20130124020250.GA32496@kernel.org> References: <20130122065341.GA1850@kernel.org> <20130123075808.GH2723@blaptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130123075808.GH2723@blaptop> Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Rik van Riel On Wed, Jan 23, 2013 at 04:58:08PM +0900, Minchan Kim wrote: > On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote: > > Hi, > > > > Because of high density, low power and low price, flash storage (SSD) is a good > > candidate to partially replace DRAM. A quick answer for this is using SSD as > > swap. But Linux swap is designed for slow hard disk storage. There are a lot of > > challenges to efficiently use SSD for swap: > > Many of below item could be applied in in-memory swap like zram, zcache. > > > > > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) > > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This > > overhead is very high even in a normal 2-socket machine. > > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap, > > which makes swap IO pattern is interleave. Block layer isn't always efficient > > to do request merge. Such IO pattern also makes swap prefetch hard. > > Agreed. > > > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is > > very inefficient, especially if swap storage is fast. > > Agreed. > > > 5. SSD related optimization, mainly discard support > > 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages > > aren't always in LRU list adjacently, so page reclaim will not swap such pages > > in adjacent storage sectors. This makes swap prefetch hard. > > One of problem is LRU churning and I wanted to try to fix it. > http://marc.info/?l=linux-mm&m=130978831028952&w=4 Yes, LRU churning is a problem. Another problem is we didn't add sequentially accessed pages to LRU list adjacently if there are multiple tasks running and consuming memory in the meantime. The percpu pagevec helps a little, but its size isn't large. > > 7. Alternative page reclaim policy to bias reclaiming anonymous page. > > Currently reclaim anonymous page is considering harder than reclaim file pages, > > so we bias reclaiming file pages. If there are high speed swap storage, we are > > considering doing swap more aggressively. > > Yeb. We need it. I tried it with extending vm_swappiness to 200. > > From: Minchan Kim > Date: Mon, 3 Dec 2012 16:21:00 +0900 > Subject: [PATCH] mm: increase swappiness to 200 I had exactly the same code in my tree. And actually I found if swappiness is set to 200, zone reclaim has problem. I has a patch for it. But haven't post it out yet. swappiness doesn't solve all the problem here. anonymous pages are in active list first. And the rotation logic bias to anonymous pages too. So even you set a high swappiness, file pages can still be easily reclaimed. > > 8. Huge page swap. Huge page swap can solve a lot of problems above, but both > > THP and hugetlbfs don't support swap. > > Another items are indirection layers. Please read Rik's mail below. > Indirection layers could give many flexibility to backends and helpful > for defragmentation. > > One of idea I am considering is that makes hierarchy swap devides, > NOT priority-based. I mean currently swap devices are used up by prioirty order. > It's not good fit if we use fast swap and slow swap at the same time. > I'd like to consume fast swap device (ex, in-memory swap) firstly, then > I want to migrate some of swap pages from fast swap to slow swap to > make room for fast swap. It could solve below concern. > In addition, buffering via in-memory swap could make big chunk which is aligned > to slow device's block size so migration speed from fast swap to slow swap > could be enhanced so wear out problem would go away, too. This looks interesting. Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx185.postini.com [74.125.245.185]) by kanga.kvack.org (Postfix) with SMTP id D37F16B0005 for ; Wed, 23 Jan 2013 21:11:21 -0500 (EST) Received: by mail-pa0-f48.google.com with SMTP id fa1so5222574pad.7 for ; Wed, 23 Jan 2013 18:11:21 -0800 (PST) Date: Thu, 24 Jan 2013 10:11:08 +0800 From: Shaohua Li Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD Message-ID: <20130124021108.GB32496@kernel.org> References: <766b9855-adf5-47ce-9484-971f88ff0e54@default> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Dan Magenheimer Cc: shli@fusionio.com, linux-mm@kvack.org On Wed, Jan 23, 2013 at 03:05:22PM -0800, Dan Magenheimer wrote: > I would be very interested in this topic. > > > Because of high density, low power and low price, flash storage (SSD) is a good > > candidate to partially replace DRAM. A quick answer for this is using SSD as > > swap. But Linux swap is designed for slow hard disk storage. There are a lot of > > challenges to efficiently use SSD for swap: > > > > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) > > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This > > overhead is very high even in a normal 2-socket machine. > > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap, > > which makes swap IO pattern is interleave. Block layer isn't always efficient > > to do request merge. Such IO pattern also makes swap prefetch hard. > > Shaohua -- > > Have you considered the possibility of subverting the block layer entirely > and accessing the SSD like slow RAM rather than a fast I/O device? E.g. > something like NVME and as in this paper? > > http://static.usenix.org/events/fast12/tech/full_papers/Yang.pdf > > If you think this could be an option, it could make a very > interesting backend to frontswap (something like ramster). We had discussion about this before, but looks this requires very low latency storage, didn't take it serious yet. Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx138.postini.com [74.125.245.138]) by kanga.kvack.org (Postfix) with SMTP id F18976B0008 for ; Thu, 24 Jan 2013 01:28:57 -0500 (EST) Received: by mail-ie0-f180.google.com with SMTP id c10so14830839ieb.25 for ; Wed, 23 Jan 2013 22:28:57 -0800 (PST) Message-ID: <1359008927.1375.7.camel@kernel> Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD From: Simon Jeons Date: Thu, 24 Jan 2013 00:28:47 -0600 In-Reply-To: <20130122065341.GA1850@kernel.org> References: <20130122065341.GA1850@kernel.org> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Shaohua Li Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org On Tue, 2013-01-22 at 14:53 +0800, Shaohua Li wrote: > Hi, > > Because of high density, low power and low price, flash storage (SSD) is a good > candidate to partially replace DRAM. A quick answer for this is using SSD as > swap. But Linux swap is designed for slow hard disk storage. There are a lot of > challenges to efficiently use SSD for swap: > > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This Which 2 TLB flush? > overhead is very high even in a normal 2-socket machine. > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap, > which makes swap IO pattern is interleave. Block layer isn't always efficient > to do request merge. Such IO pattern also makes swap prefetch hard. > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is > very inefficient, especially if swap storage is fast. > 5. SSD related optimization, mainly discard support > 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages > aren't always in LRU list adjacently, so page reclaim will not swap such pages > in adjacent storage sectors. This makes swap prefetch hard. > 7. Alternative page reclaim policy to bias reclaiming anonymous page. > Currently reclaim anonymous page is considering harder than reclaim file pages, > so we bias reclaiming file pages. If there are high speed swap storage, we are > considering doing swap more aggressively. > 8. Huge page swap. Huge page swap can solve a lot of problems above, but both > THP and hugetlbfs don't support swap. > > I had some progresses in these areas recently: > http://marc.info/?l=linux-mm&m=134665691021172&w=2 > http://marc.info/?l=linux-mm&m=135336039115191&w=2 > http://marc.info/?l=linux-mm&m=135882182225444&w=2 > http://marc.info/?l=linux-mm&m=135754636926984&w=2 > http://marc.info/?l=linux-mm&m=135754634526979&w=2 > But a lot of problems remain. I'd like to discuss the issues at the meeting. > > Thanks, > Shaohua > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx164.postini.com [74.125.245.164]) by kanga.kvack.org (Postfix) with SMTP id 7065B6B0008 for ; Thu, 24 Jan 2013 02:52:11 -0500 (EST) Received: by mail-ie0-f172.google.com with SMTP id c13so14909726ieb.17 for ; Wed, 23 Jan 2013 23:52:10 -0800 (PST) Message-ID: <1359013924.1375.8.camel@kernel> Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD From: Simon Jeons Date: Thu, 24 Jan 2013 01:52:04 -0600 In-Reply-To: <20130123075808.GH2723@blaptop> References: <20130122065341.GA1850@kernel.org> <20130123075808.GH2723@blaptop> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Shaohua Li , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Rik van Riel On Wed, 2013-01-23 at 16:58 +0900, Minchan Kim wrote: > On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote: > > Hi, > > > > Because of high density, low power and low price, flash storage (SSD) is a good > > candidate to partially replace DRAM. A quick answer for this is using SSD as > > swap. But Linux swap is designed for slow hard disk storage. There are a lot of > > challenges to efficiently use SSD for swap: > > Many of below item could be applied in in-memory swap like zram, zcache. > > > > > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) > > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This > > overhead is very high even in a normal 2-socket machine. > > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap, > > which makes swap IO pattern is interleave. Block layer isn't always efficient > > to do request merge. Such IO pattern also makes swap prefetch hard. > > Agreed. > > > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is > > very inefficient, especially if swap storage is fast. > > Agreed. > > > 5. SSD related optimization, mainly discard support > > 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages > > aren't always in LRU list adjacently, so page reclaim will not swap such pages > > in adjacent storage sectors. This makes swap prefetch hard. > > One of problem is LRU churning and I wanted to try to fix it. > http://marc.info/?l=linux-mm&m=130978831028952&w=4 What's LRU history as you mentioned in your LRU churning patchset? > > > 7. Alternative page reclaim policy to bias reclaiming anonymous page. > > Currently reclaim anonymous page is considering harder than reclaim file pages, > > so we bias reclaiming file pages. If there are high speed swap storage, we are > > considering doing swap more aggressively. > > Yeb. We need it. I tried it with extending vm_swappiness to 200. > > From: Minchan Kim > Date: Mon, 3 Dec 2012 16:21:00 +0900 > Subject: [PATCH] mm: increase swappiness to 200 > > We have thought swap out cost is very high but it's not true > if we use fast device like swap-over-zram. Nonetheless, we can > swap out 1:1 ratio of anon and page cache at most. > It's not enough to use swap device fully so we encounter OOM kill > while there are many free space in zram swap device. It's never > what we want. > > This patch makes swap out aggressively. > > Cc: Luigi Semenzato > Signed-off-by: Minchan Kim > --- > kernel/sysctl.c | 3 ++- > mm/vmscan.c | 6 ++++-- > 2 files changed, 6 insertions(+), 3 deletions(-) > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 693e0ed..f1dbd9d 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -130,6 +130,7 @@ static int __maybe_unused two = 2; > static int __maybe_unused three = 3; > static unsigned long one_ul = 1; > static int one_hundred = 100; > +extern int max_swappiness; > #ifdef CONFIG_PRINTK > static int ten_thousand = 10000; > #endif > @@ -1157,7 +1158,7 @@ static struct ctl_table vm_table[] = { > .mode = 0644, > .proc_handler = proc_dointvec_minmax, > .extra1 = &zero, > - .extra2 = &one_hundred, > + .extra2 = &max_swappiness, > }, > #ifdef CONFIG_HUGETLB_PAGE > { > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 53dcde9..64f3c21 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -53,6 +53,8 @@ > #define CREATE_TRACE_POINTS > #include > > +int max_swappiness = 200; > + > struct scan_control { > /* Incremented by the number of inactive pages that were scanned */ > unsigned long nr_scanned; > @@ -1626,6 +1628,7 @@ static int vmscan_swappiness(struct scan_control *sc) > return mem_cgroup_swappiness(sc->target_mem_cgroup); > } > > + > /* > * Determine how aggressively the anon and file LRU lists should be > * scanned. The relative value of each set of LRU lists is determined > @@ -1701,11 +1704,10 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > } > > /* > - * With swappiness at 100, anonymous and file have the same priority. > * This scanning priority is essentially the inverse of IO cost. > */ > anon_prio = vmscan_swappiness(sc); > - file_prio = 200 - anon_prio; > + file_prio = max_swappiness - anon_prio; > > /* > * OK, so we have swap space and a fair amount of page cache > -- > 1.7.9.5 > > > 8. Huge page swap. Huge page swap can solve a lot of problems above, but both > > THP and hugetlbfs don't support swap. > > Another items are indirection layers. Please read Rik's mail below. > Indirection layers could give many flexibility to backends and helpful > for defragmentation. > > One of idea I am considering is that makes hierarchy swap devides, > NOT priority-based. I mean currently swap devices are used up by prioirty order. > It's not good fit if we use fast swap and slow swap at the same time. > I'd like to consume fast swap device (ex, in-memory swap) firstly, then > I want to migrate some of swap pages from fast swap to slow swap to > make room for fast swap. It could solve below concern. > In addition, buffering via in-memory swap could make big chunk which is aligned > to slow device's block size so migration speed from fast swap to slow swap > could be enhanced so wear out problem would go away, too. > > Quote from last KS2012 - http://lwn.net/Articles/516538/ > "Andrea Arcangeli was also concerned that the first pages to be evicted from > memory are, by definition of the LRU page order, the ones that are least likely > to be used in the future. These are the pages that should be going to secondary > storage and more frequently used pages should be going to zcache. As it stands, > zcache may fill up with no-longer-used pages and then the system continues to > move used pages from and to the disk." > > From riel@redhat.com Sun Apr 10 17:50:10 2011 > Date: Sun, 10 Apr 2011 20:50:01 -0400 > From: Rik van Riel > To: Linux Memory Management List > Subject: [LSF/Collab] swap cache redesign idea > > On Thursday after LSF, Hugh, Minchan, Mel, Johannes and I were > sitting in the hallway talking about yet more VM things. > > During that discussion, we came up with a way to redesign the > swap cache. During my flight home, I came with ideas on how > to use that redesign, that may make the changes worthwhile. > > Currently, the page table entries that have swapped out pages > associated with them contain a swap entry, pointing directly > at the swap device and swap slot containing the data. Meanwhile, > the swap count lives in a separate array. > > The redesign we are considering moving the swap entry to the > page cache radix tree for the swapper_space and having the pte > contain only the offset into the swapper_space. The swap count > info can also fit inside the swapper_space page cache radix > tree (at least on 64 bits - on 32 bits we may need to get > creative or accept a smaller max amount of swap space). > > This extra layer of indirection allows us to do several things: > > 1) get rid of the virtual address scanning swapoff; instead > we just swap the data in and mark the pages as present in > the swapper_space radix tree > > 2) free swap entries as the are read in, without waiting for > the process to fault it in - this may be useful for memory > types that have a large erase block > > 3) together with the defragmentation from (2), we can always > do writes in large aligned blocks - the extra indirection > will make it relatively easy to have special backend code > for different kinds of swap space, since all the state can > now live in just one place > > 4) skip writeout of zero-filled pages - this can be a big help > for KVM virtual machines running Windows, since Windows zeroes > out free pages; simply discarding a zero-filled page is not > at all simple in the current VM, where we would have to iterate > over all the ptes to free the swap entry before being able to > free the swap cache page (I am not sure how that locking would > even work) > > with the extra layer of indirection, the locking for this scheme > can be trivial - either the faulting process gets the old page, > or it gets a new one, either way it'll be zero filled > > 5) skip writeout of pages the guest has marked as free - same as > above, with the same easier locking > > Only one real question remaining - how do we handle the swap count > in the new scheme? On 64 bit systems we have enough space in the > radix tree, on 32 bit systems maybe we'll have to start overflowing > into the "swap_count_continued" logic a little sooner than we are > now and reduce the maximum swap size a little? > > > > > I had some progresses in these areas recently: > > http://marc.info/?l=linux-mm&m=134665691021172&w=2 > > http://marc.info/?l=linux-mm&m=135336039115191&w=2 > > http://marc.info/?l=linux-mm&m=135882182225444&w=2 > > http://marc.info/?l=linux-mm&m=135754636926984&w=2 > > http://marc.info/?l=linux-mm&m=135754634526979&w=2 > > But a lot of problems remain. I'd like to discuss the issues at the meeting. > > I have an interest on this topic. > Thnaks. > > > > > Thanks, > > Shaohua > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majordomo@kvack.org. For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Don't email: email@kvack.org > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx144.postini.com [74.125.245.144]) by kanga.kvack.org (Postfix) with SMTP id 7374B6B0002 for ; Thu, 24 Jan 2013 03:30:04 -0500 (EST) Received: by mail-ie0-f175.google.com with SMTP id qd14so15229955ieb.34 for ; Thu, 24 Jan 2013 00:30:03 -0800 (PST) Message-ID: <1359016192.2866.1.camel@kernel> Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD From: Simon Jeons Date: Thu, 24 Jan 2013 02:29:52 -0600 In-Reply-To: <20130124014059.GA22654@blaptop> References: <20130122065341.GA1850@kernel.org> <20130123075808.GH2723@blaptop> <51003439.2070505@linux.vnet.ibm.com> <20130124014059.GA22654@blaptop> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Seth Jennings , Shaohua Li , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Rik van Riel On Thu, 2013-01-24 at 10:40 +0900, Minchan Kim wrote: > Hi Seth, > > On Wed, Jan 23, 2013 at 01:04:25PM -0600, Seth Jennings wrote: > > On 01/23/2013 01:58 AM, Minchan Kim wrote: > > > Currently, the page table entries that have swapped out pages > > > associated with them contain a swap entry, pointing directly > > > at the swap device and swap slot containing the data. Meanwhile, > > > the swap count lives in a separate array. > > > > > > The redesign we are considering moving the swap entry to the > > > page cache radix tree for the swapper_space and having the pte > > > contain only the offset into the swapper_space. The swap count > > > info can also fit inside the swapper_space page cache radix > > > tree (at least on 64 bits - on 32 bits we may need to get > > > creative or accept a smaller max amount of swap space). > > > > Correct me if I'm wrong, but this recent patchset creating a > > swapper_space per type would mess this up right? The offset alone > > would no longer be sufficient to access the proper swapper_space. > > If I understand Rik's idea correctly, it doesn't mess up. Because we already > have used (swp_type, swp_offset) as offset of swapper_space so although > he mentioned "pte contains only the offset into the swapper_space", > it doesn't mean we will store only swp_offset in pte but store offset of > swapper_space in pte. > > old : > do_swap_page > swp_entry_t entry = pte_to_swp_entry(pte); > if (!lookup_swap_cache(entry)) > swapin_readahead(entry) > > New : > do_swap_page > pgoff_t offset = pte_to_swp_offset(pte) > if (!lookup_swap_cache(offset)) { > swp_entry_t entry = offset_to_swp_entry(offset); > swapin_readahead(entry); > } > Since Shaohua change the logic to each swap partition have one address_space, the idea mentioned above can't work any more, correct? > IOW, entry of old and offset of new would be same vaule. > > > > > Why not just continue to store the entire swap entry (type and offset) > > in the pte? Where you planning to use the type space in the pte for > > something else? > > No plan if I didn't miss something. :) > > > > > Seth > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majordomo@kvack.org. For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Don't email: email@kvack.org > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx109.postini.com [74.125.245.109]) by kanga.kvack.org (Postfix) with SMTP id AF9E96B0002 for ; Thu, 24 Jan 2013 04:10:05 -0500 (EST) Received: by mail-ie0-f181.google.com with SMTP id 16so15088344iea.26 for ; Thu, 24 Jan 2013 01:10:05 -0800 (PST) Message-ID: <1359018598.2866.5.camel@kernel> Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD From: Simon Jeons Date: Thu, 24 Jan 2013 03:09:58 -0600 In-Reply-To: <20130123075808.GH2723@blaptop> References: <20130122065341.GA1850@kernel.org> <20130123075808.GH2723@blaptop> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Shaohua Li , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Rik van Riel Hi Minchan, On Wed, 2013-01-23 at 16:58 +0900, Minchan Kim wrote: > On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote: > > Hi, > > > > Because of high density, low power and low price, flash storage (SSD) is a good > > candidate to partially replace DRAM. A quick answer for this is using SSD as > > swap. But Linux swap is designed for slow hard disk storage. There are a lot of > > challenges to efficiently use SSD for swap: > > Many of below item could be applied in in-memory swap like zram, zcache. > > > > > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) > > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This > > overhead is very high even in a normal 2-socket machine. > > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap, > > which makes swap IO pattern is interleave. Block layer isn't always efficient > > to do request merge. Such IO pattern also makes swap prefetch hard. > > Agreed. > > > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is > > very inefficient, especially if swap storage is fast. > > Agreed. > > > 5. SSD related optimization, mainly discard support > > 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages > > aren't always in LRU list adjacently, so page reclaim will not swap such pages > > in adjacent storage sectors. This makes swap prefetch hard. > > One of problem is LRU churning and I wanted to try to fix it. > http://marc.info/?l=linux-mm&m=130978831028952&w=4 > > > 7. Alternative page reclaim policy to bias reclaiming anonymous page. > > Currently reclaim anonymous page is considering harder than reclaim file pages, > > so we bias reclaiming file pages. If there are high speed swap storage, we are > > considering doing swap more aggressively. > > Yeb. We need it. I tried it with extending vm_swappiness to 200. > > From: Minchan Kim > Date: Mon, 3 Dec 2012 16:21:00 +0900 > Subject: [PATCH] mm: increase swappiness to 200 > > We have thought swap out cost is very high but it's not true > if we use fast device like swap-over-zram. Nonetheless, we can > swap out 1:1 ratio of anon and page cache at most. > It's not enough to use swap device fully so we encounter OOM kill > while there are many free space in zram swap device. It's never > what we want. > > This patch makes swap out aggressively. > > Cc: Luigi Semenzato > Signed-off-by: Minchan Kim > --- > kernel/sysctl.c | 3 ++- > mm/vmscan.c | 6 ++++-- > 2 files changed, 6 insertions(+), 3 deletions(-) > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 693e0ed..f1dbd9d 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -130,6 +130,7 @@ static int __maybe_unused two = 2; > static int __maybe_unused three = 3; > static unsigned long one_ul = 1; > static int one_hundred = 100; > +extern int max_swappiness; > #ifdef CONFIG_PRINTK > static int ten_thousand = 10000; > #endif > @@ -1157,7 +1158,7 @@ static struct ctl_table vm_table[] = { > .mode = 0644, > .proc_handler = proc_dointvec_minmax, > .extra1 = &zero, > - .extra2 = &one_hundred, > + .extra2 = &max_swappiness, > }, > #ifdef CONFIG_HUGETLB_PAGE > { > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 53dcde9..64f3c21 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -53,6 +53,8 @@ > #define CREATE_TRACE_POINTS > #include > > +int max_swappiness = 200; > + > struct scan_control { > /* Incremented by the number of inactive pages that were scanned */ > unsigned long nr_scanned; > @@ -1626,6 +1628,7 @@ static int vmscan_swappiness(struct scan_control *sc) > return mem_cgroup_swappiness(sc->target_mem_cgroup); > } > > + > /* > * Determine how aggressively the anon and file LRU lists should be > * scanned. The relative value of each set of LRU lists is determined > @@ -1701,11 +1704,10 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > } > > /* > - * With swappiness at 100, anonymous and file have the same priority. > * This scanning priority is essentially the inverse of IO cost. > */ > anon_prio = vmscan_swappiness(sc); > - file_prio = 200 - anon_prio; > + file_prio = max_swappiness - anon_prio; > > /* > * OK, so we have swap space and a fair amount of page cache > -- > 1.7.9.5 > > > 8. Huge page swap. Huge page swap can solve a lot of problems above, but both > > THP and hugetlbfs don't support swap. > > Another items are indirection layers. Please read Rik's mail below. > Indirection layers could give many flexibility to backends and helpful > for defragmentation. > > One of idea I am considering is that makes hierarchy swap devides, > NOT priority-based. I mean currently swap devices are used up by prioirty order. > It's not good fit if we use fast swap and slow swap at the same time. > I'd like to consume fast swap device (ex, in-memory swap) firstly, then > I want to migrate some of swap pages from fast swap to slow swap to > make room for fast swap. It could solve below concern. > In addition, buffering via in-memory swap could make big chunk which is aligned > to slow device's block size so migration speed from fast swap to slow swap > could be enhanced so wear out problem would go away, too. > > Quote from last KS2012 - http://lwn.net/Articles/516538/ > "Andrea Arcangeli was also concerned that the first pages to be evicted from > memory are, by definition of the LRU page order, the ones that are least likely > to be used in the future. These are the pages that should be going to secondary > storage and more frequently used pages should be going to zcache. As it stands, > zcache may fill up with no-longer-used pages and then the system continues to > move used pages from and to the disk." > > From riel@redhat.com Sun Apr 10 17:50:10 2011 > Date: Sun, 10 Apr 2011 20:50:01 -0400 > From: Rik van Riel > To: Linux Memory Management List > Subject: [LSF/Collab] swap cache redesign idea > > On Thursday after LSF, Hugh, Minchan, Mel, Johannes and I were > sitting in the hallway talking about yet more VM things. > > During that discussion, we came up with a way to redesign the > swap cache. During my flight home, I came with ideas on how > to use that redesign, that may make the changes worthwhile. > > Currently, the page table entries that have swapped out pages > associated with them contain a swap entry, pointing directly > at the swap device and swap slot containing the data. Meanwhile, > the swap count lives in a separate array. > > The redesign we are considering moving the swap entry to the > page cache radix tree for the swapper_space and having the pte > contain only the offset into the swapper_space. The swap count > info can also fit inside the swapper_space page cache radix > tree (at least on 64 bits - on 32 bits we may need to get > creative or accept a smaller max amount of swap space). > > This extra layer of indirection allows us to do several things: > > 1) get rid of the virtual address scanning swapoff; instead > we just swap the data in and mark the pages as present in > the swapper_space radix tree If radix tree will store all rmap to the pages? If not, how to position the pages? > > 2) free swap entries as the are read in, without waiting for > the process to fault it in - this may be useful for memory > types that have a large erase block > > 3) together with the defragmentation from (2), we can always > do writes in large aligned blocks - the extra indirection > will make it relatively easy to have special backend code > for different kinds of swap space, since all the state can > now live in just one place > > 4) skip writeout of zero-filled pages - this can be a big help > for KVM virtual machines running Windows, since Windows zeroes > out free pages; simply discarding a zero-filled page is not > at all simple in the current VM, where we would have to iterate > over all the ptes to free the swap entry before being able to > free the swap cache page (I am not sure how that locking would > even work) > > with the extra layer of indirection, the locking for this scheme > can be trivial - either the faulting process gets the old page, > or it gets a new one, either way it'll be zero filled > > 5) skip writeout of pages the guest has marked as free - same as > above, with the same easier locking > > Only one real question remaining - how do we handle the swap count > in the new scheme? On 64 bit systems we have enough space in the > radix tree, on 32 bit systems maybe we'll have to start overflowing > into the "swap_count_continued" logic a little sooner than we are > now and reduce the maximum swap size a little? > > > > > I had some progresses in these areas recently: > > http://marc.info/?l=linux-mm&m=134665691021172&w=2 > > http://marc.info/?l=linux-mm&m=135336039115191&w=2 > > http://marc.info/?l=linux-mm&m=135882182225444&w=2 > > http://marc.info/?l=linux-mm&m=135754636926984&w=2 > > http://marc.info/?l=linux-mm&m=135754634526979&w=2 > > But a lot of problems remain. I'd like to discuss the issues at the meeting. > > I have an interest on this topic. > Thnaks. > > > > > Thanks, > > Shaohua > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majordomo@kvack.org. For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Don't email: email@kvack.org > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx191.postini.com [74.125.245.191]) by kanga.kvack.org (Postfix) with SMTP id 59EDE6B0005 for ; Fri, 25 Jan 2013 23:40:56 -0500 (EST) Received: by mail-ie0-f170.google.com with SMTP id c11so270651ieb.15 for ; Fri, 25 Jan 2013 20:40:55 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <1359018598.2866.5.camel@kernel> References: <20130122065341.GA1850@kernel.org> <20130123075808.GH2723@blaptop> <1359018598.2866.5.camel@kernel> Date: Sat, 26 Jan 2013 13:40:55 +0900 Message-ID: Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD From: Kyungmin Park Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Shaohua Li Cc: Minchan Kim , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Rik van Riel , Simon Jeons Hi, On 1/24/13, Simon Jeons wrote: > Hi Minchan, > On Wed, 2013-01-23 at 16:58 +0900, Minchan Kim wrote: >> On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote: >> > Hi, >> > >> > Because of high density, low power and low price, flash storage (SSD) is >> > a good >> > candidate to partially replace DRAM. A quick answer for this is using >> > SSD as >> > swap. But Linux swap is designed for slow hard disk storage. There are a >> > lot of >> > challenges to efficiently use SSD for swap: >> >> Many of below item could be applied in in-memory swap like zram, zcache. >> >> > >> > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space >> > lock) >> > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB >> > flush. This >> > overhead is very high even in a normal 2-socket machine. >> > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do >> > swap, >> > which makes swap IO pattern is interleave. Block layer isn't always >> > efficient >> > to do request merge. Such IO pattern also makes swap prefetch hard. >> >> Agreed. >> >> > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which >> > is >> > very inefficient, especially if swap storage is fast. >> >> Agreed. >> 5. SSD related optimization, mainly discard support. Now swap codes are based on each swap slots. it means it can't optimize discard feature since getting meaningful performance gain, it requires 2 pages at least. Of course it's based on eMMC. In case of SSD. it requires more pages to support discard. To address issue. I consider the batched discard approach used at filesystem. *Sometime* scan all empty slot and it issues discard continuous swap slots as many as possible. How to you think? Thank you, Kyungmin Park P.S., It's almost same topics to optimize the eMMC with swap. I mean I"m very interested with this topics. >> > 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed >> > pages >> > aren't always in LRU list adjacently, so page reclaim will not swap such >> > pages >> > in adjacent storage sectors. This makes swap prefetch hard. >> >> One of problem is LRU churning and I wanted to try to fix it. >> http://marc.info/?l=linux-mm&m=130978831028952&w=4 >> >> > 7. Alternative page reclaim policy to bias reclaiming anonymous page. >> > Currently reclaim anonymous page is considering harder than reclaim file >> > pages, >> > so we bias reclaiming file pages. If there are high speed swap storage, >> > we are >> > considering doing swap more aggressively. >> >> Yeb. We need it. I tried it with extending vm_swappiness to 200. >> >> From: Minchan Kim >> Date: Mon, 3 Dec 2012 16:21:00 +0900 >> Subject: [PATCH] mm: increase swappiness to 200 >> >> We have thought swap out cost is very high but it's not true >> if we use fast device like swap-over-zram. Nonetheless, we can >> swap out 1:1 ratio of anon and page cache at most. >> It's not enough to use swap device fully so we encounter OOM kill >> while there are many free space in zram swap device. It's never >> what we want. >> >> This patch makes swap out aggressively. >> >> Cc: Luigi Semenzato >> Signed-off-by: Minchan Kim >> --- >> kernel/sysctl.c | 3 ++- >> mm/vmscan.c | 6 ++++-- >> 2 files changed, 6 insertions(+), 3 deletions(-) >> >> diff --git a/kernel/sysctl.c b/kernel/sysctl.c >> index 693e0ed..f1dbd9d 100644 >> --- a/kernel/sysctl.c >> +++ b/kernel/sysctl.c >> @@ -130,6 +130,7 @@ static int __maybe_unused two = 2; >> static int __maybe_unused three = 3; >> static unsigned long one_ul = 1; >> static int one_hundred = 100; >> +extern int max_swappiness; >> #ifdef CONFIG_PRINTK >> static int ten_thousand = 10000; >> #endif >> @@ -1157,7 +1158,7 @@ static struct ctl_table vm_table[] = { >> .mode = 0644, >> .proc_handler = proc_dointvec_minmax, >> .extra1 = &zero, >> - .extra2 = &one_hundred, >> + .extra2 = &max_swappiness, >> }, >> #ifdef CONFIG_HUGETLB_PAGE >> { >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 53dcde9..64f3c21 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -53,6 +53,8 @@ >> #define CREATE_TRACE_POINTS >> #include >> >> +int max_swappiness = 200; >> + >> struct scan_control { >> /* Incremented by the number of inactive pages that were scanned >> */ >> unsigned long nr_scanned; >> @@ -1626,6 +1628,7 @@ static int vmscan_swappiness(struct scan_control >> *sc) >> return mem_cgroup_swappiness(sc->target_mem_cgroup); >> } >> >> + >> /* >> * Determine how aggressively the anon and file LRU lists should be >> * scanned. The relative value of each set of LRU lists is determined >> @@ -1701,11 +1704,10 @@ static void get_scan_count(struct lruvec *lruvec, >> struct scan_control *sc, >> } >> >> /* >> - * With swappiness at 100, anonymous and file have the same >> priority. >> * This scanning priority is essentially the inverse of IO cost. >> */ >> anon_prio = vmscan_swappiness(sc); >> - file_prio = 200 - anon_prio; >> + file_prio = max_swappiness - anon_prio; >> >> /* >> * OK, so we have swap space and a fair amount of page cache >> -- >> 1.7.9.5 >> >> > 8. Huge page swap. Huge page swap can solve a lot of problems above, but >> > both >> > THP and hugetlbfs don't support swap. >> >> Another items are indirection layers. Please read Rik's mail below. >> Indirection layers could give many flexibility to backends and helpful >> for defragmentation. >> >> One of idea I am considering is that makes hierarchy swap devides, >> NOT priority-based. I mean currently swap devices are used up by prioirty >> order. >> It's not good fit if we use fast swap and slow swap at the same time. >> I'd like to consume fast swap device (ex, in-memory swap) firstly, then >> I want to migrate some of swap pages from fast swap to slow swap to >> make room for fast swap. It could solve below concern. >> In addition, buffering via in-memory swap could make big chunk which is >> aligned >> to slow device's block size so migration speed from fast swap to slow >> swap >> could be enhanced so wear out problem would go away, too. >> >> Quote from last KS2012 - http://lwn.net/Articles/516538/ >> "Andrea Arcangeli was also concerned that the first pages to be evicted >> from >> memory are, by definition of the LRU page order, the ones that are least >> likely >> to be used in the future. These are the pages that should be going to >> secondary >> storage and more frequently used pages should be going to zcache. As it >> stands, >> zcache may fill up with no-longer-used pages and then the system continues >> to >> move used pages from and to the disk." >> >> From riel@redhat.com Sun Apr 10 17:50:10 2011 >> Date: Sun, 10 Apr 2011 20:50:01 -0400 >> From: Rik van Riel >> To: Linux Memory Management List >> Subject: [LSF/Collab] swap cache redesign idea >> >> On Thursday after LSF, Hugh, Minchan, Mel, Johannes and I were >> sitting in the hallway talking about yet more VM things. >> >> During that discussion, we came up with a way to redesign the >> swap cache. During my flight home, I came with ideas on how >> to use that redesign, that may make the changes worthwhile. >> >> Currently, the page table entries that have swapped out pages >> associated with them contain a swap entry, pointing directly >> at the swap device and swap slot containing the data. Meanwhile, >> the swap count lives in a separate array. >> >> The redesign we are considering moving the swap entry to the >> page cache radix tree for the swapper_space and having the pte >> contain only the offset into the swapper_space. The swap count >> info can also fit inside the swapper_space page cache radix >> tree (at least on 64 bits - on 32 bits we may need to get >> creative or accept a smaller max amount of swap space). >> >> This extra layer of indirection allows us to do several things: >> >> 1) get rid of the virtual address scanning swapoff; instead >> we just swap the data in and mark the pages as present in >> the swapper_space radix tree > > If radix tree will store all rmap to the pages? If not, how to position > the pages? > >> >> 2) free swap entries as the are read in, without waiting for >> the process to fault it in - this may be useful for memory >> types that have a large erase block >> >> 3) together with the defragmentation from (2), we can always >> do writes in large aligned blocks - the extra indirection >> will make it relatively easy to have special backend code >> for different kinds of swap space, since all the state can >> now live in just one place >> >> 4) skip writeout of zero-filled pages - this can be a big help >> for KVM virtual machines running Windows, since Windows zeroes >> out free pages; simply discarding a zero-filled page is not >> at all simple in the current VM, where we would have to iterate >> over all the ptes to free the swap entry before being able to >> free the swap cache page (I am not sure how that locking would >> even work) >> >> with the extra layer of indirection, the locking for this scheme >> can be trivial - either the faulting process gets the old page, >> or it gets a new one, either way it'll be zero filled >> >> 5) skip writeout of pages the guest has marked as free - same as >> above, with the same easier locking >> >> Only one real question remaining - how do we handle the swap count >> in the new scheme? On 64 bit systems we have enough space in the >> radix tree, on 32 bit systems maybe we'll have to start overflowing >> into the "swap_count_continued" logic a little sooner than we are >> now and reduce the maximum swap size a little? >> >> > >> > I had some progresses in these areas recently: >> > http://marc.info/?l=linux-mm&m=134665691021172&w=2 >> > http://marc.info/?l=linux-mm&m=135336039115191&w=2 >> > http://marc.info/?l=linux-mm&m=135882182225444&w=2 >> > http://marc.info/?l=linux-mm&m=135754636926984&w=2 >> > http://marc.info/?l=linux-mm&m=135754634526979&w=2 >> > But a lot of problems remain. I'd like to discuss the issues at the >> > meeting. >> >> I have an interest on this topic. >> Thnaks. >> >> > >> > Thanks, >> > Shaohua >> > >> > -- >> > To unsubscribe, send a message with 'unsubscribe linux-mm' in >> > the body to majordomo@kvack.org. For more info on Linux MM, >> > see: http://www.linux-mm.org/ . >> > Don't email: email@kvack.org >> > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx194.postini.com [74.125.245.194]) by kanga.kvack.org (Postfix) with SMTP id 7B32D6B0005 for ; Sat, 26 Jan 2013 19:26:36 -0500 (EST) Received: by mail-pb0-f47.google.com with SMTP id rp8so72434pbb.20 for ; Sat, 26 Jan 2013 16:26:35 -0800 (PST) Message-ID: <1359246393.4159.1.camel@kernel> Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD From: Simon Jeons Date: Sat, 26 Jan 2013 18:26:33 -0600 In-Reply-To: References: <20130122065341.GA1850@kernel.org> <20130123075808.GH2723@blaptop> <1359018598.2866.5.camel@kernel> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Kyungmin Park Cc: Shaohua Li , Minchan Kim , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Rik van Riel On Sat, 2013-01-26 at 13:40 +0900, Kyungmin Park wrote: > Hi, > > On 1/24/13, Simon Jeons wrote: > > Hi Minchan, > > On Wed, 2013-01-23 at 16:58 +0900, Minchan Kim wrote: > >> On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote: > >> > Hi, > >> > > >> > Because of high density, low power and low price, flash storage (SSD) is > >> > a good > >> > candidate to partially replace DRAM. A quick answer for this is using > >> > SSD as > >> > swap. But Linux swap is designed for slow hard disk storage. There are a > >> > lot of > >> > challenges to efficiently use SSD for swap: > >> > >> Many of below item could be applied in in-memory swap like zram, zcache. > >> > >> > > >> > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space > >> > lock) > >> > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB > >> > flush. This > >> > overhead is very high even in a normal 2-socket machine. > >> > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do > >> > swap, > >> > which makes swap IO pattern is interleave. Block layer isn't always > >> > efficient > >> > to do request merge. Such IO pattern also makes swap prefetch hard. > >> > >> Agreed. > >> > >> > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which > >> > is > >> > very inefficient, especially if swap storage is fast. > >> > >> Agreed. > >> > HI Kyungmin, > 5. SSD related optimization, mainly discard support. > > Now swap codes are based on each swap slots. it means it can't > optimize discard feature since getting meaningful performance gain, it > requires 2 pages at least. Of course it's based on eMMC. In case of > SSD. it requires more pages to support discard. Could explain 2 pages or more pages you mentioned used for what? Why need it? I'm interested in. > > To address issue. I consider the batched discard approach used at filesystem. > *Sometime* scan all empty slot and it issues discard continuous swap > slots as many as possible. > > How to you think? > > Thank you, > Kyungmin Park > > P.S., It's almost same topics to optimize the eMMC with swap. I mean > I"m very interested with this topics. > > >> > 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed > >> > pages > >> > aren't always in LRU list adjacently, so page reclaim will not swap such > >> > pages > >> > in adjacent storage sectors. This makes swap prefetch hard. > >> > >> One of problem is LRU churning and I wanted to try to fix it. > >> http://marc.info/?l=linux-mm&m=130978831028952&w=4 > >> > >> > 7. Alternative page reclaim policy to bias reclaiming anonymous page. > >> > Currently reclaim anonymous page is considering harder than reclaim file > >> > pages, > >> > so we bias reclaiming file pages. If there are high speed swap storage, > >> > we are > >> > considering doing swap more aggressively. > >> > >> Yeb. We need it. I tried it with extending vm_swappiness to 200. > >> > >> From: Minchan Kim > >> Date: Mon, 3 Dec 2012 16:21:00 +0900 > >> Subject: [PATCH] mm: increase swappiness to 200 > >> > >> We have thought swap out cost is very high but it's not true > >> if we use fast device like swap-over-zram. Nonetheless, we can > >> swap out 1:1 ratio of anon and page cache at most. > >> It's not enough to use swap device fully so we encounter OOM kill > >> while there are many free space in zram swap device. It's never > >> what we want. > >> > >> This patch makes swap out aggressively. > >> > >> Cc: Luigi Semenzato > >> Signed-off-by: Minchan Kim > >> --- > >> kernel/sysctl.c | 3 ++- > >> mm/vmscan.c | 6 ++++-- > >> 2 files changed, 6 insertions(+), 3 deletions(-) > >> > >> diff --git a/kernel/sysctl.c b/kernel/sysctl.c > >> index 693e0ed..f1dbd9d 100644 > >> --- a/kernel/sysctl.c > >> +++ b/kernel/sysctl.c > >> @@ -130,6 +130,7 @@ static int __maybe_unused two = 2; > >> static int __maybe_unused three = 3; > >> static unsigned long one_ul = 1; > >> static int one_hundred = 100; > >> +extern int max_swappiness; > >> #ifdef CONFIG_PRINTK > >> static int ten_thousand = 10000; > >> #endif > >> @@ -1157,7 +1158,7 @@ static struct ctl_table vm_table[] = { > >> .mode = 0644, > >> .proc_handler = proc_dointvec_minmax, > >> .extra1 = &zero, > >> - .extra2 = &one_hundred, > >> + .extra2 = &max_swappiness, > >> }, > >> #ifdef CONFIG_HUGETLB_PAGE > >> { > >> diff --git a/mm/vmscan.c b/mm/vmscan.c > >> index 53dcde9..64f3c21 100644 > >> --- a/mm/vmscan.c > >> +++ b/mm/vmscan.c > >> @@ -53,6 +53,8 @@ > >> #define CREATE_TRACE_POINTS > >> #include > >> > >> +int max_swappiness = 200; > >> + > >> struct scan_control { > >> /* Incremented by the number of inactive pages that were scanned > >> */ > >> unsigned long nr_scanned; > >> @@ -1626,6 +1628,7 @@ static int vmscan_swappiness(struct scan_control > >> *sc) > >> return mem_cgroup_swappiness(sc->target_mem_cgroup); > >> } > >> > >> + > >> /* > >> * Determine how aggressively the anon and file LRU lists should be > >> * scanned. The relative value of each set of LRU lists is determined > >> @@ -1701,11 +1704,10 @@ static void get_scan_count(struct lruvec *lruvec, > >> struct scan_control *sc, > >> } > >> > >> /* > >> - * With swappiness at 100, anonymous and file have the same > >> priority. > >> * This scanning priority is essentially the inverse of IO cost. > >> */ > >> anon_prio = vmscan_swappiness(sc); > >> - file_prio = 200 - anon_prio; > >> + file_prio = max_swappiness - anon_prio; > >> > >> /* > >> * OK, so we have swap space and a fair amount of page cache > >> -- > >> 1.7.9.5 > >> > >> > 8. Huge page swap. Huge page swap can solve a lot of problems above, but > >> > both > >> > THP and hugetlbfs don't support swap. > >> > >> Another items are indirection layers. Please read Rik's mail below. > >> Indirection layers could give many flexibility to backends and helpful > >> for defragmentation. > >> > >> One of idea I am considering is that makes hierarchy swap devides, > >> NOT priority-based. I mean currently swap devices are used up by prioirty > >> order. > >> It's not good fit if we use fast swap and slow swap at the same time. > >> I'd like to consume fast swap device (ex, in-memory swap) firstly, then > >> I want to migrate some of swap pages from fast swap to slow swap to > >> make room for fast swap. It could solve below concern. > >> In addition, buffering via in-memory swap could make big chunk which is > >> aligned > >> to slow device's block size so migration speed from fast swap to slow > >> swap > >> could be enhanced so wear out problem would go away, too. > >> > >> Quote from last KS2012 - http://lwn.net/Articles/516538/ > >> "Andrea Arcangeli was also concerned that the first pages to be evicted > >> from > >> memory are, by definition of the LRU page order, the ones that are least > >> likely > >> to be used in the future. These are the pages that should be going to > >> secondary > >> storage and more frequently used pages should be going to zcache. As it > >> stands, > >> zcache may fill up with no-longer-used pages and then the system continues > >> to > >> move used pages from and to the disk." > >> > >> From riel@redhat.com Sun Apr 10 17:50:10 2011 > >> Date: Sun, 10 Apr 2011 20:50:01 -0400 > >> From: Rik van Riel > >> To: Linux Memory Management List > >> Subject: [LSF/Collab] swap cache redesign idea > >> > >> On Thursday after LSF, Hugh, Minchan, Mel, Johannes and I were > >> sitting in the hallway talking about yet more VM things. > >> > >> During that discussion, we came up with a way to redesign the > >> swap cache. During my flight home, I came with ideas on how > >> to use that redesign, that may make the changes worthwhile. > >> > >> Currently, the page table entries that have swapped out pages > >> associated with them contain a swap entry, pointing directly > >> at the swap device and swap slot containing the data. Meanwhile, > >> the swap count lives in a separate array. > >> > >> The redesign we are considering moving the swap entry to the > >> page cache radix tree for the swapper_space and having the pte > >> contain only the offset into the swapper_space. The swap count > >> info can also fit inside the swapper_space page cache radix > >> tree (at least on 64 bits - on 32 bits we may need to get > >> creative or accept a smaller max amount of swap space). > >> > >> This extra layer of indirection allows us to do several things: > >> > >> 1) get rid of the virtual address scanning swapoff; instead > >> we just swap the data in and mark the pages as present in > >> the swapper_space radix tree > > > > If radix tree will store all rmap to the pages? If not, how to position > > the pages? > > > >> > >> 2) free swap entries as the are read in, without waiting for > >> the process to fault it in - this may be useful for memory > >> types that have a large erase block > >> > >> 3) together with the defragmentation from (2), we can always > >> do writes in large aligned blocks - the extra indirection > >> will make it relatively easy to have special backend code > >> for different kinds of swap space, since all the state can > >> now live in just one place > >> > >> 4) skip writeout of zero-filled pages - this can be a big help > >> for KVM virtual machines running Windows, since Windows zeroes > >> out free pages; simply discarding a zero-filled page is not > >> at all simple in the current VM, where we would have to iterate > >> over all the ptes to free the swap entry before being able to > >> free the swap cache page (I am not sure how that locking would > >> even work) > >> > >> with the extra layer of indirection, the locking for this scheme > >> can be trivial - either the faulting process gets the old page, > >> or it gets a new one, either way it'll be zero filled > >> > >> 5) skip writeout of pages the guest has marked as free - same as > >> above, with the same easier locking > >> > >> Only one real question remaining - how do we handle the swap count > >> in the new scheme? On 64 bit systems we have enough space in the > >> radix tree, on 32 bit systems maybe we'll have to start overflowing > >> into the "swap_count_continued" logic a little sooner than we are > >> now and reduce the maximum swap size a little? > >> > >> > > >> > I had some progresses in these areas recently: > >> > http://marc.info/?l=linux-mm&m=134665691021172&w=2 > >> > http://marc.info/?l=linux-mm&m=135336039115191&w=2 > >> > http://marc.info/?l=linux-mm&m=135882182225444&w=2 > >> > http://marc.info/?l=linux-mm&m=135754636926984&w=2 > >> > http://marc.info/?l=linux-mm&m=135754634526979&w=2 > >> > But a lot of problems remain. I'd like to discuss the issues at the > >> > meeting. > >> > >> I have an interest on this topic. > >> Thnaks. > >> > >> > > >> > Thanks, > >> > Shaohua > >> > > >> > -- > >> > To unsubscribe, send a message with 'unsubscribe linux-mm' in > >> > the body to majordomo@kvack.org. For more info on Linux MM, > >> > see: http://www.linux-mm.org/ . > >> > Don't email: email@kvack.org > >> > > > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majordomo@kvack.org. For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Don't email: email@kvack.org > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx195.postini.com [74.125.245.195]) by kanga.kvack.org (Postfix) with SMTP id 5C9B16B0005 for ; Sun, 27 Jan 2013 09:19:09 -0500 (EST) Received: by mail-pb0-f48.google.com with SMTP id wy12so1006916pbc.7 for ; Sun, 27 Jan 2013 06:19:08 -0800 (PST) Date: Sun, 27 Jan 2013 22:18:53 +0800 From: Shaohua Li Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD Message-ID: <20130127141853.GB27019@kernel.org> References: <20130122065341.GA1850@kernel.org> <20130123075808.GH2723@blaptop> <1359018598.2866.5.camel@kernel> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Kyungmin Park Cc: Minchan Kim , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Rik van Riel , Simon Jeons On Sat, Jan 26, 2013 at 01:40:55PM +0900, Kyungmin Park wrote: > Hi, > > On 1/24/13, Simon Jeons wrote: > > Hi Minchan, > > On Wed, 2013-01-23 at 16:58 +0900, Minchan Kim wrote: > >> On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote: > >> > Hi, > >> > > >> > Because of high density, low power and low price, flash storage (SSD) is > >> > a good > >> > candidate to partially replace DRAM. A quick answer for this is using > >> > SSD as > >> > swap. But Linux swap is designed for slow hard disk storage. There are a > >> > lot of > >> > challenges to efficiently use SSD for swap: > >> > >> Many of below item could be applied in in-memory swap like zram, zcache. > >> > >> > > >> > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space > >> > lock) > >> > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB > >> > flush. This > >> > overhead is very high even in a normal 2-socket machine. > >> > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do > >> > swap, > >> > which makes swap IO pattern is interleave. Block layer isn't always > >> > efficient > >> > to do request merge. Such IO pattern also makes swap prefetch hard. > >> > >> Agreed. > >> > >> > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which > >> > is > >> > very inefficient, especially if swap storage is fast. > >> > >> Agreed. > >> > > 5. SSD related optimization, mainly discard support. > > Now swap codes are based on each swap slots. it means it can't > optimize discard feature since getting meaningful performance gain, it > requires 2 pages at least. Of course it's based on eMMC. In case of > SSD. it requires more pages to support discard. > > To address issue. I consider the batched discard approach used at filesystem. > *Sometime* scan all empty slot and it issues discard continuous swap > slots as many as possible. I posted a patch to make discard async before, which is almost good to me, though we still discard a cluster. http://marc.info/?l=linux-mm&m=135087309208120&w=2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx103.postini.com [74.125.245.103]) by kanga.kvack.org (Postfix) with SMTP id 733906B0008 for ; Mon, 28 Jan 2013 02:37:15 -0500 (EST) Received: by mail-ia0-f176.google.com with SMTP id i18so3785698iac.21 for ; Sun, 27 Jan 2013 23:37:14 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20130127141853.GB27019@kernel.org> References: <20130122065341.GA1850@kernel.org> <20130123075808.GH2723@blaptop> <1359018598.2866.5.camel@kernel> <20130127141853.GB27019@kernel.org> Date: Mon, 28 Jan 2013 16:37:14 +0900 Message-ID: Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD From: Kyungmin Park Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Shaohua Li Cc: Minchan Kim , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Rik van Riel , Simon Jeons On Sun, Jan 27, 2013 at 11:18 PM, Shaohua Li wrote: > On Sat, Jan 26, 2013 at 01:40:55PM +0900, Kyungmin Park wrote: >> Hi, >> >> On 1/24/13, Simon Jeons wrote: >> > Hi Minchan, >> > On Wed, 2013-01-23 at 16:58 +0900, Minchan Kim wrote: >> >> On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote: >> >> > Hi, >> >> > >> >> > Because of high density, low power and low price, flash storage (SSD) is >> >> > a good >> >> > candidate to partially replace DRAM. A quick answer for this is using >> >> > SSD as >> >> > swap. But Linux swap is designed for slow hard disk storage. There are a >> >> > lot of >> >> > challenges to efficiently use SSD for swap: >> >> >> >> Many of below item could be applied in in-memory swap like zram, zcache. >> >> >> >> > >> >> > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space >> >> > lock) >> >> > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB >> >> > flush. This >> >> > overhead is very high even in a normal 2-socket machine. >> >> > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do >> >> > swap, >> >> > which makes swap IO pattern is interleave. Block layer isn't always >> >> > efficient >> >> > to do request merge. Such IO pattern also makes swap prefetch hard. >> >> >> >> Agreed. >> >> >> >> > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which >> >> > is >> >> > very inefficient, especially if swap storage is fast. >> >> >> >> Agreed. >> >> >> >> 5. SSD related optimization, mainly discard support. >> >> Now swap codes are based on each swap slots. it means it can't >> optimize discard feature since getting meaningful performance gain, it >> requires 2 pages at least. Of course it's based on eMMC. In case of >> SSD. it requires more pages to support discard. >> >> To address issue. I consider the batched discard approach used at filesystem. >> *Sometime* scan all empty slot and it issues discard continuous swap >> slots as many as possible. > > I posted a patch to make discard async before, which is almost good to me, though we > still discard a cluster. > http://marc.info/?l=linux-mm&m=135087309208120&w=2 I found your previous patches, It's almost same concept as batched discard. Now I'm testing your patches. BTW, which test program do you use? Now we just testing some scenario and check scenario only. There's no generic tool to measure improved performance gain. After test, I'll share the results. Thank you, Kyungmin Park -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx141.postini.com [74.125.245.141]) by kanga.kvack.org (Postfix) with SMTP id 19B9B6B0005 for ; Fri, 1 Feb 2013 07:37:44 -0500 (EST) Received: by mail-ia0-f182.google.com with SMTP id w33so5291283iag.27 for ; Fri, 01 Feb 2013 04:37:43 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <20130122065341.GA1850@kernel.org> <20130123075808.GH2723@blaptop> <1359018598.2866.5.camel@kernel> <20130127141853.GB27019@kernel.org> Date: Fri, 1 Feb 2013 21:37:43 +0900 Message-ID: Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD From: Kyungmin Park Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Shaohua Li Cc: Minchan Kim , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Rik van Riel , Simon Jeons On Mon, Jan 28, 2013 at 4:37 PM, Kyungmin Park wrote: > On Sun, Jan 27, 2013 at 11:18 PM, Shaohua Li wrote: >> On Sat, Jan 26, 2013 at 01:40:55PM +0900, Kyungmin Park wrote: >>> Hi, >>> >>> On 1/24/13, Simon Jeons wrote: >>> > Hi Minchan, >>> > On Wed, 2013-01-23 at 16:58 +0900, Minchan Kim wrote: >>> >> On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote: >>> >> > Hi, >>> >> > >>> >> > Because of high density, low power and low price, flash storage (SSD) is >>> >> > a good >>> >> > candidate to partially replace DRAM. A quick answer for this is using >>> >> > SSD as >>> >> > swap. But Linux swap is designed for slow hard disk storage. There are a >>> >> > lot of >>> >> > challenges to efficiently use SSD for swap: >>> >> >>> >> Many of below item could be applied in in-memory swap like zram, zcache. >>> >> >>> >> > >>> >> > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space >>> >> > lock) >>> >> > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB >>> >> > flush. This >>> >> > overhead is very high even in a normal 2-socket machine. >>> >> > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do >>> >> > swap, >>> >> > which makes swap IO pattern is interleave. Block layer isn't always >>> >> > efficient >>> >> > to do request merge. Such IO pattern also makes swap prefetch hard. >>> >> >>> >> Agreed. >>> >> >>> >> > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which >>> >> > is >>> >> > very inefficient, especially if swap storage is fast. >>> >> >>> >> Agreed. >>> >> >>> >>> 5. SSD related optimization, mainly discard support. >>> >>> Now swap codes are based on each swap slots. it means it can't >>> optimize discard feature since getting meaningful performance gain, it >>> requires 2 pages at least. Of course it's based on eMMC. In case of >>> SSD. it requires more pages to support discard. >>> >>> To address issue. I consider the batched discard approach used at filesystem. >>> *Sometime* scan all empty slot and it issues discard continuous swap >>> slots as many as possible. >> >> I posted a patch to make discard async before, which is almost good to me, though we >> still discard a cluster. >> http://marc.info/?l=linux-mm&m=135087309208120&w=2 > > I found your previous patches, It's almost same concept as batched > discard. Now I'm testing your patches. > BTW, which test program do you use? Now we just testing some scenario > and check scenario only. > There's no generic tool to measure improved performance gain. > > After test, I'll share the results. Updated, it has good performance gain than previous one about 4 times. Feel free to add. Tested-by: Kyungmin Park > > Thank you, > Kyungmin Park -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx145.postini.com [74.125.245.145]) by kanga.kvack.org (Postfix) with SMTP id 2C6B06B0005 for ; Sun, 3 Feb 2013 23:56:16 -0500 (EST) Received: by mail-da0-f49.google.com with SMTP id v40so2459783dad.22 for ; Sun, 03 Feb 2013 20:56:15 -0800 (PST) Date: Sun, 3 Feb 2013 20:56:15 -0800 (PST) From: Hugh Dickins Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD In-Reply-To: <20130127141853.GB27019@kernel.org> Message-ID: References: <20130122065341.GA1850@kernel.org> <20130123075808.GH2723@blaptop> <1359018598.2866.5.camel@kernel> <20130127141853.GB27019@kernel.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Shaohua Li Cc: Kyungmin Park , Minchan Kim , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Rik van Riel , Simon Jeons On Sun, 27 Jan 2013, Shaohua Li wrote: > On Sat, Jan 26, 2013 at 01:40:55PM +0900, Kyungmin Park wrote: > > 5. SSD related optimization, mainly discard support. > > > > Now swap codes are based on each swap slots. it means it can't > > optimize discard feature since getting meaningful performance gain, it > > requires 2 pages at least. Of course it's based on eMMC. In case of > > SSD. it requires more pages to support discard. > > > > To address issue. I consider the batched discard approach used at filesystem. > > *Sometime* scan all empty slot and it issues discard continuous swap > > slots as many as possible. > > I posted a patch to make discard async before, which is almost good to me, > though we still discard a cluster. > http://marc.info/?l=linux-mm&m=135087309208120&w=2 Any reason why you point to 2012/10/22 patch rather than the 2012/11/19? Seeing this reminded me to take your 1/2 and 2/2 (of 11/19) out again and give them a fresh run - though they were easier to apply to 3.8-rc rather than mmotm with your locking changes, so it was 3.8-rc6 I tried. As I reported in private mail last year, I wish you'd remove the "buddy" from description of your 1/2 allocator, that just misled me; but I've not experienced any problem with the allocator, and I still like the direction you take with improving swap discard in 2/2. This time around I've not yet seen any "swap_free: Unused swap offset entry" messages (despite forgetting to include your later SWAP_MAP_BAD addition to __swap_duplicate() - I still haven't thought that through to be honest), but did again get the VM_BUG_ON(error == -EEXIST) in __add_to_swap_cache() called from add_to_swap() from shrink_page_list(). Since it came after 1.5 hours of load, I didn't give it much thought, and just went on to test other things, thinking I could easily reproduce it later; but have failed to do so in many hours since. Still trying. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx145.postini.com [74.125.245.145]) by kanga.kvack.org (Postfix) with SMTP id CD82B6B0002 for ; Tue, 19 Feb 2013 01:15:27 -0500 (EST) Received: by mail-pa0-f53.google.com with SMTP id bg4so3187846pad.26 for ; Mon, 18 Feb 2013 22:15:27 -0800 (PST) Date: Tue, 19 Feb 2013 14:15:12 +0800 From: Shaohua Li Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD Message-ID: <20130219061512.GA14921@kernel.org> References: <20130122065341.GA1850@kernel.org> <20130123075808.GH2723@blaptop> <1359018598.2866.5.camel@kernel> <20130127141853.GB27019@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Kyungmin Park , Minchan Kim , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Rik van Riel , Simon Jeons On Sun, Feb 03, 2013 at 08:56:15PM -0800, Hugh Dickins wrote: > On Sun, 27 Jan 2013, Shaohua Li wrote: > > On Sat, Jan 26, 2013 at 01:40:55PM +0900, Kyungmin Park wrote: > > > 5. SSD related optimization, mainly discard support. > > > > > > Now swap codes are based on each swap slots. it means it can't > > > optimize discard feature since getting meaningful performance gain, it > > > requires 2 pages at least. Of course it's based on eMMC. In case of > > > SSD. it requires more pages to support discard. > > > > > > To address issue. I consider the batched discard approach used at filesystem. > > > *Sometime* scan all empty slot and it issues discard continuous swap > > > slots as many as possible. > > > > I posted a patch to make discard async before, which is almost good to me, > > though we still discard a cluster. > > http://marc.info/?l=linux-mm&m=135087309208120&w=2 > > Any reason why you point to 2012/10/22 patch rather than the 2012/11/19? > > Seeing this reminded me to take your 1/2 and 2/2 (of 11/19) out again and > give them a fresh run - though they were easier to apply to 3.8-rc rather > than mmotm with your locking changes, so it was 3.8-rc6 I tried. > > As I reported in private mail last year, I wish you'd remove the "buddy" > from description of your 1/2 allocator, that just misled me; but I've not > experienced any problem with the allocator, and I still like the direction > you take with improving swap discard in 2/2. > > This time around I've not yet seen any "swap_free: Unused swap offset entry" > messages (despite forgetting to include your later SWAP_MAP_BAD addition to > __swap_duplicate() - I still haven't thought that through to be honest), > but did again get the VM_BUG_ON(error == -EEXIST) in __add_to_swap_cache() > called from add_to_swap() from shrink_page_list(). > > Since it came after 1.5 hours of load, I didn't give it much thought, > and just went on to test other things, thinking I could easily reproduce > it later; but have failed to do so in many hours since. Still trying. Missed this mail, sorry. I'm planing to repost the patches against linux-next (because of the locking changes) and will include the SWAP_MAP_BAD change. I did see problems without the SWAP_MAP_BAD change. Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx112.postini.com [74.125.245.112]) by kanga.kvack.org (Postfix) with SMTP id 3E6BB6B0002 for ; Tue, 19 Feb 2013 14:42:41 -0500 (EST) Received: by mail-pb0-f44.google.com with SMTP id wz12so2437833pbc.17 for ; Tue, 19 Feb 2013 11:42:40 -0800 (PST) Date: Tue, 19 Feb 2013 11:41:53 -0800 (PST) From: Hugh Dickins Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD In-Reply-To: <20130219061512.GA14921@kernel.org> Message-ID: References: <20130122065341.GA1850@kernel.org> <20130123075808.GH2723@blaptop> <1359018598.2866.5.camel@kernel> <20130127141853.GB27019@kernel.org> <20130219061512.GA14921@kernel.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Shaohua Li Cc: Kyungmin Park , Minchan Kim , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Rik van Riel , Simon Jeons On Tue, 19 Feb 2013, Shaohua Li wrote: > On Sun, Feb 03, 2013 at 08:56:15PM -0800, Hugh Dickins wrote: > > > > Seeing this reminded me to take your 1/2 and 2/2 (of 11/19) out again and > > give them a fresh run - though they were easier to apply to 3.8-rc rather > > than mmotm with your locking changes, so it was 3.8-rc6 I tried. > > > > As I reported in private mail last year, I wish you'd remove the "buddy" > > from description of your 1/2 allocator, that just misled me; but I've not > > experienced any problem with the allocator, and I still like the direction > > you take with improving swap discard in 2/2. > > > > This time around I've not yet seen any "swap_free: Unused swap offset entry" > > messages (despite forgetting to include your later SWAP_MAP_BAD addition to > > __swap_duplicate() - I still haven't thought that through to be honest), > > but did again get the VM_BUG_ON(error == -EEXIST) in __add_to_swap_cache() > > called from add_to_swap() from shrink_page_list(). > > > > Since it came after 1.5 hours of load, I didn't give it much thought, > > and just went on to test other things, thinking I could easily reproduce > > it later; but have failed to do so in many hours since. Still trying. > > Missed this mail, sorry. I'm planing to repost the patches against linux-next (because > of the locking changes) and will include the SWAP_MAP_BAD change. I did see > problems without the SWAP_MAP_BAD change. Good, I'll take a look at them then. I did manage to hit the VM_BUG_ON(error == -EEXIST) in __add_to_swap_cache() again with those patches, and verified that there really was another page sitting in its radix_tree slot. Although I've never succeeded in reproducing this without your patches, I'm pretty sure they're not to blame, that they just perhaps alter the timing in some way as to make this more likely to happen. I believe (without actual evidence) that it's a race with swapin_readahead(): its read_swap_cache_async() coming in and reading into its own page, in between the swap slot being allocated from the swap_map with SWAP_HAS_CACHE and add_to_swap()'s page actually being inserted into the swap cache. I've not prepared a fix for it yet, but it shouldn't be a worry. Something I learnt in looking through the radix_tree to find the right slot, a benefit of your your per-device swapper_spaces that we had not anticipated: once you have multiple swap areas (because the swp_entry_t is arranged with the "type" at the top to get the offsets contiguous), the single-swapper_space radix_tree becomes very sparse, with matching high height and lots of silly levels of radix_tree_nodes - I had to go down 10 levels, despite having only two 1.5GB swap areas. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx171.postini.com [74.125.245.171]) by kanga.kvack.org (Postfix) with SMTP id 53E046B0027 for ; Fri, 15 Mar 2013 05:39:47 -0400 (EDT) Received: by mail-yh0-f48.google.com with SMTP id q12so552488yhf.21 for ; Fri, 15 Mar 2013 02:39:46 -0700 (PDT) Message-ID: <5142EC5A.4010509@gmail.com> Date: Fri, 15 Mar 2013 17:39:38 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD References: <20130122065341.GA1850@kernel.org> In-Reply-To: <20130122065341.GA1850@kernel.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Shaohua Li Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Hugh Dickins , Minchan Kim , Rik van Riel On 01/22/2013 02:53 PM, Shaohua Li wrote: > Hi, > > Because of high density, low power and low price, flash storage (SSD) is a good > candidate to partially replace DRAM. A quick answer for this is using SSD as > swap. But Linux swap is designed for slow hard disk storage. There are a lot of > challenges to efficiently use SSD for swap: > > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This > overhead is very high even in a normal 2-socket machine. > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap, > which makes swap IO pattern is interleave. Block layer isn't always efficient > to do request merge. Such IO pattern also makes swap prefetch hard. > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is > very inefficient, especially if swap storage is fast. > 5. SSD related optimization, mainly discard support > 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages > aren't always in LRU list adjacently, so page reclaim will not swap such pages > in adjacent storage sectors. This makes swap prefetch hard. > 7. Alternative page reclaim policy to bias reclaiming anonymous page. > Currently reclaim anonymous page is considering harder than reclaim file pages, > so we bias reclaiming file pages. If there are high speed swap storage, we are > considering doing swap more aggressively. > 8. Huge page swap. Huge page swap can solve a lot of problems above, but both > THP and hugetlbfs don't support swap. Could you tell me in which workload hugetlb/thp pages can't swapout influence your performance? Is it worth? > > I had some progresses in these areas recently: > http://marc.info/?l=linux-mm&m=134665691021172&w=2 > http://marc.info/?l=linux-mm&m=135336039115191&w=2 > http://marc.info/?l=linux-mm&m=135882182225444&w=2 > http://marc.info/?l=linux-mm&m=135754636926984&w=2 > http://marc.info/?l=linux-mm&m=135754634526979&w=2 > But a lot of problems remain. I'd like to discuss the issues at the meeting. > > Thanks, > Shaohua > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx137.postini.com [74.125.245.137]) by kanga.kvack.org (Postfix) with SMTP id 1DC226B003D for ; Mon, 18 Mar 2013 06:38:42 -0400 (EDT) Message-ID: <5146EEA5.4030003@oracle.com> Date: Mon, 18 Mar 2013 18:38:29 +0800 From: Bob Liu MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD References: <20130122065341.GA1850@kernel.org> <5142EC5A.4010509@gmail.com> In-Reply-To: <5142EC5A.4010509@gmail.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Shaohua Li , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Hugh Dickins , Minchan Kim , Rik van Riel , dan.magenheimer@oracle.com, sjenning@linux.vnet.ibm.com, rcj@linux.vnet.ibm.com On 03/15/2013 05:39 PM, Simon Jeons wrote: > On 01/22/2013 02:53 PM, Shaohua Li wrote: >> Hi, >> >> Because of high density, low power and low price, flash storage (SSD) >> is a good >> candidate to partially replace DRAM. A quick answer for this is using >> SSD as >> swap. But Linux swap is designed for slow hard disk storage. There are >> a lot of >> challenges to efficiently use SSD for swap: >> >> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) >> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB >> flush. This >> overhead is very high even in a normal 2-socket machine. >> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do >> swap, >> which makes swap IO pattern is interleave. Block layer isn't always >> efficient >> to do request merge. Such IO pattern also makes swap prefetch hard. >> 4. Swap map scan overhead. Swap in-memory map scan scans an array, >> which is >> very inefficient, especially if swap storage is fast. >> 5. SSD related optimization, mainly discard support >> 6. Better swap prefetch algorithm. Besides item 3, sequentially >> accessed pages >> aren't always in LRU list adjacently, so page reclaim will not swap >> such pages >> in adjacent storage sectors. This makes swap prefetch hard. >> 7. Alternative page reclaim policy to bias reclaiming anonymous page. >> Currently reclaim anonymous page is considering harder than reclaim >> file pages, >> so we bias reclaiming file pages. If there are high speed swap >> storage, we are >> considering doing swap more aggressively. >> 8. Huge page swap. Huge page swap can solve a lot of problems above, >> but both >> THP and hugetlbfs don't support swap. > > Could you tell me in which workload hugetlb/thp pages can't swapout > influence your performance? Is it worth? > I'm also very interesting in this workload. I think hugetlb/thp pages can be a potential user of zprojects like zswap/zcache. We can try to compress those pages before breaking them to normal pages. -- Regards, -Bob -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx107.postini.com [74.125.245.107]) by kanga.kvack.org (Postfix) with SMTP id 126CE6B0005 for ; Mon, 18 Mar 2013 21:27:42 -0400 (EDT) Received: by mail-pd0-f171.google.com with SMTP id 10so1008806pdc.16 for ; Mon, 18 Mar 2013 18:27:42 -0700 (PDT) Date: Tue, 19 Mar 2013 09:27:25 +0800 From: Shaohua Li Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD Message-ID: <20130319012725.GA28880@kernel.org> References: <20130122065341.GA1850@kernel.org> <5142EC5A.4010509@gmail.com> <5146EEA5.4030003@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5146EEA5.4030003@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Bob Liu Cc: Simon Jeons , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Hugh Dickins , Minchan Kim , Rik van Riel , dan.magenheimer@oracle.com, sjenning@linux.vnet.ibm.com, rcj@linux.vnet.ibm.com On Mon, Mar 18, 2013 at 06:38:29PM +0800, Bob Liu wrote: > > On 03/15/2013 05:39 PM, Simon Jeons wrote: > > On 01/22/2013 02:53 PM, Shaohua Li wrote: > >> Hi, > >> > >> Because of high density, low power and low price, flash storage (SSD) > >> is a good > >> candidate to partially replace DRAM. A quick answer for this is using > >> SSD as > >> swap. But Linux swap is designed for slow hard disk storage. There are > >> a lot of > >> challenges to efficiently use SSD for swap: > >> > >> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) > >> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB > >> flush. This > >> overhead is very high even in a normal 2-socket machine. > >> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do > >> swap, > >> which makes swap IO pattern is interleave. Block layer isn't always > >> efficient > >> to do request merge. Such IO pattern also makes swap prefetch hard. > >> 4. Swap map scan overhead. Swap in-memory map scan scans an array, > >> which is > >> very inefficient, especially if swap storage is fast. > >> 5. SSD related optimization, mainly discard support > >> 6. Better swap prefetch algorithm. Besides item 3, sequentially > >> accessed pages > >> aren't always in LRU list adjacently, so page reclaim will not swap > >> such pages > >> in adjacent storage sectors. This makes swap prefetch hard. > >> 7. Alternative page reclaim policy to bias reclaiming anonymous page. > >> Currently reclaim anonymous page is considering harder than reclaim > >> file pages, > >> so we bias reclaiming file pages. If there are high speed swap > >> storage, we are > >> considering doing swap more aggressively. > >> 8. Huge page swap. Huge page swap can solve a lot of problems above, > >> but both > >> THP and hugetlbfs don't support swap. > > > > Could you tell me in which workload hugetlb/thp pages can't swapout > > influence your performance? Is it worth? > > > > I'm also very interesting in this workload. > I think hugetlb/thp pages can be a potential user of zprojects like > zswap/zcache. > We can try to compress those pages before breaking them to normal pages. I don't have particular workload and don't have data for obvious reason. What I expected is swapout hugetlb/thp is to reduce some overheads (eg, tlb flush) and improve IO pattern. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx136.postini.com [74.125.245.136]) by kanga.kvack.org (Postfix) with SMTP id D91486B0006 for ; Mon, 18 Mar 2013 21:32:46 -0400 (EDT) Received: by mail-da0-f41.google.com with SMTP id w4so71691dam.28 for ; Mon, 18 Mar 2013 18:32:46 -0700 (PDT) Message-ID: <5147C037.5020707@gmail.com> Date: Tue, 19 Mar 2013 09:32:39 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD References: <20130122065341.GA1850@kernel.org> <5142EC5A.4010509@gmail.com> <5146EEA5.4030003@oracle.com> <20130319012725.GA28880@kernel.org> In-Reply-To: <20130319012725.GA28880@kernel.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Shaohua Li Cc: Bob Liu , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Hugh Dickins , Minchan Kim , Rik van Riel , dan.magenheimer@oracle.com, sjenning@linux.vnet.ibm.com, rcj@linux.vnet.ibm.com Hi Shaohua, On 03/19/2013 09:27 AM, Shaohua Li wrote: > On Mon, Mar 18, 2013 at 06:38:29PM +0800, Bob Liu wrote: >> On 03/15/2013 05:39 PM, Simon Jeons wrote: >>> On 01/22/2013 02:53 PM, Shaohua Li wrote: >>>> Hi, >>>> >>>> Because of high density, low power and low price, flash storage (SSD) >>>> is a good >>>> candidate to partially replace DRAM. A quick answer for this is using >>>> SSD as >>>> swap. But Linux swap is designed for slow hard disk storage. There are >>>> a lot of >>>> challenges to efficiently use SSD for swap: >>>> >>>> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) >>>> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB >>>> flush. This >>>> overhead is very high even in a normal 2-socket machine. >>>> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do >>>> swap, >>>> which makes swap IO pattern is interleave. Block layer isn't always >>>> efficient >>>> to do request merge. Such IO pattern also makes swap prefetch hard. >>>> 4. Swap map scan overhead. Swap in-memory map scan scans an array, >>>> which is >>>> very inefficient, especially if swap storage is fast. >>>> 5. SSD related optimization, mainly discard support >>>> 6. Better swap prefetch algorithm. Besides item 3, sequentially >>>> accessed pages >>>> aren't always in LRU list adjacently, so page reclaim will not swap >>>> such pages >>>> in adjacent storage sectors. This makes swap prefetch hard. >>>> 7. Alternative page reclaim policy to bias reclaiming anonymous page. >>>> Currently reclaim anonymous page is considering harder than reclaim >>>> file pages, >>>> so we bias reclaiming file pages. If there are high speed swap >>>> storage, we are >>>> considering doing swap more aggressively. >>>> 8. Huge page swap. Huge page swap can solve a lot of problems above, >>>> but both >>>> THP and hugetlbfs don't support swap. >>> Could you tell me in which workload hugetlb/thp pages can't swapout >>> influence your performance? Is it worth? >>> >> I'm also very interesting in this workload. >> I think hugetlb/thp pages can be a potential user of zprojects like >> zswap/zcache. >> We can try to compress those pages before breaking them to normal pages. > I don't have particular workload and don't have data for obvious reason. What I > expected is swapout hugetlb/thp is to reduce some overheads (eg, tlb flush) and > improve IO pattern. Do you have any idea about implement this feature? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx199.postini.com [74.125.245.199]) by kanga.kvack.org (Postfix) with SMTP id 10B7E6B0005 for ; Tue, 19 Mar 2013 00:25:46 -0400 (EDT) Received: from /spool/local by e23smtp08.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 19 Mar 2013 14:23:43 +1000 Received: from d23relay05.au.ibm.com (d23relay05.au.ibm.com [9.190.235.152]) by d23dlp01.au.ibm.com (Postfix) with ESMTP id 139062CE8052 for ; Tue, 19 Mar 2013 15:25:39 +1100 (EST) Received: from d23av02.au.ibm.com (d23av02.au.ibm.com [9.190.235.138]) by d23relay05.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r2J4Cc8l262622 for ; Tue, 19 Mar 2013 15:12:38 +1100 Received: from d23av02.au.ibm.com (loopback [127.0.0.1]) by d23av02.au.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r2J4Pcvt004093 for ; Tue, 19 Mar 2013 15:25:38 +1100 Date: Tue, 19 Mar 2013 12:25:36 +0800 From: Wanpeng Li Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD Message-ID: <20130319042536.GA4700@hacker.(null)> Reply-To: Wanpeng Li References: <20130122065341.GA1850@kernel.org> <5142EC5A.4010509@gmail.com> <5146EEA5.4030003@oracle.com> <20130319012725.GA28880@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130319012725.GA28880@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: Shaohua Li Cc: Bob Liu , Simon Jeons , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Hugh Dickins , Minchan Kim , Rik van Riel , dan.magenheimer@oracle.com, sjenning@linux.vnet.ibm.com, rcj@linux.vnet.ibm.com On Tue, Mar 19, 2013 at 09:27:25AM +0800, Shaohua Li wrote: >On Mon, Mar 18, 2013 at 06:38:29PM +0800, Bob Liu wrote: >> >> On 03/15/2013 05:39 PM, Simon Jeons wrote: >> > On 01/22/2013 02:53 PM, Shaohua Li wrote: >> >> Hi, >> >> >> >> Because of high density, low power and low price, flash storage (SSD) >> >> is a good >> >> candidate to partially replace DRAM. A quick answer for this is using >> >> SSD as >> >> swap. But Linux swap is designed for slow hard disk storage. There are >> >> a lot of >> >> challenges to efficiently use SSD for swap: >> >> >> >> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) >> >> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB >> >> flush. This >> >> overhead is very high even in a normal 2-socket machine. >> >> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do >> >> swap, >> >> which makes swap IO pattern is interleave. Block layer isn't always >> >> efficient >> >> to do request merge. Such IO pattern also makes swap prefetch hard. >> >> 4. Swap map scan overhead. Swap in-memory map scan scans an array, >> >> which is >> >> very inefficient, especially if swap storage is fast. >> >> 5. SSD related optimization, mainly discard support >> >> 6. Better swap prefetch algorithm. Besides item 3, sequentially >> >> accessed pages >> >> aren't always in LRU list adjacently, so page reclaim will not swap >> >> such pages >> >> in adjacent storage sectors. This makes swap prefetch hard. >> >> 7. Alternative page reclaim policy to bias reclaiming anonymous page. >> >> Currently reclaim anonymous page is considering harder than reclaim >> >> file pages, >> >> so we bias reclaiming file pages. If there are high speed swap >> >> storage, we are >> >> considering doing swap more aggressively. >> >> 8. Huge page swap. Huge page swap can solve a lot of problems above, >> >> but both >> >> THP and hugetlbfs don't support swap. >> > >> > Could you tell me in which workload hugetlb/thp pages can't swapout >> > influence your performance? Is it worth? >> > >> >> I'm also very interesting in this workload. >> I think hugetlb/thp pages can be a potential user of zprojects like >> zswap/zcache. >> We can try to compress those pages before breaking them to normal pages. > >I don't have particular workload and don't have data for obvious reason. What I >expected is swapout hugetlb/thp is to reduce some overheads (eg, tlb flush) and >improve IO pattern. Hi Shaohua and Bob, I'm doing this work currently. :-) Regards, Wanpeng Li > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in >the body to majordomo@kvack.org. For more info on Linux MM, >see: http://www.linux-mm.org/ . >Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx140.postini.com [74.125.245.140]) by kanga.kvack.org (Postfix) with SMTP id 70DF06B0005 for ; Tue, 19 Mar 2013 01:57:47 -0400 (EDT) Received: by mail-pb0-f44.google.com with SMTP id wz12so132467pbc.3 for ; Mon, 18 Mar 2013 22:57:46 -0700 (PDT) Date: Tue, 19 Mar 2013 13:57:06 +0800 From: Shaohua Li Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD Message-ID: <20130319055706.GA24130@kernel.org> References: <20130122065341.GA1850@kernel.org> <5142EC5A.4010509@gmail.com> <5146EEA5.4030003@oracle.com> <20130319012725.GA28880@kernel.org> <5147C037.5020707@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5147C037.5020707@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Bob Liu , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Hugh Dickins , Minchan Kim , Rik van Riel , dan.magenheimer@oracle.com, sjenning@linux.vnet.ibm.com, rcj@linux.vnet.ibm.com On Tue, Mar 19, 2013 at 09:32:39AM +0800, Simon Jeons wrote: > Hi Shaohua, > On 03/19/2013 09:27 AM, Shaohua Li wrote: > >On Mon, Mar 18, 2013 at 06:38:29PM +0800, Bob Liu wrote: > >>On 03/15/2013 05:39 PM, Simon Jeons wrote: > >>>On 01/22/2013 02:53 PM, Shaohua Li wrote: > >>>>Hi, > >>>> > >>>>Because of high density, low power and low price, flash storage (SSD) > >>>>is a good > >>>>candidate to partially replace DRAM. A quick answer for this is using > >>>>SSD as > >>>>swap. But Linux swap is designed for slow hard disk storage. There are > >>>>a lot of > >>>>challenges to efficiently use SSD for swap: > >>>> > >>>>1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) > >>>>2. TLB flush overhead. To reclaim one page, we need at least 2 TLB > >>>>flush. This > >>>>overhead is very high even in a normal 2-socket machine. > >>>>3. Better swap IO pattern. Both direct and kswapd page reclaim can do > >>>>swap, > >>>>which makes swap IO pattern is interleave. Block layer isn't always > >>>>efficient > >>>>to do request merge. Such IO pattern also makes swap prefetch hard. > >>>>4. Swap map scan overhead. Swap in-memory map scan scans an array, > >>>>which is > >>>>very inefficient, especially if swap storage is fast. > >>>>5. SSD related optimization, mainly discard support > >>>>6. Better swap prefetch algorithm. Besides item 3, sequentially > >>>>accessed pages > >>>>aren't always in LRU list adjacently, so page reclaim will not swap > >>>>such pages > >>>>in adjacent storage sectors. This makes swap prefetch hard. > >>>>7. Alternative page reclaim policy to bias reclaiming anonymous page. > >>>>Currently reclaim anonymous page is considering harder than reclaim > >>>>file pages, > >>>>so we bias reclaiming file pages. If there are high speed swap > >>>>storage, we are > >>>>considering doing swap more aggressively. > >>>>8. Huge page swap. Huge page swap can solve a lot of problems above, > >>>>but both > >>>>THP and hugetlbfs don't support swap. > >>>Could you tell me in which workload hugetlb/thp pages can't swapout > >>>influence your performance? Is it worth? > >>> > >>I'm also very interesting in this workload. > >>I think hugetlb/thp pages can be a potential user of zprojects like > >>zswap/zcache. > >>We can try to compress those pages before breaking them to normal pages. > >I don't have particular workload and don't have data for obvious reason. What I > >expected is swapout hugetlb/thp is to reduce some overheads (eg, tlb flush) and > >improve IO pattern. > Do you have any idea about implement this feature? Didn't look at hugetlb yet, but for THP, maybe it's an overkill to really do 2M page swapping. My idea is to provide a special version of add_to_swap + try_to_unmap in page reclaim. We still do huge page split, but in the split, we also do 'unmap' to reduce unnecessary TLB flush. In the split, tail pages should be added back to page_list of shrink_page_list() instead of lru list, so tail pages can be pageout soon. In this way, we can use existing swap code (not bothering changing arch code and swap space allocation for example) and reach my goal (reduce tlb flush and improve IO pattern). But that said, I didn't do any coding yet, this might be just wrong actually, but I'll try some time. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx196.postini.com [74.125.245.196]) by kanga.kvack.org (Postfix) with SMTP id E1E926B0005 for ; Tue, 19 Mar 2013 02:10:14 -0400 (EDT) Received: by mail-pb0-f46.google.com with SMTP id uo15so136969pbc.33 for ; Mon, 18 Mar 2013 23:10:14 -0700 (PDT) Message-ID: <5148013F.6090703@gmail.com> Date: Tue, 19 Mar 2013 14:10:07 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD References: <20130122065341.GA1850@kernel.org> <5142EC5A.4010509@gmail.com> <5146EEA5.4030003@oracle.com> <20130319012725.GA28880@kernel.org> <5147C037.5020707@gmail.com> <20130319055706.GA24130@kernel.org> In-Reply-To: <20130319055706.GA24130@kernel.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Shaohua Li Cc: Bob Liu , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Hugh Dickins , Minchan Kim , Rik van Riel , dan.magenheimer@oracle.com, sjenning@linux.vnet.ibm.com, rcj@linux.vnet.ibm.com Hi Shaohua, On 03/19/2013 01:57 PM, Shaohua Li wrote: > On Tue, Mar 19, 2013 at 09:32:39AM +0800, Simon Jeons wrote: >> Hi Shaohua, >> On 03/19/2013 09:27 AM, Shaohua Li wrote: >>> On Mon, Mar 18, 2013 at 06:38:29PM +0800, Bob Liu wrote: >>>> On 03/15/2013 05:39 PM, Simon Jeons wrote: >>>>> On 01/22/2013 02:53 PM, Shaohua Li wrote: >>>>>> Hi, >>>>>> >>>>>> Because of high density, low power and low price, flash storage (SSD) >>>>>> is a good >>>>>> candidate to partially replace DRAM. A quick answer for this is using >>>>>> SSD as >>>>>> swap. But Linux swap is designed for slow hard disk storage. There are >>>>>> a lot of >>>>>> challenges to efficiently use SSD for swap: >>>>>> >>>>>> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) >>>>>> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB >>>>>> flush. This >>>>>> overhead is very high even in a normal 2-socket machine. >>>>>> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do >>>>>> swap, >>>>>> which makes swap IO pattern is interleave. Block layer isn't always >>>>>> efficient >>>>>> to do request merge. Such IO pattern also makes swap prefetch hard. >>>>>> 4. Swap map scan overhead. Swap in-memory map scan scans an array, >>>>>> which is >>>>>> very inefficient, especially if swap storage is fast. >>>>>> 5. SSD related optimization, mainly discard support >>>>>> 6. Better swap prefetch algorithm. Besides item 3, sequentially >>>>>> accessed pages >>>>>> aren't always in LRU list adjacently, so page reclaim will not swap >>>>>> such pages >>>>>> in adjacent storage sectors. This makes swap prefetch hard. >>>>>> 7. Alternative page reclaim policy to bias reclaiming anonymous page. >>>>>> Currently reclaim anonymous page is considering harder than reclaim >>>>>> file pages, >>>>>> so we bias reclaiming file pages. If there are high speed swap >>>>>> storage, we are >>>>>> considering doing swap more aggressively. >>>>>> 8. Huge page swap. Huge page swap can solve a lot of problems above, >>>>>> but both >>>>>> THP and hugetlbfs don't support swap. >>>>> Could you tell me in which workload hugetlb/thp pages can't swapout >>>>> influence your performance? Is it worth? >>>>> >>>> I'm also very interesting in this workload. >>>> I think hugetlb/thp pages can be a potential user of zprojects like >>>> zswap/zcache. >>>> We can try to compress those pages before breaking them to normal pages. >>> I don't have particular workload and don't have data for obvious reason. What I >>> expected is swapout hugetlb/thp is to reduce some overheads (eg, tlb flush) and >>> improve IO pattern. >> Do you have any idea about implement this feature? > Didn't look at hugetlb yet, but for THP, maybe it's an overkill to really do 2M > page swapping. My idea is to provide a special version of add_to_swap + > try_to_unmap in page reclaim. We still do huge page split, but in the split, we > also do 'unmap' to reduce unnecessary TLB flush. In the split, tail pages > should be added back to page_list of shrink_page_list() instead of lru list, so > tail pages can be pageout soon. In this way, we can use existing swap code (not > bothering changing arch code and swap space allocation for example) and reach > my goal (reduce tlb flush and improve IO pattern). But that said, I didn't do > any coding yet, this might be just wrong actually, but I'll try some time. What will happen when swapin? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wanpeng Li Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD Date: Tue, 19 Mar 2013 12:25:36 +0800 Message-ID: <44411.2792647958$1363667176@news.gmane.org> References: <20130122065341.GA1850@kernel.org> <5142EC5A.4010509@gmail.com> <5146EEA5.4030003@oracle.com> <20130319012725.GA28880@kernel.org> Reply-To: Wanpeng Li Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Received: from kanga.kvack.org ([205.233.56.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1UHo7p-0008L9-Bh for glkm-linux-mm-2@m.gmane.org; Tue, 19 Mar 2013 05:26:13 +0100 Received: from psmtp.com (na3sys010amx199.postini.com [74.125.245.199]) by kanga.kvack.org (Postfix) with SMTP id 10B7E6B0005 for ; Tue, 19 Mar 2013 00:25:46 -0400 (EDT) Received: from /spool/local by e23smtp08.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 19 Mar 2013 14:23:43 +1000 Received: from d23relay05.au.ibm.com (d23relay05.au.ibm.com [9.190.235.152]) by d23dlp01.au.ibm.com (Postfix) with ESMTP id 139062CE8052 for ; Tue, 19 Mar 2013 15:25:39 +1100 (EST) Received: from d23av02.au.ibm.com (d23av02.au.ibm.com [9.190.235.138]) by d23relay05.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r2J4Cc8l262622 for ; Tue, 19 Mar 2013 15:12:38 +1100 Received: from d23av02.au.ibm.com (loopback [127.0.0.1]) by d23av02.au.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r2J4Pcvt004093 for ; Tue, 19 Mar 2013 15:25:38 +1100 Content-Disposition: inline In-Reply-To: <20130319012725.GA28880@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: Shaohua Li Cc: Bob Liu , Simon Jeons , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Hugh Dickins , Minchan Kim , Rik van Riel , dan.magenheimer@oracle.com, sjenning@linux.vnet.ibm.com, rcj@linux.vnet.ibm.com On Tue, Mar 19, 2013 at 09:27:25AM +0800, Shaohua Li wrote: >On Mon, Mar 18, 2013 at 06:38:29PM +0800, Bob Liu wrote: >> >> On 03/15/2013 05:39 PM, Simon Jeons wrote: >> > On 01/22/2013 02:53 PM, Shaohua Li wrote: >> >> Hi, >> >> >> >> Because of high density, low power and low price, flash storage (SSD) >> >> is a good >> >> candidate to partially replace DRAM. A quick answer for this is using >> >> SSD as >> >> swap. But Linux swap is designed for slow hard disk storage. There are >> >> a lot of >> >> challenges to efficiently use SSD for swap: >> >> >> >> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) >> >> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB >> >> flush. This >> >> overhead is very high even in a normal 2-socket machine. >> >> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do >> >> swap, >> >> which makes swap IO pattern is interleave. Block layer isn't always >> >> efficient >> >> to do request merge. Such IO pattern also makes swap prefetch hard. >> >> 4. Swap map scan overhead. Swap in-memory map scan scans an array, >> >> which is >> >> very inefficient, especially if swap storage is fast. >> >> 5. SSD related optimization, mainly discard support >> >> 6. Better swap prefetch algorithm. Besides item 3, sequentially >> >> accessed pages >> >> aren't always in LRU list adjacently, so page reclaim will not swap >> >> such pages >> >> in adjacent storage sectors. This makes swap prefetch hard. >> >> 7. Alternative page reclaim policy to bias reclaiming anonymous page. >> >> Currently reclaim anonymous page is considering harder than reclaim >> >> file pages, >> >> so we bias reclaiming file pages. If there are high speed swap >> >> storage, we are >> >> considering doing swap more aggressively. >> >> 8. Huge page swap. Huge page swap can solve a lot of problems above, >> >> but both >> >> THP and hugetlbfs don't support swap. >> > >> > Could you tell me in which workload hugetlb/thp pages can't swapout >> > influence your performance? Is it worth? >> > >> >> I'm also very interesting in this workload. >> I think hugetlb/thp pages can be a potential user of zprojects like >> zswap/zcache. >> We can try to compress those pages before breaking them to normal pages. > >I don't have particular workload and don't have data for obvious reason. What I >expected is swapout hugetlb/thp is to reduce some overheads (eg, tlb flush) and >improve IO pattern. Hi Shaohua and Bob, I'm doing this work currently. :-) Regards, Wanpeng Li > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in >the body to majordomo@kvack.org. For more info on Linux MM, >see: http://www.linux-mm.org/ . >Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx170.postini.com [74.125.245.170]) by kanga.kvack.org (Postfix) with SMTP id 957166B0005 for ; Thu, 4 Apr 2013 20:17:07 -0400 (EDT) Received: by mail-pa0-f45.google.com with SMTP id kl13so1750109pab.32 for ; Thu, 04 Apr 2013 17:17:06 -0700 (PDT) Message-ID: <515E17FC.9050008@gmail.com> Date: Fri, 05 Apr 2013 08:17:00 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD References: <20130122065341.GA1850@kernel.org> <20130123075808.GH2723@blaptop> In-Reply-To: <20130123075808.GH2723@blaptop> Content-Type: multipart/alternative; boundary="------------090906050906080805000905" Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Shaohua Li , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Rik van Riel This is a multi-part message in MIME format. --------------090906050906080805000905 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hi Minchan, On 01/23/2013 03:58 PM, Minchan Kim wrote: > On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote: >> Hi, >> >> Because of high density, low power and low price, flash storage (SSD) is a good >> candidate to partially replace DRAM. A quick answer for this is using SSD as >> swap. But Linux swap is designed for slow hard disk storage. There are a lot of >> challenges to efficiently use SSD for swap: > Many of below item could be applied in in-memory swap like zram, zcache. > >> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) >> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This >> overhead is very high even in a normal 2-socket machine. >> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap, >> which makes swap IO pattern is interleave. Block layer isn't always efficient >> to do request merge. Such IO pattern also makes swap prefetch hard. > Agreed. > >> 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is >> very inefficient, especially if swap storage is fast. > Agreed. > >> 5. SSD related optimization, mainly discard support >> 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages >> aren't always in LRU list adjacently, so page reclaim will not swap such pages >> in adjacent storage sectors. This makes swap prefetch hard. > One of problem is LRU churning and I wanted to try to fix it. > http://marc.info/?l=linux-mm&m=130978831028952&w=4 I'm interested in this feature, why it didn't merged? what's the fatal issue in your patchset? http://lwn.net/Articles/449866/ You mentioned test script and all-at-once patch, but I can't get them from the URL, could you tell me how to get it? > >> 7. Alternative page reclaim policy to bias reclaiming anonymous page. >> Currently reclaim anonymous page is considering harder than reclaim file pages, >> so we bias reclaiming file pages. If there are high speed swap storage, we are >> considering doing swap more aggressively. > Yeb. We need it. I tried it with extending vm_swappiness to 200. > > From: Minchan Kim > Date: Mon, 3 Dec 2012 16:21:00 +0900 > Subject: [PATCH] mm: increase swappiness to 200 > > We have thought swap out cost is very high but it's not true > if we use fast device like swap-over-zram. Nonetheless, we can > swap out 1:1 ratio of anon and page cache at most. > It's not enough to use swap device fully so we encounter OOM kill > while there are many free space in zram swap device. It's never > what we want. > > This patch makes swap out aggressively. > > Cc: Luigi Semenzato > Signed-off-by: Minchan Kim > --- > kernel/sysctl.c | 3 ++- > mm/vmscan.c | 6 ++++-- > 2 files changed, 6 insertions(+), 3 deletions(-) > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 693e0ed..f1dbd9d 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -130,6 +130,7 @@ static int __maybe_unused two = 2; > static int __maybe_unused three = 3; > static unsigned long one_ul = 1; > static int one_hundred = 100; > +extern int max_swappiness; > #ifdef CONFIG_PRINTK > static int ten_thousand = 10000; > #endif > @@ -1157,7 +1158,7 @@ static struct ctl_table vm_table[] = { > .mode = 0644, > .proc_handler = proc_dointvec_minmax, > .extra1 = &zero, > - .extra2 = &one_hundred, > + .extra2 = &max_swappiness, > }, > #ifdef CONFIG_HUGETLB_PAGE > { > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 53dcde9..64f3c21 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -53,6 +53,8 @@ > #define CREATE_TRACE_POINTS > #include > > +int max_swappiness = 200; > + > struct scan_control { > /* Incremented by the number of inactive pages that were scanned */ > unsigned long nr_scanned; > @@ -1626,6 +1628,7 @@ static int vmscan_swappiness(struct scan_control *sc) > return mem_cgroup_swappiness(sc->target_mem_cgroup); > } > > + > /* > * Determine how aggressively the anon and file LRU lists should be > * scanned. The relative value of each set of LRU lists is determined > @@ -1701,11 +1704,10 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > } > > /* > - * With swappiness at 100, anonymous and file have the same priority. > * This scanning priority is essentially the inverse of IO cost. > */ > anon_prio = vmscan_swappiness(sc); > - file_prio = 200 - anon_prio; > + file_prio = max_swappiness - anon_prio; > > /* > * OK, so we have swap space and a fair amount of page cache --------------090906050906080805000905 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit
Hi Minchan,
On 01/23/2013 03:58 PM, Minchan Kim wrote:
On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote:
Hi,

Because of high density, low power and low price, flash storage (SSD) is a good
candidate to partially replace DRAM. A quick answer for this is using SSD as
swap. But Linux swap is designed for slow hard disk storage. There are a lot of
challenges to efficiently use SSD for swap:
Many of below item could be applied in in-memory swap like zram, zcache.

1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This
overhead is very high even in a normal 2-socket machine.
3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap,
which makes swap IO pattern is interleave. Block layer isn't always efficient
to do request merge. Such IO pattern also makes swap prefetch hard.
Agreed.

4. Swap map scan overhead. Swap in-memory map scan scans an array, which is
very inefficient, especially if swap storage is fast.
Agreed.

5. SSD related optimization, mainly discard support
6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages
aren't always in LRU list adjacently, so page reclaim will not swap such pages
in adjacent storage sectors. This makes swap prefetch hard.
One of problem is LRU churning and I wanted to try to fix it.
http://marc.info/?l=linux-mm&m=130978831028952&w=4

I'm interested in this feature, why it didn't merged? what's the fatal issue in your patchset?
http://lwn.net/Articles/449866/
You mentioned test script and all-at-once patch, but I can't get them from the URL, could you tell me how to get it?


7. Alternative page reclaim policy to bias reclaiming anonymous page.
Currently reclaim anonymous page is considering harder than reclaim file pages,
so we bias reclaiming file pages. If there are high speed swap storage, we are
considering doing swap more aggressively.
Yeb. We need it. I tried it with extending vm_swappiness to 200.

From: Minchan Kim <minchan@kernel.org>
Date: Mon, 3 Dec 2012 16:21:00 +0900
Subject: [PATCH] mm: increase swappiness to 200

We have thought swap out cost is very high but it's not true
if we use fast device like swap-over-zram. Nonetheless, we can
swap out 1:1 ratio of anon and page cache at most.
It's not enough to use swap device fully so we encounter OOM kill
while there are many free space in zram swap device. It's never
what we want.

This patch makes swap out aggressively.

Cc: Luigi Semenzato <semenzato@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 kernel/sysctl.c |    3 ++-
 mm/vmscan.c     |    6 ++++--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 693e0ed..f1dbd9d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -130,6 +130,7 @@ static int __maybe_unused two = 2;
 static int __maybe_unused three = 3;
 static unsigned long one_ul = 1;
 static int one_hundred = 100;
+extern int max_swappiness;
 #ifdef CONFIG_PRINTK
 static int ten_thousand = 10000;
 #endif
@@ -1157,7 +1158,7 @@ static struct ctl_table vm_table[] = {
                .mode           = 0644,
                .proc_handler   = proc_dointvec_minmax,
                .extra1         = &zero,
-               .extra2         = &one_hundred,
+               .extra2         = &max_swappiness,
        },
 #ifdef CONFIG_HUGETLB_PAGE
        {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 53dcde9..64f3c21 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -53,6 +53,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/vmscan.h>
 
+int max_swappiness = 200;
+
 struct scan_control {
        /* Incremented by the number of inactive pages that were scanned */
        unsigned long nr_scanned;
@@ -1626,6 +1628,7 @@ static int vmscan_swappiness(struct scan_control *sc)
        return mem_cgroup_swappiness(sc->target_mem_cgroup);
 }
 
+
 /*
  * Determine how aggressively the anon and file LRU lists should be
  * scanned.  The relative value of each set of LRU lists is determined
@@ -1701,11 +1704,10 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
        }
 
        /*
-        * With swappiness at 100, anonymous and file have the same priority.
         * This scanning priority is essentially the inverse of IO cost.
         */
        anon_prio = vmscan_swappiness(sc);
-       file_prio = 200 - anon_prio;
+       file_prio = max_swappiness - anon_prio;
 
        /*
         * OK, so we have swap space and a fair amount of page cache

--------------090906050906080805000905-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx156.postini.com [74.125.245.156]) by kanga.kvack.org (Postfix) with SMTP id 1C8346B0027 for ; Fri, 5 Apr 2013 04:08:19 -0400 (EDT) Date: Fri, 5 Apr 2013 17:08:17 +0900 From: Minchan Kim Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD Message-ID: <20130405080817.GC32126@blaptop> References: <20130122065341.GA1850@kernel.org> <20130123075808.GH2723@blaptop> <515E17FC.9050008@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <515E17FC.9050008@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Shaohua Li , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Rik van Riel On Fri, Apr 05, 2013 at 08:17:00AM +0800, Simon Jeons wrote: > Hi Minchan, > On 01/23/2013 03:58 PM, Minchan Kim wrote: > >On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote: > >>Hi, > >> > >>Because of high density, low power and low price, flash storage (SSD) is a good > >>candidate to partially replace DRAM. A quick answer for this is using SSD as > >>swap. But Linux swap is designed for slow hard disk storage. There are a lot of > >>challenges to efficiently use SSD for swap: > >Many of below item could be applied in in-memory swap like zram, zcache. > > > >>1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) > >>2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This > >>overhead is very high even in a normal 2-socket machine. > >>3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap, > >>which makes swap IO pattern is interleave. Block layer isn't always efficient > >>to do request merge. Such IO pattern also makes swap prefetch hard. > >Agreed. > > > >>4. Swap map scan overhead. Swap in-memory map scan scans an array, which is > >>very inefficient, especially if swap storage is fast. > >Agreed. > > > >>5. SSD related optimization, mainly discard support > >>6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages > >>aren't always in LRU list adjacently, so page reclaim will not swap such pages > >>in adjacent storage sectors. This makes swap prefetch hard. > >One of problem is LRU churning and I wanted to try to fix it. > >http://marc.info/?l=linux-mm&m=130978831028952&w=4 > > I'm interested in this feature, why it didn't merged? what's the > fatal issue in your patchset? > http://lwn.net/Articles/449866/ There wasn't any fatal issue, AFAIRC but some people had a concern about balancing between code complexity and benefit and dragged for a long time and I lost interest. > You mentioned test script and all-at-once patch, but I can't get > them from the URL, could you tell me how to get it? You can google it and google will find it in a few second. http://www.filewatcher.com/b/ftp/ftp.cs.huji.ac.il/mirror/linux/kernel/linux/kernel/people/minchan/inorder_putback/v4-0.html -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx172.postini.com [74.125.245.172]) by kanga.kvack.org (Postfix) with SMTP id 9C90B6B0032 for ; Sun, 28 Apr 2013 04:12:16 -0400 (EDT) Received: by mail-pa0-f46.google.com with SMTP id ld11so885650pab.19 for ; Sun, 28 Apr 2013 01:12:15 -0700 (PDT) Message-ID: <517CD9DB.5010702@gmail.com> Date: Sun, 28 Apr 2013 16:12:11 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC]swap improvements for fast SSD References: <20130122065341.GA1850@kernel.org> In-Reply-To: <20130122065341.GA1850@kernel.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Shaohua Li Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Shaohua Li Hi Shaohua, On 01/22/2013 02:53 PM, Shaohua Li wrote: > Hi, > > Because of high density, low power and low price, flash storage (SSD) is a good > candidate to partially replace DRAM. A quick answer for this is using SSD as > swap. But Linux swap is designed for slow hard disk storage. There are a lot of > challenges to efficiently use SSD for swap: > > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock) > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This > overhead is very high even in a normal 2-socket machine. Why at least 2 TLB flush instead of one? > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap, > which makes swap IO pattern is interleave. Block layer isn't always efficient > to do request merge. Such IO pattern also makes swap prefetch hard. > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is > very inefficient, especially if swap storage is fast. > 5. SSD related optimization, mainly discard support > 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages > aren't always in LRU list adjacently, so page reclaim will not swap such pages > in adjacent storage sectors. This makes swap prefetch hard. > 7. Alternative page reclaim policy to bias reclaiming anonymous page. > Currently reclaim anonymous page is considering harder than reclaim file pages, > so we bias reclaiming file pages. If there are high speed swap storage, we are > considering doing swap more aggressively. > 8. Huge page swap. Huge page swap can solve a lot of problems above, but both > THP and hugetlbfs don't support swap. > > I had some progresses in these areas recently: > http://marc.info/?l=linux-mm&m=134665691021172&w=2 > http://marc.info/?l=linux-mm&m=135336039115191&w=2 > http://marc.info/?l=linux-mm&m=135882182225444&w=2 > http://marc.info/?l=linux-mm&m=135754636926984&w=2 > http://marc.info/?l=linux-mm&m=135754634526979&w=2 > But a lot of problems remain. I'd like to discuss the issues at the meeting. > > Thanks, > Shaohua > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org