From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx193.postini.com [74.125.245.193]) by kanga.kvack.org (Postfix) with SMTP id B3EC66B0032 for ; Sun, 18 Aug 2013 12:14:46 -0400 (EDT) Received: by mail-ob0-f193.google.com with SMTP id dn14so1614045obc.8 for ; Sun, 18 Aug 2013 09:14:45 -0700 (PDT) MIME-Version: 1.0 Date: Mon, 19 Aug 2013 00:14:45 +0800 Message-ID: Subject: [BUG REPORT] ZSWAP: theoretical race condition issues From: Weijie Yang Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: sjenning@linux.vnet.ibm.com Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, dan.magenheimer@oracle.com, bob.liu@oracle.com I found a few bugs in zswap when I review Linux-3.11-rc5, and I have also some questions about it, described as following: BUG: 1. A race condition when reclaim a page when a handle alloced from zbud, zbud considers this handle is used validly by upper(zswap) and can be a candidate for reclaim. But zswap has to initialize it such as setting swapentry and addding it to rbtree. so there is a race condition, such as: thread 0: obtain handle x from zbud_alloc thread 1: zbud_reclaim_page is called thread 1: callback zswap_writeback_entry to reclaim handle x thread 1: get swpentry from handle x (it is random value now) thread 1: bad thing may happen thread 0: initialize handle x with swapentry Of course, this situation almost never happen, it is a "theoretical race condition" issue. 2. Pollute swapcache data by add a pre-invalided swap page when a swap_entry is invalidated, it will be reused by other anon page. At the same time, zswap is reclaiming old page, pollute swapcache of new page as a result, because old page and new page use the same swap_entry, such as: thread 1: zswap reclaim entry x thread 0: zswap_frontswap_invalidate_page entry x thread 0: entry x reused by other anon page thread 1: add old data to swapcache of entry x thread 0: swapcache of entry x is polluted Of course, this situation almost never happen, it is another "theoretical race condition" issue. 3. Frontswap uses frontswap_map bitmap to track page in "backend" implementation, when zswap reclaim a page, the corresponding bitmap record is not cleared. 4. zswap_tree is not freed when swapoff, and it got re-kzalloc in swapon, memory leak occurs. questions: 1. How about SetPageReclaim befor __swap_writepage, so that move it to the tail of the inactive list? 2. zswap uses GFP_KERNEL flag to alloc things in store and reclaim function, does this lead to these function called recursively? 3. for reclaiming one zbud page which contains two buddies, zswap needs to alloc two pages. Does this reclaim cost-efficient? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx115.postini.com [74.125.245.115]) by kanga.kvack.org (Postfix) with SMTP id 4CEC26B0033 for ; Sun, 18 Aug 2013 22:17:52 -0400 (EDT) Message-ID: <52118042.30101@oracle.com> Date: Mon, 19 Aug 2013 10:17:38 +0800 From: Bob Liu MIME-Version: 1.0 Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Weijie Yang Cc: sjenning@linux.vnet.ibm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, dan.magenheimer@oracle.com Hi Weijie, On 08/19/2013 12:14 AM, Weijie Yang wrote: > I found a few bugs in zswap when I review Linux-3.11-rc5, and I have > also some questions about it, described as following: > > BUG: > 1. A race condition when reclaim a page > when a handle alloced from zbud, zbud considers this handle is used > validly by upper(zswap) and can be a candidate for reclaim. > But zswap has to initialize it such as setting swapentry and addding > it to rbtree. so there is a race condition, such as: > thread 0: obtain handle x from zbud_alloc > thread 1: zbud_reclaim_page is called > thread 1: callback zswap_writeback_entry to reclaim handle x > thread 1: get swpentry from handle x (it is random value now) > thread 1: bad thing may happen > thread 0: initialize handle x with swapentry Yes, this may happen potentially but in rare case. Because we have a LRU list for page frames, after Thread 0 called zbud_alloc the corresponding page will be add to the head of LRU list,While zbud_reclaim_page(Thread 1 called) is started from the tail of LRU list. > Of course, this situation almost never happen, it is a "theoretical > race condition" issue. > > 2. Pollute swapcache data by add a pre-invalided swap page > when a swap_entry is invalidated, it will be reused by other anon > page. At the same time, zswap is reclaiming old page, pollute > swapcache of new page as a result, because old page and new page use > the same swap_entry, such as: > thread 1: zswap reclaim entry x > thread 0: zswap_frontswap_invalidate_page entry x > thread 0: entry x reused by other anon page > thread 1: add old data to swapcache of entry x I didn't get your idea here, why thread1 will add old data to entry x? > thread 0: swapcache of entry x is polluted > Of course, this situation almost never happen, it is another > "theoretical race condition" issue. > > 3. Frontswap uses frontswap_map bitmap to track page in "backend" > implementation, when zswap reclaim a > page, the corresponding bitmap record is not cleared. > That's true, but I don't think it's a big problem. Only waste little time to search rbtree during zswap_frontswap_load(). > 4. zswap_tree is not freed when swapoff, and it got re-kzalloc in > swapon, memory leak occurs. Nice catch! I think it should be freed in zswap_frontswap_invalidate_area(). > > questions: > 1. How about SetPageReclaim befor __swap_writepage, so that move it to > the tail of the inactive list? It will be added to inactive now. > 2. zswap uses GFP_KERNEL flag to alloc things in store and reclaim > function, does this lead to these function called recursively? Yes, that's a potential problem. > 3. for reclaiming one zbud page which contains two buddies, zswap > needs to alloc two pages. Does this reclaim cost-efficient? > Yes, that's a problem too. And that's why we use zbud as the default allocator instead of zsmalloc. I think improving the write back path of zswap is the next important step for zswap. -- Regards, -Bob -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx160.postini.com [74.125.245.160]) by kanga.kvack.org (Postfix) with SMTP id C21A96B0032 for ; Mon, 19 Aug 2013 01:47:19 -0400 (EDT) Date: Mon, 19 Aug 2013 14:47:42 +0900 From: Minchan Kim Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues Message-ID: <20130819054742.GA28062@bbox> References: <52118042.30101@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <52118042.30101@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Bob Liu Cc: Weijie Yang , sjenning@linux.vnet.ibm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, dan.magenheimer@oracle.com On Mon, Aug 19, 2013 at 10:17:38AM +0800, Bob Liu wrote: > Hi Weijie, > > On 08/19/2013 12:14 AM, Weijie Yang wrote: > > I found a few bugs in zswap when I review Linux-3.11-rc5, and I have > > also some questions about it, described as following: > > > > BUG: > > 1. A race condition when reclaim a page > > when a handle alloced from zbud, zbud considers this handle is used > > validly by upper(zswap) and can be a candidate for reclaim. > > But zswap has to initialize it such as setting swapentry and addding > > it to rbtree. so there is a race condition, such as: > > thread 0: obtain handle x from zbud_alloc > > thread 1: zbud_reclaim_page is called > > thread 1: callback zswap_writeback_entry to reclaim handle x > > thread 1: get swpentry from handle x (it is random value now) > > thread 1: bad thing may happen > > thread 0: initialize handle x with swapentry Nice catch! > > Yes, this may happen potentially but in rare case. > Because we have a LRU list for page frames, after Thread 0 called > zbud_alloc the corresponding page will be add to the head of LRU > list,While zbud_reclaim_page(Thread 1 called) is started from the tail > of LRU list. > > > Of course, this situation almost never happen, it is a "theoretical > > race condition" issue. But it's doable and we should prevent that although you feel it's rare because system could go hang. When I look at the code, Why should zbud have LRU logic instead of zswap? If I missed some history, sorry about that. But at least to me, zbud is just allocator so it should have a role to handle alloc/free object and how client of the allocator uses objects depends on the upper layer so zbud should handle LRU. If so, we wouldn't encounter this problem, either. > > > > 2. Pollute swapcache data by add a pre-invalided swap page > > when a swap_entry is invalidated, it will be reused by other anon > > page. At the same time, zswap is reclaiming old page, pollute > > swapcache of new page as a result, because old page and new page use > > the same swap_entry, such as: > > thread 1: zswap reclaim entry x > > thread 0: zswap_frontswap_invalidate_page entry x > > thread 0: entry x reused by other anon page > > thread 1: add old data to swapcache of entry x > > I didn't get your idea here, why thread1 will add old data to entry x? > > > thread 0: swapcache of entry x is polluted > > Of course, this situation almost never happen, it is another > > "theoretical race condition" issue. Don't swapcache_prepare close the race? > > > > 3. Frontswap uses frontswap_map bitmap to track page in "backend" > > implementation, when zswap reclaim a > > page, the corresponding bitmap record is not cleared. > > > > That's true, but I don't think it's a big problem. > Only waste little time to search rbtree during zswap_frontswap_load(). > > > 4. zswap_tree is not freed when swapoff, and it got re-kzalloc in > > swapon, memory leak occurs. > > Nice catch! I think it should be freed in zswap_frontswap_invalidate_area(). > > > > > questions: > > 1. How about SetPageReclaim befor __swap_writepage, so that move it to > > the tail of the inactive list? It's a good idea to avoid unnecessary page scanning. > > It will be added to inactive now. > > > 2. zswap uses GFP_KERNEL flag to alloc things in store and reclaim > > function, does this lead to these function called recursively? > > Yes, that's a potential problem. It should use GFP_NOIO. > > > 3. for reclaiming one zbud page which contains two buddies, zswap > > needs to alloc two pages. Does this reclaim cost-efficient? It would be better to evict zpage which is a compressed sequence of PAGE_SIZE bytes rather than decompresesed PAGE_SIZE bytes a page when we are about to reclaim the page but it's hard part from frontswap API. > > > > Yes, that's a problem too. And that's why we use zbud as the default > allocator instead of zsmalloc. > I think improving the write back path of zswap is the next important > step for zswap. > > -- > Regards, > -Bob > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx189.postini.com [74.125.245.189]) by kanga.kvack.org (Postfix) with SMTP id CFB9D6B0032 for ; Tue, 20 Aug 2013 11:22:15 -0400 (EDT) Received: by mail-ob0-f180.google.com with SMTP id up14so807528obb.39 for ; Tue, 20 Aug 2013 08:22:14 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20130819054742.GA28062@bbox> References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> Date: Tue, 20 Aug 2013 23:22:14 +0800 Message-ID: Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues From: Weijie Yang Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Bob Liu , sjenning@linux.vnet.ibm.com, linux-mm@kvack.org 2013/8/19 Minchan Kim : > On Mon, Aug 19, 2013 at 10:17:38AM +0800, Bob Liu wrote: >> Hi Weijie, >> >> On 08/19/2013 12:14 AM, Weijie Yang wrote: >> > I found a few bugs in zswap when I review Linux-3.11-rc5, and I have >> > also some questions about it, described as following: >> > >> > BUG: >> > 1. A race condition when reclaim a page >> > when a handle alloced from zbud, zbud considers this handle is used >> > validly by upper(zswap) and can be a candidate for reclaim. >> > But zswap has to initialize it such as setting swapentry and addding >> > it to rbtree. so there is a race condition, such as: >> > thread 0: obtain handle x from zbud_alloc >> > thread 1: zbud_reclaim_page is called >> > thread 1: callback zswap_writeback_entry to reclaim handle x >> > thread 1: get swpentry from handle x (it is random value now) >> > thread 1: bad thing may happen >> > thread 0: initialize handle x with swapentry > > Nice catch! > >> >> Yes, this may happen potentially but in rare case. >> Because we have a LRU list for page frames, after Thread 0 called >> zbud_alloc the corresponding page will be add to the head of LRU >> list,While zbud_reclaim_page(Thread 1 called) is started from the tail >> of LRU list. >> >> > Of course, this situation almost never happen, it is a "theoretical >> > race condition" issue. > > But it's doable and we should prevent that although you feel it's rare > because system could go hang. When I look at the code, Why should zbud > have LRU logic instead of zswap? If I missed some history, sorry about that. > But at least to me, zbud is just allocator so it should have a role > to handle alloc/free object and how client of the allocator uses objects > depends on the upper layer so zbud should handle LRU. If so, we wouldn't > encounter this problem, either. > >> > >> > 2. Pollute swapcache data by add a pre-invalided swap page >> > when a swap_entry is invalidated, it will be reused by other anon >> > page. At the same time, zswap is reclaiming old page, pollute >> > swapcache of new page as a result, because old page and new page use >> > the same swap_entry, such as: >> > thread 1: zswap reclaim entry x >> > thread 0: zswap_frontswap_invalidate_page entry x >> > thread 0: entry x reused by other anon page >> > thread 1: add old data to swapcache of entry x >> >> I didn't get your idea here, why thread1 will add old data to entry x? >> >> > thread 0: swapcache of entry x is polluted >> > Of course, this situation almost never happen, it is another >> > "theoretical race condition" issue. > > Don't swapcache_prepare close the race? > Yes, I made a mistake, there is not a race here. However, I find another bug here after my more careful review. It is not only "theoretical", it will happen really. as: thread 1: zswap reclaim entry x (get the refcount, but not call zswap_get_swap_cache_page yet) thread 0: zswap_frontswap_invalidate_page entry x (finished, entry x and its zbud is not freed as its refcount != 0) now, the swap_map[x] = 0 thread 1: zswap_get_swap_cache_page called, swapcache_prepare return -ENOENT because entry x is not used any more zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM zswap_writeback_entry do nothing except put refcount now, the memory of zswap_entry x leaks and its zpage become a zombie Best Regards, Weijie Yang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx130.postini.com [74.125.245.130]) by kanga.kvack.org (Postfix) with SMTP id B0F676B0034 for ; Tue, 20 Aug 2013 11:30:30 -0400 (EDT) Received: by mail-oa0-f49.google.com with SMTP id n10so1020394oag.8 for ; Tue, 20 Aug 2013 08:30:29 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20130819054742.GA28062@bbox> References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> Date: Tue, 20 Aug 2013 23:30:29 +0800 Message-ID: Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues From: Weijie Yang Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Bob Liu , sjenning@linux.vnet.ibm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org 2013/8/19 Minchan Kim : > On Mon, Aug 19, 2013 at 10:17:38AM +0800, Bob Liu wrote: >> Hi Weijie, >> >> On 08/19/2013 12:14 AM, Weijie Yang wrote: >> > I found a few bugs in zswap when I review Linux-3.11-rc5, and I have >> > also some questions about it, described as following: >> > >> > BUG: >> > 1. A race condition when reclaim a page >> > when a handle alloced from zbud, zbud considers this handle is used >> > validly by upper(zswap) and can be a candidate for reclaim. >> > But zswap has to initialize it such as setting swapentry and addding >> > it to rbtree. so there is a race condition, such as: >> > thread 0: obtain handle x from zbud_alloc >> > thread 1: zbud_reclaim_page is called >> > thread 1: callback zswap_writeback_entry to reclaim handle x >> > thread 1: get swpentry from handle x (it is random value now) >> > thread 1: bad thing may happen >> > thread 0: initialize handle x with swapentry > > Nice catch! > >> >> Yes, this may happen potentially but in rare case. >> Because we have a LRU list for page frames, after Thread 0 called >> zbud_alloc the corresponding page will be add to the head of LRU >> list,While zbud_reclaim_page(Thread 1 called) is started from the tail >> of LRU list. >> >> > Of course, this situation almost never happen, it is a "theoretical >> > race condition" issue. > > But it's doable and we should prevent that although you feel it's rare > because system could go hang. When I look at the code, Why should zbud > have LRU logic instead of zswap? If I missed some history, sorry about that. > But at least to me, zbud is just allocator so it should have a role > to handle alloc/free object and how client of the allocator uses objects > depends on the upper layer so zbud should handle LRU. If so, we wouldn't > encounter this problem, either. > >> > >> > 2. Pollute swapcache data by add a pre-invalided swap page >> > when a swap_entry is invalidated, it will be reused by other anon >> > page. At the same time, zswap is reclaiming old page, pollute >> > swapcache of new page as a result, because old page and new page use >> > the same swap_entry, such as: >> > thread 1: zswap reclaim entry x >> > thread 0: zswap_frontswap_invalidate_page entry x >> > thread 0: entry x reused by other anon page >> > thread 1: add old data to swapcache of entry x >> >> I didn't get your idea here, why thread1 will add old data to entry x? >> >> > thread 0: swapcache of entry x is polluted >> > Of course, this situation almost never happen, it is another >> > "theoretical race condition" issue. > > Don't swapcache_prepare close the race? Yes, I made a mistake, there is not a race here. However, I find another bug here after my more careful review. It is not only "theoretical", it will happen really. as: thread 1: zswap reclaim entry x (get the refcount, but not call zswap_get_swap_cache_page yet) thread 0: zswap_frontswap_invalidate_page entry x (finished, entry x and its zbud is not freed as its refcount != 0) now, the swap_map[x] = 0 thread 1: zswap_get_swap_cache_page called, swapcache_prepare return -ENOENT because entry x is not used any more zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM zswap_writeback_entry do nothing except put refcount now, the memory of zswap_entry x leaks and its zpage become a zombie Best Regards, Weijie Yang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx122.postini.com [74.125.245.122]) by kanga.kvack.org (Postfix) with SMTP id 62EEE6B0034 for ; Wed, 21 Aug 2013 03:49:11 -0400 (EDT) Date: Wed, 21 Aug 2013 16:49:39 +0900 From: Minchan Kim Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues Message-ID: <20130821074939.GE3022@bbox> References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Weijie Yang Cc: Bob Liu , sjenning@linux.vnet.ibm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Hello, On Tue, Aug 20, 2013 at 11:30:29PM +0800, Weijie Yang wrote: > 2013/8/19 Minchan Kim : > > On Mon, Aug 19, 2013 at 10:17:38AM +0800, Bob Liu wrote: > >> Hi Weijie, > >> > >> On 08/19/2013 12:14 AM, Weijie Yang wrote: > >> > I found a few bugs in zswap when I review Linux-3.11-rc5, and I have > >> > also some questions about it, described as following: > >> > > >> > BUG: > >> > 1. A race condition when reclaim a page > >> > when a handle alloced from zbud, zbud considers this handle is used > >> > validly by upper(zswap) and can be a candidate for reclaim. > >> > But zswap has to initialize it such as setting swapentry and addding > >> > it to rbtree. so there is a race condition, such as: > >> > thread 0: obtain handle x from zbud_alloc > >> > thread 1: zbud_reclaim_page is called > >> > thread 1: callback zswap_writeback_entry to reclaim handle x > >> > thread 1: get swpentry from handle x (it is random value now) > >> > thread 1: bad thing may happen > >> > thread 0: initialize handle x with swapentry > > > > Nice catch! > > > >> > >> Yes, this may happen potentially but in rare case. > >> Because we have a LRU list for page frames, after Thread 0 called > >> zbud_alloc the corresponding page will be add to the head of LRU > >> list,While zbud_reclaim_page(Thread 1 called) is started from the tail > >> of LRU list. > >> > >> > Of course, this situation almost never happen, it is a "theoretical > >> > race condition" issue. > > > > But it's doable and we should prevent that although you feel it's rare > > because system could go hang. When I look at the code, Why should zbud > > have LRU logic instead of zswap? If I missed some history, sorry about that. > > But at least to me, zbud is just allocator so it should have a role > > to handle alloc/free object and how client of the allocator uses objects > > depends on the upper layer so zbud should handle LRU. If so, we wouldn't > > encounter this problem, either. > > > >> > > >> > 2. Pollute swapcache data by add a pre-invalided swap page > >> > when a swap_entry is invalidated, it will be reused by other anon > >> > page. At the same time, zswap is reclaiming old page, pollute > >> > swapcache of new page as a result, because old page and new page use > >> > the same swap_entry, such as: > >> > thread 1: zswap reclaim entry x > >> > thread 0: zswap_frontswap_invalidate_page entry x > >> > thread 0: entry x reused by other anon page > >> > thread 1: add old data to swapcache of entry x > >> > >> I didn't get your idea here, why thread1 will add old data to entry x? > >> > >> > thread 0: swapcache of entry x is polluted > >> > Of course, this situation almost never happen, it is another > >> > "theoretical race condition" issue. > > > > Don't swapcache_prepare close the race? > > Yes, I made a mistake, there is not a race here. > However, I find another bug here after my more careful review. It is > not only "theoretical", it will happen really. as: > thread 1: zswap reclaim entry x (get the refcount, but not call > zswap_get_swap_cache_page yet) > thread 0: zswap_frontswap_invalidate_page entry x (finished, entry x > and its zbud is not freed as its refcount != 0) > now, the swap_map[x] = 0 > thread 1: zswap_get_swap_cache_page called, swapcache_prepare return > -ENOENT because entry x is not used any more > zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM > zswap_writeback_entry do nothing except put refcount > now, the memory of zswap_entry x leaks and its zpage become a zombie It makes sense to me. Maybe we should introduce ZSWAP_SWAPCACHE_NOENT and free the entry in the case. Would you mind to send patches on the problems you found? > > Best Regards, > Weijie Yang > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx130.postini.com [74.125.245.130]) by kanga.kvack.org (Postfix) with SMTP id 648636B009F for ; Wed, 21 Aug 2013 08:11:36 -0400 (EDT) Message-ID: <5214AE6B.9030804@oracle.com> Date: Wed, 21 Aug 2013 20:11:23 +0800 From: Bob Liu MIME-Version: 1.0 Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Weijie Yang Cc: Minchan Kim , sjenning@linux.vnet.ibm.com, linux-mm@kvack.org On 08/20/2013 11:22 PM, Weijie Yang wrote: > 2013/8/19 Minchan Kim : >> On Mon, Aug 19, 2013 at 10:17:38AM +0800, Bob Liu wrote: >>> Hi Weijie, >>> >>> On 08/19/2013 12:14 AM, Weijie Yang wrote: >>>> I found a few bugs in zswap when I review Linux-3.11-rc5, and I have >>>> also some questions about it, described as following: >>>> >>>> BUG: >>>> 1. A race condition when reclaim a page >>>> when a handle alloced from zbud, zbud considers this handle is used >>>> validly by upper(zswap) and can be a candidate for reclaim. >>>> But zswap has to initialize it such as setting swapentry and addding >>>> it to rbtree. so there is a race condition, such as: >>>> thread 0: obtain handle x from zbud_alloc >>>> thread 1: zbud_reclaim_page is called >>>> thread 1: callback zswap_writeback_entry to reclaim handle x >>>> thread 1: get swpentry from handle x (it is random value now) >>>> thread 1: bad thing may happen >>>> thread 0: initialize handle x with swapentry >> >> Nice catch! >> >>> >>> Yes, this may happen potentially but in rare case. >>> Because we have a LRU list for page frames, after Thread 0 called >>> zbud_alloc the corresponding page will be add to the head of LRU >>> list,While zbud_reclaim_page(Thread 1 called) is started from the tail >>> of LRU list. >>> >>>> Of course, this situation almost never happen, it is a "theoretical >>>> race condition" issue. >> >> But it's doable and we should prevent that although you feel it's rare >> because system could go hang. When I look at the code, Why should zbud >> have LRU logic instead of zswap? If I missed some history, sorry about that. >> But at least to me, zbud is just allocator so it should have a role >> to handle alloc/free object and how client of the allocator uses objects >> depends on the upper layer so zbud should handle LRU. If so, we wouldn't >> encounter this problem, either. >> >>>> >>>> 2. Pollute swapcache data by add a pre-invalided swap page >>>> when a swap_entry is invalidated, it will be reused by other anon >>>> page. At the same time, zswap is reclaiming old page, pollute >>>> swapcache of new page as a result, because old page and new page use >>>> the same swap_entry, such as: >>>> thread 1: zswap reclaim entry x >>>> thread 0: zswap_frontswap_invalidate_page entry x >>>> thread 0: entry x reused by other anon page >>>> thread 1: add old data to swapcache of entry x >>> >>> I didn't get your idea here, why thread1 will add old data to entry x? >>> >>>> thread 0: swapcache of entry x is polluted >>>> Of course, this situation almost never happen, it is another >>>> "theoretical race condition" issue. >> >> Don't swapcache_prepare close the race? >> > > Yes, I made a mistake, there is not a race here. > However, I find another bug here after my more careful review. It is > not only "theoretical", it will happen really. as: > thread 1: zswap reclaim entry x (get the refcount, but not call > zswap_get_swap_cache_page yet) > thread 0: zswap_frontswap_invalidate_page entry x (finished, entry x > and its zbud is not freed as its refcount != 0) > now, the swap_map[x] = 0 > thread 1: zswap_get_swap_cache_page called, swapcache_prepare return > -ENOENT because entry x is not used any more > zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM > zswap_writeback_entry do nothing except put refcount > now, the memory of zswap_entry x leaks and its zpage become a zombie > Nice catch! How about fix like this? @@ -612,7 +612,10 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle) fail: spin_lock(&tree->lock); - zswap_entry_put(entry); + refcount = zswap_entry_put(entry); + if (refcount <= 0) + /* invalidate happened */ + zswap_free_entry(tree, entry); spin_unlock(&tree->lock); return ret; -- Regards, -Bob -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f182.google.com (mail-pd0-f182.google.com [209.85.192.182]) by kanga.kvack.org (Postfix) with ESMTP id DD6C76B0031 for ; Wed, 25 Sep 2013 04:09:07 -0400 (EDT) Received: by mail-pd0-f182.google.com with SMTP id r10so5797563pdi.27 for ; Wed, 25 Sep 2013 01:09:07 -0700 (PDT) Received: by mail-ie0-f180.google.com with SMTP id u16so10338518iet.11 for ; Wed, 25 Sep 2013 01:09:05 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20130821074939.GE3022@bbox> References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> Date: Wed, 25 Sep 2013 16:09:04 +0800 Message-ID: Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues From: Weijie Yang Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Bob Liu , sjenning@linux.vnet.ibm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org I think I find a new issue, for integrity of this mail thread, I reply to this mail. It is a concurrence issue either, when duplicate store and reclaim concurrentlly. zswap entry x with offset A is already stored in zswap backend. Consider the following scenario: thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) thread 1: store new page with the same offset A, alloc a new zswap entry y. store finished. shrink_page_list() call __remove_mapping(), and now it is not in swap_cache thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache Now, swap cache has old data rather than new data for offset A. error will happen If do_swap_page() get page from swap_cache. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f48.google.com (mail-pb0-f48.google.com [209.85.160.48]) by kanga.kvack.org (Postfix) with ESMTP id A6ACD6B0031 for ; Wed, 25 Sep 2013 04:31:11 -0400 (EDT) Received: by mail-pb0-f48.google.com with SMTP id ma3so5718175pbc.35 for ; Wed, 25 Sep 2013 01:31:11 -0700 (PDT) Received: by mail-ve0-f181.google.com with SMTP id oy12so4516570veb.12 for ; Wed, 25 Sep 2013 01:31:08 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> Date: Wed, 25 Sep 2013 16:31:08 +0800 Message-ID: Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues From: Bob Liu Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Weijie Yang Cc: Minchan Kim , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: > I think I find a new issue, for integrity of this mail thread, I reply > to this mail. > > It is a concurrence issue either, when duplicate store and reclaim > concurrentlly. > > zswap entry x with offset A is already stored in zswap backend. > Consider the following scenario: > > thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) > > thread 1: store new page with the same offset A, alloc a new zswap entry y. > store finished. shrink_page_list() call __remove_mapping(), and now > it is not in swap_cache > But I don't think swap layer will call zswap with the same offset A. > thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache > > Now, swap cache has old data rather than new data for offset A. > error will happen If do_swap_page() get page from swap_cache. > -- Regards, --Bob -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f46.google.com (mail-pa0-f46.google.com [209.85.220.46]) by kanga.kvack.org (Postfix) with ESMTP id AC91C6B0031 for ; Wed, 25 Sep 2013 05:33:46 -0400 (EDT) Received: by mail-pa0-f46.google.com with SMTP id fa1so6246213pad.33 for ; Wed, 25 Sep 2013 02:33:46 -0700 (PDT) Received: by mail-ie0-f173.google.com with SMTP id ar20so10430626iec.18 for ; Wed, 25 Sep 2013 02:33:44 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> Date: Wed, 25 Sep 2013 17:33:43 +0800 Message-ID: Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues From: Weijie Yang Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Bob Liu Cc: Minchan Kim , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: >> I think I find a new issue, for integrity of this mail thread, I reply >> to this mail. >> >> It is a concurrence issue either, when duplicate store and reclaim >> concurrentlly. >> >> zswap entry x with offset A is already stored in zswap backend. >> Consider the following scenario: >> >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) >> >> thread 1: store new page with the same offset A, alloc a new zswap entry y. >> store finished. shrink_page_list() call __remove_mapping(), and now >> it is not in swap_cache >> > > But I don't think swap layer will call zswap with the same offset A. 1. store page of offset A in zswap 2. some time later, pagefault occur, load page data from zswap. But notice that zswap entry x is still in zswap because it is not frontswap_tmem_exclusive_gets_enabled. this page is with PageSwapCache(page) and page_private(page) = entry.val 3. change this page data, and it become dirty 4. some time later again, swap this page on the same offset A. so, a duplicate store happens. what I can think is that use flags and CAS to protect store and reclaim on the same offset happens concurrentlly. >> thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache >> >> Now, swap cache has old data rather than new data for offset A. >> error will happen If do_swap_page() get page from swap_cache. >> > > -- > Regards, > --Bob -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f53.google.com (mail-pa0-f53.google.com [209.85.220.53]) by kanga.kvack.org (Postfix) with ESMTP id 2E16F6B0031 for ; Wed, 25 Sep 2013 06:02:18 -0400 (EDT) Received: by mail-pa0-f53.google.com with SMTP id kq14so5005481pab.12 for ; Wed, 25 Sep 2013 03:02:17 -0700 (PDT) Received: by mail-ve0-f171.google.com with SMTP id pa12so4623916veb.30 for ; Wed, 25 Sep 2013 03:02:15 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> Date: Wed, 25 Sep 2013 18:02:15 +0800 Message-ID: Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues From: Bob Liu Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Weijie Yang Cc: Minchan Kim , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel On Wed, Sep 25, 2013 at 5:33 PM, Weijie Yang wrote: > On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: >> On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: >>> I think I find a new issue, for integrity of this mail thread, I reply >>> to this mail. >>> >>> It is a concurrence issue either, when duplicate store and reclaim >>> concurrentlly. >>> >>> zswap entry x with offset A is already stored in zswap backend. >>> Consider the following scenario: >>> >>> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) >>> >>> thread 1: store new page with the same offset A, alloc a new zswap entry y. >>> store finished. shrink_page_list() call __remove_mapping(), and now >>> it is not in swap_cache >>> >> >> But I don't think swap layer will call zswap with the same offset A. > > 1. store page of offset A in zswap > 2. some time later, pagefault occur, load page data from zswap. > But notice that zswap entry x is still in zswap because it is not Sorry I didn't notice that zswap_frontswap_load() doesn't call rb_erase(). > frontswap_tmem_exclusive_gets_enabled. > this page is with PageSwapCache(page) and page_private(page) = entry.val > 3. change this page data, and it become dirty > 4. some time later again, swap this page on the same offset A. > > so, a duplicate store happens. > Then I think we should erase the entry from rbtree in zswap_frontswap_load(). After the page is decompressed and loaded from zswap, still storing the compressed data in zswap is meanless. -- Regards, --Bob -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f44.google.com (mail-pb0-f44.google.com [209.85.160.44]) by kanga.kvack.org (Postfix) with ESMTP id B1D566B0031 for ; Wed, 25 Sep 2013 22:06:42 -0400 (EDT) Received: by mail-pb0-f44.google.com with SMTP id xa7so458069pbc.17 for ; Wed, 25 Sep 2013 19:06:42 -0700 (PDT) Received: by mail-ie0-f169.google.com with SMTP id tp5so612859ieb.0 for ; Wed, 25 Sep 2013 19:06:40 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> Date: Thu, 26 Sep 2013 10:06:39 +0800 Message-ID: Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues From: Weijie Yang Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Bob Liu Cc: Minchan Kim , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel On Wed, Sep 25, 2013 at 6:02 PM, Bob Liu wrote: > On Wed, Sep 25, 2013 at 5:33 PM, Weijie Yang wrote: >> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: >>> On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: >>>> I think I find a new issue, for integrity of this mail thread, I reply >>>> to this mail. >>>> >>>> It is a concurrence issue either, when duplicate store and reclaim >>>> concurrentlly. >>>> >>>> zswap entry x with offset A is already stored in zswap backend. >>>> Consider the following scenario: >>>> >>>> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) >>>> >>>> thread 1: store new page with the same offset A, alloc a new zswap entry y. >>>> store finished. shrink_page_list() call __remove_mapping(), and now >>>> it is not in swap_cache >>>> >>> >>> But I don't think swap layer will call zswap with the same offset A. >> >> 1. store page of offset A in zswap >> 2. some time later, pagefault occur, load page data from zswap. >> But notice that zswap entry x is still in zswap because it is not > > Sorry I didn't notice that zswap_frontswap_load() doesn't call rb_erase(). > >> frontswap_tmem_exclusive_gets_enabled. >> this page is with PageSwapCache(page) and page_private(page) = entry.val >> 3. change this page data, and it become dirty >> 4. some time later again, swap this page on the same offset A. >> >> so, a duplicate store happens. >> > > Then I think we should erase the entry from rbtree in zswap_frontswap_load(). > After the page is decompressed and loaded from zswap, still storing > the compressed data in zswap is meanless. Of cause, erasing the entry after load() can resolve this. However, this problem is not simple and interesting. If we drop the entry after load(), we should SetPageDirty, it will generate more swap even if we don't change this page data. I think that is why frontswap has two load mode: default(used now) and exclusive_gets. I don't have test data, but I think which mode is better is decided by corresponding workload. It's better to realize these two modes, I will try it. > -- > Regards, > --Bob -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f44.google.com (mail-pa0-f44.google.com [209.85.220.44]) by kanga.kvack.org (Postfix) with ESMTP id 20D706B0031 for ; Thu, 26 Sep 2013 01:57:31 -0400 (EDT) Received: by mail-pa0-f44.google.com with SMTP id lf10so830780pab.17 for ; Wed, 25 Sep 2013 22:57:30 -0700 (PDT) Date: Thu, 26 Sep 2013 14:58:02 +0900 From: Minchan Kim Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues Message-ID: <20130926055802.GA20634@bbox> References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Weijie Yang Cc: Bob Liu , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel Hello Weigie, On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote: > On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: > > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: > >> I think I find a new issue, for integrity of this mail thread, I reply > >> to this mail. > >> > >> It is a concurrence issue either, when duplicate store and reclaim > >> concurrentlly. > >> > >> zswap entry x with offset A is already stored in zswap backend. > >> Consider the following scenario: > >> > >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) > >> > >> thread 1: store new page with the same offset A, alloc a new zswap entry y. > >> store finished. shrink_page_list() call __remove_mapping(), and now > >> it is not in swap_cache > >> > > > > But I don't think swap layer will call zswap with the same offset A. > > 1. store page of offset A in zswap > 2. some time later, pagefault occur, load page data from zswap. > But notice that zswap entry x is still in zswap because it is not > frontswap_tmem_exclusive_gets_enabled. frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff between CPU burining by frequent swapout and memory footprint by duplicate copy in swap cache and frontswap backend so it shouldn't affect the stability. > this page is with PageSwapCache(page) and page_private(page) = entry.val > 3. change this page data, and it become dirty If non-shared swapin page become redirty, it should remove the page from swapcache. If shared swapin page become redirty, it should do CoW so it's a new page so that it doesn't live in swap cache. It means it should have new offset which is different with old's one for swap out. What's wrong with that? > 4. some time later again, swap this page on the same offset A. > > so, a duplicate store happens. > > what I can think is that use flags and CAS to protect store and reclaim on > the same offset happens concurrentlly. > > >> thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache > >> > >> Now, swap cache has old data rather than new data for offset A. > >> error will happen If do_swap_page() get page from swap_cache. > >> > > > > -- > > Regards, > > --Bob > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f50.google.com (mail-pb0-f50.google.com [209.85.160.50]) by kanga.kvack.org (Postfix) with ESMTP id 5A43A6B0031 for ; Thu, 26 Sep 2013 03:26:36 -0400 (EDT) Received: by mail-pb0-f50.google.com with SMTP id uo5so748545pbc.23 for ; Thu, 26 Sep 2013 00:26:36 -0700 (PDT) Received: by mail-ie0-f172.google.com with SMTP id x13so893190ief.3 for ; Thu, 26 Sep 2013 00:26:33 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20130926055802.GA20634@bbox> References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> <20130926055802.GA20634@bbox> Date: Thu, 26 Sep 2013 15:26:33 +0800 Message-ID: Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues From: Weijie Yang Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Bob Liu , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel On Thu, Sep 26, 2013 at 1:58 PM, Minchan Kim wrote: > Hello Weigie, > > On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote: >> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: >> > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: >> >> I think I find a new issue, for integrity of this mail thread, I reply >> >> to this mail. >> >> >> >> It is a concurrence issue either, when duplicate store and reclaim >> >> concurrentlly. >> >> >> >> zswap entry x with offset A is already stored in zswap backend. >> >> Consider the following scenario: >> >> >> >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) >> >> >> >> thread 1: store new page with the same offset A, alloc a new zswap entry y. >> >> store finished. shrink_page_list() call __remove_mapping(), and now >> >> it is not in swap_cache >> >> >> > >> > But I don't think swap layer will call zswap with the same offset A. >> >> 1. store page of offset A in zswap >> 2. some time later, pagefault occur, load page data from zswap. >> But notice that zswap entry x is still in zswap because it is not >> frontswap_tmem_exclusive_gets_enabled. > > frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff > between CPU burining by frequent swapout and memory footprint by duplicate > copy in swap cache and frontswap backend so it shouldn't affect the stability. Thanks for explain this. I don't mean to say this option affects the stability, but that zswap only realize one option. Maybe it's better to realize both options for different workloads. >> this page is with PageSwapCache(page) and page_private(page) = entry.val >> 3. change this page data, and it become dirty > > If non-shared swapin page become redirty, it should remove the page from > swapcache. If shared swapin page become redirty, it should do CoW so it's a > new page so that it doesn't live in swap cache. It means it should have new > offset which is different with old's one for swap out. > > What's wrong with that? It is really not a right scene for duplicate store. And I can not think out one. If duplicate store is impossible, How about delete the handle code in zswap? If it does exist, I think there is a potential issue as I described. >> 4. some time later again, swap this page on the same offset A. >> >> so, a duplicate store happens. >> >> what I can think is that use flags and CAS to protect store and reclaim on >> the same offset happens concurrentlly. >> >> >> thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache >> >> >> >> Now, swap cache has old data rather than new data for offset A. >> >> error will happen If do_swap_page() get page from swap_cache. >> >> >> > >> > -- >> > Regards, >> > --Bob >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: email@kvack.org > > -- > Kind regards, > Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f53.google.com (mail-pa0-f53.google.com [209.85.220.53]) by kanga.kvack.org (Postfix) with ESMTP id 17AD46B0031 for ; Thu, 26 Sep 2013 03:56:54 -0400 (EDT) Received: by mail-pa0-f53.google.com with SMTP id kq14so965173pab.12 for ; Thu, 26 Sep 2013 00:56:53 -0700 (PDT) Date: Thu, 26 Sep 2013 16:57:25 +0900 From: Minchan Kim Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues Message-ID: <20130926075725.GA22339@bbox> References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> <20130926055802.GA20634@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Weijie Yang Cc: Bob Liu , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel On Thu, Sep 26, 2013 at 03:26:33PM +0800, Weijie Yang wrote: > On Thu, Sep 26, 2013 at 1:58 PM, Minchan Kim wrote: > > Hello Weigie, > > > > On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote: > >> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: > >> > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: > >> >> I think I find a new issue, for integrity of this mail thread, I reply > >> >> to this mail. > >> >> > >> >> It is a concurrence issue either, when duplicate store and reclaim > >> >> concurrentlly. > >> >> > >> >> zswap entry x with offset A is already stored in zswap backend. > >> >> Consider the following scenario: > >> >> > >> >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) > >> >> > >> >> thread 1: store new page with the same offset A, alloc a new zswap entry y. > >> >> store finished. shrink_page_list() call __remove_mapping(), and now > >> >> it is not in swap_cache > >> >> > >> > > >> > But I don't think swap layer will call zswap with the same offset A. > >> > >> 1. store page of offset A in zswap > >> 2. some time later, pagefault occur, load page data from zswap. > >> But notice that zswap entry x is still in zswap because it is not > >> frontswap_tmem_exclusive_gets_enabled. > > > > frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff > > between CPU burining by frequent swapout and memory footprint by duplicate > > copy in swap cache and frontswap backend so it shouldn't affect the stability. > > Thanks for explain this. > I don't mean to say this option affects the stability, but that zswap > only realize > one option. Maybe it's better to realize both options for different workloads. "zswap only relize one option" What does it mena? Sorry. I couldn't parse your intention. :) You mean zswap should do something special to support frontswap_tmem_exclusive_gets? > > >> this page is with PageSwapCache(page) and page_private(page) = entry.val > >> 3. change this page data, and it become dirty > > > > If non-shared swapin page become redirty, it should remove the page from > > swapcache. If shared swapin page become redirty, it should do CoW so it's a > > new page so that it doesn't live in swap cache. It means it should have new > > offset which is different with old's one for swap out. > > > > What's wrong with that? > > It is really not a right scene for duplicate store. And I can not think out one. > If duplicate store is impossible, How about delete the handle code in zswap? > If it does exist, I think there is a potential issue as I described. You mean "zswap_duplicate_entry"? AFAIR, I already had a question to Seth when zswap was born but AFAIRC, he said that he didn't know exact reason but he saw that case during experiement so copy the code peice from zcache. Do you see the case, too? Anyway, we need to dive into that to know what happens and then open our eyes for clear solution before dumping meaningless patch. I hope Seth or Bob already know it. > > >> 4. some time later again, swap this page on the same offset A. > >> > >> so, a duplicate store happens. > >> > >> what I can think is that use flags and CAS to protect store and reclaim on > >> the same offset happens concurrentlly. > >> > >> >> thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache > >> >> > >> >> Now, swap cache has old data rather than new data for offset A. > >> >> error will happen If do_swap_page() get page from swap_cache. > >> >> > >> > > >> > -- > >> > Regards, > >> > --Bob > >> > >> -- > >> To unsubscribe, send a message with 'unsubscribe linux-mm' in > >> the body to majordomo@kvack.org. For more info on Linux MM, > >> see: http://www.linux-mm.org/ . > >> Don't email: email@kvack.org > > > > -- > > Kind regards, > > Minchan Kim > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f45.google.com (mail-pa0-f45.google.com [209.85.220.45]) by kanga.kvack.org (Postfix) with ESMTP id F02A46B0031 for ; Thu, 26 Sep 2013 04:48:08 -0400 (EDT) Received: by mail-pa0-f45.google.com with SMTP id rd3so1010951pab.18 for ; Thu, 26 Sep 2013 01:48:08 -0700 (PDT) Received: by mail-ie0-f169.google.com with SMTP id tp5so987281ieb.28 for ; Thu, 26 Sep 2013 01:48:04 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20130926075725.GA22339@bbox> References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> <20130926055802.GA20634@bbox> <20130926075725.GA22339@bbox> Date: Thu, 26 Sep 2013 16:48:03 +0800 Message-ID: Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues From: Weijie Yang Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Bob Liu , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel On Thu, Sep 26, 2013 at 3:57 PM, Minchan Kim wrote: > On Thu, Sep 26, 2013 at 03:26:33PM +0800, Weijie Yang wrote: >> On Thu, Sep 26, 2013 at 1:58 PM, Minchan Kim wrote: >> > Hello Weigie, >> > >> > On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote: >> >> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: >> >> > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: >> >> >> I think I find a new issue, for integrity of this mail thread, I reply >> >> >> to this mail. >> >> >> >> >> >> It is a concurrence issue either, when duplicate store and reclaim >> >> >> concurrentlly. >> >> >> >> >> >> zswap entry x with offset A is already stored in zswap backend. >> >> >> Consider the following scenario: >> >> >> >> >> >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) >> >> >> >> >> >> thread 1: store new page with the same offset A, alloc a new zswap entry y. >> >> >> store finished. shrink_page_list() call __remove_mapping(), and now >> >> >> it is not in swap_cache >> >> >> >> >> > >> >> > But I don't think swap layer will call zswap with the same offset A. >> >> >> >> 1. store page of offset A in zswap >> >> 2. some time later, pagefault occur, load page data from zswap. >> >> But notice that zswap entry x is still in zswap because it is not >> >> frontswap_tmem_exclusive_gets_enabled. >> > >> > frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff >> > between CPU burining by frequent swapout and memory footprint by duplicate >> > copy in swap cache and frontswap backend so it shouldn't affect the stability. >> >> Thanks for explain this. >> I don't mean to say this option affects the stability, but that zswap >> only realize >> one option. Maybe it's better to realize both options for different workloads. > > "zswap only relize one option" > What does it mena? Sorry. I couldn't parse your intention. :) > You mean zswap should do something special to support frontswap_tmem_exclusive_gets? Yes. But I am not sure whether it is worth. >> >> >> this page is with PageSwapCache(page) and page_private(page) = entry.val >> >> 3. change this page data, and it become dirty >> > >> > If non-shared swapin page become redirty, it should remove the page from >> > swapcache. If shared swapin page become redirty, it should do CoW so it's a >> > new page so that it doesn't live in swap cache. It means it should have new >> > offset which is different with old's one for swap out. >> > >> > What's wrong with that? >> >> It is really not a right scene for duplicate store. And I can not think out one. >> If duplicate store is impossible, How about delete the handle code in zswap? >> If it does exist, I think there is a potential issue as I described. > > You mean "zswap_duplicate_entry"? > AFAIR, I already had a question to Seth when zswap was born but AFAIRC, > he said that he didn't know exact reason but he saw that case during > experiement so copy the code peice from zcache. > > Do you see the case, too? Yes, I mean duplicate store. I check the /Documentation/vm/frontswap.txt, it mentions "duplicate stores", but I am still confused. I wrote a zcache varietas which swap out compressed page to swapfile. I did see that case when I test it on andorid smartphone(arm v7), and it happens rarely and occasionally. In one test, only 1 duplicate store occur in about 3157162 times stores. > Anyway, we need to dive into that to know what happens and then open > our eyes for clear solution before dumping meaningless patch. > > I hope Seth or Bob already know it. > >> >> >> 4. some time later again, swap this page on the same offset A. >> >> >> >> so, a duplicate store happens. >> >> >> >> what I can think is that use flags and CAS to protect store and reclaim on >> >> the same offset happens concurrentlly. >> >> >> >> >> thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache >> >> >> >> >> >> Now, swap cache has old data rather than new data for offset A. >> >> >> error will happen If do_swap_page() get page from swap_cache. >> >> >> >> >> > >> >> > -- >> >> > Regards, >> >> > --Bob >> >> >> >> -- >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> >> the body to majordomo@kvack.org. For more info on Linux MM, >> >> see: http://www.linux-mm.org/ . >> >> Don't email: email@kvack.org >> > >> > -- >> > Kind regards, >> > Minchan Kim >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: email@kvack.org > > -- > Kind regards, > Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f182.google.com (mail-pd0-f182.google.com [209.85.192.182]) by kanga.kvack.org (Postfix) with ESMTP id 39D4D6B0031 for ; Thu, 26 Sep 2013 05:56:43 -0400 (EDT) Received: by mail-pd0-f182.google.com with SMTP id r10so928073pdi.41 for ; Thu, 26 Sep 2013 02:56:42 -0700 (PDT) Received: by mail-ve0-f173.google.com with SMTP id cz12so679175veb.4 for ; Thu, 26 Sep 2013 02:56:40 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> <20130926055802.GA20634@bbox> <20130926075725.GA22339@bbox> Date: Thu, 26 Sep 2013 17:56:40 +0800 Message-ID: Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues From: Bob Liu Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Weijie Yang Cc: Minchan Kim , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel On Thu, Sep 26, 2013 at 4:48 PM, Weijie Yang wrote: > On Thu, Sep 26, 2013 at 3:57 PM, Minchan Kim wrote: >> On Thu, Sep 26, 2013 at 03:26:33PM +0800, Weijie Yang wrote: >>> On Thu, Sep 26, 2013 at 1:58 PM, Minchan Kim wrote: >>> > Hello Weigie, >>> > >>> > On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote: >>> >> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: >>> >> > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: >>> >> >> I think I find a new issue, for integrity of this mail thread, I reply >>> >> >> to this mail. >>> >> >> >>> >> >> It is a concurrence issue either, when duplicate store and reclaim >>> >> >> concurrentlly. >>> >> >> >>> >> >> zswap entry x with offset A is already stored in zswap backend. >>> >> >> Consider the following scenario: >>> >> >> >>> >> >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) >>> >> >> >>> >> >> thread 1: store new page with the same offset A, alloc a new zswap entry y. >>> >> >> store finished. shrink_page_list() call __remove_mapping(), and now >>> >> >> it is not in swap_cache >>> >> >> >>> >> > >>> >> > But I don't think swap layer will call zswap with the same offset A. >>> >> >>> >> 1. store page of offset A in zswap >>> >> 2. some time later, pagefault occur, load page data from zswap. >>> >> But notice that zswap entry x is still in zswap because it is not >>> >> frontswap_tmem_exclusive_gets_enabled. >>> > >>> > frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff >>> > between CPU burining by frequent swapout and memory footprint by duplicate >>> > copy in swap cache and frontswap backend so it shouldn't affect the stability. >>> >>> Thanks for explain this. >>> I don't mean to say this option affects the stability, but that zswap >>> only realize >>> one option. Maybe it's better to realize both options for different workloads. >> >> "zswap only relize one option" >> What does it mena? Sorry. I couldn't parse your intention. :) >> You mean zswap should do something special to support frontswap_tmem_exclusive_gets? > > Yes. But I am not sure whether it is worth. > >>> >>> >> this page is with PageSwapCache(page) and page_private(page) = entry.val >>> >> 3. change this page data, and it become dirty >>> > >>> > If non-shared swapin page become redirty, it should remove the page from >>> > swapcache. If shared swapin page become redirty, it should do CoW so it's a >>> > new page so that it doesn't live in swap cache. It means it should have new >>> > offset which is different with old's one for swap out. >>> > >>> > What's wrong with that? >>> >>> It is really not a right scene for duplicate store. And I can not think out one. >>> If duplicate store is impossible, How about delete the handle code in zswap? >>> If it does exist, I think there is a potential issue as I described. >> >> You mean "zswap_duplicate_entry"? >> AFAIR, I already had a question to Seth when zswap was born but AFAIRC, >> he said that he didn't know exact reason but he saw that case during >> experiement so copy the code peice from zcache. >> >> Do you see the case, too? > > Yes, I mean duplicate store. > I check the /Documentation/vm/frontswap.txt, it mentions "duplicate stores", > but I am still confused. > > I wrote a zcache varietas which swap out compressed page to swapfile. > I did see that case when I test it on andorid smartphone(arm v7), Why not test it based on zswap directly? My suggestion is do more and more stress testing against current zswap, if there is really any bug triggered or unusual performance issue then we dig it out and fix it. -- Regards, --Bob -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f172.google.com (mail-pd0-f172.google.com [209.85.192.172]) by kanga.kvack.org (Postfix) with ESMTP id A8ACD6B0032 for ; Fri, 27 Sep 2013 00:58:08 -0400 (EDT) Received: by mail-pd0-f172.google.com with SMTP id z10so2091449pdj.17 for ; Thu, 26 Sep 2013 21:58:08 -0700 (PDT) Date: Fri, 27 Sep 2013 13:58:42 +0900 From: Minchan Kim Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues Message-ID: <20130927045842.GB22339@bbox> References: <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> <20130926055802.GA20634@bbox> <20130926075725.GA22339@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Weijie Yang Cc: Bob Liu , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel Hello Weijie, On Thu, Sep 26, 2013 at 04:48:03PM +0800, Weijie Yang wrote: > On Thu, Sep 26, 2013 at 3:57 PM, Minchan Kim wrote: > > On Thu, Sep 26, 2013 at 03:26:33PM +0800, Weijie Yang wrote: > >> On Thu, Sep 26, 2013 at 1:58 PM, Minchan Kim wrote: > >> > Hello Weigie, > >> > > >> > On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote: > >> >> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: > >> >> > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: > >> >> >> I think I find a new issue, for integrity of this mail thread, I reply > >> >> >> to this mail. > >> >> >> > >> >> >> It is a concurrence issue either, when duplicate store and reclaim > >> >> >> concurrentlly. > >> >> >> > >> >> >> zswap entry x with offset A is already stored in zswap backend. > >> >> >> Consider the following scenario: > >> >> >> > >> >> >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) > >> >> >> > >> >> >> thread 1: store new page with the same offset A, alloc a new zswap entry y. > >> >> >> store finished. shrink_page_list() call __remove_mapping(), and now > >> >> >> it is not in swap_cache > >> >> >> > >> >> > > >> >> > But I don't think swap layer will call zswap with the same offset A. > >> >> > >> >> 1. store page of offset A in zswap > >> >> 2. some time later, pagefault occur, load page data from zswap. > >> >> But notice that zswap entry x is still in zswap because it is not > >> >> frontswap_tmem_exclusive_gets_enabled. > >> > > >> > frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff > >> > between CPU burining by frequent swapout and memory footprint by duplicate > >> > copy in swap cache and frontswap backend so it shouldn't affect the stability. > >> > >> Thanks for explain this. > >> I don't mean to say this option affects the stability, but that zswap > >> only realize > >> one option. Maybe it's better to realize both options for different workloads. > > > > "zswap only relize one option" > > What does it mena? Sorry. I couldn't parse your intention. :) > > You mean zswap should do something special to support frontswap_tmem_exclusive_gets? > > Yes. But I am not sure whether it is worth. > > >> > >> >> this page is with PageSwapCache(page) and page_private(page) = entry.val > >> >> 3. change this page data, and it become dirty > >> > > >> > If non-shared swapin page become redirty, it should remove the page from > >> > swapcache. If shared swapin page become redirty, it should do CoW so it's a > >> > new page so that it doesn't live in swap cache. It means it should have new > >> > offset which is different with old's one for swap out. > >> > > >> > What's wrong with that? > >> > >> It is really not a right scene for duplicate store. And I can not think out one. > >> If duplicate store is impossible, How about delete the handle code in zswap? > >> If it does exist, I think there is a potential issue as I described. > > > > You mean "zswap_duplicate_entry"? > > AFAIR, I already had a question to Seth when zswap was born but AFAIRC, > > he said that he didn't know exact reason but he saw that case during > > experiement so copy the code peice from zcache. > > > > Do you see the case, too? > > Yes, I mean duplicate store. > I check the /Documentation/vm/frontswap.txt, it mentions "duplicate stores", > but I am still confused. It seems that there are two Minchan in LKML. Other Minchan, not me who have a horrible memory, already was first to figure it out a few month ago. https://lkml.org/lkml/2013/1/31/3 /me slaps self. I'd like to look into that issue more but now I don't have a time. Just FYI. ;-) -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753449Ab3HRQOr (ORCPT ); Sun, 18 Aug 2013 12:14:47 -0400 Received: from mail-oa0-f66.google.com ([209.85.219.66]:52582 "EHLO mail-oa0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752805Ab3HRQOq (ORCPT ); Sun, 18 Aug 2013 12:14:46 -0400 MIME-Version: 1.0 Date: Mon, 19 Aug 2013 00:14:45 +0800 Message-ID: Subject: [BUG REPORT] ZSWAP: theoretical race condition issues From: Weijie Yang To: sjenning@linux.vnet.ibm.com Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, dan.magenheimer@oracle.com, bob.liu@oracle.com Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I found a few bugs in zswap when I review Linux-3.11-rc5, and I have also some questions about it, described as following: BUG: 1. A race condition when reclaim a page when a handle alloced from zbud, zbud considers this handle is used validly by upper(zswap) and can be a candidate for reclaim. But zswap has to initialize it such as setting swapentry and addding it to rbtree. so there is a race condition, such as: thread 0: obtain handle x from zbud_alloc thread 1: zbud_reclaim_page is called thread 1: callback zswap_writeback_entry to reclaim handle x thread 1: get swpentry from handle x (it is random value now) thread 1: bad thing may happen thread 0: initialize handle x with swapentry Of course, this situation almost never happen, it is a "theoretical race condition" issue. 2. Pollute swapcache data by add a pre-invalided swap page when a swap_entry is invalidated, it will be reused by other anon page. At the same time, zswap is reclaiming old page, pollute swapcache of new page as a result, because old page and new page use the same swap_entry, such as: thread 1: zswap reclaim entry x thread 0: zswap_frontswap_invalidate_page entry x thread 0: entry x reused by other anon page thread 1: add old data to swapcache of entry x thread 0: swapcache of entry x is polluted Of course, this situation almost never happen, it is another "theoretical race condition" issue. 3. Frontswap uses frontswap_map bitmap to track page in "backend" implementation, when zswap reclaim a page, the corresponding bitmap record is not cleared. 4. zswap_tree is not freed when swapoff, and it got re-kzalloc in swapon, memory leak occurs. questions: 1. How about SetPageReclaim befor __swap_writepage, so that move it to the tail of the inactive list? 2. zswap uses GFP_KERNEL flag to alloc things in store and reclaim function, does this lead to these function called recursively? 3. for reclaiming one zbud page which contains two buddies, zswap needs to alloc two pages. Does this reclaim cost-efficient? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753459Ab3HSCR4 (ORCPT ); Sun, 18 Aug 2013 22:17:56 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:25466 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752410Ab3HSCRy (ORCPT ); Sun, 18 Aug 2013 22:17:54 -0400 Message-ID: <52118042.30101@oracle.com> Date: Mon, 19 Aug 2013 10:17:38 +0800 From: Bob Liu User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130308 Thunderbird/17.0.4 MIME-Version: 1.0 To: Weijie Yang CC: sjenning@linux.vnet.ibm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, dan.magenheimer@oracle.com Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Source-IP: acsinet22.oracle.com [141.146.126.238] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Weijie, On 08/19/2013 12:14 AM, Weijie Yang wrote: > I found a few bugs in zswap when I review Linux-3.11-rc5, and I have > also some questions about it, described as following: > > BUG: > 1. A race condition when reclaim a page > when a handle alloced from zbud, zbud considers this handle is used > validly by upper(zswap) and can be a candidate for reclaim. > But zswap has to initialize it such as setting swapentry and addding > it to rbtree. so there is a race condition, such as: > thread 0: obtain handle x from zbud_alloc > thread 1: zbud_reclaim_page is called > thread 1: callback zswap_writeback_entry to reclaim handle x > thread 1: get swpentry from handle x (it is random value now) > thread 1: bad thing may happen > thread 0: initialize handle x with swapentry Yes, this may happen potentially but in rare case. Because we have a LRU list for page frames, after Thread 0 called zbud_alloc the corresponding page will be add to the head of LRU list,While zbud_reclaim_page(Thread 1 called) is started from the tail of LRU list. > Of course, this situation almost never happen, it is a "theoretical > race condition" issue. > > 2. Pollute swapcache data by add a pre-invalided swap page > when a swap_entry is invalidated, it will be reused by other anon > page. At the same time, zswap is reclaiming old page, pollute > swapcache of new page as a result, because old page and new page use > the same swap_entry, such as: > thread 1: zswap reclaim entry x > thread 0: zswap_frontswap_invalidate_page entry x > thread 0: entry x reused by other anon page > thread 1: add old data to swapcache of entry x I didn't get your idea here, why thread1 will add old data to entry x? > thread 0: swapcache of entry x is polluted > Of course, this situation almost never happen, it is another > "theoretical race condition" issue. > > 3. Frontswap uses frontswap_map bitmap to track page in "backend" > implementation, when zswap reclaim a > page, the corresponding bitmap record is not cleared. > That's true, but I don't think it's a big problem. Only waste little time to search rbtree during zswap_frontswap_load(). > 4. zswap_tree is not freed when swapoff, and it got re-kzalloc in > swapon, memory leak occurs. Nice catch! I think it should be freed in zswap_frontswap_invalidate_area(). > > questions: > 1. How about SetPageReclaim befor __swap_writepage, so that move it to > the tail of the inactive list? It will be added to inactive now. > 2. zswap uses GFP_KERNEL flag to alloc things in store and reclaim > function, does this lead to these function called recursively? Yes, that's a potential problem. > 3. for reclaiming one zbud page which contains two buddies, zswap > needs to alloc two pages. Does this reclaim cost-efficient? > Yes, that's a problem too. And that's why we use zbud as the default allocator instead of zsmalloc. I think improving the write back path of zswap is the next important step for zswap. -- Regards, -Bob From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750836Ab3HSFrV (ORCPT ); Mon, 19 Aug 2013 01:47:21 -0400 Received: from LGEMRELSE6Q.lge.com ([156.147.1.121]:53223 "EHLO LGEMRELSE6Q.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750792Ab3HSFrT (ORCPT ); Mon, 19 Aug 2013 01:47:19 -0400 X-AuditID: 9c930179-b7c0bae0000040ac-7e-5211b165994e Date: Mon, 19 Aug 2013 14:47:42 +0900 From: Minchan Kim To: Bob Liu Cc: Weijie Yang , sjenning@linux.vnet.ibm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, dan.magenheimer@oracle.com Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues Message-ID: <20130819054742.GA28062@bbox> References: <52118042.30101@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <52118042.30101@oracle.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Aug 19, 2013 at 10:17:38AM +0800, Bob Liu wrote: > Hi Weijie, > > On 08/19/2013 12:14 AM, Weijie Yang wrote: > > I found a few bugs in zswap when I review Linux-3.11-rc5, and I have > > also some questions about it, described as following: > > > > BUG: > > 1. A race condition when reclaim a page > > when a handle alloced from zbud, zbud considers this handle is used > > validly by upper(zswap) and can be a candidate for reclaim. > > But zswap has to initialize it such as setting swapentry and addding > > it to rbtree. so there is a race condition, such as: > > thread 0: obtain handle x from zbud_alloc > > thread 1: zbud_reclaim_page is called > > thread 1: callback zswap_writeback_entry to reclaim handle x > > thread 1: get swpentry from handle x (it is random value now) > > thread 1: bad thing may happen > > thread 0: initialize handle x with swapentry Nice catch! > > Yes, this may happen potentially but in rare case. > Because we have a LRU list for page frames, after Thread 0 called > zbud_alloc the corresponding page will be add to the head of LRU > list,While zbud_reclaim_page(Thread 1 called) is started from the tail > of LRU list. > > > Of course, this situation almost never happen, it is a "theoretical > > race condition" issue. But it's doable and we should prevent that although you feel it's rare because system could go hang. When I look at the code, Why should zbud have LRU logic instead of zswap? If I missed some history, sorry about that. But at least to me, zbud is just allocator so it should have a role to handle alloc/free object and how client of the allocator uses objects depends on the upper layer so zbud should handle LRU. If so, we wouldn't encounter this problem, either. > > > > 2. Pollute swapcache data by add a pre-invalided swap page > > when a swap_entry is invalidated, it will be reused by other anon > > page. At the same time, zswap is reclaiming old page, pollute > > swapcache of new page as a result, because old page and new page use > > the same swap_entry, such as: > > thread 1: zswap reclaim entry x > > thread 0: zswap_frontswap_invalidate_page entry x > > thread 0: entry x reused by other anon page > > thread 1: add old data to swapcache of entry x > > I didn't get your idea here, why thread1 will add old data to entry x? > > > thread 0: swapcache of entry x is polluted > > Of course, this situation almost never happen, it is another > > "theoretical race condition" issue. Don't swapcache_prepare close the race? > > > > 3. Frontswap uses frontswap_map bitmap to track page in "backend" > > implementation, when zswap reclaim a > > page, the corresponding bitmap record is not cleared. > > > > That's true, but I don't think it's a big problem. > Only waste little time to search rbtree during zswap_frontswap_load(). > > > 4. zswap_tree is not freed when swapoff, and it got re-kzalloc in > > swapon, memory leak occurs. > > Nice catch! I think it should be freed in zswap_frontswap_invalidate_area(). > > > > > questions: > > 1. How about SetPageReclaim befor __swap_writepage, so that move it to > > the tail of the inactive list? It's a good idea to avoid unnecessary page scanning. > > It will be added to inactive now. > > > 2. zswap uses GFP_KERNEL flag to alloc things in store and reclaim > > function, does this lead to these function called recursively? > > Yes, that's a potential problem. It should use GFP_NOIO. > > > 3. for reclaiming one zbud page which contains two buddies, zswap > > needs to alloc two pages. Does this reclaim cost-efficient? It would be better to evict zpage which is a compressed sequence of PAGE_SIZE bytes rather than decompresesed PAGE_SIZE bytes a page when we are about to reclaim the page but it's hard part from frontswap API. > > > > Yes, that's a problem too. And that's why we use zbud as the default > allocator instead of zsmalloc. > I think improving the write back path of zswap is the next important > step for zswap. > > -- > Regards, > -Bob > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751624Ab3HTPac (ORCPT ); Tue, 20 Aug 2013 11:30:32 -0400 Received: from mail-ob0-f182.google.com ([209.85.214.182]:62528 "EHLO mail-ob0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751582Ab3HTPaa (ORCPT ); Tue, 20 Aug 2013 11:30:30 -0400 MIME-Version: 1.0 In-Reply-To: <20130819054742.GA28062@bbox> References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> Date: Tue, 20 Aug 2013 23:30:29 +0800 Message-ID: Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues From: Weijie Yang To: Minchan Kim Cc: Bob Liu , sjenning@linux.vnet.ibm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 2013/8/19 Minchan Kim : > On Mon, Aug 19, 2013 at 10:17:38AM +0800, Bob Liu wrote: >> Hi Weijie, >> >> On 08/19/2013 12:14 AM, Weijie Yang wrote: >> > I found a few bugs in zswap when I review Linux-3.11-rc5, and I have >> > also some questions about it, described as following: >> > >> > BUG: >> > 1. A race condition when reclaim a page >> > when a handle alloced from zbud, zbud considers this handle is used >> > validly by upper(zswap) and can be a candidate for reclaim. >> > But zswap has to initialize it such as setting swapentry and addding >> > it to rbtree. so there is a race condition, such as: >> > thread 0: obtain handle x from zbud_alloc >> > thread 1: zbud_reclaim_page is called >> > thread 1: callback zswap_writeback_entry to reclaim handle x >> > thread 1: get swpentry from handle x (it is random value now) >> > thread 1: bad thing may happen >> > thread 0: initialize handle x with swapentry > > Nice catch! > >> >> Yes, this may happen potentially but in rare case. >> Because we have a LRU list for page frames, after Thread 0 called >> zbud_alloc the corresponding page will be add to the head of LRU >> list,While zbud_reclaim_page(Thread 1 called) is started from the tail >> of LRU list. >> >> > Of course, this situation almost never happen, it is a "theoretical >> > race condition" issue. > > But it's doable and we should prevent that although you feel it's rare > because system could go hang. When I look at the code, Why should zbud > have LRU logic instead of zswap? If I missed some history, sorry about that. > But at least to me, zbud is just allocator so it should have a role > to handle alloc/free object and how client of the allocator uses objects > depends on the upper layer so zbud should handle LRU. If so, we wouldn't > encounter this problem, either. > >> > >> > 2. Pollute swapcache data by add a pre-invalided swap page >> > when a swap_entry is invalidated, it will be reused by other anon >> > page. At the same time, zswap is reclaiming old page, pollute >> > swapcache of new page as a result, because old page and new page use >> > the same swap_entry, such as: >> > thread 1: zswap reclaim entry x >> > thread 0: zswap_frontswap_invalidate_page entry x >> > thread 0: entry x reused by other anon page >> > thread 1: add old data to swapcache of entry x >> >> I didn't get your idea here, why thread1 will add old data to entry x? >> >> > thread 0: swapcache of entry x is polluted >> > Of course, this situation almost never happen, it is another >> > "theoretical race condition" issue. > > Don't swapcache_prepare close the race? Yes, I made a mistake, there is not a race here. However, I find another bug here after my more careful review. It is not only "theoretical", it will happen really. as: thread 1: zswap reclaim entry x (get the refcount, but not call zswap_get_swap_cache_page yet) thread 0: zswap_frontswap_invalidate_page entry x (finished, entry x and its zbud is not freed as its refcount != 0) now, the swap_map[x] = 0 thread 1: zswap_get_swap_cache_page called, swapcache_prepare return -ENOENT because entry x is not used any more zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM zswap_writeback_entry do nothing except put refcount now, the memory of zswap_entry x leaks and its zpage become a zombie Best Regards, Weijie Yang From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752548Ab3HUHtM (ORCPT ); Wed, 21 Aug 2013 03:49:12 -0400 Received: from lgeamrelo01.lge.com ([156.147.1.125]:53455 "EHLO LGEAMRELO01.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752527Ab3HUHtL (ORCPT ); Wed, 21 Aug 2013 03:49:11 -0400 X-AuditID: 9c93017d-b7cdfae0000026c0-9f-521470f44596 Date: Wed, 21 Aug 2013 16:49:39 +0900 From: Minchan Kim To: Weijie Yang Cc: Bob Liu , sjenning@linux.vnet.ibm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues Message-ID: <20130821074939.GE3022@bbox> References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, On Tue, Aug 20, 2013 at 11:30:29PM +0800, Weijie Yang wrote: > 2013/8/19 Minchan Kim : > > On Mon, Aug 19, 2013 at 10:17:38AM +0800, Bob Liu wrote: > >> Hi Weijie, > >> > >> On 08/19/2013 12:14 AM, Weijie Yang wrote: > >> > I found a few bugs in zswap when I review Linux-3.11-rc5, and I have > >> > also some questions about it, described as following: > >> > > >> > BUG: > >> > 1. A race condition when reclaim a page > >> > when a handle alloced from zbud, zbud considers this handle is used > >> > validly by upper(zswap) and can be a candidate for reclaim. > >> > But zswap has to initialize it such as setting swapentry and addding > >> > it to rbtree. so there is a race condition, such as: > >> > thread 0: obtain handle x from zbud_alloc > >> > thread 1: zbud_reclaim_page is called > >> > thread 1: callback zswap_writeback_entry to reclaim handle x > >> > thread 1: get swpentry from handle x (it is random value now) > >> > thread 1: bad thing may happen > >> > thread 0: initialize handle x with swapentry > > > > Nice catch! > > > >> > >> Yes, this may happen potentially but in rare case. > >> Because we have a LRU list for page frames, after Thread 0 called > >> zbud_alloc the corresponding page will be add to the head of LRU > >> list,While zbud_reclaim_page(Thread 1 called) is started from the tail > >> of LRU list. > >> > >> > Of course, this situation almost never happen, it is a "theoretical > >> > race condition" issue. > > > > But it's doable and we should prevent that although you feel it's rare > > because system could go hang. When I look at the code, Why should zbud > > have LRU logic instead of zswap? If I missed some history, sorry about that. > > But at least to me, zbud is just allocator so it should have a role > > to handle alloc/free object and how client of the allocator uses objects > > depends on the upper layer so zbud should handle LRU. If so, we wouldn't > > encounter this problem, either. > > > >> > > >> > 2. Pollute swapcache data by add a pre-invalided swap page > >> > when a swap_entry is invalidated, it will be reused by other anon > >> > page. At the same time, zswap is reclaiming old page, pollute > >> > swapcache of new page as a result, because old page and new page use > >> > the same swap_entry, such as: > >> > thread 1: zswap reclaim entry x > >> > thread 0: zswap_frontswap_invalidate_page entry x > >> > thread 0: entry x reused by other anon page > >> > thread 1: add old data to swapcache of entry x > >> > >> I didn't get your idea here, why thread1 will add old data to entry x? > >> > >> > thread 0: swapcache of entry x is polluted > >> > Of course, this situation almost never happen, it is another > >> > "theoretical race condition" issue. > > > > Don't swapcache_prepare close the race? > > Yes, I made a mistake, there is not a race here. > However, I find another bug here after my more careful review. It is > not only "theoretical", it will happen really. as: > thread 1: zswap reclaim entry x (get the refcount, but not call > zswap_get_swap_cache_page yet) > thread 0: zswap_frontswap_invalidate_page entry x (finished, entry x > and its zbud is not freed as its refcount != 0) > now, the swap_map[x] = 0 > thread 1: zswap_get_swap_cache_page called, swapcache_prepare return > -ENOENT because entry x is not used any more > zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM > zswap_writeback_entry do nothing except put refcount > now, the memory of zswap_entry x leaks and its zpage become a zombie It makes sense to me. Maybe we should introduce ZSWAP_SWAPCACHE_NOENT and free the entry in the case. Would you mind to send patches on the problems you found? > > Best Regards, > Weijie Yang > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754228Ab3IYIJJ (ORCPT ); Wed, 25 Sep 2013 04:09:09 -0400 Received: from mail-ie0-f177.google.com ([209.85.223.177]:57923 "EHLO mail-ie0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751719Ab3IYIJG (ORCPT ); Wed, 25 Sep 2013 04:09:06 -0400 MIME-Version: 1.0 In-Reply-To: <20130821074939.GE3022@bbox> References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> Date: Wed, 25 Sep 2013 16:09:04 +0800 Message-ID: Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues From: Weijie Yang To: Minchan Kim Cc: Bob Liu , sjenning@linux.vnet.ibm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I think I find a new issue, for integrity of this mail thread, I reply to this mail. It is a concurrence issue either, when duplicate store and reclaim concurrentlly. zswap entry x with offset A is already stored in zswap backend. Consider the following scenario: thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) thread 1: store new page with the same offset A, alloc a new zswap entry y. store finished. shrink_page_list() call __remove_mapping(), and now it is not in swap_cache thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache Now, swap cache has old data rather than new data for offset A. error will happen If do_swap_page() get page from swap_cache. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753330Ab3IYIbL (ORCPT ); Wed, 25 Sep 2013 04:31:11 -0400 Received: from mail-vc0-f179.google.com ([209.85.220.179]:44727 "EHLO mail-vc0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750774Ab3IYIbJ (ORCPT ); Wed, 25 Sep 2013 04:31:09 -0400 MIME-Version: 1.0 In-Reply-To: References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> Date: Wed, 25 Sep 2013 16:31:08 +0800 Message-ID: Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues From: Bob Liu To: Weijie Yang Cc: Minchan Kim , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: > I think I find a new issue, for integrity of this mail thread, I reply > to this mail. > > It is a concurrence issue either, when duplicate store and reclaim > concurrentlly. > > zswap entry x with offset A is already stored in zswap backend. > Consider the following scenario: > > thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) > > thread 1: store new page with the same offset A, alloc a new zswap entry y. > store finished. shrink_page_list() call __remove_mapping(), and now > it is not in swap_cache > But I don't think swap layer will call zswap with the same offset A. > thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache > > Now, swap cache has old data rather than new data for offset A. > error will happen If do_swap_page() get page from swap_cache. > -- Regards, --Bob From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754949Ab3IYJdq (ORCPT ); Wed, 25 Sep 2013 05:33:46 -0400 Received: from mail-ie0-f181.google.com ([209.85.223.181]:53142 "EHLO mail-ie0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754880Ab3IYJdo (ORCPT ); Wed, 25 Sep 2013 05:33:44 -0400 MIME-Version: 1.0 In-Reply-To: References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> Date: Wed, 25 Sep 2013 17:33:43 +0800 Message-ID: Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues From: Weijie Yang To: Bob Liu Cc: Minchan Kim , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: >> I think I find a new issue, for integrity of this mail thread, I reply >> to this mail. >> >> It is a concurrence issue either, when duplicate store and reclaim >> concurrentlly. >> >> zswap entry x with offset A is already stored in zswap backend. >> Consider the following scenario: >> >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) >> >> thread 1: store new page with the same offset A, alloc a new zswap entry y. >> store finished. shrink_page_list() call __remove_mapping(), and now >> it is not in swap_cache >> > > But I don't think swap layer will call zswap with the same offset A. 1. store page of offset A in zswap 2. some time later, pagefault occur, load page data from zswap. But notice that zswap entry x is still in zswap because it is not frontswap_tmem_exclusive_gets_enabled. this page is with PageSwapCache(page) and page_private(page) = entry.val 3. change this page data, and it become dirty 4. some time later again, swap this page on the same offset A. so, a duplicate store happens. what I can think is that use flags and CAS to protect store and reclaim on the same offset happens concurrentlly. >> thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache >> >> Now, swap cache has old data rather than new data for offset A. >> error will happen If do_swap_page() get page from swap_cache. >> > > -- > Regards, > --Bob From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754963Ab3IYKCR (ORCPT ); Wed, 25 Sep 2013 06:02:17 -0400 Received: from mail-ve0-f175.google.com ([209.85.128.175]:43902 "EHLO mail-ve0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753371Ab3IYKCQ (ORCPT ); Wed, 25 Sep 2013 06:02:16 -0400 MIME-Version: 1.0 In-Reply-To: References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> Date: Wed, 25 Sep 2013 18:02:15 +0800 Message-ID: Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues From: Bob Liu To: Weijie Yang Cc: Minchan Kim , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Sep 25, 2013 at 5:33 PM, Weijie Yang wrote: > On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: >> On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: >>> I think I find a new issue, for integrity of this mail thread, I reply >>> to this mail. >>> >>> It is a concurrence issue either, when duplicate store and reclaim >>> concurrentlly. >>> >>> zswap entry x with offset A is already stored in zswap backend. >>> Consider the following scenario: >>> >>> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) >>> >>> thread 1: store new page with the same offset A, alloc a new zswap entry y. >>> store finished. shrink_page_list() call __remove_mapping(), and now >>> it is not in swap_cache >>> >> >> But I don't think swap layer will call zswap with the same offset A. > > 1. store page of offset A in zswap > 2. some time later, pagefault occur, load page data from zswap. > But notice that zswap entry x is still in zswap because it is not Sorry I didn't notice that zswap_frontswap_load() doesn't call rb_erase(). > frontswap_tmem_exclusive_gets_enabled. > this page is with PageSwapCache(page) and page_private(page) = entry.val > 3. change this page data, and it become dirty > 4. some time later again, swap this page on the same offset A. > > so, a duplicate store happens. > Then I think we should erase the entry from rbtree in zswap_frontswap_load(). After the page is decompressed and loaded from zswap, still storing the compressed data in zswap is meanless. -- Regards, --Bob From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755442Ab3IZCGn (ORCPT ); Wed, 25 Sep 2013 22:06:43 -0400 Received: from mail-ie0-f175.google.com ([209.85.223.175]:43173 "EHLO mail-ie0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752857Ab3IZCGm (ORCPT ); Wed, 25 Sep 2013 22:06:42 -0400 MIME-Version: 1.0 In-Reply-To: References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> Date: Thu, 26 Sep 2013 10:06:39 +0800 Message-ID: Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues From: Weijie Yang To: Bob Liu Cc: Minchan Kim , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Sep 25, 2013 at 6:02 PM, Bob Liu wrote: > On Wed, Sep 25, 2013 at 5:33 PM, Weijie Yang wrote: >> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: >>> On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: >>>> I think I find a new issue, for integrity of this mail thread, I reply >>>> to this mail. >>>> >>>> It is a concurrence issue either, when duplicate store and reclaim >>>> concurrentlly. >>>> >>>> zswap entry x with offset A is already stored in zswap backend. >>>> Consider the following scenario: >>>> >>>> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) >>>> >>>> thread 1: store new page with the same offset A, alloc a new zswap entry y. >>>> store finished. shrink_page_list() call __remove_mapping(), and now >>>> it is not in swap_cache >>>> >>> >>> But I don't think swap layer will call zswap with the same offset A. >> >> 1. store page of offset A in zswap >> 2. some time later, pagefault occur, load page data from zswap. >> But notice that zswap entry x is still in zswap because it is not > > Sorry I didn't notice that zswap_frontswap_load() doesn't call rb_erase(). > >> frontswap_tmem_exclusive_gets_enabled. >> this page is with PageSwapCache(page) and page_private(page) = entry.val >> 3. change this page data, and it become dirty >> 4. some time later again, swap this page on the same offset A. >> >> so, a duplicate store happens. >> > > Then I think we should erase the entry from rbtree in zswap_frontswap_load(). > After the page is decompressed and loaded from zswap, still storing > the compressed data in zswap is meanless. Of cause, erasing the entry after load() can resolve this. However, this problem is not simple and interesting. If we drop the entry after load(), we should SetPageDirty, it will generate more swap even if we don't change this page data. I think that is why frontswap has two load mode: default(used now) and exclusive_gets. I don't have test data, but I think which mode is better is decided by corresponding workload. It's better to realize these two modes, I will try it. > -- > Regards, > --Bob From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752889Ab3IZF5a (ORCPT ); Thu, 26 Sep 2013 01:57:30 -0400 Received: from LGEMRELSE6Q.lge.com ([156.147.1.121]:45578 "EHLO LGEMRELSE6Q.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751384Ab3IZF52 (ORCPT ); Thu, 26 Sep 2013 01:57:28 -0400 X-AuditID: 9c930179-b7c8bae000006c65-cb-5243ccc77d12 Date: Thu, 26 Sep 2013 14:58:02 +0900 From: Minchan Kim To: Weijie Yang Cc: Bob Liu , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues Message-ID: <20130926055802.GA20634@bbox> References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello Weigie, On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote: > On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: > > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: > >> I think I find a new issue, for integrity of this mail thread, I reply > >> to this mail. > >> > >> It is a concurrence issue either, when duplicate store and reclaim > >> concurrentlly. > >> > >> zswap entry x with offset A is already stored in zswap backend. > >> Consider the following scenario: > >> > >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) > >> > >> thread 1: store new page with the same offset A, alloc a new zswap entry y. > >> store finished. shrink_page_list() call __remove_mapping(), and now > >> it is not in swap_cache > >> > > > > But I don't think swap layer will call zswap with the same offset A. > > 1. store page of offset A in zswap > 2. some time later, pagefault occur, load page data from zswap. > But notice that zswap entry x is still in zswap because it is not > frontswap_tmem_exclusive_gets_enabled. frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff between CPU burining by frequent swapout and memory footprint by duplicate copy in swap cache and frontswap backend so it shouldn't affect the stability. > this page is with PageSwapCache(page) and page_private(page) = entry.val > 3. change this page data, and it become dirty If non-shared swapin page become redirty, it should remove the page from swapcache. If shared swapin page become redirty, it should do CoW so it's a new page so that it doesn't live in swap cache. It means it should have new offset which is different with old's one for swap out. What's wrong with that? > 4. some time later again, swap this page on the same offset A. > > so, a duplicate store happens. > > what I can think is that use flags and CAS to protect store and reclaim on > the same offset happens concurrentlly. > > >> thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache > >> > >> Now, swap cache has old data rather than new data for offset A. > >> error will happen If do_swap_page() get page from swap_cache. > >> > > > > -- > > Regards, > > --Bob > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755816Ab3IZH0f (ORCPT ); Thu, 26 Sep 2013 03:26:35 -0400 Received: from mail-ie0-f173.google.com ([209.85.223.173]:61024 "EHLO mail-ie0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754775Ab3IZH0e (ORCPT ); Thu, 26 Sep 2013 03:26:34 -0400 MIME-Version: 1.0 In-Reply-To: <20130926055802.GA20634@bbox> References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> <20130926055802.GA20634@bbox> Date: Thu, 26 Sep 2013 15:26:33 +0800 Message-ID: Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues From: Weijie Yang To: Minchan Kim Cc: Bob Liu , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 26, 2013 at 1:58 PM, Minchan Kim wrote: > Hello Weigie, > > On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote: >> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: >> > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: >> >> I think I find a new issue, for integrity of this mail thread, I reply >> >> to this mail. >> >> >> >> It is a concurrence issue either, when duplicate store and reclaim >> >> concurrentlly. >> >> >> >> zswap entry x with offset A is already stored in zswap backend. >> >> Consider the following scenario: >> >> >> >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) >> >> >> >> thread 1: store new page with the same offset A, alloc a new zswap entry y. >> >> store finished. shrink_page_list() call __remove_mapping(), and now >> >> it is not in swap_cache >> >> >> > >> > But I don't think swap layer will call zswap with the same offset A. >> >> 1. store page of offset A in zswap >> 2. some time later, pagefault occur, load page data from zswap. >> But notice that zswap entry x is still in zswap because it is not >> frontswap_tmem_exclusive_gets_enabled. > > frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff > between CPU burining by frequent swapout and memory footprint by duplicate > copy in swap cache and frontswap backend so it shouldn't affect the stability. Thanks for explain this. I don't mean to say this option affects the stability, but that zswap only realize one option. Maybe it's better to realize both options for different workloads. >> this page is with PageSwapCache(page) and page_private(page) = entry.val >> 3. change this page data, and it become dirty > > If non-shared swapin page become redirty, it should remove the page from > swapcache. If shared swapin page become redirty, it should do CoW so it's a > new page so that it doesn't live in swap cache. It means it should have new > offset which is different with old's one for swap out. > > What's wrong with that? It is really not a right scene for duplicate store. And I can not think out one. If duplicate store is impossible, How about delete the handle code in zswap? If it does exist, I think there is a potential issue as I described. >> 4. some time later again, swap this page on the same offset A. >> >> so, a duplicate store happens. >> >> what I can think is that use flags and CAS to protect store and reclaim on >> the same offset happens concurrentlly. >> >> >> thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache >> >> >> >> Now, swap cache has old data rather than new data for offset A. >> >> error will happen If do_swap_page() get page from swap_cache. >> >> >> > >> > -- >> > Regards, >> > --Bob >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: email@kvack.org > > -- > Kind regards, > Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756224Ab3IZH4x (ORCPT ); Thu, 26 Sep 2013 03:56:53 -0400 Received: from LGEMRELSE6Q.lge.com ([156.147.1.121]:61985 "EHLO LGEMRELSE6Q.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752238Ab3IZH4v (ORCPT ); Thu, 26 Sep 2013 03:56:51 -0400 X-AuditID: 9c930179-b7c8bae000006c65-f3-5243e8c1571c Date: Thu, 26 Sep 2013 16:57:25 +0900 From: Minchan Kim To: Weijie Yang Cc: Bob Liu , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues Message-ID: <20130926075725.GA22339@bbox> References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> <20130926055802.GA20634@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 26, 2013 at 03:26:33PM +0800, Weijie Yang wrote: > On Thu, Sep 26, 2013 at 1:58 PM, Minchan Kim wrote: > > Hello Weigie, > > > > On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote: > >> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: > >> > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: > >> >> I think I find a new issue, for integrity of this mail thread, I reply > >> >> to this mail. > >> >> > >> >> It is a concurrence issue either, when duplicate store and reclaim > >> >> concurrentlly. > >> >> > >> >> zswap entry x with offset A is already stored in zswap backend. > >> >> Consider the following scenario: > >> >> > >> >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) > >> >> > >> >> thread 1: store new page with the same offset A, alloc a new zswap entry y. > >> >> store finished. shrink_page_list() call __remove_mapping(), and now > >> >> it is not in swap_cache > >> >> > >> > > >> > But I don't think swap layer will call zswap with the same offset A. > >> > >> 1. store page of offset A in zswap > >> 2. some time later, pagefault occur, load page data from zswap. > >> But notice that zswap entry x is still in zswap because it is not > >> frontswap_tmem_exclusive_gets_enabled. > > > > frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff > > between CPU burining by frequent swapout and memory footprint by duplicate > > copy in swap cache and frontswap backend so it shouldn't affect the stability. > > Thanks for explain this. > I don't mean to say this option affects the stability, but that zswap > only realize > one option. Maybe it's better to realize both options for different workloads. "zswap only relize one option" What does it mena? Sorry. I couldn't parse your intention. :) You mean zswap should do something special to support frontswap_tmem_exclusive_gets? > > >> this page is with PageSwapCache(page) and page_private(page) = entry.val > >> 3. change this page data, and it become dirty > > > > If non-shared swapin page become redirty, it should remove the page from > > swapcache. If shared swapin page become redirty, it should do CoW so it's a > > new page so that it doesn't live in swap cache. It means it should have new > > offset which is different with old's one for swap out. > > > > What's wrong with that? > > It is really not a right scene for duplicate store. And I can not think out one. > If duplicate store is impossible, How about delete the handle code in zswap? > If it does exist, I think there is a potential issue as I described. You mean "zswap_duplicate_entry"? AFAIR, I already had a question to Seth when zswap was born but AFAIRC, he said that he didn't know exact reason but he saw that case during experiement so copy the code peice from zcache. Do you see the case, too? Anyway, we need to dive into that to know what happens and then open our eyes for clear solution before dumping meaningless patch. I hope Seth or Bob already know it. > > >> 4. some time later again, swap this page on the same offset A. > >> > >> so, a duplicate store happens. > >> > >> what I can think is that use flags and CAS to protect store and reclaim on > >> the same offset happens concurrentlly. > >> > >> >> thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache > >> >> > >> >> Now, swap cache has old data rather than new data for offset A. > >> >> error will happen If do_swap_page() get page from swap_cache. > >> >> > >> > > >> > -- > >> > Regards, > >> > --Bob > >> > >> -- > >> To unsubscribe, send a message with 'unsubscribe linux-mm' in > >> the body to majordomo@kvack.org. For more info on Linux MM, > >> see: http://www.linux-mm.org/ . > >> Don't email: email@kvack.org > > > > -- > > Kind regards, > > Minchan Kim > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751486Ab3IZIsJ (ORCPT ); Thu, 26 Sep 2013 04:48:09 -0400 Received: from mail-ie0-f178.google.com ([209.85.223.178]:33065 "EHLO mail-ie0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750821Ab3IZIsF (ORCPT ); Thu, 26 Sep 2013 04:48:05 -0400 MIME-Version: 1.0 In-Reply-To: <20130926075725.GA22339@bbox> References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> <20130926055802.GA20634@bbox> <20130926075725.GA22339@bbox> Date: Thu, 26 Sep 2013 16:48:03 +0800 Message-ID: Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues From: Weijie Yang To: Minchan Kim Cc: Bob Liu , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 26, 2013 at 3:57 PM, Minchan Kim wrote: > On Thu, Sep 26, 2013 at 03:26:33PM +0800, Weijie Yang wrote: >> On Thu, Sep 26, 2013 at 1:58 PM, Minchan Kim wrote: >> > Hello Weigie, >> > >> > On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote: >> >> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: >> >> > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: >> >> >> I think I find a new issue, for integrity of this mail thread, I reply >> >> >> to this mail. >> >> >> >> >> >> It is a concurrence issue either, when duplicate store and reclaim >> >> >> concurrentlly. >> >> >> >> >> >> zswap entry x with offset A is already stored in zswap backend. >> >> >> Consider the following scenario: >> >> >> >> >> >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) >> >> >> >> >> >> thread 1: store new page with the same offset A, alloc a new zswap entry y. >> >> >> store finished. shrink_page_list() call __remove_mapping(), and now >> >> >> it is not in swap_cache >> >> >> >> >> > >> >> > But I don't think swap layer will call zswap with the same offset A. >> >> >> >> 1. store page of offset A in zswap >> >> 2. some time later, pagefault occur, load page data from zswap. >> >> But notice that zswap entry x is still in zswap because it is not >> >> frontswap_tmem_exclusive_gets_enabled. >> > >> > frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff >> > between CPU burining by frequent swapout and memory footprint by duplicate >> > copy in swap cache and frontswap backend so it shouldn't affect the stability. >> >> Thanks for explain this. >> I don't mean to say this option affects the stability, but that zswap >> only realize >> one option. Maybe it's better to realize both options for different workloads. > > "zswap only relize one option" > What does it mena? Sorry. I couldn't parse your intention. :) > You mean zswap should do something special to support frontswap_tmem_exclusive_gets? Yes. But I am not sure whether it is worth. >> >> >> this page is with PageSwapCache(page) and page_private(page) = entry.val >> >> 3. change this page data, and it become dirty >> > >> > If non-shared swapin page become redirty, it should remove the page from >> > swapcache. If shared swapin page become redirty, it should do CoW so it's a >> > new page so that it doesn't live in swap cache. It means it should have new >> > offset which is different with old's one for swap out. >> > >> > What's wrong with that? >> >> It is really not a right scene for duplicate store. And I can not think out one. >> If duplicate store is impossible, How about delete the handle code in zswap? >> If it does exist, I think there is a potential issue as I described. > > You mean "zswap_duplicate_entry"? > AFAIR, I already had a question to Seth when zswap was born but AFAIRC, > he said that he didn't know exact reason but he saw that case during > experiement so copy the code peice from zcache. > > Do you see the case, too? Yes, I mean duplicate store. I check the /Documentation/vm/frontswap.txt, it mentions "duplicate stores", but I am still confused. I wrote a zcache varietas which swap out compressed page to swapfile. I did see that case when I test it on andorid smartphone(arm v7), and it happens rarely and occasionally. In one test, only 1 duplicate store occur in about 3157162 times stores. > Anyway, we need to dive into that to know what happens and then open > our eyes for clear solution before dumping meaningless patch. > > I hope Seth or Bob already know it. > >> >> >> 4. some time later again, swap this page on the same offset A. >> >> >> >> so, a duplicate store happens. >> >> >> >> what I can think is that use flags and CAS to protect store and reclaim on >> >> the same offset happens concurrentlly. >> >> >> >> >> thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache >> >> >> >> >> >> Now, swap cache has old data rather than new data for offset A. >> >> >> error will happen If do_swap_page() get page from swap_cache. >> >> >> >> >> > >> >> > -- >> >> > Regards, >> >> > --Bob >> >> >> >> -- >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> >> the body to majordomo@kvack.org. For more info on Linux MM, >> >> see: http://www.linux-mm.org/ . >> >> Don't email: email@kvack.org >> > >> > -- >> > Kind regards, >> > Minchan Kim >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: email@kvack.org > > -- > Kind regards, > Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756619Ab3IZJ44 (ORCPT ); Thu, 26 Sep 2013 05:56:56 -0400 Received: from mail-ve0-f182.google.com ([209.85.128.182]:48309 "EHLO mail-ve0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753314Ab3IZJ4m (ORCPT ); Thu, 26 Sep 2013 05:56:42 -0400 MIME-Version: 1.0 In-Reply-To: References: <52118042.30101@oracle.com> <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> <20130926055802.GA20634@bbox> <20130926075725.GA22339@bbox> Date: Thu, 26 Sep 2013 17:56:40 +0800 Message-ID: Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues From: Bob Liu To: Weijie Yang Cc: Minchan Kim , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 26, 2013 at 4:48 PM, Weijie Yang wrote: > On Thu, Sep 26, 2013 at 3:57 PM, Minchan Kim wrote: >> On Thu, Sep 26, 2013 at 03:26:33PM +0800, Weijie Yang wrote: >>> On Thu, Sep 26, 2013 at 1:58 PM, Minchan Kim wrote: >>> > Hello Weigie, >>> > >>> > On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote: >>> >> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: >>> >> > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: >>> >> >> I think I find a new issue, for integrity of this mail thread, I reply >>> >> >> to this mail. >>> >> >> >>> >> >> It is a concurrence issue either, when duplicate store and reclaim >>> >> >> concurrentlly. >>> >> >> >>> >> >> zswap entry x with offset A is already stored in zswap backend. >>> >> >> Consider the following scenario: >>> >> >> >>> >> >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) >>> >> >> >>> >> >> thread 1: store new page with the same offset A, alloc a new zswap entry y. >>> >> >> store finished. shrink_page_list() call __remove_mapping(), and now >>> >> >> it is not in swap_cache >>> >> >> >>> >> > >>> >> > But I don't think swap layer will call zswap with the same offset A. >>> >> >>> >> 1. store page of offset A in zswap >>> >> 2. some time later, pagefault occur, load page data from zswap. >>> >> But notice that zswap entry x is still in zswap because it is not >>> >> frontswap_tmem_exclusive_gets_enabled. >>> > >>> > frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff >>> > between CPU burining by frequent swapout and memory footprint by duplicate >>> > copy in swap cache and frontswap backend so it shouldn't affect the stability. >>> >>> Thanks for explain this. >>> I don't mean to say this option affects the stability, but that zswap >>> only realize >>> one option. Maybe it's better to realize both options for different workloads. >> >> "zswap only relize one option" >> What does it mena? Sorry. I couldn't parse your intention. :) >> You mean zswap should do something special to support frontswap_tmem_exclusive_gets? > > Yes. But I am not sure whether it is worth. > >>> >>> >> this page is with PageSwapCache(page) and page_private(page) = entry.val >>> >> 3. change this page data, and it become dirty >>> > >>> > If non-shared swapin page become redirty, it should remove the page from >>> > swapcache. If shared swapin page become redirty, it should do CoW so it's a >>> > new page so that it doesn't live in swap cache. It means it should have new >>> > offset which is different with old's one for swap out. >>> > >>> > What's wrong with that? >>> >>> It is really not a right scene for duplicate store. And I can not think out one. >>> If duplicate store is impossible, How about delete the handle code in zswap? >>> If it does exist, I think there is a potential issue as I described. >> >> You mean "zswap_duplicate_entry"? >> AFAIR, I already had a question to Seth when zswap was born but AFAIRC, >> he said that he didn't know exact reason but he saw that case during >> experiement so copy the code peice from zcache. >> >> Do you see the case, too? > > Yes, I mean duplicate store. > I check the /Documentation/vm/frontswap.txt, it mentions "duplicate stores", > but I am still confused. > > I wrote a zcache varietas which swap out compressed page to swapfile. > I did see that case when I test it on andorid smartphone(arm v7), Why not test it based on zswap directly? My suggestion is do more and more stress testing against current zswap, if there is really any bug triggered or unusual performance issue then we dig it out and fix it. -- Regards, --Bob From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751226Ab3I0E6J (ORCPT ); Fri, 27 Sep 2013 00:58:09 -0400 Received: from lgeamrelo01.lge.com ([156.147.1.125]:48343 "EHLO LGEAMRELO01.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750821Ab3I0E6E (ORCPT ); Fri, 27 Sep 2013 00:58:04 -0400 X-AuditID: 9c93017d-b7c2dae000000ea6-a7-5245105bf26b Date: Fri, 27 Sep 2013 13:58:42 +0900 From: Minchan Kim To: Weijie Yang Cc: Bob Liu , Bob Liu , Seth Jennings , Linux-MM , Linux-Kernel Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues Message-ID: <20130927045842.GB22339@bbox> References: <20130819054742.GA28062@bbox> <20130821074939.GE3022@bbox> <20130926055802.GA20634@bbox> <20130926075725.GA22339@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello Weijie, On Thu, Sep 26, 2013 at 04:48:03PM +0800, Weijie Yang wrote: > On Thu, Sep 26, 2013 at 3:57 PM, Minchan Kim wrote: > > On Thu, Sep 26, 2013 at 03:26:33PM +0800, Weijie Yang wrote: > >> On Thu, Sep 26, 2013 at 1:58 PM, Minchan Kim wrote: > >> > Hello Weigie, > >> > > >> > On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote: > >> >> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: > >> >> > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: > >> >> >> I think I find a new issue, for integrity of this mail thread, I reply > >> >> >> to this mail. > >> >> >> > >> >> >> It is a concurrence issue either, when duplicate store and reclaim > >> >> >> concurrentlly. > >> >> >> > >> >> >> zswap entry x with offset A is already stored in zswap backend. > >> >> >> Consider the following scenario: > >> >> >> > >> >> >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) > >> >> >> > >> >> >> thread 1: store new page with the same offset A, alloc a new zswap entry y. > >> >> >> store finished. shrink_page_list() call __remove_mapping(), and now > >> >> >> it is not in swap_cache > >> >> >> > >> >> > > >> >> > But I don't think swap layer will call zswap with the same offset A. > >> >> > >> >> 1. store page of offset A in zswap > >> >> 2. some time later, pagefault occur, load page data from zswap. > >> >> But notice that zswap entry x is still in zswap because it is not > >> >> frontswap_tmem_exclusive_gets_enabled. > >> > > >> > frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff > >> > between CPU burining by frequent swapout and memory footprint by duplicate > >> > copy in swap cache and frontswap backend so it shouldn't affect the stability. > >> > >> Thanks for explain this. > >> I don't mean to say this option affects the stability, but that zswap > >> only realize > >> one option. Maybe it's better to realize both options for different workloads. > > > > "zswap only relize one option" > > What does it mena? Sorry. I couldn't parse your intention. :) > > You mean zswap should do something special to support frontswap_tmem_exclusive_gets? > > Yes. But I am not sure whether it is worth. > > >> > >> >> this page is with PageSwapCache(page) and page_private(page) = entry.val > >> >> 3. change this page data, and it become dirty > >> > > >> > If non-shared swapin page become redirty, it should remove the page from > >> > swapcache. If shared swapin page become redirty, it should do CoW so it's a > >> > new page so that it doesn't live in swap cache. It means it should have new > >> > offset which is different with old's one for swap out. > >> > > >> > What's wrong with that? > >> > >> It is really not a right scene for duplicate store. And I can not think out one. > >> If duplicate store is impossible, How about delete the handle code in zswap? > >> If it does exist, I think there is a potential issue as I described. > > > > You mean "zswap_duplicate_entry"? > > AFAIR, I already had a question to Seth when zswap was born but AFAIRC, > > he said that he didn't know exact reason but he saw that case during > > experiement so copy the code peice from zcache. > > > > Do you see the case, too? > > Yes, I mean duplicate store. > I check the /Documentation/vm/frontswap.txt, it mentions "duplicate stores", > but I am still confused. It seems that there are two Minchan in LKML. Other Minchan, not me who have a horrible memory, already was first to figure it out a few month ago. https://lkml.org/lkml/2013/1/31/3 /me slaps self. I'd like to look into that issue more but now I don't have a time. Just FYI. ;-) -- Kind regards, Minchan Kim