[PATCH v3 0/3] Fix zsmalloc crash problem

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/3] Fix zsmalloc crash problem
@ 2016-11-25  8:35 Minchan Kim
  2016-11-25  8:35 ` [PATCH v3 1/3] mm: support anonymous stable page Minchan Kim
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Minchan Kim @ 2016-11-25  8:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Sergey Senozhatsky, Takashi Iwai, Hyeoncheol Lee,
	yjay.kim, Sangseok Lee, Hugh Dickins, Minchan Kim

During developemnt for zram-swap asynchronous writeback, I found strange
corruption of compressed page.

Modules linked in: zram(E)
CPU: 3 PID: 1520 Comm: zramd-1 Tainted: G            E   4.8.0-mm1-00320-ge0d4894c9c38-dirty #3274
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: ffff88007620b840 task.stack: ffff880078090000
RIP: 0010:[<ffffffff811d6f3d>]  [<ffffffff811d6f3d>] set_freeobj.part.43+0x1c/0x1f
RSP: 0018:ffff880078093ca8  EFLAGS: 00010246
RAX: 0000000000000018 RBX: ffff880076798d88 RCX: ffffffff81c408c8
RDX: 0000000000000018 RSI: 0000000000000000 RDI: 0000000000000246
RBP: ffff880078093cb0 R08: 0000000000000000 R09: 0000000000000000
R10: ffff88005bc43030 R11: 0000000000001df3 R12: ffff880076798d88
R13: 000000000005bc43 R14: ffff88007819d1b8 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff88007e380000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc934048f20 CR3: 0000000077b01000 CR4: 00000000000406e0
Stack:
 ffffea00016f10c0 ffff880078093d08 ffffffff811d43cb ffff88005bc43030
 ffff88005bc43030 0000000000000001 ffff88007aa17f68 ffff88007819d1b8
 000000000200020a 0000000002000200 ffff880076798d88 ffff88007aa17f68
Call Trace:
 [<ffffffff811d43cb>] obj_malloc+0x22b/0x260
 [<ffffffff811d4be4>] zs_malloc+0x1e4/0x580
 [<ffffffff81355490>] ? lz4_compress_crypto+0x30/0x50
 [<ffffffffa000269d>] zram_bvec_rw+0x4cd/0x830 [zram]
 [<ffffffffa000356c>] page_requests_rw+0x9c/0x130 [zram]
 [<ffffffffa0003600>] ? page_requests_rw+0x130/0x130 [zram]
 [<ffffffffa00036e6>] zram_thread+0xe6/0x173 [zram]
 [<ffffffff810b67e0>] ? wake_atomic_t_function+0x60/0x60
 [<ffffffff81094cfa>] kthread+0xca/0xe0
 [<ffffffff81094c30>] ? kthread_park+0x60/0x60
 [<ffffffff817ae775>] ret_from_fork+0x25/0x30

With investigation, it reveals currently stable page doesn't support
anonymous page.  IOW, reuse_swap_page can reuse the page without
waiting writeback completion so it can overwrite page zram is
compressing.

Unfortunately, zram has used per-cpu stream feature from v4.7.
It aims for increasing cache hit ratio of scratch buffer for
compressing. Downside of that approach is that zram should ask
memory space for compressed page in per-cpu context which requires
stricted gfp flag which could be failed. If so, it retries to
allocate memory space out of per-cpu context so it could get memory
this time and compress the data again, copies it to the memory space.

In this scenario, zram assumes the data should never be changed
but it is not true unless stable page supports. So, If the data is
changed under us, zram can make buffer overrun because second
compression size could be bigger than one we got in previous trial
and blindly, copy bigger size object to smaller buffer which is
buffer overrun. The overrun breaks zsmalloc free object chaining
so system goes crash like above.

I think below is same problem.
https://bugzilla.suse.com/show_bug.cgi?id=997574

This patchset fixes the problem.
[1/3] is to support anonymous stable page.
[2/3] is prepartion step for support zram stable write support.
[3/3] is to support stable write for zram.

This patchset should go to the stable for [4.7+].

Minchan Kim (3):
  [1/3] mm: support anonymous stable page
  [2/3] zram: revalidate disk under init_lock
  [3/3] zram: support BDI_CAP_STABLE_WRITES

 drivers/block/zram/zram_drv.c | 18 ++++++++++--------
 include/linux/swap.h          |  3 ++-
 mm/swapfile.c                 | 20 +++++++++++++++++++-
 3 files changed, 31 insertions(+), 10 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 1/3] mm: support anonymous stable page
  2016-11-25  8:35 [PATCH v3 0/3] Fix zsmalloc crash problem Minchan Kim
@ 2016-11-25  8:35 ` Minchan Kim
  2016-11-27 13:19   ` Sergey Senozhatsky
  2016-11-25  8:35 ` [PATCH v3 2/3] zram: revalidate disk under init_lock Minchan Kim
  2016-11-25  8:35 ` [PATCH v3 3/3] zram: support BDI_CAP_STABLE_WRITES Minchan Kim
  2 siblings, 1 reply; 11+ messages in thread
From: Minchan Kim @ 2016-11-25  8:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Sergey Senozhatsky, Takashi Iwai, Hyeoncheol Lee,
	yjay.kim, Sangseok Lee, Hugh Dickins, Minchan Kim, linux-mm,
	Darrick J . Wong, stable

During developemnt for zram-swap asynchronous writeback, I found strange
corruption of compressed page.

Modules linked in: zram(E)
CPU: 3 PID: 1520 Comm: zramd-1 Tainted: G            E   4.8.0-mm1-00320-ge0d4894c9c38-dirty #3274
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: ffff88007620b840 task.stack: ffff880078090000
RIP: 0010:[<ffffffff811d6f3d>]  [<ffffffff811d6f3d>] set_freeobj.part.43+0x1c/0x1f
RSP: 0018:ffff880078093ca8  EFLAGS: 00010246
RAX: 0000000000000018 RBX: ffff880076798d88 RCX: ffffffff81c408c8
RDX: 0000000000000018 RSI: 0000000000000000 RDI: 0000000000000246
RBP: ffff880078093cb0 R08: 0000000000000000 R09: 0000000000000000
R10: ffff88005bc43030 R11: 0000000000001df3 R12: ffff880076798d88
R13: 000000000005bc43 R14: ffff88007819d1b8 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff88007e380000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc934048f20 CR3: 0000000077b01000 CR4: 00000000000406e0
Stack:
 ffffea00016f10c0 ffff880078093d08 ffffffff811d43cb ffff88005bc43030
 ffff88005bc43030 0000000000000001 ffff88007aa17f68 ffff88007819d1b8
 000000000200020a 0000000002000200 ffff880076798d88 ffff88007aa17f68
Call Trace:
 [<ffffffff811d43cb>] obj_malloc+0x22b/0x260
 [<ffffffff811d4be4>] zs_malloc+0x1e4/0x580
 [<ffffffff81355490>] ? lz4_compress_crypto+0x30/0x50
 [<ffffffffa000269d>] zram_bvec_rw+0x4cd/0x830 [zram]
 [<ffffffffa000356c>] page_requests_rw+0x9c/0x130 [zram]
 [<ffffffffa0003600>] ? page_requests_rw+0x130/0x130 [zram]
 [<ffffffffa00036e6>] zram_thread+0xe6/0x173 [zram]
 [<ffffffff810b67e0>] ? wake_atomic_t_function+0x60/0x60
 [<ffffffff81094cfa>] kthread+0xca/0xe0
 [<ffffffff81094c30>] ? kthread_park+0x60/0x60
 [<ffffffff817ae775>] ret_from_fork+0x25/0x30

With investigation, it reveals currently stable page doesn't support
anonymous page.  IOW, reuse_swap_page can reuse the page without waiting
writeback completion so it can overwrite page zram is compressing.

Unfortunately, zram has used per-cpu stream feature from v4.7.
It aims for increasing cache hit ratio of scratch buffer for
compressing. Downside of that approach is that zram should ask
memory space for compressed page in per-cpu context which requires
stricted gfp flag which could be failed. If so, it retries to
allocate memory space out of per-cpu context so it could get memory
this time and compress the data again, copies it to the memory space.

In this scenario, zram assumes the data should never be changed
but it is not true unless stable page supports. So, If the data is
changed under us, zram can make buffer overrun because second
compression size could be bigger than one we got in previous trial
and blindly, copy bigger size object to smaller buffer which is
buffer overrun. The overrun breaks zsmalloc free object chaining
so system goes crash like above.

I think below is same problem.
https://bugzilla.suse.com/show_bug.cgi?id=997574

Unfortunately, reuse_swap_page should be atomic so that we cannot wait on
writeback in there so the approach in this patch is simply return false if
we found it needs stable page.  Although it increases memory footprint
temporarily, it happens rarely and it should be reclaimed easily althoug
it happened.  Also, It would be better than waiting of IO completion,
which is critial path for application latency.

Fixes: da9556a2367c ("zram: user per-cpu compression streams")
Link: http://lkml.kernel.org/r/20161120233015.GA14113@bbox
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: stable@vger.kernel.org
---
 include/linux/swap.h |  3 ++-
 mm/swapfile.c        | 20 +++++++++++++++++++-
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 09f4be179ff3..7f47b7098b1b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -150,8 +150,9 @@ enum {
 	SWP_FILE	= (1 << 7),	/* set after swap_activate success */
 	SWP_AREA_DISCARD = (1 << 8),	/* single-time swap area discards */
 	SWP_PAGE_DISCARD = (1 << 9),	/* freed swap page-cluster discards */
+	SWP_STABLE_WRITES = (1 << 10),	/* no overwrite PG_writeback pages */
 					/* add others here before... */
-	SWP_SCANNING	= (1 << 10),	/* refcount in scan_swap_map */
+	SWP_SCANNING	= (1 << 11),	/* refcount in scan_swap_map */
 };
 
 #define SWAP_CLUSTER_MAX 32UL
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f30438970cd1..d76b2a18f044 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -943,11 +943,25 @@ bool reuse_swap_page(struct page *page, int *total_mapcount)
 	count = page_trans_huge_mapcount(page, total_mapcount);
 	if (count <= 1 && PageSwapCache(page)) {
 		count += page_swapcount(page);
-		if (count == 1 && !PageWriteback(page)) {
+		if (count != 1)
+			goto out;
+		if (!PageWriteback(page)) {
 			delete_from_swap_cache(page);
 			SetPageDirty(page);
+		} else {
+			swp_entry_t entry;
+			struct swap_info_struct *p;
+
+			entry.val = page_private(page);
+			p = swap_info_get(entry);
+			if (p->flags & SWP_STABLE_WRITES) {
+				spin_unlock(&p->lock);
+				return false;
+			}
+			spin_unlock(&p->lock);
 		}
 	}
+out:
 	return count <= 1;
 }
 
@@ -2449,6 +2463,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		error = -ENOMEM;
 		goto bad_swap;
 	}
+
+	if (bdi_cap_stable_pages_required(inode_to_bdi(inode)))
+		p->flags |= SWP_STABLE_WRITES;
+
 	if (p->bdev && blk_queue_nonrot(bdev_get_queue(p->bdev))) {
 		int cpu;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 2/3] zram: revalidate disk under init_lock
  2016-11-25  8:35 [PATCH v3 0/3] Fix zsmalloc crash problem Minchan Kim
  2016-11-25  8:35 ` [PATCH v3 1/3] mm: support anonymous stable page Minchan Kim
@ 2016-11-25  8:35 ` Minchan Kim
  2016-11-26  6:38   ` Sergey Senozhatsky
  2016-11-25  8:35 ` [PATCH v3 3/3] zram: support BDI_CAP_STABLE_WRITES Minchan Kim
  2 siblings, 1 reply; 11+ messages in thread
From: Minchan Kim @ 2016-11-25  8:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Sergey Senozhatsky, Takashi Iwai, Hyeoncheol Lee,
	yjay.kim, Sangseok Lee, Hugh Dickins, Minchan Kim, stable

[1] moved revalidate_disk call out of init_lock to avoid lockdep
false-positive splat. However, [2] remove init_lock in IO path
so there is no worry about lockdep splat. So, let's restore it.
This patch need to set BDI_CAP_STABLE_WRITES atomically in
next patch.

[1] b4c5c60920e3: zram: avoid lockdep splat by revalidate_disk
[2] 08eee69fcf6b: zram: remove init_lock in zram_make_request

Fixes: da9556a2367c ("zram: user per-cpu compression streams")
Cc: stable@vger.kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/zram/zram_drv.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 5163c8f918cb..d93a4b2135c2 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1094,14 +1094,8 @@ static ssize_t disksize_store(struct device *dev,
 	zram->comp = comp;
 	zram->disksize = disksize;
 	set_capacity(zram->disk, zram->disksize >> SECTOR_SHIFT);
-	up_write(&zram->init_lock);
-
-	/*
-	 * Revalidate disk out of the init_lock to avoid lockdep splat.
-	 * It's okay because disk's capacity is protected by init_lock
-	 * so that revalidate_disk always sees up-to-date capacity.
-	 */
 	revalidate_disk(zram->disk);
+	up_write(&zram->init_lock);
 
 	return len;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 3/3] zram: support BDI_CAP_STABLE_WRITES
  2016-11-25  8:35 [PATCH v3 0/3] Fix zsmalloc crash problem Minchan Kim
  2016-11-25  8:35 ` [PATCH v3 1/3] mm: support anonymous stable page Minchan Kim
  2016-11-25  8:35 ` [PATCH v3 2/3] zram: revalidate disk under init_lock Minchan Kim
@ 2016-11-25  8:35 ` Minchan Kim
  2016-11-26  6:37   ` Sergey Senozhatsky
  2 siblings, 1 reply; 11+ messages in thread
From: Minchan Kim @ 2016-11-25  8:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Sergey Senozhatsky, Takashi Iwai, Hyeoncheol Lee,
	yjay.kim, Sangseok Lee, Hugh Dickins, Minchan Kim, stable

zram has used per-cpu stream feature from v4.7.
It aims for increasing cache hit ratio of scratch buffer for
compressing. Downside of that approach is that zram should ask
memory space for compressed page in per-cpu context which requires
stricted gfp flag which could be failed. If so, it retries to
allocate memory space out of per-cpu context so it could get memory
this time and compress the data again, copies it to the memory space.

In this scenario, zram assumes the data should never be changed
but it is not true without stable page support. So, If the data is
changed under us, zram can make buffer overrun so that zsmalloc
free object chain is broken so system goes crash like below
https://bugzilla.suse.com/show_bug.cgi?id=997574

This patch adds BDI_CAP_STABLE_WRITES to zram for declaring
"I am block device needing *stable write*".

Fixes: da9556a2367c ("zram: user per-cpu compression streams")
Cc: stable@vger.kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/zram/zram_drv.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index d93a4b2135c2..9d7f83f9a388 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -30,6 +30,7 @@
 #include <linux/err.h>
 #include <linux/idr.h>
 #include <linux/sysfs.h>
+#include <linux/backing-dev.h>
 
 #include "zram_drv.h"
 
@@ -111,6 +112,13 @@ static inline bool is_partial_io(struct bio_vec *bvec)
 	return bvec->bv_len != PAGE_SIZE;
 }
 
+static void zram_revalidate_disk(struct zram *zram)
+{
+	revalidate_disk(zram->disk);
+	zram->disk->queue->backing_dev_info.capabilities |=
+		BDI_CAP_STABLE_WRITES;
+}
+
 /*
  * Check if request is within bounds and aligned on zram logical blocks.
  */
@@ -1094,7 +1102,7 @@ static ssize_t disksize_store(struct device *dev,
 	zram->comp = comp;
 	zram->disksize = disksize;
 	set_capacity(zram->disk, zram->disksize >> SECTOR_SHIFT);
-	revalidate_disk(zram->disk);
+	zram_revalidate_disk(zram);
 	up_write(&zram->init_lock);
 
 	return len;
@@ -1142,7 +1150,7 @@ static ssize_t reset_store(struct device *dev,
 	/* Make sure all the pending I/O are finished */
 	fsync_bdev(bdev);
 	zram_reset_device(zram);
-	revalidate_disk(zram->disk);
+	zram_revalidate_disk(zram);
 	bdput(bdev);
 
 	mutex_lock(&bdev->bd_mutex);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/3] zram: support BDI_CAP_STABLE_WRITES
  2016-11-25  8:35 ` [PATCH v3 3/3] zram: support BDI_CAP_STABLE_WRITES Minchan Kim
@ 2016-11-26  6:37   ` Sergey Senozhatsky
  2016-11-26 14:41     ` Minchan Kim
  0 siblings, 1 reply; 11+ messages in thread
From: Sergey Senozhatsky @ 2016-11-26  6:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, Sergey Senozhatsky, Takashi Iwai,
	Hyeoncheol Lee, yjay.kim, Sangseok Lee, Hugh Dickins, stable

Hello Minchan,

On (11/25/16 17:35), Minchan Kim wrote:
[..]
> +static void zram_revalidate_disk(struct zram *zram)
> +{
> +	revalidate_disk(zram->disk);
> +	zram->disk->queue->backing_dev_info.capabilities |=
> +		BDI_CAP_STABLE_WRITES;
> +}
> +
>  /*
>   * Check if request is within bounds and aligned on zram logical blocks.
>   */
> @@ -1094,7 +1102,7 @@ static ssize_t disksize_store(struct device *dev,
>  	zram->comp = comp;
>  	zram->disksize = disksize;
>  	set_capacity(zram->disk, zram->disksize >> SECTOR_SHIFT);
> -	revalidate_disk(zram->disk);
> +	zram_revalidate_disk(zram);
>  	up_write(&zram->init_lock);
>  
>  	return len;
> @@ -1142,7 +1150,7 @@ static ssize_t reset_store(struct device *dev,
>  	/* Make sure all the pending I/O are finished */
>  	fsync_bdev(bdev);
>  	zram_reset_device(zram);
> -	revalidate_disk(zram->disk);
> +	zram_revalidate_disk(zram);
>  	bdput(bdev);
>  
>  	mutex_lock(&bdev->bd_mutex);

why not set it just once, when we allocate queue/disk and configure both
of them:  in zram_add()

	queue->backing_dev_info.capabilities |= BDI_CAP_CGROUP_WRITEBACK;

	-ss

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/3] zram: revalidate disk under init_lock
  2016-11-25  8:35 ` [PATCH v3 2/3] zram: revalidate disk under init_lock Minchan Kim
@ 2016-11-26  6:38   ` Sergey Senozhatsky
  0 siblings, 0 replies; 11+ messages in thread
From: Sergey Senozhatsky @ 2016-11-26  6:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, Sergey Senozhatsky, Takashi Iwai,
	Hyeoncheol Lee, yjay.kim, Sangseok Lee, Hugh Dickins, stable


Hi,

On (11/25/16 17:35), Minchan Kim wrote:
> [1] moved revalidate_disk call out of init_lock to avoid lockdep
> false-positive splat. However, [2] remove init_lock in IO path
> so there is no worry about lockdep splat. So, let's restore it.
> This patch need to set BDI_CAP_STABLE_WRITES atomically in
> next patch.

can we break that dependency on the next patch if we would
set BDI_CAP_STABLE_WRITES when we allocate the queue?

	queue->backing_dev_info.capabilities |= BDI_CAP_CGROUP_WRITEBACK;

	-ss

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/3] zram: support BDI_CAP_STABLE_WRITES
  2016-11-26  6:37   ` Sergey Senozhatsky
@ 2016-11-26 14:41     ` Minchan Kim
  2016-11-27 13:01       ` Sergey Senozhatsky
  0 siblings, 1 reply; 11+ messages in thread
From: Minchan Kim @ 2016-11-26 14:41 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Minchan Kim, Andrew Morton, linux-kernel, Takashi Iwai,
	Hyeoncheol Lee, yjay.kim, Sangseok Lee, Hugh Dickins, stable

Hi Sergey,

On Sat, Nov 26, 2016 at 03:37:01PM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> On (11/25/16 17:35), Minchan Kim wrote:
> [..]
> > +static void zram_revalidate_disk(struct zram *zram)
> > +{
> > +	revalidate_disk(zram->disk);
> > +	zram->disk->queue->backing_dev_info.capabilities |=
> > +		BDI_CAP_STABLE_WRITES;
> > +}
> > +
> >  /*
> >   * Check if request is within bounds and aligned on zram logical blocks.
> >   */
> > @@ -1094,7 +1102,7 @@ static ssize_t disksize_store(struct device *dev,
> >  	zram->comp = comp;
> >  	zram->disksize = disksize;
> >  	set_capacity(zram->disk, zram->disksize >> SECTOR_SHIFT);
> > -	revalidate_disk(zram->disk);
> > +	zram_revalidate_disk(zram);
> >  	up_write(&zram->init_lock);
> >  
> >  	return len;
> > @@ -1142,7 +1150,7 @@ static ssize_t reset_store(struct device *dev,
> >  	/* Make sure all the pending I/O are finished */
> >  	fsync_bdev(bdev);
> >  	zram_reset_device(zram);
> > -	revalidate_disk(zram->disk);
> > +	zram_revalidate_disk(zram);
> >  	bdput(bdev);
> >  
> >  	mutex_lock(&bdev->bd_mutex);
> 
> why not set it just once, when we allocate queue/disk and configure both
> of them:  in zram_add()

I should have mentioned the reason.
The revalidate_disk reset the BDI_CAP_STABLE_WRITES.

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/3] zram: support BDI_CAP_STABLE_WRITES
  2016-11-26 14:41     ` Minchan Kim
@ 2016-11-27 13:01       ` Sergey Senozhatsky
  0 siblings, 0 replies; 11+ messages in thread
From: Sergey Senozhatsky @ 2016-11-27 13:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, linux-kernel, Takashi Iwai,
	Hyeoncheol Lee, yjay.kim, Sangseok Lee, Hugh Dickins, stable

On (11/26/16 23:41), Minchan Kim wrote:
[..]
> > >  	mutex_lock(&bdev->bd_mutex);
> > 
> > why not set it just once, when we allocate queue/disk and configure both
> > of them:  in zram_add()
> 
> I should have mentioned the reason.
> The revalidate_disk reset the BDI_CAP_STABLE_WRITES.

aha. either sets or clears it in blk_integrity_revalidate(),
now I see it.

	-ss

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/3] mm: support anonymous stable page
  2016-11-25  8:35 ` [PATCH v3 1/3] mm: support anonymous stable page Minchan Kim
@ 2016-11-27 13:19   ` Sergey Senozhatsky
  2016-11-28  0:41     ` Minchan Kim
  0 siblings, 1 reply; 11+ messages in thread
From: Sergey Senozhatsky @ 2016-11-27 13:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, Sergey Senozhatsky, Takashi Iwai,
	Hyeoncheol Lee, yjay.kim, Sangseok Lee, Hugh Dickins, linux-mm,
	Darrick J . Wong, stable, Sergey Senozhatsky

Hi,

On (11/25/16 17:35), Minchan Kim wrote:
[..]
> Unfortunately, zram has used per-cpu stream feature from v4.7.
> It aims for increasing cache hit ratio of scratch buffer for
> compressing. Downside of that approach is that zram should ask
> memory space for compressed page in per-cpu context which requires
> stricted gfp flag which could be failed. If so, it retries to
> allocate memory space out of per-cpu context so it could get memory
> this time and compress the data again, copies it to the memory space.
> 
> In this scenario, zram assumes the data should never be changed
> but it is not true unless stable page supports. So, If the data is
> changed under us, zram can make buffer overrun because second
> compression size could be bigger than one we got in previous trial
> and blindly, copy bigger size object to smaller buffer which is
> buffer overrun. The overrun breaks zsmalloc free object chaining
> so system goes crash like above.

very interesting find! didn't see this coming.

> Unfortunately, reuse_swap_page should be atomic so that we cannot wait on
> writeback in there so the approach in this patch is simply return false if
> we found it needs stable page.  Although it increases memory footprint
> temporarily, it happens rarely and it should be reclaimed easily althoug
> it happened.  Also, It would be better than waiting of IO completion,
> which is critial path for application latency.

wondering - how many pages can it hold? we are in low memory, that's why we
failed to zsmalloc in fast path, so how likely this to worsen memory pressure?
just asking. in async zram the window between zram_rw_page() and actual
write of a page even bigger, isn't it?

we *probably* and *may be* can try handle it in zram:

-- store the previous clen before re-compression
-- check if new clen > saved_clen and if it is - we can't use previously
   allocate handle and need to allocate a new one again. if it's less or
   equal than the saved one - store the object (wasting some space,
   yes. but we are in low mem).

-- we, may be, also can try harder in zsmalloc. once we detected that
   zsmllaoc has failed, then we can declare it as an emergency and
   store objects of size X in higher classes (assuming that there is a
   bigger size class available with allocated and unused object).

	-ss

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/3] mm: support anonymous stable page
  2016-11-27 13:19   ` Sergey Senozhatsky
@ 2016-11-28  0:41     ` Minchan Kim
  2016-11-28  5:38       ` Sergey Senozhatsky
  0 siblings, 1 reply; 11+ messages in thread
From: Minchan Kim @ 2016-11-28  0:41 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, linux-kernel, Takashi Iwai, Hyeoncheol Lee,
	yjay.kim, Sangseok Lee, Hugh Dickins, linux-mm, Darrick J . Wong,
	stable, Sergey Senozhatsky

Hi Sergey,

I'm going on a long vacation so forgive if I respond slowly. :)

On Sun, Nov 27, 2016 at 10:19:10PM +0900, Sergey Senozhatsky wrote:
> Hi,
> 
> On (11/25/16 17:35), Minchan Kim wrote:
> [..]
> > Unfortunately, zram has used per-cpu stream feature from v4.7.
> > It aims for increasing cache hit ratio of scratch buffer for
> > compressing. Downside of that approach is that zram should ask
> > memory space for compressed page in per-cpu context which requires
> > stricted gfp flag which could be failed. If so, it retries to
> > allocate memory space out of per-cpu context so it could get memory
> > this time and compress the data again, copies it to the memory space.
> > 
> > In this scenario, zram assumes the data should never be changed
> > but it is not true unless stable page supports. So, If the data is
> > changed under us, zram can make buffer overrun because second
> > compression size could be bigger than one we got in previous trial
> > and blindly, copy bigger size object to smaller buffer which is
> > buffer overrun. The overrun breaks zsmalloc free object chaining
> > so system goes crash like above.
> 
> very interesting find! didn't see this coming.
> 
> > Unfortunately, reuse_swap_page should be atomic so that we cannot wait on
> > writeback in there so the approach in this patch is simply return false if
> > we found it needs stable page.  Although it increases memory footprint
> > temporarily, it happens rarely and it should be reclaimed easily althoug
> > it happened.  Also, It would be better than waiting of IO completion,
> > which is critial path for application latency.
> 
> wondering - how many pages can it hold? we are in low memory, that's why we
> failed to zsmalloc in fast path, so how likely this to worsen memory pressure?

Actually, I don't have real number to say but a thing I can say surely is
it's really hard to meet in normal stress test I have done until now.
That's why it takes a long time to find(i.e., I could encounter the bug
once two days). But once I understood the problem, I can reproduce the
problem in 15 minutes.

About memory pressure, my testing was already severe memory pressure(i.e.,
many memory failure and frequent OOM kill) so it doesn't make any
meaningful difference before and after.

> just asking. in async zram the window between zram_rw_page() and actual
> write of a page even bigger, isn't it?

Yes. That's why I found the problem with that feature enabled. Lucky. ;)

> 
> we *probably* and *may be* can try handle it in zram:
> 
> -- store the previous clen before re-compression
> -- check if new clen > saved_clen and if it is - we can't use previously
>    allocate handle and need to allocate a new one again. if it's less or
>    equal than the saved one - store the object (wasting some space,
>    yes. but we are in low mem).

It was my first attempt but changed mind.
It can save against crash but broken data could go to the disk
(i.e., zram). If someone want to read block directly(e.g.,
open /dev/zram0; read /dev/zram or DIO), it cannot read the data
forever until someone writes some stable data into that sectors.
Instead, he will see many decompression failure message.
It's weired.

I believe stable page problem should be solved by generic layer,
not driver itself.

> 
> -- we, may be, also can try harder in zsmalloc. once we detected that
>    zsmllaoc has failed, then we can declare it as an emergency and
>    store objects of size X in higher classes (assuming that there is a
>    bigger size class available with allocated and unused object).

It cannot solve the problem I mentioned above, either and I don't want
to make zram complicated to solve that problem. :(



> 
> 	-ss

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/3] mm: support anonymous stable page
  2016-11-28  0:41     ` Minchan Kim
@ 2016-11-28  5:38       ` Sergey Senozhatsky
  0 siblings, 0 replies; 11+ messages in thread
From: Sergey Senozhatsky @ 2016-11-28  5:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, linux-kernel, Takashi Iwai,
	Hyeoncheol Lee, yjay.kim, Sangseok Lee, Hugh Dickins, linux-mm,
	Darrick J . Wong, stable, Sergey Senozhatsky

Hello,

On (11/28/16 09:41), Minchan Kim wrote:
> 
> I'm going on a long vacation so forgive if I respond slowly. :)

no prob. have a good one!


> On Sun, Nov 27, 2016 at 10:19:10PM +0900, Sergey Senozhatsky wrote:
[..]
> > wondering - how many pages can it hold? we are in low memory, that's why we
> > failed to zsmalloc in fast path, so how likely this to worsen memory pressure?
> 
> Actually, I don't have real number to say but a thing I can say surely is
> it's really hard to meet in normal stress test I have done until now.
> That's why it takes a long time to find(i.e., I could encounter the bug
> once two days). But once I understood the problem, I can reproduce the
> problem in 15 minutes.
> 
> About memory pressure, my testing was already severe memory pressure(i.e.,
> many memory failure and frequent OOM kill) so it doesn't make any
> meaningful difference before and after.
> 
> > just asking. in async zram the window between zram_rw_page() and actual
> > write of a page even bigger, isn't it?
> 
> Yes. That's why I found the problem with that feature enabled. Lucky. ;)

I see. just curious, the worst case is deflate compression (which
can be 8-9 slower than lz4) and sync zram. right? are we speaking
of megabytes here?

> > we *probably* and *may be* can try handle it in zram:
> > 
> > -- store the previous clen before re-compression
> > -- check if new clen > saved_clen and if it is - we can't use previously
> >    allocate handle and need to allocate a new one again. if it's less or
> >    equal than the saved one - store the object (wasting some space,
> >    yes. but we are in low mem).
> 
> It was my first attempt but changed mind.
> It can save against crash but broken data could go to the disk
> (i.e., zram). If someone want to read block directly(e.g.,
> open /dev/zram0; read /dev/zram or DIO), it cannot read the data
> forever until someone writes some stable data into that sectors.
> Instead, he will see many decompression failure message.
> It's weired.
>
> I believe stable page problem should be solved by generic layer,
> not driver itself.

yeah, I'm fine with this. at the same, there are chances, I suspect,
that may be we will see some 'regressions'. well, just may be.

	-ss

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-11-28  5:38 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-25  8:35 [PATCH v3 0/3] Fix zsmalloc crash problem Minchan Kim
2016-11-25  8:35 ` [PATCH v3 1/3] mm: support anonymous stable page Minchan Kim
2016-11-27 13:19   ` Sergey Senozhatsky
2016-11-28  0:41     ` Minchan Kim
2016-11-28  5:38       ` Sergey Senozhatsky
2016-11-25  8:35 ` [PATCH v3 2/3] zram: revalidate disk under init_lock Minchan Kim
2016-11-26  6:38   ` Sergey Senozhatsky
2016-11-25  8:35 ` [PATCH v3 3/3] zram: support BDI_CAP_STABLE_WRITES Minchan Kim
2016-11-26  6:37   ` Sergey Senozhatsky
2016-11-26 14:41     ` Minchan Kim
2016-11-27 13:01       ` Sergey Senozhatsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).