Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: always use REQ_OP_WRITE_ZEROES for zeroing offload
From: Martin K. Petersen @ 2017-04-05 11:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg, linux-block, linux-scsi, drbd-dev, dm-devel,
	linux-raid
In-Reply-To: <20170331163313.31821-1-hch@lst.de>

Christoph Hellwig <hch@lst.de> writes:

Christoph,

> This series makes REQ_OP_WRITE_ZEROES the only zeroing offload
> supported by the block layer, and switches existing implementations
> of REQ_OP_DISCARD that correctly set discard_zeroes_data to it,
> removes incorrect discard_zeroes_data, and also switches WRITE SAME
> based zeroing in SCSI to this new method.

Very, very nice. I think this is the correct approach.

I'm going to send two follow-up patches that allow us to use UNMAP for
discards and WRITE SAME w/ UNMAP for zeroout. That appears to be the
preferred configuration for most storage devices.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* Re: [PATCH 1/4] mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
From: Andrey Ryabinin @ 2017-04-05 11:40 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton
  Cc: linux-mm, linux-kernel, Michal Hocko, Mel Gorman, Johannes Weiner,
	linux-block, nbd-general, open-iscsi, linux-scsi, netdev, stable
In-Reply-To: <20170405074700.29871-2-vbabka@suse.cz>

On 04/05/2017 10:46 AM, Vlastimil Babka wrote:
> The function __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent
> deadlock during page migration by lock_page() (see the comment in
> __unmap_and_move()). Then it unconditionally clears the flag, which can clear a
> pre-existing PF_MEMALLOC flag and result in recursive reclaim. This was not a
> problem until commit a8161d1ed609 ("mm, page_alloc: restructure direct
> compaction handling in slowpath"), because direct compation was called only
> after direct reclaim, which was skipped when PF_MEMALLOC flag was set.
> 
> Even now it's only a theoretical issue, as the new callsite of
> __alloc_pages_direct_compact() is reached only for costly orders and when
> gfp_pfmemalloc_allowed() is true, which means either __GFP_NOMEMALLOC is in
                           is false			

> gfp_flags or in_interrupt() is true. There is no such known context, but let's
> play it safe and make __alloc_pages_direct_compact() robust for cases where
> PF_MEMALLOC is already set.
> 
> Fixes: a8161d1ed609 ("mm, page_alloc: restructure direct compaction handling in slowpath")
> Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: <stable@vger.kernel.org>
> ---
>  mm/page_alloc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3589f8be53be..b84e6ffbe756 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3288,6 +3288,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		enum compact_priority prio, enum compact_result *compact_result)
>  {
>  	struct page *page;
> +	unsigned int noreclaim_flag = current->flags & PF_MEMALLOC;
>  
>  	if (!order)
>  		return NULL;
> @@ -3295,7 +3296,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	current->flags |= PF_MEMALLOC;
>  	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>  									prio);
> -	current->flags &= ~PF_MEMALLOC;
> +	current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag;

Perhaps this would look better:

	tsk_restore_flags(current, noreclaim_flag, PF_MEMALLOC);

?

>  	if (*compact_result <= COMPACT_INACTIVE)
>  		return NULL;
> 

^ permalink raw reply

* Re: [PATCH 4/4] mtd: nand: nandsim: convert to memalloc_noreclaim_*()
From: Vlastimil Babka @ 2017-04-05 11:39 UTC (permalink / raw)
  To: Richard Weinberger, Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-kernel, Mel Gorman,
	Johannes Weiner, linux-block, nbd-general, open-iscsi, linux-scsi,
	netdev, Boris Brezillon, Adrian Hunter
In-Reply-To: <ee6649ed-b0e8-1c59-c193-d1688fdfe7f5@nod.at>

On 04/05/2017 01:36 PM, Richard Weinberger wrote:
> Michal,
> 
> Am 05.04.2017 um 13:31 schrieb Michal Hocko:
>> On Wed 05-04-17 09:47:00, Vlastimil Babka wrote:
>>> Nandsim has own functions set_memalloc() and clear_memalloc() for robust
>>> setting and clearing of PF_MEMALLOC. Replace them by the new generic helpers.
>>> No functional change.
>>
>> This one smells like an abuser. Why the hell should read/write path
>> touch memory reserves at all!
> 
> Could be. Let's ask Adrian, AFAIK he wrote that code.
> Adrian, can you please clarify why nandsim needs to play with PF_MEMALLOC?

I was thinking about it and concluded that since the simulator can be
used as a block device where reclaimed pages go to, writing the data out
is a memalloc operation. Then reading can be called as part of r-m-w
cycle, so reading as well. But it would be great if somebody more
knowledgeable confirmed this.

> Thanks,
> //richard
> 

^ permalink raw reply

* Re: [PATCH 4/4] mtd: nand: nandsim: convert to memalloc_noreclaim_*()
From: Richard Weinberger @ 2017-04-05 11:36 UTC (permalink / raw)
  To: Michal Hocko, Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-kernel, Mel Gorman,
	Johannes Weiner, linux-block, nbd-general, open-iscsi, linux-scsi,
	netdev, Boris Brezillon, Adrian Hunter
In-Reply-To: <20170405113157.GM6035@dhcp22.suse.cz>

Michal,

Am 05.04.2017 um 13:31 schrieb Michal Hocko:
> On Wed 05-04-17 09:47:00, Vlastimil Babka wrote:
>> Nandsim has own functions set_memalloc() and clear_memalloc() for robust
>> setting and clearing of PF_MEMALLOC. Replace them by the new generic helpers.
>> No functional change.
> 
> This one smells like an abuser. Why the hell should read/write path
> touch memory reserves at all!

Could be. Let's ask Adrian, AFAIK he wrote that code.
Adrian, can you please clarify why nandsim needs to play with PF_MEMALLOC?

Thanks,
//richard

^ permalink raw reply

* Re: [PATCH 4/4] mtd: nand: nandsim: convert to memalloc_noreclaim_*()
From: Michal Hocko @ 2017-04-05 11:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-kernel, Mel Gorman,
	Johannes Weiner, linux-block, nbd-general, open-iscsi, linux-scsi,
	netdev, Boris Brezillon, Richard Weinberger
In-Reply-To: <20170405074700.29871-5-vbabka@suse.cz>

On Wed 05-04-17 09:47:00, Vlastimil Babka wrote:
> Nandsim has own functions set_memalloc() and clear_memalloc() for robust
> setting and clearing of PF_MEMALLOC. Replace them by the new generic helpers.
> No functional change.

This one smells like an abuser. Why the hell should read/write path
touch memory reserves at all!

> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
> Cc: Richard Weinberger <richard@nod.at>
> ---
>  drivers/mtd/nand/nandsim.c | 29 +++++++++--------------------
>  1 file changed, 9 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/mtd/nand/nandsim.c b/drivers/mtd/nand/nandsim.c
> index cef818f535ed..03a0d057bf2f 100644
> --- a/drivers/mtd/nand/nandsim.c
> +++ b/drivers/mtd/nand/nandsim.c
> @@ -40,6 +40,7 @@
>  #include <linux/list.h>
>  #include <linux/random.h>
>  #include <linux/sched.h>
> +#include <linux/sched/mm.h>
>  #include <linux/fs.h>
>  #include <linux/pagemap.h>
>  #include <linux/seq_file.h>
> @@ -1368,31 +1369,18 @@ static int get_pages(struct nandsim *ns, struct file *file, size_t count, loff_t
>  	return 0;
>  }
>  
> -static int set_memalloc(void)
> -{
> -	if (current->flags & PF_MEMALLOC)
> -		return 0;
> -	current->flags |= PF_MEMALLOC;
> -	return 1;
> -}
> -
> -static void clear_memalloc(int memalloc)
> -{
> -	if (memalloc)
> -		current->flags &= ~PF_MEMALLOC;
> -}
> -
>  static ssize_t read_file(struct nandsim *ns, struct file *file, void *buf, size_t count, loff_t pos)
>  {
>  	ssize_t tx;
> -	int err, memalloc;
> +	int err;
> +	unsigned int noreclaim_flag;
>  
>  	err = get_pages(ns, file, count, pos);
>  	if (err)
>  		return err;
> -	memalloc = set_memalloc();
> +	noreclaim_flag = memalloc_noreclaim_save();
>  	tx = kernel_read(file, pos, buf, count);
> -	clear_memalloc(memalloc);
> +	memalloc_noreclaim_restore(noreclaim_flag);
>  	put_pages(ns);
>  	return tx;
>  }
> @@ -1400,14 +1388,15 @@ static ssize_t read_file(struct nandsim *ns, struct file *file, void *buf, size_
>  static ssize_t write_file(struct nandsim *ns, struct file *file, void *buf, size_t count, loff_t pos)
>  {
>  	ssize_t tx;
> -	int err, memalloc;
> +	int err;
> +	unsigned int noreclaim_flag;
>  
>  	err = get_pages(ns, file, count, pos);
>  	if (err)
>  		return err;
> -	memalloc = set_memalloc();
> +	noreclaim_flag = memalloc_noreclaim_save();
>  	tx = kernel_write(file, buf, count, pos);
> -	clear_memalloc(memalloc);
> +	memalloc_noreclaim_restore(noreclaim_flag);
>  	put_pages(ns);
>  	return tx;
>  }
> -- 
> 2.12.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH 3/4] treewide: convert PF_MEMALLOC manipulations to new helpers
From: Michal Hocko @ 2017-04-05 11:30 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-kernel, Mel Gorman,
	Johannes Weiner, linux-block, nbd-general, open-iscsi, linux-scsi,
	netdev, Josef Bacik, Lee Duncan, Chris Leech, David S. Miller,
	Eric Dumazet
In-Reply-To: <20170405074700.29871-4-vbabka@suse.cz>

On Wed 05-04-17 09:46:59, Vlastimil Babka wrote:
> We now have memalloc_noreclaim_{save,restore} helpers for robust setting and
> clearing of PF_MEMALLOC. Let's convert the code which was using the generic
> tsk_restore_flags(). No functional change.

It would be really great to revisit why those places outside of the mm
proper really need this flag. I know this is a painful exercise but I
wouldn't be surprised if there were abusers there.

> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Josef Bacik <jbacik@fb.com>
> Cc: Lee Duncan <lduncan@suse.com>
> Cc: Chris Leech <cleech@redhat.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  drivers/block/nbd.c      | 7 ++++---
>  drivers/scsi/iscsi_tcp.c | 7 ++++---
>  net/core/dev.c           | 7 ++++---
>  net/core/sock.c          | 7 ++++---
>  4 files changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> index 03ae72985c79..929fc548c7fb 100644
> --- a/drivers/block/nbd.c
> +++ b/drivers/block/nbd.c
> @@ -18,6 +18,7 @@
>  #include <linux/module.h>
>  #include <linux/init.h>
>  #include <linux/sched.h>
> +#include <linux/sched/mm.h>
>  #include <linux/fs.h>
>  #include <linux/bio.h>
>  #include <linux/stat.h>
> @@ -210,7 +211,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, int send,
>  	struct socket *sock = nbd->socks[index]->sock;
>  	int result;
>  	struct msghdr msg;
> -	unsigned long pflags = current->flags;
> +	unsigned int noreclaim_flag;
>  
>  	if (unlikely(!sock)) {
>  		dev_err_ratelimited(disk_to_dev(nbd->disk),
> @@ -221,7 +222,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, int send,
>  
>  	msg.msg_iter = *iter;
>  
> -	current->flags |= PF_MEMALLOC;
> +	noreclaim_flag = memalloc_noreclaim_save();
>  	do {
>  		sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC;
>  		msg.msg_name = NULL;
> @@ -244,7 +245,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, int send,
>  			*sent += result;
>  	} while (msg_data_left(&msg));
>  
> -	tsk_restore_flags(current, pflags, PF_MEMALLOC);
> +	memalloc_noreclaim_restore(noreclaim_flag);
>  
>  	return result;
>  }
> diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
> index 4228aba1f654..4842fc0e809d 100644
> --- a/drivers/scsi/iscsi_tcp.c
> +++ b/drivers/scsi/iscsi_tcp.c
> @@ -30,6 +30,7 @@
>  #include <linux/types.h>
>  #include <linux/inet.h>
>  #include <linux/slab.h>
> +#include <linux/sched/mm.h>
>  #include <linux/file.h>
>  #include <linux/blkdev.h>
>  #include <linux/delay.h>
> @@ -371,10 +372,10 @@ static inline int iscsi_sw_tcp_xmit_qlen(struct iscsi_conn *conn)
>  static int iscsi_sw_tcp_pdu_xmit(struct iscsi_task *task)
>  {
>  	struct iscsi_conn *conn = task->conn;
> -	unsigned long pflags = current->flags;
> +	unsigned int noreclaim_flag;
>  	int rc = 0;
>  
> -	current->flags |= PF_MEMALLOC;
> +	noreclaim_flag = memalloc_noreclaim_save();
>  
>  	while (iscsi_sw_tcp_xmit_qlen(conn)) {
>  		rc = iscsi_sw_tcp_xmit(conn);
> @@ -387,7 +388,7 @@ static int iscsi_sw_tcp_pdu_xmit(struct iscsi_task *task)
>  		rc = 0;
>  	}
>  
> -	tsk_restore_flags(current, pflags, PF_MEMALLOC);
> +	memalloc_noreclaim_restore(noreclaim_flag);
>  	return rc;
>  }
>  
> diff --git a/net/core/dev.c b/net/core/dev.c
> index fde8b3f7136b..e0705a126b24 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -81,6 +81,7 @@
>  #include <linux/hash.h>
>  #include <linux/slab.h>
>  #include <linux/sched.h>
> +#include <linux/sched/mm.h>
>  #include <linux/mutex.h>
>  #include <linux/string.h>
>  #include <linux/mm.h>
> @@ -4227,7 +4228,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>  	int ret;
>  
>  	if (sk_memalloc_socks() && skb_pfmemalloc(skb)) {
> -		unsigned long pflags = current->flags;
> +		unsigned int noreclaim_flag;
>  
>  		/*
>  		 * PFMEMALLOC skbs are special, they should
> @@ -4238,9 +4239,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>  		 * Use PF_MEMALLOC as this saves us from propagating the allocation
>  		 * context down to all allocation sites.
>  		 */
> -		current->flags |= PF_MEMALLOC;
> +		noreclaim_flag = memalloc_noreclaim_save();
>  		ret = __netif_receive_skb_core(skb, true);
> -		tsk_restore_flags(current, pflags, PF_MEMALLOC);
> +		memalloc_noreclaim_restore(noreclaim_flag);
>  	} else
>  		ret = __netif_receive_skb_core(skb, false);
>  
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 392f9b6f96e2..0b2d06b4c308 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -102,6 +102,7 @@
>  #include <linux/proc_fs.h>
>  #include <linux/seq_file.h>
>  #include <linux/sched.h>
> +#include <linux/sched/mm.h>
>  #include <linux/timer.h>
>  #include <linux/string.h>
>  #include <linux/sockios.h>
> @@ -372,14 +373,14 @@ EXPORT_SYMBOL_GPL(sk_clear_memalloc);
>  int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
>  {
>  	int ret;
> -	unsigned long pflags = current->flags;
> +	unsigned int noreclaim_flag;
>  
>  	/* these should have been dropped before queueing */
>  	BUG_ON(!sock_flag(sk, SOCK_MEMALLOC));
>  
> -	current->flags |= PF_MEMALLOC;
> +	noreclaim_flag = memalloc_noreclaim_save();
>  	ret = sk->sk_backlog_rcv(sk, skb);
> -	tsk_restore_flags(current, pflags, PF_MEMALLOC);
> +	memalloc_noreclaim_restore(noreclaim_flag);
>  
>  	return ret;
>  }
> -- 
> 2.12.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH 2/4] mm: introduce memalloc_noreclaim_{save,restore}
From: Michal Hocko @ 2017-04-05 11:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-kernel, Mel Gorman,
	Johannes Weiner, linux-block, nbd-general, open-iscsi, linux-scsi,
	netdev
In-Reply-To: <20170405074700.29871-3-vbabka@suse.cz>

On Wed 05-04-17 09:46:58, Vlastimil Babka wrote:
> The previous patch has shown that simply setting and clearing PF_MEMALLOC in
> current->flags can result in wrongly clearing a pre-existing PF_MEMALLOC flag
> and potentially lead to recursive reclaim. Let's introduce helpers that support
> proper nesting by saving the previous stat of the flag, similar to the existing
> memalloc_noio_* and memalloc_nofs_* helpers. Convert existing setting/clearing
> of PF_MEMALLOC within mm to the new helpers.
> 
> There are no known issues with the converted code, but the change makes it more
> robust.
> 
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

One could argue that tsk_restore_flags() could be extended to provide
tsk_set_flags() and use it for all allocation related PF flags. I do not
have a strong opinion on that but explicit API sounds a bit better to me
because is easier to follow (at least for me). If others think that
generic API would be better then I won't have any objections. Anyway
this looks good to me.
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/sched/mm.h | 12 ++++++++++++
>  mm/page_alloc.c          | 11 ++++++-----
>  mm/vmscan.c              | 17 +++++++++++------
>  3 files changed, 29 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 9daabe138c99..2b24a6974847 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -191,4 +191,16 @@ static inline void memalloc_nofs_restore(unsigned int flags)
>  	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
>  }
>  
> +static inline unsigned int memalloc_noreclaim_save(void)
> +{
> +	unsigned int flags = current->flags & PF_MEMALLOC;
> +	current->flags |= PF_MEMALLOC;
> +	return flags;
> +}
> +
> +static inline void memalloc_noreclaim_restore(unsigned int flags)
> +{
> +	current->flags = (current->flags & ~PF_MEMALLOC) | flags;
> +}
> +
>  #endif /* _LINUX_SCHED_MM_H */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b84e6ffbe756..037e32dccd7a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3288,15 +3288,15 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		enum compact_priority prio, enum compact_result *compact_result)
>  {
>  	struct page *page;
> -	unsigned int noreclaim_flag = current->flags & PF_MEMALLOC;
> +	unsigned int noreclaim_flag;
>  
>  	if (!order)
>  		return NULL;
>  
> -	current->flags |= PF_MEMALLOC;
> +	noreclaim_flag = memalloc_noreclaim_save();
>  	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>  									prio);
> -	current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag;
> +	memalloc_noreclaim_restore(noreclaim_flag);
>  
>  	if (*compact_result <= COMPACT_INACTIVE)
>  		return NULL;
> @@ -3443,12 +3443,13 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
>  {
>  	struct reclaim_state reclaim_state;
>  	int progress;
> +	unsigned int noreclaim_flag;
>  
>  	cond_resched();
>  
>  	/* We now go into synchronous reclaim */
>  	cpuset_memory_pressure_bump();
> -	current->flags |= PF_MEMALLOC;
> +	noreclaim_flag = memalloc_noreclaim_save();
>  	lockdep_set_current_reclaim_state(gfp_mask);
>  	reclaim_state.reclaimed_slab = 0;
>  	current->reclaim_state = &reclaim_state;
> @@ -3458,7 +3459,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
>  
>  	current->reclaim_state = NULL;
>  	lockdep_clear_current_reclaim_state();
> -	current->flags &= ~PF_MEMALLOC;
> +	memalloc_noreclaim_restore(noreclaim_flag);
>  
>  	cond_resched();
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 58615bb27f2f..ff63b91a0f48 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2992,6 +2992,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  	struct zonelist *zonelist;
>  	unsigned long nr_reclaimed;
>  	int nid;
> +	unsigned int noreclaim_flag;
>  	struct scan_control sc = {
>  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
>  		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
> @@ -3018,9 +3019,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  					    sc.gfp_mask,
>  					    sc.reclaim_idx);
>  
> -	current->flags |= PF_MEMALLOC;
> +	noreclaim_flag = memalloc_noreclaim_save();
>  	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
> -	current->flags &= ~PF_MEMALLOC;
> +	memalloc_noreclaim_restore(noreclaim_flag);
>  
>  	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
>  
> @@ -3544,8 +3545,9 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
>  	struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
>  	struct task_struct *p = current;
>  	unsigned long nr_reclaimed;
> +	unsigned int noreclaim_flag;
>  
> -	p->flags |= PF_MEMALLOC;
> +	noreclaim_flag = memalloc_noreclaim_save();
>  	lockdep_set_current_reclaim_state(sc.gfp_mask);
>  	reclaim_state.reclaimed_slab = 0;
>  	p->reclaim_state = &reclaim_state;
> @@ -3554,7 +3556,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
>  
>  	p->reclaim_state = NULL;
>  	lockdep_clear_current_reclaim_state();
> -	p->flags &= ~PF_MEMALLOC;
> +	memalloc_noreclaim_restore(noreclaim_flag);
>  
>  	return nr_reclaimed;
>  }
> @@ -3719,6 +3721,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
>  	struct task_struct *p = current;
>  	struct reclaim_state reclaim_state;
>  	int classzone_idx = gfp_zone(gfp_mask);
> +	unsigned int noreclaim_flag;
>  	struct scan_control sc = {
>  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
>  		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
> @@ -3736,7 +3739,8 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
>  	 * and we also need to be able to write out pages for RECLAIM_WRITE
>  	 * and RECLAIM_UNMAP.
>  	 */
> -	p->flags |= PF_MEMALLOC | PF_SWAPWRITE;
> +	noreclaim_flag = memalloc_noreclaim_save();
> +	p->flags |= PF_SWAPWRITE;
>  	lockdep_set_current_reclaim_state(gfp_mask);
>  	reclaim_state.reclaimed_slab = 0;
>  	p->reclaim_state = &reclaim_state;
> @@ -3752,7 +3756,8 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
>  	}
>  
>  	p->reclaim_state = NULL;
> -	current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
> +	current->flags &= ~PF_SWAPWRITE;
> +	memalloc_noreclaim_restore(noreclaim_flag);
>  	lockdep_clear_current_reclaim_state();
>  	return sc.nr_reclaimed >= nr_pages;
>  }
> -- 
> 2.12.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH 1/4] mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
From: Michal Hocko @ 2017-04-05 11:21 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-kernel, Mel Gorman,
	Johannes Weiner, linux-block, nbd-general, open-iscsi, linux-scsi,
	netdev, stable, Andrey Ryabinin
In-Reply-To: <20170405074700.29871-2-vbabka@suse.cz>

On Wed 05-04-17 09:46:57, Vlastimil Babka wrote:
> The function __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent
> deadlock during page migration by lock_page() (see the comment in
> __unmap_and_move()). Then it unconditionally clears the flag, which can clear a
> pre-existing PF_MEMALLOC flag and result in recursive reclaim. This was not a
> problem until commit a8161d1ed609 ("mm, page_alloc: restructure direct
> compaction handling in slowpath"), because direct compation was called only
> after direct reclaim, which was skipped when PF_MEMALLOC flag was set.
> 
> Even now it's only a theoretical issue, as the new callsite of
> __alloc_pages_direct_compact() is reached only for costly orders and when
> gfp_pfmemalloc_allowed() is true, which means either __GFP_NOMEMALLOC is in
> gfp_flags or in_interrupt() is true. There is no such known context, but let's
> play it safe and make __alloc_pages_direct_compact() robust for cases where
> PF_MEMALLOC is already set.
> 
> Fixes: a8161d1ed609 ("mm, page_alloc: restructure direct compaction handling in slowpath")
> Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: <stable@vger.kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3589f8be53be..b84e6ffbe756 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3288,6 +3288,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		enum compact_priority prio, enum compact_result *compact_result)
>  {
>  	struct page *page;
> +	unsigned int noreclaim_flag = current->flags & PF_MEMALLOC;
>  
>  	if (!order)
>  		return NULL;
> @@ -3295,7 +3296,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	current->flags |= PF_MEMALLOC;
>  	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>  									prio);
> -	current->flags &= ~PF_MEMALLOC;
> +	current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag;
>  
>  	if (*compact_result <= COMPACT_INACTIVE)
>  		return NULL;
> -- 
> 2.12.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* [PATCH 3/4] treewide: convert PF_MEMALLOC manipulations to new helpers
From: Vlastimil Babka @ 2017-04-05  7:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Michal Hocko, Mel Gorman, Johannes Weiner,
	linux-block, nbd-general, open-iscsi, linux-scsi, netdev,
	Vlastimil Babka, Josef Bacik, Lee Duncan, Chris Leech,
	David S. Miller, Eric Dumazet
In-Reply-To: <20170405074700.29871-1-vbabka@suse.cz>

We now have memalloc_noreclaim_{save,restore} helpers for robust setting and
clearing of PF_MEMALLOC. Let's convert the code which was using the generic
tsk_restore_flags(). No functional change.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Lee Duncan <lduncan@suse.com>
Cc: Chris Leech <cleech@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
---
 drivers/block/nbd.c      | 7 ++++---
 drivers/scsi/iscsi_tcp.c | 7 ++++---
 net/core/dev.c           | 7 ++++---
 net/core/sock.c          | 7 ++++---
 4 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 03ae72985c79..929fc548c7fb 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -18,6 +18,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/sched.h>
+#include <linux/sched/mm.h>
 #include <linux/fs.h>
 #include <linux/bio.h>
 #include <linux/stat.h>
@@ -210,7 +211,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, int send,
 	struct socket *sock = nbd->socks[index]->sock;
 	int result;
 	struct msghdr msg;
-	unsigned long pflags = current->flags;
+	unsigned int noreclaim_flag;
 
 	if (unlikely(!sock)) {
 		dev_err_ratelimited(disk_to_dev(nbd->disk),
@@ -221,7 +222,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, int send,
 
 	msg.msg_iter = *iter;
 
-	current->flags |= PF_MEMALLOC;
+	noreclaim_flag = memalloc_noreclaim_save();
 	do {
 		sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC;
 		msg.msg_name = NULL;
@@ -244,7 +245,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, int send,
 			*sent += result;
 	} while (msg_data_left(&msg));
 
-	tsk_restore_flags(current, pflags, PF_MEMALLOC);
+	memalloc_noreclaim_restore(noreclaim_flag);
 
 	return result;
 }
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 4228aba1f654..4842fc0e809d 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -30,6 +30,7 @@
 #include <linux/types.h>
 #include <linux/inet.h>
 #include <linux/slab.h>
+#include <linux/sched/mm.h>
 #include <linux/file.h>
 #include <linux/blkdev.h>
 #include <linux/delay.h>
@@ -371,10 +372,10 @@ static inline int iscsi_sw_tcp_xmit_qlen(struct iscsi_conn *conn)
 static int iscsi_sw_tcp_pdu_xmit(struct iscsi_task *task)
 {
 	struct iscsi_conn *conn = task->conn;
-	unsigned long pflags = current->flags;
+	unsigned int noreclaim_flag;
 	int rc = 0;
 
-	current->flags |= PF_MEMALLOC;
+	noreclaim_flag = memalloc_noreclaim_save();
 
 	while (iscsi_sw_tcp_xmit_qlen(conn)) {
 		rc = iscsi_sw_tcp_xmit(conn);
@@ -387,7 +388,7 @@ static int iscsi_sw_tcp_pdu_xmit(struct iscsi_task *task)
 		rc = 0;
 	}
 
-	tsk_restore_flags(current, pflags, PF_MEMALLOC);
+	memalloc_noreclaim_restore(noreclaim_flag);
 	return rc;
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index fde8b3f7136b..e0705a126b24 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -81,6 +81,7 @@
 #include <linux/hash.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/sched/mm.h>
 #include <linux/mutex.h>
 #include <linux/string.h>
 #include <linux/mm.h>
@@ -4227,7 +4228,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	int ret;
 
 	if (sk_memalloc_socks() && skb_pfmemalloc(skb)) {
-		unsigned long pflags = current->flags;
+		unsigned int noreclaim_flag;
 
 		/*
 		 * PFMEMALLOC skbs are special, they should
@@ -4238,9 +4239,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
 		 * Use PF_MEMALLOC as this saves us from propagating the allocation
 		 * context down to all allocation sites.
 		 */
-		current->flags |= PF_MEMALLOC;
+		noreclaim_flag = memalloc_noreclaim_save();
 		ret = __netif_receive_skb_core(skb, true);
-		tsk_restore_flags(current, pflags, PF_MEMALLOC);
+		memalloc_noreclaim_restore(noreclaim_flag);
 	} else
 		ret = __netif_receive_skb_core(skb, false);
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 392f9b6f96e2..0b2d06b4c308 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -102,6 +102,7 @@
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 #include <linux/sched.h>
+#include <linux/sched/mm.h>
 #include <linux/timer.h>
 #include <linux/string.h>
 #include <linux/sockios.h>
@@ -372,14 +373,14 @@ EXPORT_SYMBOL_GPL(sk_clear_memalloc);
 int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
 {
 	int ret;
-	unsigned long pflags = current->flags;
+	unsigned int noreclaim_flag;
 
 	/* these should have been dropped before queueing */
 	BUG_ON(!sock_flag(sk, SOCK_MEMALLOC));
 
-	current->flags |= PF_MEMALLOC;
+	noreclaim_flag = memalloc_noreclaim_save();
 	ret = sk->sk_backlog_rcv(sk, skb);
-	tsk_restore_flags(current, pflags, PF_MEMALLOC);
+	memalloc_noreclaim_restore(noreclaim_flag);
 
 	return ret;
 }
-- 
2.12.2

^ permalink raw reply related

* [PATCH 0/4] more robust PF_MEMALLOC handling
From: Vlastimil Babka @ 2017-04-05  7:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Michal Hocko, Mel Gorman, Johannes Weiner,
	linux-block, nbd-general, open-iscsi, linux-scsi, netdev,
	Vlastimil Babka, Andrey Ryabinin, Boris Brezillon, Chris Leech,
	David S. Miller, Eric Dumazet, Josef Bacik, Lee Duncan,
	Michal Hocko, Richard Weinberger, stable

Hi,

this series aims to unify the setting and clearing of PF_MEMALLOC, which
prevents recursive reclaim. There are some places that clear the flag
unconditionally from current->flags, which may result in clearing a
pre-existing flag. This already resulted in a bug report that Patch 1 fixes
(without the new helpers, to make backporting easier). Patch 2 introduces the
new helpers, modelled after existing memalloc_noio_* and memalloc_nofs_*
helpers, and converts mm core to use them. Patches 3 and 4 convert non-mm code.

Based on next-20170404.

Vlastimil Babka (4):
  mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
  mm: introduce memalloc_noreclaim_{save,restore}
  treewide: convert PF_MEMALLOC manipulations to new helpers
  mtd: nand: nandsim: convert to memalloc_noreclaim_*()

 drivers/block/nbd.c        |  7 ++++---
 drivers/mtd/nand/nandsim.c | 29 +++++++++--------------------
 drivers/scsi/iscsi_tcp.c   |  7 ++++---
 include/linux/sched/mm.h   | 12 ++++++++++++
 mm/page_alloc.c            | 10 ++++++----
 mm/vmscan.c                | 17 +++++++++++------
 net/core/dev.c             |  7 ++++---
 net/core/sock.c            |  7 ++++---
 8 files changed, 54 insertions(+), 42 deletions(-)

-- 
2.12.2

^ permalink raw reply

* [PATCH 1/4] mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
From: Vlastimil Babka @ 2017-04-05  7:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Michal Hocko, Mel Gorman, Johannes Weiner,
	linux-block, nbd-general, open-iscsi, linux-scsi, netdev,
	Vlastimil Babka, stable, Andrey Ryabinin
In-Reply-To: <20170405074700.29871-1-vbabka@suse.cz>

The function __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent
deadlock during page migration by lock_page() (see the comment in
__unmap_and_move()). Then it unconditionally clears the flag, which can clear a
pre-existing PF_MEMALLOC flag and result in recursive reclaim. This was not a
problem until commit a8161d1ed609 ("mm, page_alloc: restructure direct
compaction handling in slowpath"), because direct compation was called only
after direct reclaim, which was skipped when PF_MEMALLOC flag was set.

Even now it's only a theoretical issue, as the new callsite of
__alloc_pages_direct_compact() is reached only for costly orders and when
gfp_pfmemalloc_allowed() is true, which means either __GFP_NOMEMALLOC is in
gfp_flags or in_interrupt() is true. There is no such known context, but let's
play it safe and make __alloc_pages_direct_compact() robust for cases where
PF_MEMALLOC is already set.

Fixes: a8161d1ed609 ("mm, page_alloc: restructure direct compaction handling in slowpath")
Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3589f8be53be..b84e6ffbe756 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3288,6 +3288,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		enum compact_priority prio, enum compact_result *compact_result)
 {
 	struct page *page;
+	unsigned int noreclaim_flag = current->flags & PF_MEMALLOC;

 	if (!order)
 		return NULL;
@@ -3295,7 +3296,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	current->flags |= PF_MEMALLOC;
 	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
 									prio);
-	current->flags &= ~PF_MEMALLOC;
+	current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag;

 	if (*compact_result <= COMPACT_INACTIVE)
 		return NULL;
-- 
2.12.2

^ permalink raw reply related

* [PATCH 4/4] mtd: nand: nandsim: convert to memalloc_noreclaim_*()
From: Vlastimil Babka @ 2017-04-05  7:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Michal Hocko, Mel Gorman, Johannes Weiner,
	linux-block, nbd-general, open-iscsi, linux-scsi, netdev,
	Vlastimil Babka, Boris Brezillon, Richard Weinberger
In-Reply-To: <20170405074700.29871-1-vbabka@suse.cz>

Nandsim has own functions set_memalloc() and clear_memalloc() for robust
setting and clearing of PF_MEMALLOC. Replace them by the new generic helpers.
No functional change.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
Cc: Richard Weinberger <richard@nod.at>
---
 drivers/mtd/nand/nandsim.c | 29 +++++++++--------------------
 1 file changed, 9 insertions(+), 20 deletions(-)

diff --git a/drivers/mtd/nand/nandsim.c b/drivers/mtd/nand/nandsim.c
index cef818f535ed..03a0d057bf2f 100644
--- a/drivers/mtd/nand/nandsim.c
+++ b/drivers/mtd/nand/nandsim.c
@@ -40,6 +40,7 @@
 #include <linux/list.h>
 #include <linux/random.h>
 #include <linux/sched.h>
+#include <linux/sched/mm.h>
 #include <linux/fs.h>
 #include <linux/pagemap.h>
 #include <linux/seq_file.h>
@@ -1368,31 +1369,18 @@ static int get_pages(struct nandsim *ns, struct file *file, size_t count, loff_t
 	return 0;
 }
 
-static int set_memalloc(void)
-{
-	if (current->flags & PF_MEMALLOC)
-		return 0;
-	current->flags |= PF_MEMALLOC;
-	return 1;
-}
-
-static void clear_memalloc(int memalloc)
-{
-	if (memalloc)
-		current->flags &= ~PF_MEMALLOC;
-}
-
 static ssize_t read_file(struct nandsim *ns, struct file *file, void *buf, size_t count, loff_t pos)
 {
 	ssize_t tx;
-	int err, memalloc;
+	int err;
+	unsigned int noreclaim_flag;
 
 	err = get_pages(ns, file, count, pos);
 	if (err)
 		return err;
-	memalloc = set_memalloc();
+	noreclaim_flag = memalloc_noreclaim_save();
 	tx = kernel_read(file, pos, buf, count);
-	clear_memalloc(memalloc);
+	memalloc_noreclaim_restore(noreclaim_flag);
 	put_pages(ns);
 	return tx;
 }
@@ -1400,14 +1388,15 @@ static ssize_t read_file(struct nandsim *ns, struct file *file, void *buf, size_
 static ssize_t write_file(struct nandsim *ns, struct file *file, void *buf, size_t count, loff_t pos)
 {
 	ssize_t tx;
-	int err, memalloc;
+	int err;
+	unsigned int noreclaim_flag;
 
 	err = get_pages(ns, file, count, pos);
 	if (err)
 		return err;
-	memalloc = set_memalloc();
+	noreclaim_flag = memalloc_noreclaim_save();
 	tx = kernel_write(file, buf, count, pos);
-	clear_memalloc(memalloc);
+	memalloc_noreclaim_restore(noreclaim_flag);
 	put_pages(ns);
 	return tx;
 }
-- 
2.12.2

^ permalink raw reply related

* [PATCH 2/4] mm: introduce memalloc_noreclaim_{save,restore}
From: Vlastimil Babka @ 2017-04-05  7:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Michal Hocko, Mel Gorman, Johannes Weiner,
	linux-block, nbd-general, open-iscsi, linux-scsi, netdev,
	Vlastimil Babka, Michal Hocko
In-Reply-To: <20170405074700.29871-1-vbabka@suse.cz>

The previous patch has shown that simply setting and clearing PF_MEMALLOC in
current->flags can result in wrongly clearing a pre-existing PF_MEMALLOC flag
and potentially lead to recursive reclaim. Let's introduce helpers that support
proper nesting by saving the previous stat of the flag, similar to the existing
memalloc_noio_* and memalloc_nofs_* helpers. Convert existing setting/clearing
of PF_MEMALLOC within mm to the new helpers.

There are no known issues with the converted code, but the change makes it more
robust.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/sched/mm.h | 12 ++++++++++++
 mm/page_alloc.c          | 11 ++++++-----
 mm/vmscan.c              | 17 +++++++++++------
 3 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 9daabe138c99..2b24a6974847 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -191,4 +191,16 @@ static inline void memalloc_nofs_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
 }
 
+static inline unsigned int memalloc_noreclaim_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC;
+	current->flags |= PF_MEMALLOC;
+	return flags;
+}
+
+static inline void memalloc_noreclaim_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC) | flags;
+}
+
 #endif /* _LINUX_SCHED_MM_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b84e6ffbe756..037e32dccd7a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3288,15 +3288,15 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		enum compact_priority prio, enum compact_result *compact_result)
 {
 	struct page *page;
-	unsigned int noreclaim_flag = current->flags & PF_MEMALLOC;
+	unsigned int noreclaim_flag;
 
 	if (!order)
 		return NULL;
 
-	current->flags |= PF_MEMALLOC;
+	noreclaim_flag = memalloc_noreclaim_save();
 	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
 									prio);
-	current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag;
+	memalloc_noreclaim_restore(noreclaim_flag);
 
 	if (*compact_result <= COMPACT_INACTIVE)
 		return NULL;
@@ -3443,12 +3443,13 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 {
 	struct reclaim_state reclaim_state;
 	int progress;
+	unsigned int noreclaim_flag;
 
 	cond_resched();
 
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
-	current->flags |= PF_MEMALLOC;
+	noreclaim_flag = memalloc_noreclaim_save();
 	lockdep_set_current_reclaim_state(gfp_mask);
 	reclaim_state.reclaimed_slab = 0;
 	current->reclaim_state = &reclaim_state;
@@ -3458,7 +3459,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	current->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();
-	current->flags &= ~PF_MEMALLOC;
+	memalloc_noreclaim_restore(noreclaim_flag);
 
 	cond_resched();
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 58615bb27f2f..ff63b91a0f48 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2992,6 +2992,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
 	int nid;
+	unsigned int noreclaim_flag;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
 		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
@@ -3018,9 +3019,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					    sc.gfp_mask,
 					    sc.reclaim_idx);
 
-	current->flags |= PF_MEMALLOC;
+	noreclaim_flag = memalloc_noreclaim_save();
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
-	current->flags &= ~PF_MEMALLOC;
+	memalloc_noreclaim_restore(noreclaim_flag);
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
@@ -3544,8 +3545,9 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 	struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
 	struct task_struct *p = current;
 	unsigned long nr_reclaimed;
+	unsigned int noreclaim_flag;
 
-	p->flags |= PF_MEMALLOC;
+	noreclaim_flag = memalloc_noreclaim_save();
 	lockdep_set_current_reclaim_state(sc.gfp_mask);
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
@@ -3554,7 +3556,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 
 	p->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();
-	p->flags &= ~PF_MEMALLOC;
+	memalloc_noreclaim_restore(noreclaim_flag);
 
 	return nr_reclaimed;
 }
@@ -3719,6 +3721,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	struct task_struct *p = current;
 	struct reclaim_state reclaim_state;
 	int classzone_idx = gfp_zone(gfp_mask);
+	unsigned int noreclaim_flag;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
 		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
@@ -3736,7 +3739,8 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	 * and we also need to be able to write out pages for RECLAIM_WRITE
 	 * and RECLAIM_UNMAP.
 	 */
-	p->flags |= PF_MEMALLOC | PF_SWAPWRITE;
+	noreclaim_flag = memalloc_noreclaim_save();
+	p->flags |= PF_SWAPWRITE;
 	lockdep_set_current_reclaim_state(gfp_mask);
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
@@ -3752,7 +3756,8 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	}
 
 	p->reclaim_state = NULL;
-	current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
+	current->flags &= ~PF_SWAPWRITE;
+	memalloc_noreclaim_restore(noreclaim_flag);
 	lockdep_clear_current_reclaim_state();
 	return sc.nr_reclaimed >= nr_pages;
 }
-- 
2.12.2

^ permalink raw reply related

* Re: [PATCH v2] loop: Add PF_LESS_THROTTLE to block/loop device thread.
From: Michal Hocko @ 2017-04-05  7:32 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jens Axboe, linux-block, linux-mm, LKML, Ming Lei
In-Reply-To: <20170405071927.GA7258@dhcp22.suse.cz>

On Wed 05-04-17 09:19:27, Michal Hocko wrote:
> On Wed 05-04-17 14:33:50, NeilBrown wrote:
[...]
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index 0ecb6461ed81..44b3506fd086 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -852,6 +852,7 @@ static int loop_prepare_queue(struct loop_device *lo)
> >  	if (IS_ERR(lo->worker_task))
> >  		return -ENOMEM;
> >  	set_user_nice(lo->worker_task, MIN_NICE);
> > +	lo->worker_task->flags |= PF_LESS_THROTTLE;
> >  	return 0;
> 
> As mentioned elsewhere, PF flags should be updated only on the current
> task otherwise there is potential rmw race. Is this safe? The code runs
> concurrently with the worker thread.

I believe you need something like this instead
---
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f347285c67ec..07b2a909e4fb 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -844,10 +844,16 @@ static void loop_unprepare_queue(struct loop_device *lo)
 	kthread_stop(lo->worker_task);
 }
 
+int loop_kthread_worker_fn(void *worker_ptr)
+{
+	current->flags |= PF_LESS_THROTTLE;
+	return kthread_worker_fn(worker_ptr);
+}
+
 static int loop_prepare_queue(struct loop_device *lo)
 {
 	kthread_init_worker(&lo->worker);
-	lo->worker_task = kthread_run(kthread_worker_fn,
+	lo->worker_task = kthread_run(loop_kthread_worker_fn,
 			&lo->worker, "loop%d", lo->lo_number);
 	if (IS_ERR(lo->worker_task))
 		return -ENOMEM;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related

* Re: [PATCH v2] loop: Add PF_LESS_THROTTLE to block/loop device thread.
From: Michal Hocko @ 2017-04-05  7:19 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jens Axboe, linux-block, linux-mm, LKML, Ming Lei
In-Reply-To: <87wpazh3rl.fsf@notabene.neil.brown.name>

On Wed 05-04-17 14:33:50, NeilBrown wrote:
> 
> When a filesystem is mounted from a loop device, writes are
> throttled by balance_dirty_pages() twice: once when writing
> to the filesystem and once when the loop_handle_cmd() writes
> to the backing file.  This double-throttling can trigger
> positive feedback loops that create significant delays.  The
> throttling at the lower level is seen by the upper level as
> a slow device, so it throttles extra hard.
> 
> The PF_LESS_THROTTLE flag was created to handle exactly this
> circumstance, though with an NFS filesystem mounted from a
> local NFS server.  It reduces the throttling on the lower
> layer so that it can proceed largely unthrottled.
> 
> To demonstrate this, create a filesystem on a loop device
> and write (e.g. with dd) several large files which combine
> to consume significantly more than the limit set by
> /proc/sys/vm/dirty_ratio or dirty_bytes.  Measure the total
> time taken.
> 
> When I do this directly on a device (no loop device) the
> total time for several runs (mkfs, mount, write 200 files,
> umount) is fairly stable: 28-35 seconds.
> When I do this over a loop device the times are much worse
> and less stable.  52-460 seconds.  Half below 100seconds,
> half above.
> When I apply this patch, the times become stable again,
> though not as fast as the no-loop-back case: 53-72 seconds.
> 
> There may be room for further improvement as the total overhead still
> seems too high, but this is a big improvement.
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
> 
> I moved where the flag is set, thanks to suggestion from
> Ming Lei.
> I've preserved the *-by: tags I was offered despite the code
> being different, as the concept is identical.
> 
> Thanks,
> NeilBrown
> 
> 
>  drivers/block/loop.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 0ecb6461ed81..44b3506fd086 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -852,6 +852,7 @@ static int loop_prepare_queue(struct loop_device *lo)
>  	if (IS_ERR(lo->worker_task))
>  		return -ENOMEM;
>  	set_user_nice(lo->worker_task, MIN_NICE);
> +	lo->worker_task->flags |= PF_LESS_THROTTLE;
>  	return 0;

As mentioned elsewhere, PF flags should be updated only on the current
task otherwise there is potential rmw race. Is this safe? The code runs
concurrently with the worker thread.


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH 0/10 v5] block: Fix block device shutdown related races
From: Jan Kara @ 2017-04-05  7:15 UTC (permalink / raw)
  To: Thiago Jung Bauermann
  Cc: Jan Kara, Jens Axboe, linux-block, Christoph Hellwig,
	Dan Williams, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Lekshmi C. Pillai
In-Reply-To: <1727592.C3UmHNa1QZ@morokweng>

On Tue 04-04-17 14:09:51, Thiago Jung Bauermann wrote:
> Hello,
> 
> Am Donnerstag, 23. Mï¿½rz 2017, 01:36:52 BRT schrieb Jan Kara:
> > this is a series with the remaining patches (on top of 4.11-rc2) to fix
> > several different races and issues I've found when testing device shutdown
> > and reuse. The first patch fixes possible (theoretical) problems when
> > opening of a block device races with shutdown of a gendisk structure.
> > Patches 2-8 fix oops that is triggered by __blkdev_put() calling
> > inode_detach_wb() too early (the problem reported by Thiago). Patches 9 and
> > 10 fix oops due to a bug in gendisk code where get_gendisk() can return
> > already freed gendisk structure (again triggered by Omar's stress test).
> > 
> > All patches got reviewed by Tejun and also tested by Thiago (thanks!). Jens,
> > can you please queue these fixes for the next merge window? Thanks!
> 
> Lekshmi tested these patches on top of v4.11-rc4 and they worked fine.
> Sorry for the delay, it takes at least 72h of running time to be sure we're 
> not hitting the race condition, and I had trouble with v4.11-rc2.

Great. Thanks for testing!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 9/9] bio-integrity: Restore original iterator on verify stage
From: Hannes Reinecke @ 2017-04-05  6:41 UTC (permalink / raw)
  To: Dmitry Monakhov, linux-kernel, linux-block, martin.petersen
In-Reply-To: <1491332201-26926-10-git-send-email-dmonakhov@openvz.org>

On 04/04/2017 08:56 PM, Dmitry Monakhov wrote:
> Currently ->verify_fn not woks at all because at the moment it is called
> bio->bi_iter.bi_size == 0, so we do not iterate integrity bvecs at all.
> 
> In order to perform verification we need to know original data vector,
> with new bvec rewind API this is trivial.
> 
> testcase: https://github.com/dmonakhov/xfstests/commit/3c6509eaa83b9c17cd0bc95d73fcdd76e1c54a85
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  block/bio-integrity.c | 22 +++++++++++++++-------
>  1 file changed, 15 insertions(+), 7 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply

* Re: [PATCH 8/9] bio: add bvec_iter rewind API
From: Hannes Reinecke @ 2017-04-05  6:39 UTC (permalink / raw)
  To: Dmitry Monakhov, linux-kernel, linux-block, martin.petersen
In-Reply-To: <1491332201-26926-9-git-send-email-dmonakhov@openvz.org>

On 04/04/2017 08:56 PM, Dmitry Monakhov wrote:
> Some ->bi_end_io handlers (for example: pi_verify or decrypt handlers)
> need to know original data vector, but after bio traverse io-stack it may
> be advanced, splited and relocated many times so it is hard to guess
> original iterator. Let's add 'bi_done' conter which accounts number
> of bytes iterator was advanced during it's evolution. Later end_io handler
> may easily restore original iterator by rewinding iterator to
> iter->bi_done.
> 
> Note: this change makes sizeof (struct bvec_iter) multiple to 8
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  include/linux/bio.h  | 21 +++++++++++++++++++--
>  include/linux/bvec.h | 26 ++++++++++++++++++++++++++
>  2 files changed, 45 insertions(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply

* Re: [PATCH 7/9] Guard bvec iteration logic v3
From: Hannes Reinecke @ 2017-04-05  6:39 UTC (permalink / raw)
  To: Dmitry Monakhov, linux-kernel, linux-block, martin.petersen
In-Reply-To: <1491332201-26926-8-git-send-email-dmonakhov@openvz.org>

On 04/04/2017 08:56 PM, Dmitry Monakhov wrote:
> Currently if some one try to advance bvec beyond it's size we simply
> dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
> This simply means that we endup dereferencing/corrupting random memory
> region.
> 
> Sane reaction would be to propagate error back to calling context
> But bvec_iter_advance's calling context is not always good for error
> handling. For safity reason let truncate iterator size to zero which
> will break external iteration loop which prevent us from unpredictable
> memory range corruption. And even it caller ignores an error, it will
> corrupt it's own bvecs, not others.
> 
> This patch does:
> - Return error back to caller with hope that it will react on this
> - Truncate iterator size
> 
> Code was added long time ago here 4550dd6c, luckily no one hit it
> in real life :)
> 
> changes since V1:
>  - Replace  BUG_ON with error logic.
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  drivers/nvdimm/blk.c |  4 +++-
>  drivers/nvdimm/btt.c |  4 +++-
>  include/linux/bio.h  |  8 ++++++--
>  include/linux/bvec.h | 11 ++++++++---
>  4 files changed, 20 insertions(+), 7 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply

* Re: [PATCH 6/9] T10: Move opencoded contants to common header
From: Hannes Reinecke @ 2017-04-05  6:37 UTC (permalink / raw)
  To: Dmitry Monakhov, linux-kernel, linux-block, martin.petersen
In-Reply-To: <1491332201-26926-7-git-send-email-dmonakhov@openvz.org>

On 04/04/2017 08:56 PM, Dmitry Monakhov wrote:
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  block/t10-pi.c                   | 9 +++------
>  drivers/scsi/lpfc/lpfc_scsi.c    | 5 +++--
>  drivers/scsi/qla2xxx/qla_isr.c   | 8 ++++----
>  drivers/target/target_core_sbc.c | 2 +-
>  include/linux/t10-pi.h           | 2 ++
>  5 files changed, 13 insertions(+), 13 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply

* Re: [PATCH 5/9] bio-integrity: fold bio_integrity_enabled to bio_integrity_prep
From: Hannes Reinecke @ 2017-04-05  6:37 UTC (permalink / raw)
  To: Dmitry Monakhov, linux-kernel, linux-block, martin.petersen
In-Reply-To: <1491332201-26926-6-git-send-email-dmonakhov@openvz.org>

On 04/04/2017 08:56 PM, Dmitry Monakhov wrote:
> Currently all integrity prep hooks are open-coded, and if prepare fails
> we ignore it's code and fail bio with EIO. Let's return real error to
> upper layer, so later caller may react accordingly.
> 
> In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled,
> so it is reasonable to fold it in to one function.
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  Documentation/block/data-integrity.txt |  3 --
>  block/bio-integrity.c                  | 88 ++++++++++++++++++----------------
>  block/blk-core.c                       |  5 +-
>  block/blk-mq.c                         |  8 +---
>  drivers/nvdimm/blk.c                   | 13 +----
>  drivers/nvdimm/btt.c                   | 13 +----
>  include/linux/bio.h                    |  6 ---
>  7 files changed, 55 insertions(+), 81 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply

* Re: [PATCH 4/9] bio-integrity: fix interface for bio_integrity_trim v2
From: Hannes Reinecke @ 2017-04-05  6:36 UTC (permalink / raw)
  To: Dmitry Monakhov, linux-kernel, linux-block, martin.petersen
In-Reply-To: <1491332201-26926-5-git-send-email-dmonakhov@openvz.org>

On 04/04/2017 08:56 PM, Dmitry Monakhov wrote:
> bio_integrity_trim inherent it's interface from bio_trim and accept
> offset and size, but this API is error prone because data offset
> must always be insync with bio's data offset. That is why we have
> integrity update hook in bio_advance()
> 
> So only meaningful values are: offset == 0, sectors == bio_sectors(bio)
> Let's just remove them completely.
> 
> changes from v1:
>  - remove 'sectors' arguments
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  block/bio-integrity.c | 11 ++---------
>  block/bio.c           |  4 ++--
>  drivers/md/dm.c       |  2 +-
>  include/linux/bio.h   |  5 ++---
>  4 files changed, 7 insertions(+), 15 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply

* Re: [PATCH 3/9] bio-integrity: bio_integrity_advance must update integrity seed
From: Hannes Reinecke @ 2017-04-05  6:34 UTC (permalink / raw)
  To: Dmitry Monakhov, linux-kernel, linux-block, martin.petersen
In-Reply-To: <1491332201-26926-4-git-send-email-dmonakhov@openvz.org>

On 04/04/2017 08:56 PM, Dmitry Monakhov wrote:
> SCSI drivers do care about bip_seed so we must update it accordingly.
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  block/bio-integrity.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
> index b5009a8..82a6ffb 100644
> --- a/block/bio-integrity.c
> +++ b/block/bio-integrity.c
> @@ -425,6 +425,7 @@ void bio_integrity_advance(struct bio *bio, unsigned int bytes_done)
>  	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
>  	unsigned bytes = bio_integrity_bytes(bi, bytes_done >> 9);
>  
> +	bip->bip_iter.bi_sector += bytes_done >> 9;
>  	bvec_iter_advance(bip->bip_vec, &bip->bip_iter, bytes);
>  }
>  EXPORT_SYMBOL(bio_integrity_advance);
> 
Odd that we've missed that one...

Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply

* Re: [PATCH 2/9] bio-integrity: bio_trim should truncate integrity vector accordingly
From: Hannes Reinecke @ 2017-04-05  6:32 UTC (permalink / raw)
  To: Dmitry Monakhov, linux-kernel, linux-block, martin.petersen
In-Reply-To: <1491332201-26926-3-git-send-email-dmonakhov@openvz.org>

On 04/04/2017 08:56 PM, Dmitry Monakhov wrote:
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  block/bio.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/block/bio.c b/block/bio.c
> index e75878f..fa84323 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1907,6 +1907,10 @@ void bio_trim(struct bio *bio, int offset, int size)
>  	bio_advance(bio, offset << 9);
>  
>  	bio->bi_iter.bi_size = size;
> +
> +	if (bio_integrity(bio))
> +		bio_integrity_trim(bio, 0, size);
> +
>  }
>  EXPORT_SYMBOL_GPL(bio_trim);
>  
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply

* Re: [PATCH 1/9] bio-integrity: Do not allocate integrity context for bio w/o data
From: Hannes Reinecke @ 2017-04-05  6:32 UTC (permalink / raw)
  To: Dmitry Monakhov, linux-kernel, linux-block, martin.petersen
In-Reply-To: <1491332201-26926-2-git-send-email-dmonakhov@openvz.org>

On 04/04/2017 08:56 PM, Dmitry Monakhov wrote:
> If bio has no data, such as ones from blkdev_issue_flush(),
> then we have nothing to protect.
> 
> This patch prevent bugon like follows:
> 
> kfree_debugcheck: out of range ptr ac1fa1d106742a5ah
> kernel BUG at mm/slab.c:2773!
> invalid opcode: 0000 [#1] SMP
> Modules linked in: bcache
> CPU: 0 PID: 4428 Comm: xfs_io Tainted: G        W       4.11.0-rc4-ext4-00041-g2ef0043-dirty #43
> Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014
> task: ffff880137786440 task.stack: ffffc90000ba8000
> RIP: 0010:kfree_debugcheck+0x25/0x2a
> RSP: 0018:ffffc90000babde0 EFLAGS: 00010082
> RAX: 0000000000000034 RBX: ac1fa1d106742a5a RCX: 0000000000000007
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013f3ccb40
> RBP: ffffc90000babde8 R08: 0000000000000000 R09: 0000000000000000
> R10: 00000000fcb76420 R11: 00000000725172ed R12: 0000000000000282
> R13: ffffffff8150e766 R14: ffff88013a145e00 R15: 0000000000000001
> FS:  00007fb09384bf40(0000) GS:ffff88013f200000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fd0172f9e40 CR3: 0000000137fa9000 CR4: 00000000000006f0
> Call Trace:
>  kfree+0xc8/0x1b3
>  bio_integrity_free+0xc3/0x16b
>  bio_free+0x25/0x66
>  bio_put+0x14/0x26
>  blkdev_issue_flush+0x7a/0x85
>  blkdev_fsync+0x35/0x42
>  vfs_fsync_range+0x8e/0x9f
>  vfs_fsync+0x1c/0x1e
>  do_fsync+0x31/0x4a
>  SyS_fsync+0x10/0x14
>  entry_SYSCALL_64_fastpath+0x1f/0xc2
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  block/bio-integrity.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
> index 5384713..b5009a8 100644
> --- a/block/bio-integrity.c
> +++ b/block/bio-integrity.c
> @@ -175,6 +175,9 @@ bool bio_integrity_enabled(struct bio *bio)
>  	if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE)
>  		return false;
>  
> +	if (!bio_sectors(bio))
> +		return false;
> +
>  	/* Already protected? */
>  	if (bio_integrity(bio))
>  		return false;
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox