Re: [PATCH -mm] mm, swap: Fix race between swap count continuation operations

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Huang\, Ying" <ying.huang@intel.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Johannes Weiner <hannes@cmpxchg.org>,
	Shaohua Li <shli@kernel.org>, Tim Chen <tim.c.chen@intel.com>,
	Aaron Lu <aaron.lu@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Andi Kleen <ak@linux.intel.com>, Minchan Kim <minchan@kernel.org>
Subject: Re: [PATCH -mm] mm, swap: Fix race between swap count continuation operations
Date: Tue, 17 Oct 2017 21:51:12 +0800	[thread overview]
Message-ID: <87zi8p7ui7.fsf@yhuang-dev.intel.com> (raw)
In-Reply-To: <20171017084827.bochtr5ufsgylvkd@dhcp22.suse.cz> (Michal Hocko's message of "Tue, 17 Oct 2017 10:48:27 +0200")

Michal Hocko <mhocko@kernel.org> writes:

> On Tue 17-10-17 16:13:20, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> One page may store a set of entries of the
>> sis->swap_map (swap_info_struct->swap_map) in multiple swap clusters.
>> If some of the entries has sis->swap_map[offset] > SWAP_MAP_MAX,
>> multiple pages will be used to store the set of entries of the
>> sis->swap_map.  And the pages are linked with page->lru.  This is
>> called swap count continuation.  To access the pages which store the
>> set of entries of the sis->swap_map simultaneously, previously,
>> sis->lock is used.  But to improve the scalability of
>> __swap_duplicate(), swap cluster lock may be used in
>> swap_count_continued() now.  This may race with
>> add_swap_count_continuation() which operates on a nearby swap cluster,
>> in which the sis->swap_map entries are stored in the same page.
>
> So what is the result of the race? Is a user able to trigger it?

Yes.  It is possible for a user to trigger it.  With the race, the
reference count of swap slots may be wrong, and which may cause
swap slots leak, infinite loop in kernel, etc.

>> To fix the race, a new spin lock called cont_lock is added to struct
>> swap_info_struct to protect the swap count continuation page list.
>> This is a lock at the swap device level, so the scalability isn't very
>> well.  But it is still much better than the original sis->lock,
>> because it is only acquired/released when swap count continuation is
>> used.  Which is considered rare in practice.  If it turns out that the
>> scalability becomes an issue for some workloads, we can split the lock
>> into some more fine grained locks.
>
> Is this a stable material? Could you think of the appropriate Fixes:
> tags?

Sure.  Will add it.

Best Regards,
Huang, Ying

>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Shaohua Li <shli@kernel.org>
>> Cc: Tim Chen <tim.c.chen@intel.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Aaron Lu <aaron.lu@intel.com>
>> Cc: Dave Hansen <dave.hansen@intel.com>
>> Cc: Andi Kleen <ak@linux.intel.com>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> ---
>>  include/linux/swap.h |  4 ++++
>>  mm/swapfile.c        | 23 +++++++++++++++++------
>>  2 files changed, 21 insertions(+), 6 deletions(-)
>> 
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 1f5c52313890..9e8e11be7e0b 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -266,6 +266,10 @@ struct swap_info_struct {
>>  					 * both locks need hold, hold swap_lock
>>  					 * first.
>>  					 */
>> +	spinlock_t cont_lock;		/*
>> +					 * protect swap count continuation page
>> +					 * list.
>> +					 */
>>  	struct work_struct discard_work; /* discard worker */
>>  	struct swap_cluster_list discard_clusters; /* discard clusters list */
>>  };
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index d67715ffc194..3074b02eaa09 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -2876,6 +2876,7 @@ static struct swap_info_struct *alloc_swap_info(void)
>>  	p->flags = SWP_USED;
>>  	spin_unlock(&swap_lock);
>>  	spin_lock_init(&p->lock);
>> +	spin_lock_init(&p->cont_lock);
>>  
>>  	return p;
>>  }
>> @@ -3558,6 +3559,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
>>  	head = vmalloc_to_page(si->swap_map + offset);
>>  	offset &= ~PAGE_MASK;
>>  
>> +	spin_lock(&si->cont_lock);
>>  	/*
>>  	 * Page allocation does not initialize the page's lru field,
>>  	 * but it does always reset its private field.
>> @@ -3577,7 +3579,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
>>  		 * a continuation page, free our allocation and use this one.
>>  		 */
>>  		if (!(count & COUNT_CONTINUED))
>> -			goto out;
>> +			goto out_unlock_cont;
>>  
>>  		map = kmap_atomic(list_page) + offset;
>>  		count = *map;
>> @@ -3588,11 +3590,13 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
>>  		 * free our allocation and use this one.
>>  		 */
>>  		if ((count & ~COUNT_CONTINUED) != SWAP_CONT_MAX)
>> -			goto out;
>> +			goto out_unlock_cont;
>>  	}
>>  
>>  	list_add_tail(&page->lru, &head->lru);
>>  	page = NULL;			/* now it's attached, don't free it */
>> +out_unlock_cont:
>> +	spin_unlock(&si->cont_lock);
>>  out:
>>  	unlock_cluster(ci);
>>  	spin_unlock(&si->lock);
>> @@ -3617,6 +3621,7 @@ static bool swap_count_continued(struct swap_info_struct *si,
>>  	struct page *head;
>>  	struct page *page;
>>  	unsigned char *map;
>> +	bool ret;
>>  
>>  	head = vmalloc_to_page(si->swap_map + offset);
>>  	if (page_private(head) != SWP_CONTINUED) {
>> @@ -3624,6 +3629,7 @@ static bool swap_count_continued(struct swap_info_struct *si,
>>  		return false;		/* need to add count continuation */
>>  	}
>>  
>> +	spin_lock(&si->cont_lock);
>>  	offset &= ~PAGE_MASK;
>>  	page = list_entry(head->lru.next, struct page, lru);
>>  	map = kmap_atomic(page) + offset;
>> @@ -3644,8 +3650,10 @@ static bool swap_count_continued(struct swap_info_struct *si,
>>  		if (*map == SWAP_CONT_MAX) {
>>  			kunmap_atomic(map);
>>  			page = list_entry(page->lru.next, struct page, lru);
>> -			if (page == head)
>> -				return false;	/* add count continuation */
>> +			if (page == head) {
>> +				ret = false;	/* add count continuation */
>> +				goto out;
>> +			}
>>  			map = kmap_atomic(page) + offset;
>>  init_map:		*map = 0;		/* we didn't zero the page */
>>  		}
>> @@ -3658,7 +3666,7 @@ init_map:		*map = 0;		/* we didn't zero the page */
>>  			kunmap_atomic(map);
>>  			page = list_entry(page->lru.prev, struct page, lru);
>>  		}
>> -		return true;			/* incremented */
>> +		ret = true;			/* incremented */
>>  
>>  	} else {				/* decrementing */
>>  		/*
>> @@ -3684,8 +3692,11 @@ init_map:		*map = 0;		/* we didn't zero the page */
>>  			kunmap_atomic(map);
>>  			page = list_entry(page->lru.prev, struct page, lru);
>>  		}
>> -		return count == COUNT_CONTINUED;
>> +		ret = count == COUNT_CONTINUED;
>>  	}
>> +out:
>> +	spin_unlock(&si->cont_lock);
>> +	return ret;
>>  }
>>  
>>  /*
>> -- 
>> 2.14.2
>> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: "Huang\, Ying" <ying.huang@intel.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: "Huang\, Ying" <ying.huang@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>, <linux-mm@kvack.org>,
	<linux-kernel@vger.kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	"Shaohua Li" <shli@kernel.org>, Tim Chen <tim.c.chen@intel.com>,
	Aaron Lu <aaron.lu@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Andi Kleen <ak@linux.intel.com>, Minchan Kim <minchan@kernel.org>
Subject: Re: [PATCH -mm] mm, swap: Fix race between swap count continuation operations
Date: Tue, 17 Oct 2017 21:51:12 +0800	[thread overview]
Message-ID: <87zi8p7ui7.fsf@yhuang-dev.intel.com> (raw)
In-Reply-To: <20171017084827.bochtr5ufsgylvkd@dhcp22.suse.cz> (Michal Hocko's message of "Tue, 17 Oct 2017 10:48:27 +0200")

Michal Hocko <mhocko@kernel.org> writes:

> On Tue 17-10-17 16:13:20, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> One page may store a set of entries of the
>> sis->swap_map (swap_info_struct->swap_map) in multiple swap clusters.
>> If some of the entries has sis->swap_map[offset] > SWAP_MAP_MAX,
>> multiple pages will be used to store the set of entries of the
>> sis->swap_map.  And the pages are linked with page->lru.  This is
>> called swap count continuation.  To access the pages which store the
>> set of entries of the sis->swap_map simultaneously, previously,
>> sis->lock is used.  But to improve the scalability of
>> __swap_duplicate(), swap cluster lock may be used in
>> swap_count_continued() now.  This may race with
>> add_swap_count_continuation() which operates on a nearby swap cluster,
>> in which the sis->swap_map entries are stored in the same page.
>
> So what is the result of the race? Is a user able to trigger it?

Yes.  It is possible for a user to trigger it.  With the race, the
reference count of swap slots may be wrong, and which may cause
swap slots leak, infinite loop in kernel, etc.

>> To fix the race, a new spin lock called cont_lock is added to struct
>> swap_info_struct to protect the swap count continuation page list.
>> This is a lock at the swap device level, so the scalability isn't very
>> well.  But it is still much better than the original sis->lock,
>> because it is only acquired/released when swap count continuation is
>> used.  Which is considered rare in practice.  If it turns out that the
>> scalability becomes an issue for some workloads, we can split the lock
>> into some more fine grained locks.
>
> Is this a stable material? Could you think of the appropriate Fixes:
> tags?

Sure.  Will add it.

Best Regards,
Huang, Ying

>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Shaohua Li <shli@kernel.org>
>> Cc: Tim Chen <tim.c.chen@intel.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Aaron Lu <aaron.lu@intel.com>
>> Cc: Dave Hansen <dave.hansen@intel.com>
>> Cc: Andi Kleen <ak@linux.intel.com>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> ---
>>  include/linux/swap.h |  4 ++++
>>  mm/swapfile.c        | 23 +++++++++++++++++------
>>  2 files changed, 21 insertions(+), 6 deletions(-)
>> 
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 1f5c52313890..9e8e11be7e0b 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -266,6 +266,10 @@ struct swap_info_struct {
>>  					 * both locks need hold, hold swap_lock
>>  					 * first.
>>  					 */
>> +	spinlock_t cont_lock;		/*
>> +					 * protect swap count continuation page
>> +					 * list.
>> +					 */
>>  	struct work_struct discard_work; /* discard worker */
>>  	struct swap_cluster_list discard_clusters; /* discard clusters list */
>>  };
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index d67715ffc194..3074b02eaa09 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -2876,6 +2876,7 @@ static struct swap_info_struct *alloc_swap_info(void)
>>  	p->flags = SWP_USED;
>>  	spin_unlock(&swap_lock);
>>  	spin_lock_init(&p->lock);
>> +	spin_lock_init(&p->cont_lock);
>>  
>>  	return p;
>>  }
>> @@ -3558,6 +3559,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
>>  	head = vmalloc_to_page(si->swap_map + offset);
>>  	offset &= ~PAGE_MASK;
>>  
>> +	spin_lock(&si->cont_lock);
>>  	/*
>>  	 * Page allocation does not initialize the page's lru field,
>>  	 * but it does always reset its private field.
>> @@ -3577,7 +3579,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
>>  		 * a continuation page, free our allocation and use this one.
>>  		 */
>>  		if (!(count & COUNT_CONTINUED))
>> -			goto out;
>> +			goto out_unlock_cont;
>>  
>>  		map = kmap_atomic(list_page) + offset;
>>  		count = *map;
>> @@ -3588,11 +3590,13 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
>>  		 * free our allocation and use this one.
>>  		 */
>>  		if ((count & ~COUNT_CONTINUED) != SWAP_CONT_MAX)
>> -			goto out;
>> +			goto out_unlock_cont;
>>  	}
>>  
>>  	list_add_tail(&page->lru, &head->lru);
>>  	page = NULL;			/* now it's attached, don't free it */
>> +out_unlock_cont:
>> +	spin_unlock(&si->cont_lock);
>>  out:
>>  	unlock_cluster(ci);
>>  	spin_unlock(&si->lock);
>> @@ -3617,6 +3621,7 @@ static bool swap_count_continued(struct swap_info_struct *si,
>>  	struct page *head;
>>  	struct page *page;
>>  	unsigned char *map;
>> +	bool ret;
>>  
>>  	head = vmalloc_to_page(si->swap_map + offset);
>>  	if (page_private(head) != SWP_CONTINUED) {
>> @@ -3624,6 +3629,7 @@ static bool swap_count_continued(struct swap_info_struct *si,
>>  		return false;		/* need to add count continuation */
>>  	}
>>  
>> +	spin_lock(&si->cont_lock);
>>  	offset &= ~PAGE_MASK;
>>  	page = list_entry(head->lru.next, struct page, lru);
>>  	map = kmap_atomic(page) + offset;
>> @@ -3644,8 +3650,10 @@ static bool swap_count_continued(struct swap_info_struct *si,
>>  		if (*map == SWAP_CONT_MAX) {
>>  			kunmap_atomic(map);
>>  			page = list_entry(page->lru.next, struct page, lru);
>> -			if (page == head)
>> -				return false;	/* add count continuation */
>> +			if (page == head) {
>> +				ret = false;	/* add count continuation */
>> +				goto out;
>> +			}
>>  			map = kmap_atomic(page) + offset;
>>  init_map:		*map = 0;		/* we didn't zero the page */
>>  		}
>> @@ -3658,7 +3666,7 @@ init_map:		*map = 0;		/* we didn't zero the page */
>>  			kunmap_atomic(map);
>>  			page = list_entry(page->lru.prev, struct page, lru);
>>  		}
>> -		return true;			/* incremented */
>> +		ret = true;			/* incremented */
>>  
>>  	} else {				/* decrementing */
>>  		/*
>> @@ -3684,8 +3692,11 @@ init_map:		*map = 0;		/* we didn't zero the page */
>>  			kunmap_atomic(map);
>>  			page = list_entry(page->lru.prev, struct page, lru);
>>  		}
>> -		return count == COUNT_CONTINUED;
>> +		ret = count == COUNT_CONTINUED;
>>  	}
>> +out:
>> +	spin_unlock(&si->cont_lock);
>> +	return ret;
>>  }
>>  
>>  /*
>> -- 
>> 2.14.2
>>

next prev parent reply	other threads:[~2017-10-17 13:51 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-17  8:13 [PATCH -mm] mm, swap: Fix race between swap count continuation operations Huang, Ying
2017-10-17  8:13 ` Huang, Ying
2017-10-17  8:48 ` Michal Hocko
2017-10-17  8:48   ` Michal Hocko
2017-10-17 13:51   ` Huang, Ying [this message]
2017-10-17 13:51     ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87zi8p7ui7.fsf@yhuang-dev.intel.com \
    --to=ying.huang@intel.com \
    --cc=aaron.lu@intel.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=minchan@kernel.org \
    --cc=shli@kernel.org \
    --cc=tim.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.