From: Andrew Morton <akpm@linux-foundation.org>
To: Shaohua Li <shli@kernel.org>
Cc: linux-mm@kvack.org, riel@redhat.com, minchan@kernel.org,
kmpark@infradead.org, hughd@google.com, aquini@redhat.com
Subject: Re: [patch 1/4 v4]swap: change block allocation algorithm for SSD
Date: Wed, 12 Jun 2013 15:21:22 -0700 [thread overview]
Message-ID: <20130612152122.1f18457bbf6fc096b70eea94@linux-foundation.org> (raw)
In-Reply-To: <20130326053706.GA19646@kernel.org>
On Tue, 26 Mar 2013 13:37:06 +0800 Shaohua Li <shli@kernel.org> wrote:
> I'm using a fast SSD to do swap. scan_swap_map() sometimes uses up to 20~30%
> CPU time (when cluster is hard to find, the CPU time can be up to 80%), which
> becomes a bottleneck. scan_swap_map() scans a byte array to search a 256 page
> cluster, which is very slow.
>
> Here I introduced a simple algorithm to search cluster. Since we only care
> about 256 pages cluster, we can just use a counter to track if a cluster is
> free. Every 256 pages use one int to store the counter. If the counter of a
> cluster is 0, the cluster is free. All free clusters will be added to a list,
> so searching cluster is very efficient. With this, scap_swap_map() overhead
> disappears.
>
> Since searching cluster with a list is easy, we can easily implement a per-cpu
> cluster algorithm to do block allocation, which can make swapout more
> efficient. This is in my TODO list.
>
> This might help low end SD card swap too. Because if the cluster is aligned, SD
> firmware can do flash erase more efficiently.
>
> We only enable the algorithm for SSD. Hard disk swap isn't fast enough and has
> downside with the algorithm which might introduce regression (see below).
>
> The patch slightly changes which cluster is choosen. It always adds free
> cluster to list tail. This can help wear leveling for low end SSD too. And if
> no cluster found, the scan_swap_map() will do search from the end of last
> cluster. So if no cluster found, the scan_swap_map() will do search from the
> end of last free cluster, which is random. For SSD, this isn't a problem at
> all.
>
> Another downside is the cluster must be aligned to 256 pages, which will reduce
> the chance to find a cluster. I would expect this isn't a big problem for SSD
> because of the non-seek penality. (And this is the reason I only enable the
> algorithm for SSD).
>
> ...
>
> +/*
> + * cluster info is a unsigned int, the highest 8 bits stores flags, the low 24
> + * bits stores next cluster if the cluster is free or cluster counter otherwise
> + */
> +#define CLUSTER_FLAG_FREE (1 << 0) /* This cluster is free */
> +#define CLUSTER_FLAG_NEXT_NULL (1 << 1) /* This cluster has no next cluster */
> +#define CLUSTER_NULL (CLUSTER_FLAG_NEXT_NULL << 24)
> +static inline unsigned int cluster_flag(unsigned int info)
> +{
> + return info >> 24;
> +}
> +
> +static inline void cluster_set_flag(unsigned int *info, unsigned int flag)
> +{
> + *info = ((*info) & 0xffffff) | (flag << 24);
> +}
> +
> +static inline unsigned int cluster_count(unsigned int info)
> +{
> + return info & 0xffffff;
> +}
> +
> +static inline void cluster_set_count(unsigned int *info, unsigned int c)
> +{
> + *info = (cluster_flag(*info) << 24) | c;
> +}
> +
> +static inline unsigned int cluster_next(unsigned int info)
> +{
> + return info & 0xffffff;
> +}
> +
> +static inline void cluster_set_next(unsigned int *info, unsigned int n)
> +{
> + *info = (cluster_flag(*info) << 24) | n;
> +}
> +
> +static inline bool cluster_is_free(unsigned int info)
> +{
> + return cluster_flag(info) & CLUSTER_FLAG_FREE;
> +}
This is all a bit gruesome and might generate inefficient code.
It may look a bit better if we were to do
#define CLUSTER_FLAG_FREE (1 << 24) /* This cluster is free */
#define CLUSTER_FLAG_NEXT_NULL (2 << 24)
However I suspect it would work out very nicely if the code were to use
C bitfields?
> +static inline void inc_cluster_info_page(struct swap_info_struct *p,
> + unsigned int *cluster_info, unsigned long page_nr)
> +{
> + unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> +
> + if (!cluster_info)
> + return;
> + if (cluster_is_free(cluster_info[idx])) {
> + VM_BUG_ON(p->free_cluster_head != idx);
> + p->free_cluster_head = cluster_next(cluster_info[idx]);
> + if (p->free_cluster_tail == idx) {
> + p->free_cluster_tail = CLUSTER_NULL;
> + p->free_cluster_head = CLUSTER_NULL;
> + }
> + cluster_set_flag(&cluster_info[idx], 0);
> + cluster_set_count(&cluster_info[idx], 0);
> + }
> +
> + VM_BUG_ON(cluster_count(cluster_info[idx]) >= SWAPFILE_CLUSTER);
> + cluster_set_count(&cluster_info[idx],
> + cluster_count(cluster_info[idx]) + 1);
> +}
> +
> +static inline void dec_cluster_info_page(struct swap_info_struct *p,
> + unsigned int *cluster_info, unsigned long page_nr)
> +{
> + unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> +
> + if (!cluster_info)
> + return;
> +
> + VM_BUG_ON(cluster_count(cluster_info[idx]) == 0);
> + cluster_set_count(&cluster_info[idx],
> + cluster_count(cluster_info[idx]) - 1);
> +
> + if (cluster_count(cluster_info[idx]) == 0) {
> + cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
> + if (p->free_cluster_head == CLUSTER_NULL) {
> + p->free_cluster_head = idx;
> + p->free_cluster_tail = idx;
> + } else {
> + cluster_set_next(&cluster_info[p->free_cluster_tail],
> + idx);
> + p->free_cluster_tail = idx;
> + }
> + }
> +}
I'd remove the 'inline' keywords here - the compiler will work it out
for us.
> +/*
> + * It's possible scan_swap_map() uses a free cluster in the middle of free
> + * cluster list. Avoiding such abuse to avoid list corruption.
> + */
> +static inline bool scan_swap_map_recheck_cluster(struct swap_info_struct *si,
> + unsigned long offset)
> +{
> + offset /= SWAPFILE_CLUSTER;
> + return si->free_cluster_head != CLUSTER_NULL &&
> + offset != si->free_cluster_head &&
> + cluster_is_free(si->cluster_info[offset]);
> +}
> +
> static unsigned long scan_swap_map(struct swap_info_struct *si,
> unsigned char usage)
> {
>
> ...
>
> @@ -2102,13 +2277,28 @@ SYSCALL_DEFINE2(swapon, const char __use
> error = -ENOMEM;
> goto bad_swap;
> }
> + if (p->bdev && blk_queue_nonrot(bdev_get_queue(p->bdev))) {
> + p->flags |= SWP_SOLIDSTATE;
> + /*
> + * select a random position to start with to help wear leveling
> + * SSD
> + */
> + p->cluster_next = 1 + (prandom_u32() % p->highest_bit);
> +
> + cluster_info = vzalloc(DIV_ROUND_UP(maxpages,
> + SWAPFILE_CLUSTER) * sizeof(*cluster_info));
Why vmalloc()? How large can this allocation be?
> + if (!cluster_info) {
> + error = -ENOMEM;
> + goto bad_swap;
> + }
> + }
>
> error = swap_cgroup_swapon(p->type, maxpages);
> if (error)
> goto bad_swap;
>
> nr_extents = setup_swap_map_and_extents(p, swap_header, swap_map,
> - maxpages, &span);
> + cluster_info, maxpages, &span);
> if (unlikely(nr_extents < 0)) {
> error = nr_extents;
> goto bad_swap;
> ...
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2013-06-12 22:21 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-03-26 5:37 [patch 1/4 v4]swap: change block allocation algorithm for SSD Shaohua Li
2013-03-29 2:00 ` Rafael Aquini
2013-03-29 2:50 ` Shaohua Li
2013-03-29 3:13 ` Rafael Aquini
2013-04-29 22:17 ` Rafael Aquini
2013-06-12 22:21 ` Andrew Morton
2013-06-12 22:21 ` Andrew Morton [this message]
2013-06-13 10:54 ` Shaohua Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130612152122.1f18457bbf6fc096b70eea94@linux-foundation.org \
--to=akpm@linux-foundation.org \
--cc=aquini@redhat.com \
--cc=hughd@google.com \
--cc=kmpark@infradead.org \
--cc=linux-mm@kvack.org \
--cc=minchan@kernel.org \
--cc=riel@redhat.com \
--cc=shli@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).