Re: Recent kernel "mount" slow

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jeff Chua <jeff.chua.linux@gmail.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>,
	Lai Jiangshan <laijs@cn.fujitsu.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Jan Kara <jack@suse.cz>, lkml <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: Recent kernel "mount" slow
Date: Tue, 27 Nov 2012 17:45:40 +0900	[thread overview]
Message-ID: <CAAJw_Zsg7AW5+8h8pXZ-dPhBOZ2oechWA6N9eb1fea-vTj=vCw@mail.gmail.com> (raw)
In-Reply-To: <50B46F79.5010606@kernel.dk>

Jens,

Limited access now at Incheon Airport. Will try the patch out when I arrived.

Thanks,
Jeff

On 11/27/12, Jens Axboe <axboe@kernel.dk> wrote:
> On 2012-11-27 08:38, Jens Axboe wrote:
>> On 2012-11-27 06:57, Jeff Chua wrote:
>>> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua <jeff.chua.linux@gmail.com>
>>> wrote:
>>>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka <mpatocka@redhat.com>
>>>> wrote:
>>>>> So it's better to slow down mount.
>>>>
>>>> I am quite proud of the linux boot time pitting against other OS. Even
>>>> with 10 partitions. Linux can boot up in just a few seconds, but now
>>>> you're saying that we need to do this semaphore check at boot up. By
>>>> doing so, it's inducing additional 4 seconds during boot up.
>>>
>>> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
>>> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
>>> kind of degradation would this cause or just the same?
>>
>> It'd likely be the same slow down time wise, but as a percentage it
>> would appear smaller on a slower disk.
>>
>> Could you please test Mikulas' suggestion of changing
>> synchronize_sched() in include/linux/percpu-rwsem.h to
>> synchronize_sched_expedited()?
>>
>> linux-next also has a re-write of the per-cpu rw sems, out of Andrews
>> tree. It would be a good data point it you could test that, too.
>>
>> In any case, the slow down definitely isn't acceptable. Fixing an
>> obscure issue like block sizes changing while O_DIRECT is in flight
>> definitely does NOT warrant a mount slow down.
>
> Here's Olegs patch, might be easier for you than switching to
> linux-next. Please try that.
>
> From: Oleg Nesterov <oleg@redhat.com>
> Subject: percpu_rw_semaphore: reimplement to not block the readers
> unnecessarily
>
> Currently the writer does msleep() plus synchronize_sched() 3 times to
> acquire/release the semaphore, and during this time the readers are
> blocked completely.  Even if the "write" section was not actually started
> or if it was already finished.
>
> With this patch down_write/up_write does synchronize_sched() twice and
> down_read/up_read are still possible during this time, just they use the
> slow path.
>
> percpu_down_write() first forces the readers to use rw_semaphore and
> increment the "slow" counter to take the lock for reading, then it
> takes that rw_semaphore for writing and blocks the readers.
>
> Also.  With this patch the code relies on the documented behaviour of
> synchronize_sched(), it doesn't try to pair synchronize_sched() with
> barrier.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Mikulas Patocka <mpatocka@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
> Cc: Anton Arapov <anton@redhat.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
>  include/linux/percpu-rwsem.h |   85 +++-------------------
>  lib/Makefile                 |    2
>  lib/percpu-rwsem.c           |  123 +++++++++++++++++++++++++++++++++
>  3 files changed, 138 insertions(+), 72 deletions(-)
>
> diff -puN
> include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
> include/linux/percpu-rwsem.h
> ---
> a/include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
> +++ a/include/linux/percpu-rwsem.h
> @@ -2,82 +2,25 @@
>  #define _LINUX_PERCPU_RWSEM_H
>
>  #include <linux/mutex.h>
> +#include <linux/rwsem.h>
>  #include <linux/percpu.h>
> -#include <linux/rcupdate.h>
> -#include <linux/delay.h>
> +#include <linux/wait.h>
>
>  struct percpu_rw_semaphore {
> -	unsigned __percpu *counters;
> -	bool locked;
> -	struct mutex mtx;
> +	unsigned int __percpu	*fast_read_ctr;
> +	struct mutex		writer_mutex;
> +	struct rw_semaphore	rw_sem;
> +	atomic_t		slow_read_ctr;
> +	wait_queue_head_t	write_waitq;
>  };
>
> -#define light_mb()	barrier()
> -#define heavy_mb()	synchronize_sched()
> +extern void percpu_down_read(struct percpu_rw_semaphore *);
> +extern void percpu_up_read(struct percpu_rw_semaphore *);
>
> -static inline void percpu_down_read(struct percpu_rw_semaphore *p)
> -{
> -	rcu_read_lock_sched();
> -	if (unlikely(p->locked)) {
> -		rcu_read_unlock_sched();
> -		mutex_lock(&p->mtx);
> -		this_cpu_inc(*p->counters);
> -		mutex_unlock(&p->mtx);
> -		return;
> -	}
> -	this_cpu_inc(*p->counters);
> -	rcu_read_unlock_sched();
> -	light_mb(); /* A, between read of p->locked and read of data, paired with
> D */
> -}
> -
> -static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> -{
> -	light_mb(); /* B, between read of the data and write to p->counter, paired
> with C */
> -	this_cpu_dec(*p->counters);
> -}
> -
> -static inline unsigned __percpu_count(unsigned __percpu *counters)
> -{
> -	unsigned total = 0;
> -	int cpu;
> -
> -	for_each_possible_cpu(cpu)
> -		total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
> -
> -	return total;
> -}
> -
> -static inline void percpu_down_write(struct percpu_rw_semaphore *p)
> -{
> -	mutex_lock(&p->mtx);
> -	p->locked = true;
> -	synchronize_sched(); /* make sure that all readers exit the
> rcu_read_lock_sched region */
> -	while (__percpu_count(p->counters))
> -		msleep(1);
> -	heavy_mb(); /* C, between read of p->counter and write to data, paired
> with B */
> -}
> -
> -static inline void percpu_up_write(struct percpu_rw_semaphore *p)
> -{
> -	heavy_mb(); /* D, between write to data and write to p->locked, paired
> with A */
> -	p->locked = false;
> -	mutex_unlock(&p->mtx);
> -}
> -
> -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
> -{
> -	p->counters = alloc_percpu(unsigned);
> -	if (unlikely(!p->counters))
> -		return -ENOMEM;
> -	p->locked = false;
> -	mutex_init(&p->mtx);
> -	return 0;
> -}
> -
> -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
> -{
> -	free_percpu(p->counters);
> -	p->counters = NULL; /* catch use after free bugs */
> -}
> +extern void percpu_down_write(struct percpu_rw_semaphore *);
> +extern void percpu_up_write(struct percpu_rw_semaphore *);
> +
> +extern int percpu_init_rwsem(struct percpu_rw_semaphore *);
> +extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
>
>  #endif
> diff -puN
> lib/Makefile~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
> lib/Makefile
> ---
> a/lib/Makefile~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
> +++ a/lib/Makefile
> @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmd
>  	 idr.o int_sqrt.o extable.o \
>  	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
>  	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
> -	 is_single_threaded.o plist.o decompress.o earlycpio.o
> +	 is_single_threaded.o plist.o decompress.o earlycpio.o percpu-rwsem.o
>
>  lib-$(CONFIG_MMU) += ioremap.o
>  lib-$(CONFIG_SMP) += cpumask.o
> diff -puN /dev/null lib/percpu-rwsem.c
> --- /dev/null
> +++ a/lib/percpu-rwsem.c
> @@ -0,0 +1,123 @@
> +#include <linux/percpu-rwsem.h>
> +#include <linux/rcupdate.h>
> +#include <linux/sched.h>
> +
> +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> +{
> +	brw->fast_read_ctr = alloc_percpu(int);
> +	if (unlikely(!brw->fast_read_ctr))
> +		return -ENOMEM;
> +
> +	mutex_init(&brw->writer_mutex);
> +	init_rwsem(&brw->rw_sem);
> +	atomic_set(&brw->slow_read_ctr, 0);
> +	init_waitqueue_head(&brw->write_waitq);
> +	return 0;
> +}
> +
> +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> +{
> +	free_percpu(brw->fast_read_ctr);
> +	brw->fast_read_ctr = NULL; /* catch use after free bugs */
> +}
> +
> +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int
> val)
> +{
> +	bool success = false;
> +
> +	preempt_disable();
> +	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
> +		__this_cpu_add(*brw->fast_read_ctr, val);
> +		success = true;
> +	}
> +	preempt_enable();
> +
> +	return success;
> +}
> +
> +/*
> + * Like the normal down_read() this is not recursive, the writer can
> + * come after the first percpu_down_read() and create the deadlock.
> + */
> +void percpu_down_read(struct percpu_rw_semaphore *brw)
> +{
> +	if (likely(update_fast_ctr(brw, +1)))
> +		return;
> +
> +	down_read(&brw->rw_sem);
> +	atomic_inc(&brw->slow_read_ctr);
> +	up_read(&brw->rw_sem);
> +}
> +
> +void percpu_up_read(struct percpu_rw_semaphore *brw)
> +{
> +	if (likely(update_fast_ctr(brw, -1)))
> +		return;
> +
> +	/* false-positive is possible but harmless */
> +	if (atomic_dec_and_test(&brw->slow_read_ctr))
> +		wake_up_all(&brw->write_waitq);
> +}
> +
> +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
> +{
> +	unsigned int sum = 0;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		sum += per_cpu(*brw->fast_read_ctr, cpu);
> +		per_cpu(*brw->fast_read_ctr, cpu) = 0;
> +	}
> +
> +	return sum;
> +}
> +
> +/*
> + * A writer takes ->writer_mutex to exclude other writers and to force the
> + * readers to switch to the slow mode, note the mutex_is_locked() check in
> + * update_fast_ctr().
> + *
> + * After that the readers can only inc/dec the slow ->slow_read_ctr
> counter,
> + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> + * counter it represents the number of active readers.
> + *
> + * Finally the writer takes ->rw_sem for writing and blocks the new
> readers,
> + * then waits until the slow counter becomes zero.
> + */
> +void percpu_down_write(struct percpu_rw_semaphore *brw)
> +{
> +	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
> +	mutex_lock(&brw->writer_mutex);
> +
> +	/*
> +	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> +	 *    so that update_fast_ctr() can't succeed.
> +	 *
> +	 * 2. Ensures we see the result of every previous this_cpu_add() in
> +	 *    update_fast_ctr().
> +	 *
> +	 * 3. Ensures that if any reader has exited its critical section via
> +	 *    fast-path, it executes a full memory barrier before we return.
> +	 */
> +	synchronize_sched();
> +
> +	/* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
> +	atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr);
> +
> +	/* block the new readers completely */
> +	down_write(&brw->rw_sem);
> +
> +	/* wait for all readers to complete their percpu_up_read() */
> +	wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
> +}
> +
> +void percpu_up_write(struct percpu_rw_semaphore *brw)
> +{
> +	/* allow the new readers, but only the slow-path */
> +	up_write(&brw->rw_sem);
> +
> +	/* insert the barrier before the next fast-path in down_read */
> +	synchronize_sched();
> +
> +	mutex_unlock(&brw->writer_mutex);
> +}
>
> --
> Jens Axboe
>
>

next prev parent reply	other threads:[~2012-11-27  8:45 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAAJw_ZtbhE5Jtd4PsWx8a23QdFTW7aMrKBmRf-bo5Wrean9Xhg@mail.gmail.com>
2012-11-20 18:09 ` Recent kernel "mount" slow Jan Kara
2012-11-21 15:46   ` Jeff Chua
2012-11-22 14:30     ` Jeff Chua
2012-11-22 19:21       ` Linus Torvalds
2012-11-23 13:24         ` Jens Axboe
2012-11-23 22:21           ` Jeff Chua
2012-11-23 23:31             ` Jeff Chua
2012-11-23 23:48               ` Jeff Chua
2012-11-24 21:09             ` Mikulas Patocka
2012-11-24 23:23               ` Jeff Chua
2012-11-27  5:57                 ` Jeff Chua
2012-11-27  7:38                   ` Jens Axboe
2012-11-27  7:44                     ` Jens Axboe
2012-11-27  8:45                       ` Jeff Chua [this message]
2012-11-27 10:06                     ` Jeff Chua
2012-11-27 12:33                       ` Jens Axboe
2012-11-28  3:57                         ` Mikulas Patocka
2012-11-28  8:33                           ` Jens Axboe
2012-11-28 13:05                             ` Jeff Chua
2012-11-28 17:25                             ` [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow) Mikulas Patocka
2012-11-28 19:15                               ` Linus Torvalds
2012-11-28 19:43                                 ` Al Viro
2012-11-28 19:53                                   ` Linus Torvalds
2012-11-28 22:01                                   ` [PATCH v2] Do a proper locking for mmap and block size change Mikulas Patocka
2012-11-29 17:19                                     ` Linus Torvalds
2012-11-29 18:23                                       ` Mikulas Patocka
2012-11-29 18:46                                         ` Linus Torvalds
2012-11-29 19:02                                       ` Linus Torvalds
2012-11-29 19:15                                         ` Chris Mason
2012-11-29 19:26                                           ` Linus Torvalds
2012-11-29 19:48                                             ` Chris Mason
2012-11-29 19:55                                               ` Linus Torvalds
2012-11-29 20:10                                                 ` Linus Torvalds
2012-11-29 20:52                                               ` Linus Torvalds
2012-11-29 21:29                                                 ` Chris Mason
2012-11-29 22:16                                                   ` Linus Torvalds
2012-11-29 22:36                                                     ` Linus Torvalds
2012-11-30  1:16                                                       ` Chris Mason
2012-11-30  2:13                                                         ` Linus Torvalds
2012-11-30  2:27                                                           ` Chris Mason
2012-11-30  2:49                                                     ` Dave Chinner
2012-11-30 14:31                                                       ` Chris Mason
2012-11-30 16:42                                                         ` Linus Torvalds
2012-11-30 16:36                                                       ` Christoph Hellwig
2012-11-30 22:40                                                         ` Dave Chinner
2012-11-30 23:09                                                           ` Christoph Hellwig
2012-11-29 19:50                                             ` Linus Torvalds
2012-11-28 19:50                                 ` [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow) Mikulas Patocka
2012-11-28 20:03                                   ` Linus Torvalds
2012-11-28 20:13                                     ` Linus Torvalds
2012-11-28 20:32                                       ` Linus Torvalds
2012-11-28 20:47                                         ` Linus Torvalds
2012-11-28 22:10                                           ` Mikulas Patocka
2012-11-28 21:29                                       ` Mikulas Patocka
2012-11-28 22:52                                         ` Linus Torvalds
2012-11-28 23:13                                           ` Linus Torvalds
2012-11-29  1:20                                             ` Mikulas Patocka
2012-11-29  0:38                                           ` Mikulas Patocka
2012-11-29  2:04                                             ` Linus Torvalds
2012-11-29  2:58                                               ` Linus Torvalds
2012-11-29  6:16                                                 ` Linus Torvalds
2012-11-29  6:25                                                   ` Al Viro
2012-11-29  6:30                                                     ` Al Viro
2012-11-29  6:37                                                       ` Linus Torvalds
2012-11-29  6:45                                                         ` Al Viro
2012-11-29 10:57                                                           ` Jeff Chua
2012-11-29  6:33                                                     ` Linus Torvalds
2012-11-29 14:12                                                   ` Chris Mason
2012-11-29 17:26                                                     ` Chris Mason
2012-11-29 17:26                                                     ` Linus Torvalds
2012-11-29 17:51                                                       ` Chris Mason
2012-11-29 18:12                                                         ` Linus Torvalds
2012-11-28  3:59                       ` [PATCH 1/2] percpu-rwsem: use synchronize_sched_expedited Mikulas Patocka
2012-11-28  4:01                         ` [PATCH 2/2] block_dev: don't take the write lock if block size doesn't change Mikulas Patocka
2012-11-28 14:24                           ` Jeff Chua
2012-11-28 22:03                             ` Mikulas Patocka
2012-11-28 14:19                         ` [PATCH 1/2] percpu-rwsem: use synchronize_sched_expedited Jeff Chua
2012-11-30  0:06                         ` Andrew Morton
2012-11-30  3:00                           ` Mikulas Patocka
2012-11-30 13:42                             ` Paul E. McKenney
2012-11-30 18:57                           ` Oleg Nesterov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAAJw_Zsg7AW5+8h8pXZ-dPhBOZ2oechWA6N9eb1fea-vTj=vCw@mail.gmail.com' \
    --to=jeff.chua.linux@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=jack@suse.cz \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mpatocka@redhat.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).