All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@kernel.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Linux-RT <linux-rt-users@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2] locking/rwbase: Prevent indefinite writer starvation
Date: Tue, 17 Jan 2023 15:22:30 +0100	[thread overview]
Message-ID: <Y8avJm1FQI9vB9cv@linutronix.de> (raw)
In-Reply-To: <20230117083817.togfwc5cy4g67e5r@techsingularity.net>

On 2023-01-17 08:38:17 [+0000], Mel Gorman wrote:
> rw_semaphore and rwlock are explicitly unfair to writers in the presense
> of readers by design with a PREEMPT_RT configuration. Commit 943f0edb754f
> ("locking/rt: Add base code for RT rw_semaphore and rwlock") notes;
> 
>         The implementation is writer unfair, as it is not feasible to do
>         priority inheritance on multiple readers, but experience has shown
>         that real-time workloads are not the typical workloads which are
>         sensitive to writer starvation.
> 
> While atypical, it's also trivial to block writers with PREEMPT_RT
> indefinitely without ever making forward progress. Since LTP-20220121,
> the dio_truncate test case went from having 1 reader to having 16 readers
> and the number of readers is sufficient to prevent the down_write ever
> succeeding while readers exist. Eventually the test is killed after 30
> minutes as a failure.
> 
> dio_truncate is not a realtime application but indefinite writer starvation

If so then the PI boosting would not work if we would have it ;)

> is undesirable. The test case has one writer appending and truncating files
> A and B while multiple readers read file A. The readers and writer are
> contending for one file's inode lock which never succeeds as the readers
> keep reading until the writer is done which never happens.

This tests the implementation of rwsem/ rwlock functionality to ensure
that it is not writer unfair.

> This patch records a timestamp when the first writer is blocked if no
> deadline or realtime task has recently acquired the lock for read. If
> dt/rt tasks are involved, then reader bias is preserved. For other tasks,
DL/ RT. Would it work to use the capital letters if it refers to the
scheduling class?

> reader bias is allowed until the first writer has been blocked for a minimum
> of 4ms or 1 tick. The cutoff time is arbitrary on the assumption that a
> normal application contending for 4ms also does not need PREEMPT_RT. On
> a test machine, the test completed in 88 seconds.

I would go for one second just because it _usually_ does not matter
since none of the important locks rely on that (as stated in the commit
message). But then why not use the 4ms/ 1 tick as you suggest. This is
after all what the NON-PREEMPT_RT implementation is using to ensure that
the writer is not stalled infinitely. The RWLOCK implementation is
already writer unfair.

Side note: If the test case gets updated to RT reader which acquire the
lock (the whole time) then they will block writer again :)

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/rwbase_rt.h  |  3 ++
>  kernel/locking/rwbase_rt.c | 84 +++++++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 82 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/rwbase_rt.h b/include/linux/rwbase_rt.h
> index 1d264dd08625..05c4dc74b8bd 100644
> --- a/include/linux/rwbase_rt.h
> +++ b/include/linux/rwbase_rt.h
> @@ -10,12 +10,14 @@
>  
>  struct rwbase_rt {
>  	atomic_t		readers;
> +	unsigned long		waiter_blocked;
>  	struct rt_mutex_base	rtmutex;
>  };
>  
>  #define __RWBASE_INITIALIZER(name)				\
>  {								\
>  	.readers = ATOMIC_INIT(READER_BIAS),			\
> +	.waiter_blocked = 0,					\
>  	.rtmutex = __RT_MUTEX_BASE_INITIALIZER(name.rtmutex),	\
>  }
>  
> @@ -23,6 +25,7 @@ struct rwbase_rt {
>  	do {							\
>  		rt_mutex_base_init(&(rwbase)->rtmutex);		\
>  		atomic_set(&(rwbase)->readers, READER_BIAS);	\
> +		(rwbase)->waiter_blocked = 0;			\
>  	} while (0)
>  
>  
> diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
> index c201aadb9301..db2f6accf49f 100644
> --- a/kernel/locking/rwbase_rt.c
> +++ b/kernel/locking/rwbase_rt.c
> @@ -39,7 +39,11 @@
>   * major surgery for a very dubious value.
>   *
>   * The risk of writer starvation is there, but the pathological use cases
> - * which trigger it are not necessarily the typical RT workloads.
> + * which trigger it are not necessarily the typical RT workloads. The worst
> + * case of indefinite starvation of a writer will force readers into the
> + * slow path if a writer is blocked for more than RW_CONTENTION_THRESHOLD
> + * jiffies unless dl/rt tasks have taken a read lock since the last write
DL/RT please.

> + * unlock.
>   *
>   * Fast-path orderings:
>   * The lock/unlock of readers can run in fast paths: lock and unlock are only
> @@ -65,6 +69,61 @@ static __always_inline int rwbase_read_trylock(struct rwbase_rt *rwb)
>  	return 0;
>  }
>  
> +/*
> + * Allow reader bias with a pending writer for a minimum of 4ms or 1 tick.

    * This matches RWSEM_WAIT_TIMEOUT for the generic RWSEM
    * implementation.

> + * The granularity is not exact as the lowest bit in rwbase_rt->waiter_blocked
> + * is used to detect recent rt/dl tasks taking a read lock.
> + */
> +#define RW_CONTENTION_THRESHOLD (HZ/250+1)
				   DIV_ROUND_UP(HZ, 250)

> +static void __sched update_dlrt_reader(struct rwbase_rt *rwb)
> +{
> +	/* No update required if dl/rt tasks already identified. */
> +	if (rwb->waiter_blocked & 1)
> +		return;
> +
> +	/*
> +	 * Record a dl/rt task acquiring the lock for read. This may result
DL/RT
> +	 * in indefinite writer starvation but dl/rt tasks should avoid such
> +	 * behaviour.
> +	 */
> +	if (dl_task(current) || rt_task(current)) {

There is also task_is_realtime(). But using only rt_task() should work
since it also covers dl_task().

> +		struct rt_mutex_base *rtm = &rwb->rtmutex;
> +		unsigned long flags;
> +
> +		raw_spin_lock_irqsave(&rtm->wait_lock, flags);
> +		rwb->waiter_blocked |= 1;
> +		raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
> +	}
> +}
> +
> +/* rtmutex->wait_lock must be held. */
> +static void __sched set_writer_blocked(struct rwbase_rt *rwb)
> +{
> +	/*
> +	 * Lowest bit preserved to identify recent rt/dl tasks acquiring
> +	 * the lock for read so guarantee at least one tick delay.
> +	 */
> +	rwb->waiter_blocked |= (jiffies + 2) & ~1UL;

I'm unsure what |= means in terms of multiple writers. It seems to
extend the wait period and the second writer has none after the first
one leaves.

> +}
> +
> +static bool __sched rwbase_allow_reader_bias(struct rwbase_rt *rwb)
> +{
> +	/*
> +	 * Allow reader bias if a dl or rt task took the lock for read
> +	 * since the last write unlock. Such tasks should be designed
> +	 * to avoid heavy writer contention or indefinite starvation.
> +	 */
> +	if (rwb->waiter_blocked & 1)
> +		return true;
> +
> +	/*
> +	 * Allow reader bias unless a writer has been blocked for more
> +	 * than RW_CONTENTION_THRESHOLD jiffies.
> +	 */
> +	return jiffies - rwb->waiter_blocked < RW_CONTENTION_THRESHOLD;

if you set
	rwb->waiter_blocked = jiffies + RW_CONTENTION_THRESHOLD

then you could use
	time_after(jiffies, waiter->waiter_blocked)

and we could name it timeout. So the first writer sets it and my guess
would be that the each RT reader ignores this delay while every non-RT
tries to acquire the lock unless as long as the timeout did not occur.
Then they back off and wait for one writer to acquire the lock.
I don't know what we do with the possible second writer but I guess
first writer on unlock should reset the timeout for the next writer. So
we have again reader followed by writer.

> +}
> +
>  static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
>  				      unsigned int state)
>  {
> @@ -74,9 +133,11 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
>  	raw_spin_lock_irq(&rtm->wait_lock);
>  	/*
>  	 * Allow readers, as long as the writer has not completely
> -	 * acquired the semaphore for write.
> +	 * acquired the semaphore for write and reader bias is still
> +	 * allowed.
>  	 */
> -	if (atomic_read(&rwb->readers) != WRITER_BIAS) {
> +	if (atomic_read(&rwb->readers) != WRITER_BIAS &&
> +	    rwbase_allow_reader_bias(rwb)) {
>  		atomic_inc(&rwb->readers);
>  		raw_spin_unlock_irq(&rtm->wait_lock);
>  		return 0;
> @@ -140,10 +201,18 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
>  static __always_inline int rwbase_read_lock(struct rwbase_rt *rwb,
>  					    unsigned int state)
>  {
> +	int ret;
> +
>  	if (rwbase_read_trylock(rwb))
> -		return 0;
> +		ret = 0;
> +	else
> +		ret = __rwbase_read_lock(rwb, state);
> +
> +	/* Record if the current task acquiring the lock is a dl/rt task. */
> +	if (!ret)
> +		update_dlrt_reader(rwb);
>  
> -	return __rwbase_read_lock(rwb, state);
> +	return ret;
>  }
>  
>  static void __sched __rwbase_read_unlock(struct rwbase_rt *rwb,
> @@ -264,12 +333,17 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
>  		if (__rwbase_write_trylock(rwb))
>  			break;
>  
> +		/* Record first new read/write contention. */
> +		set_writer_blocked(rwb);
> +
>  		raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
>  		rwbase_schedule();
>  		raw_spin_lock_irqsave(&rtm->wait_lock, flags);
>  
>  		set_current_state(state);
>  	}
> +
> +	rwb->waiter_blocked = 0;
>  	rwbase_restore_current_state();
>  	trace_contention_end(rwb, 0);
>  

Sebastian

  parent reply	other threads:[~2023-01-17 14:24 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-17  8:38 [PATCH v2] locking/rwbase: Prevent indefinite writer starvation Mel Gorman
     [not found] ` <20230117105031.2512-1-hdanton@sina.com>
2023-01-17 12:18   ` Mel Gorman
2023-01-17 14:22 ` Sebastian Andrzej Siewior [this message]
2023-01-17 16:50   ` Mel Gorman
2023-01-18 10:45     ` Ingo Molnar
2023-01-18 16:00       ` Mel Gorman
2023-01-18 15:25     ` Sebastian Andrzej Siewior
2023-01-18 17:31       ` Mel Gorman
2023-01-19  1:15         ` Hillf Danton
2023-01-19  8:32           ` Sebastian Andrzej Siewior
2023-01-19 13:59             ` Hillf Danton
2023-01-19 16:36               ` Sebastian Andrzej Siewior
2023-01-20  9:37                 ` Hillf Danton
2023-01-20 18:34                   ` Sebastian Andrzej Siewior
2023-01-21  3:46                     ` Hillf Danton
2023-01-19  8:25         ` Sebastian Andrzej Siewior
2023-01-19 11:02           ` Mel Gorman
2023-01-19 16:28             ` Sebastian Andrzej Siewior
2023-01-19 17:41               ` Mel Gorman
2023-01-19 17:48                 ` Davidlohr Bueso
2023-01-19 17:58                   ` Davidlohr Bueso
2023-01-20  8:25                 ` Sebastian Andrzej Siewior
2023-01-20 13:24                   ` Mel Gorman
2023-01-20 13:38                     ` Sebastian Andrzej Siewior
2023-01-20 14:07                       ` Mel Gorman
2023-01-20 15:36                     ` Davidlohr Bueso

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y8avJm1FQI9vB9cv@linutronix.de \
    --to=bigeasy@linutronix.de \
    --cc=dave@stgolabs.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rt-users@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.