All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Khalid Aziz <khalid.aziz@oracle.com>
Cc: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com,
	peterz@infradead.org, andi.kleen@intel.com, rob@landley.net,
	viro@zeniv.linux.org.uk, oleg@redhat.com,
	gnomes@lxorguk.ukuu.org.uk, riel@redhat.com, snorcht@gmail.com,
	dhowells@redhat.com, luto@amacapital.net, daeseok.youn@gmail.com,
	ebiederm@xmission.com, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org
Subject: Re: [PATCH v2] Pre-emption control for userspace
Date: Tue, 25 Mar 2014 10:44:10 -0700	[thread overview]
Message-ID: <20140325104410.25e06c4d.akpm@linux-foundation.org> (raw)
In-Reply-To: <1395767870-28053-1-git-send-email-khalid.aziz@oracle.com>

On Tue, 25 Mar 2014 11:17:50 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:

> 
> This patch adds a way for a thread to request additional timeslice from
> the scheduler if it is about to be preempted, so it could complete any
> critical task it is in the middle of. This functionality helps with
> performance on databases and has been used for many years on other OSs
> by the databases. This functionality helps in situation where a thread
> acquires a lock before performing a critical operation on the database,
> happens to get preempted before it completes its task and releases the
> lock.  This lock causes all other threads that also acquire the same
> lock to perform their critical operation on the database to start
> queueing up and causing large number of context switches. This queueing
> problem can be avoided if the thread that acquires lock first could
> request scheduler to grant it an additional timeslice once it enters its
> critical section and hence allow it to complete its critical sectiona
> without causing queueing problem. If critical section completes before
> the thread is due for preemption, the thread can simply desassert its
> request. A thread sends the scheduler this request by setting a flag in
> a memory location it has shared with the kernel.  Kernel uses bytes in
> the same memory location to let the thread know when its request for
> amnesty from preemption has been granted. Thread should yield the
> processor at the end of its critical section if it was granted amnesty
> to play nice with other threads. If thread fails to yield processor, it
> gets penalized by having its next amnesty request turned down by
> scheduler.  Documentation file included in this patch contains further
> details on how to use this functionality and conditions associated with
> its use. This patch also adds a new field in scheduler statistics which
> keeps track of how many times was a thread granted amnesty from
> preemption. This feature and its usage are documented in
> Documentation/scheduler/sched-preempt-delay.txt and this patch includes
> a test for this feature under tools/testing/selftests/preempt-delay

What a long paragraph ;)

The feature makes sense and sounds useful, but I'd like to see some
real world performance testing results so we can understand what the
benefit will be to our users?

>
> ...
>
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +static int
> +tid_preempt_delay_show(struct seq_file *m, void *v)
> +{
> +	struct inode *inode = m->private;
> +	struct task_struct *task = get_proc_task(inode);
> +	unsigned char *delay_req;
> +
> +	if (!task)
> +		return -ENOENT;
> +
> +	delay_req = (unsigned char *)task->sched_preempt_delay.delay_req;
> +	seq_printf(m, "0x%-p\n", delay_req);
> +
> +	put_task_struct(task);
> +	return 0;
> +}
> +
> +static ssize_t
> +tid_preempt_delay_write(struct file *file, const char __user *buf,
> +			  size_t count, loff_t *offset)
> +{
> +	struct inode *inode = file_inode(file);
> +	struct task_struct *task = get_proc_task(inode);
> +	u32 __user *delay_req;
> +	int retval;
> +
> +	if (!task) {
> +		retval = -ENOENT;
> +		goto out;
> +	}
> +
> +	/*
> +	 * A thread can write only to its corresponding preempt_delay
> +	 * proc file
> +	 */
> +	if (current != task) {
> +		retval =  -EPERM;
> +		goto out;
> +	}
> +
> +	delay_req = *(u32 __user **)buf;
> +
> +	/*
> +	 * Do not allow write if pointer is currently set
> +	 */
> +	if (task->sched_preempt_delay.delay_req && (delay_req != NULL)) {
> +		retval = -EINVAL;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Validate the pointer.
> +	 */
> +	if (unlikely(!access_ok(rw, delay_req, sizeof(u32)))) {
> +		retval = -EFAULT;
> +		goto out;
> +	}
> +
> +	task->sched_preempt_delay.delay_req = delay_req;
> +
> +	/* zero out flags */
> +	put_user(0, delay_req);
> +
> +	retval = count;
> +
> +out:
> +	put_task_struct(task);
> +	return retval;
> +}

So the procfs file is written in binary format and is read back in
ascii format.  Seems odd.

Perhaps this should all be done as a new syscall rather than some
procfs thing.

>
> ...
>
> @@ -1250,6 +1251,13 @@ struct task_struct {
>  	/* Revert to default priority/policy when forking */
>  	unsigned sched_reset_on_fork:1;
>  	unsigned sched_contributes_to_load:1;
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +	struct preempt_delay {
> +		u32 __user *delay_req;		/* delay request flag pointer */
> +		unsigned char delay_granted:1;	/* currently in delay */
> +		unsigned char yield_penalty:1;	/* failure to yield penalty */
> +	} sched_preempt_delay;

The problem with bitfields is that a write to one bitfield can corrupt
a concurrent write to the other one.  So it's your responsibility to
provide locking and/or to describe how this race is avoided.  A comment
here in the definition would be a suitable way of addressing this.

> +#endif
>  
>  	pid_t pid;
>  	pid_t tgid;
>
> ...
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4055,6 +4055,14 @@ SYSCALL_DEFINE0(sched_yield)
>  {
>  	struct rq *rq = this_rq_lock();
>  
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +	/*
> +	 * Clear the penalty flag for current task to reward it for
> +	 * palying by the rules

"playing"

> +	 */
> +	current->sched_preempt_delay.yield_penalty = 0;
> +#endif
> +
>  	schedstat_inc(rq, yld_count);
>  	current->sched_class->yield_task(rq);
>  
> ...
>
> +static void
> +delay_resched_task(struct task_struct *curr)
> +{
> +	struct sched_entity *se;
> +	int cpu = task_cpu(curr);
> +	u32 __user *delay_req;
> +	unsigned int delay_req_flag;
> +	unsigned char *delay_flag;
> +
> +	/*
> +	 * Check if task is using pre-emption delay feature. If address
> +	 * for preemption delay request flag is not set, this task is
> +	 * not using preemption delay feature, we can reschedule without
> +	 * any delay
> +	 */
> +	delay_req = curr->sched_preempt_delay.delay_req;
> +
> +	if ((delay_req == NULL) || (cpu != smp_processor_id()))

Presumably we don't get "smp_processor_id() used in preemptible code"
warnings here, so called-with-preempt-disabled is a secret prerequisite
for delay_resched_task().

> +		goto resched_now;
> +
> +	/*
> +	 * Pre-emption delay will  be granted only once. If this task
> +	 * has already been granted delay, rechedule now
> +	 */
> +	if (curr->sched_preempt_delay.delay_granted) {
> +		curr->sched_preempt_delay.delay_granted = 0;
> +		goto resched_now;
> +	}
> +
> +	/*
> +	 * Get the value of preemption delay request flag from userspace.
> +	 * Task had already passed us the address where the flag is stored
> +	 * in userspace earlier. This flag is just like the PROCESS_PRIVATE
> +	 * futex, leverage the futex code here to read the flag. If there
> +	 * is a page fault accessing this flag in userspace, that means
> +	 * userspace has not touched this flag recently and we can
> +	 * assume no preemption delay is needed.
> +	 *
> +	 * If task is not requesting additional timeslice, resched now
> +	 */
> +	if (delay_req) {
> +		int ret;
> +
> +		pagefault_disable();
> +		ret = __copy_from_user_inatomic(&delay_req_flag, delay_req,
> +				sizeof(u32));
> +		pagefault_enable();

This all looks rather hacky and unneccesary.  Can't we somehow use
plain old get_user() and avoid such fuss?

> +		delay_flag = &delay_req_flag;
> +		if (ret || !delay_flag[0])
> +			goto resched_now;
> +	} else {
> +		goto resched_now;
> +	}
> +
> +	/*
> +	 * Current thread has requested preemption delay and has not
> +	 * been granted an extension yet. If this thread failed to yield
> +	 * processor after being granted amnesty last time, penalize it
> +	 * by not granting this delay request, otherwise give it an extra
> +	 * timeslice.
> +	 */
> +	if (curr->sched_preempt_delay.yield_penalty) {
> +		curr->sched_preempt_delay.yield_penalty = 0;
> +		goto resched_now;
> +	}
> +
> +	se = &curr->se;
> +	curr->sched_preempt_delay.delay_granted = 1;
> +
> +	/*
> +	 * Set the penalty flag for failing to yield the processor after
> +	 * being granted immunity. This flag will be cleared in
> +	 * sched_yield() if the thread indeed calls sched_yield
> +	 */
> +	curr->sched_preempt_delay.yield_penalty = 1;
> +
> +	/*
> +	 * Let the thread know it got amnesty and it should call
> +	 * sched_yield() when it is done to avoid penalty next time
> +	 * it wants amnesty. We need to write to userspace location.
> +	 * Since we just read from this location, chances are extremley
> +	 * low we might page fault. If we do page fault, we will ignore
> +	 * it and accept the cost of failed write in form of unnecessary
> +	 * penalty for userspace task for not yielding processor.
> +	 * This is a highly unlikely scenario.
> +	 */
> +	delay_flag[0] = 0;
> +	delay_flag[1] = 1;
> +	pagefault_disable();
> +	__copy_to_user_inatomic(delay_req, &delay_req_flag, sizeof(u32));
> +	pagefault_enable();

and put_user() here.

> +	schedstat_inc(curr, se.statistics.nr_preempt_delayed);
> +	return;
> +
> +resched_now:
> +	resched_task(curr);
> +}
> +#else
> +#define delay_resched_task(curr) resched_task(curr)

This could have been implemented in C...

> +#endif /* CONFIG_SCHED_PREEMPT_DELAY */
> +
>
> ...
>

  reply	other threads:[~2014-03-25 17:44 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-03 18:07 [RFC] [PATCH] Pre-emption control for userspace Khalid Aziz
2014-03-03 21:51 ` Davidlohr Bueso
2014-03-03 23:29   ` Khalid Aziz
2014-03-04 13:56 ` Oleg Nesterov
2014-03-04 17:44   ` Khalid Aziz
2014-03-04 18:38     ` Al Viro
2014-03-04 19:01       ` Khalid Aziz
2014-03-04 19:03     ` Oleg Nesterov
2014-03-04 20:14       ` Khalid Aziz
2014-03-05 14:38         ` Oleg Nesterov
2014-03-05 16:12           ` Oleg Nesterov
2014-03-05 17:10             ` Khalid Aziz
2014-03-04 21:12 ` H. Peter Anvin
2014-03-04 21:39   ` Khalid Aziz
2014-03-04 22:23     ` One Thousand Gnomes
2014-03-04 22:44       ` Khalid Aziz
2014-03-05  0:39         ` Thomas Gleixner
2014-03-05  0:51           ` Andi Kleen
2014-03-05 11:10             ` Peter Zijlstra
2014-03-05 17:29               ` Khalid Aziz
2014-03-05 19:58               ` Khalid Aziz
2014-03-06  9:57                 ` Peter Zijlstra
2014-03-06 16:08                   ` Khalid Aziz
2014-03-06 11:14                 ` Thomas Gleixner
2014-03-06 16:32                   ` Khalid Aziz
2014-03-05 14:54             ` Oleg Nesterov
2014-03-05 15:56               ` Andi Kleen
2014-03-05 16:36                 ` Oleg Nesterov
2014-03-05 17:22                   ` Khalid Aziz
2014-03-05 23:13                     ` David Lang
2014-03-05 23:48                       ` Khalid Aziz
2014-03-05 23:56                         ` H. Peter Anvin
2014-03-06  0:02                           ` Khalid Aziz
2014-03-06  0:13                             ` H. Peter Anvin
2014-03-05 23:59                         ` David Lang
2014-03-06  0:17                           ` Khalid Aziz
2014-03-06  0:36                             ` David Lang
2014-03-06  1:22                               ` Khalid Aziz
2014-03-06 14:23                                 ` David Lang
2014-03-06 12:13             ` Kevin Easton
2014-03-06 13:59               ` Peter Zijlstra
2014-03-06 22:41                 ` Andi Kleen
2014-03-06 14:25               ` David Lang
2014-03-06 16:12                 ` Khalid Aziz
2014-03-06 13:24   ` Rasmus Villemoes
2014-03-06 13:34     ` Peter Zijlstra
2014-03-06 13:45       ` Rasmus Villemoes
2014-03-06 14:02         ` Peter Zijlstra
2014-03-06 14:33           ` Thomas Gleixner
2014-03-06 14:34             ` H. Peter Anvin
2014-03-06 14:04         ` Thomas Gleixner
2014-03-25 17:17 ` [PATCH v2] " Khalid Aziz
2014-03-25 17:44   ` Andrew Morton [this message]
2014-03-25 17:56     ` Khalid Aziz
2014-03-25 18:14       ` Andrew Morton
2014-03-25 17:46   ` Oleg Nesterov
2014-03-25 17:59     ` Khalid Aziz
2014-03-25 18:20   ` Andi Kleen
2014-03-25 18:47     ` Khalid Aziz
2014-03-25 19:47       ` Andi Kleen
2014-03-25 18:59   ` Eric W. Biederman
2014-03-25 19:15     ` Khalid Aziz
2014-03-25 20:31       ` Eric W. Biederman
2014-03-25 21:37         ` Khalid Aziz
2014-03-26  6:03     ` Mike Galbraith
2014-03-25 23:01 ` [RFC] [PATCH] " Davidlohr Bueso
2014-03-25 23:29   ` Khalid Aziz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140325104410.25e06c4d.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=andi.kleen@intel.com \
    --cc=daeseok.youn@gmail.com \
    --cc=dhowells@redhat.com \
    --cc=ebiederm@xmission.com \
    --cc=gnomes@lxorguk.ukuu.org.uk \
    --cc=hpa@zytor.com \
    --cc=khalid.aziz@oracle.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mingo@redhat.com \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=rob@landley.net \
    --cc=snorcht@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.