[patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
@ 2001-11-22  8:59 Ingo Molnar
  2001-11-22 20:22 ` Davide Libenzi
  2001-11-22 23:45 ` Robert Love
  0 siblings, 2 replies; 20+ messages in thread
From: Ingo Molnar @ 2001-11-22  8:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-smp

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2186 bytes --]


the attached set-affinity-A1 patch is relative to the scheduler
fixes/cleanups in 2.4.15-pre9. It implements the following two
new system calls:

 asmlinkage int sys_sched_set_affinity(pid_t pid, unsigned int mask_len,
    unsigned long *new_mask_ptr);

 asmlinkage int sys_sched_get_affinity(pid_t pid, unsigned int
    *user_mask_len_ptr, unsigned long *user_mask_ptr);

as a testcase, softirq.c is updated to use this mechanizm, plus see the
attached loop_affine.c code.

the sched_set_affinity() syscall also ensures that the target process will
run on the right CPU (or CPUs).

I think this interface is the right way to expose user-selectable affinity
to user-space - there are more complex affinity interfaces in existence,
but i believe that the discovery of actual caching hierarchy is and should
be up to a different mechanizm, i dont think it should be mixed into the
affinity syscalls. Using a mask of linear CPU IDs is IMO sufficient to
express user-space affinity wishes.

There are no security issues wrt. cpus_allowed, so these syscalls are
available to every process. (there are permission restrictions of course,
similar to those of existing scheduler syscalls.)

sched_get_affinity(pid, &mask_len, NULL) can be used to query the kernel's
supported CPU bitmask length. This should help us in achieving a stable
libc interface once we get over the 32/64 CPUs limit.

the attached loop_affine.c code tests both syscalls:

 mars:~> ./loop_affine
 current process's affinity: 4 bytes mask, value 000000ff.
 trying to set process: affinity to 00000001.
 current process's affinity: 4 bytes mask, value 00000001.
 speed: 2162052 loops.
 speed: 2162078 loops.
 [...]

i've tested the patch on both SMP and UP systems. On UP the syscalls are
pretty pointless, but they show that the internal state of the scheduler
folds nicely into the UP case as well:

 mars:~> ./loop_affine
 current process's affinity: 4 bytes mask, value 00000001.
 trying to set process: affinity to 00000001.
 current process's affinity: 4 bytes mask, value 00000001.
 speed: 2160880 loops.
 speed: 2160511 loops.
 [...]

comments? Is there any reason to do a more complex interface than this?

	Ingo

[-- Attachment #2: Type: TEXT/PLAIN, Size: 5310 bytes --]

--- linux/kernel/sched.c.orig	Wed Nov 21 11:12:05 2001
+++ linux/kernel/sched.c	Wed Nov 21 11:44:41 2001
@@ -1112,6 +1112,132 @@
 	return retval;
 }
 
+/*
+ * sys_sched_set_affinity - Set the CPU affinity mask.
+ *
+ * @pid: the PID of the process
+ * @mask_len: length of the bitfield
+ * @new_mask_ptr: user-space pointer to the new CPU mask bitfield
+ */
+asmlinkage int sys_sched_set_affinity(pid_t pid, unsigned int mask_len, unsigned long *new_mask_ptr)
+{
+	int ret, reschedule = 0;
+	unsigned long new_mask;
+	struct task_struct *p;
+
+	/*
+	 * Right now we support an 'unsigned long' bitmask - this can
+	 * be extended without changing the syscall interface.
+	 */
+	if (mask_len < sizeof(new_mask))
+		return -EINVAL;
+
+	if (copy_from_user(&new_mask, new_mask_ptr, sizeof(new_mask)))
+		return -EFAULT;
+
+	new_mask &= cpu_online_map;
+	if (!new_mask)
+		return -EINVAL;
+
+	read_lock_irq(&tasklist_lock);
+	spin_lock(&runqueue_lock);
+
+	ret = -ESRCH;
+	p = find_process_by_pid(pid);
+	if (!p)
+		goto out_unlock;
+
+	ret = -EPERM;
+	if ((current->euid != p->euid) && (current->euid != p->uid) &&
+			!capable(CAP_SYS_NICE))
+		goto out_unlock;
+	p->cpus_allowed = new_mask;
+	if (!(p->cpus_runnable & p->cpus_allowed)) {
+		if (p == current)
+			reschedule = 1;
+#ifdef CONFIG_SMP
+		else  {
+			/*
+			 * If running on a different CPU then
+			 * trigger a reschedule to get the process
+			 * moved to a legal CPU:
+			 */
+			p->need_resched = 1;
+			smp_send_reschedule(p->processor);
+		}
+#endif
+	}
+	ret = 0;
+out_unlock:
+	spin_unlock(&runqueue_lock);
+	read_unlock_irq(&tasklist_lock);
+
+	/*
+	 * Reschedule once if the current CPU is not in
+	 * the affinity mask. (do the reschedule here so
+	 * that kernel internal processes can call this
+	 * interface as well.)
+	 */
+	if (reschedule)
+		schedule();
+
+	return ret;
+}
+
+/*
+ * sys_sched_get_affinity - Set the CPU affinity mask.
+ *
+ * @pid: the PID of the process
+ * @mask_len_ptr: user-space pointer to the length of the bitfield
+ * @new_mask_ptr: user-space pointer to the CPU mask bitfield
+ */
+asmlinkage int sys_sched_get_affinity(pid_t pid, unsigned int *user_mask_len_ptr, unsigned long *user_mask_ptr)
+{
+	unsigned int mask_len, user_mask_len;
+	unsigned long mask;
+	struct task_struct *p;
+	int ret;
+
+	mask_len = sizeof(mask);
+
+	if (copy_from_user(&user_mask_len, user_mask_len_ptr, sizeof(user_mask_len)))
+		return -EFAULT;
+	if (copy_to_user(user_mask_len_ptr, &mask_len, sizeof(mask_len)))
+		return -EFAULT;
+	/*
+	 * Exit if we cannot copy the full bitmask into user-space.
+	 * But above we have copied the desired mask length to user-space
+	 * already, so user-space has a chance to fix up.
+	 */
+	if (user_mask_len < mask_len)
+		return -EINVAL;
+
+	read_lock_irq(&tasklist_lock);
+	spin_lock(&runqueue_lock);
+
+	ret = -ESRCH;
+	p = find_process_by_pid(pid);
+	if (!p)
+		goto out_unlock;
+
+	ret = -EPERM;
+	if ((current->euid != p->euid) && (current->euid != p->uid) &&
+			!capable(CAP_SYS_NICE))
+		goto out_unlock;
+
+	mask = p->cpus_allowed & cpu_online_map;
+	ret = 0;
+out_unlock:
+	spin_unlock(&runqueue_lock);
+	read_unlock_irq(&tasklist_lock);
+
+	if (ret)
+		return ret;
+	if (copy_to_user(user_mask_ptr, &mask, sizeof(mask)))
+		return -EFAULT;
+	return 0;
+}
+
 static void show_task(struct task_struct * p)
 {
 	unsigned long free = 0;
--- linux/kernel/softirq.c.orig	Wed Nov 21 11:12:05 2001
+++ linux/kernel/softirq.c	Wed Nov 21 11:24:10 2001
@@ -363,15 +363,17 @@
 {
 	int bind_cpu = (int) (long) __bind_cpu;
 	int cpu = cpu_logical_map(bind_cpu);
+	unsigned long cpu_mask = 1UL << cpu;
 
 	daemonize();
 	current->nice = 19;
 	sigfillset(&current->blocked);
 
 	/* Migrate to the right CPU */
-	current->cpus_allowed = 1UL << cpu;
-	while (smp_processor_id() != cpu)
-		schedule();
+	if (sys_sched_set_affinity(0, sizeof(cpu_mask), &cpu_mask))
+		BUG();
+	if (smp_processor_id() != cpu)
+		BUG();
 
 	sprintf(current->comm, "ksoftirqd_CPU%d", bind_cpu);
 
--- linux/include/linux/sched.h.orig	Wed Nov 21 11:19:56 2001
+++ linux/include/linux/sched.h	Wed Nov 21 11:39:36 2001
@@ -589,6 +589,8 @@
 #define wake_up_interruptible_sync(x)	__wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
 #define wake_up_interruptible_sync_nr(x) __wake_up_sync((x),TASK_INTERRUPTIBLE,  nr)
 asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);
+asmlinkage int sys_sched_set_affinity(pid_t pid, unsigned int mask_len, unsigned long *new_mask_ptr);
+asmlinkage int sys_sched_get_affinity(pid_t pid, unsigned int *user_mask_len_ptr, unsigned long *user_mask_ptr);
 
 extern int in_group_p(gid_t);
 extern int in_egroup_p(gid_t);
--- linux/arch/i386/kernel/entry.S.orig	Wed Nov 21 11:12:36 2001
+++ linux/arch/i386/kernel/entry.S	Wed Nov 21 11:35:24 2001
@@ -622,6 +622,8 @@
 	.long SYMBOL_NAME(sys_ni_syscall)	/* Reserved for Security */
 	.long SYMBOL_NAME(sys_gettid)
 	.long SYMBOL_NAME(sys_readahead)	/* 225 */
+	.long SYMBOL_NAME(sys_sched_set_affinity)
+	.long SYMBOL_NAME(sys_sched_get_affinity)
 
 	.rept NR_syscalls-(.-sys_call_table)/4
 		.long SYMBOL_NAME(sys_ni_syscall)

[-- Attachment #3: Type: TEXT/PLAIN, Size: 1458 bytes --]


/*
 * Simple loop testing the CPU-affinity syscall.
 */
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <linux/unistd.h>

#define __NR_sched_set_affinity 226
_syscall3 (int, sched_set_affinity, pid_t, pid, unsigned int, mask_len, unsigned long *, mask)

#define __NR_sched_get_affinity 227
_syscall3 (int, sched_get_affinity, pid_t, pid, unsigned int *, mask_len, unsigned long *, mask)

int main (void)
{
	int ret;
	unsigned int now, count, mask_len, iteration;
	unsigned long mask, new_mask = (1 << 0);

	ret = sched_get_affinity(0, &mask_len, &mask);
	if (ret) {
		printf("sched_get_affinity returned %d, exiting.\n", ret);
		return -1;
	}
	printf("current process's affinity: %d bytes mask, value %08lx.\n",
		mask_len, mask);

	printf("trying to set process: affinity to %08lx.\n", new_mask);	

	ret = sched_set_affinity(0, sizeof(new_mask), &new_mask);
	if (ret) {
		printf("sched_set_affinity returned %d, exiting.\n", ret);
		return -1;
	}

	ret = sched_get_affinity(0, &mask_len, &mask);
	if (ret) {
		printf("sched_get_affinity returned %d, exiting.\n", ret);
		return -1;
	}
	printf("current process's affinity: %d bytes mask, value %08lx.\n",
		mask_len, mask);
	iteration = 0;
repeat:
	now = time(0);
	count = 0;
	for (;;) {
		count++;
		if (time(0) != now)
			break;
	}
	if (iteration)
		printf("speed: %d loops.\n", count);
	iteration++;
	goto repeat;
	return 0;
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
  2001-11-22  8:59 Ingo Molnar
@ 2001-11-22 20:22 ` Davide Libenzi
  2001-11-22 23:45 ` Robert Love
  1 sibling, 0 replies; 20+ messages in thread
From: Davide Libenzi @ 2001-11-22 20:22 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, linux-smp

On Thu, 22 Nov 2001, Ingo Molnar wrote:

>
> the attached set-affinity-A1 patch is relative to the scheduler
> fixes/cleanups in 2.4.15-pre9. It implements the following two
> new system calls:
>
>  asmlinkage int sys_sched_set_affinity(pid_t pid, unsigned int mask_len,
>     unsigned long *new_mask_ptr);
>
>  asmlinkage int sys_sched_get_affinity(pid_t pid, unsigned int
>     *user_mask_len_ptr, unsigned long *user_mask_ptr);

I think that maybe it's better to have a new type :

typedef whatever-is-appropriate cpu_affinity_t;

with a set of macros :

CPU_AFFINITY_INIT(aptr)
CPU_AFFINITY_SET(aptr, n)
CPU_AFFINITY_ISSET(aprt, n)

so that we can simplify that interfaces :

asmlinkage int sys_sched_set_affinity(pid_t pid, cpu_affinity_t *aptr);
asmlinkage int sys_sched_get_affinity(pid_t pid, cpu_affinity_t *aptr);




- Davide



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
  2001-11-22  8:59 Ingo Molnar
  2001-11-22 20:22 ` Davide Libenzi
@ 2001-11-22 23:45 ` Robert Love
  2001-11-23  0:20   ` Ryan Cumming
  2001-11-23 11:02   ` Ingo Molnar
  1 sibling, 2 replies; 20+ messages in thread
From: Robert Love @ 2001-11-22 23:45 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel, linux-smp

On Thu, 2001-11-22 at 03:59, Ingo Molnar wrote:

> the attached set-affinity-A1 patch is relative to the scheduler
> fixes/cleanups in 2.4.15-pre9. It implements the following two
> new system calls: [...]

Ingo, I like your implementation, particularly the use of the
cpu_online_map, although I am not sure all arch's implement it yet.  I
am curious, however, what you would think of using a /proc interface
instead of a set of syscalls ?

Ie, we would have a /proc/<pid>/cpu_affinity which is the same as your
`unsigned long * user_mask_ptr'.  Reading and writing of the proc
interface would correspond to your get and set syscalls.  Besides the
sort of relevancy and useful abstraction of putting the affinity in the
procfs, it eliminates any sizeof(cpus_allowed) problem since the read
string is the size in characters of cpus_allowed.

I would use your syscall code, though -- just reimplement it as a procfs
file. This would mean adding a proc_write function, since the _actual_
procfs (the proc part) only has a read method, but that is simple.

Thoughts?

	Robert Love

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
  2001-11-22 23:45 ` Robert Love
@ 2001-11-23  0:20   ` Ryan Cumming
  2001-11-23  0:36     ` Mark Hahn
                       ` (3 more replies)
  2001-11-23 11:02   ` Ingo Molnar
  1 sibling, 4 replies; 20+ messages in thread
From: Ryan Cumming @ 2001-11-23  0:20 UTC (permalink / raw)
  To: Robert Love; +Cc: linux-kernel

On November 22, 2001 15:45, Robert Love wrote:
>
> Ie, we would have a /proc/<pid>/cpu_affinity which is the same as your
> `unsigned long * user_mask_ptr'.  Reading and writing of the proc
> interface would correspond to your get and set syscalls.  Besides the
> sort of relevancy and useful abstraction of putting the affinity in the
> procfs, it eliminates any sizeof(cpus_allowed) problem since the read
> string is the size in characters of cpus_allowed.
>
> I would use your syscall code, though -- just reimplement it as a procfs
> file. This would mean adding a proc_write function, since the _actual_
> procfs (the proc part) only has a read method, but that is simple.
>
> Thoughts?

 Here here, I was just thinking "Well, I like the CPU affinity idea, but I 
loathe syscall creep... I hope this Robert Love fellow says something about 
that" as I read your email's header.

 In addition to keeping the syscall table from being filled with very 
specific, non-standard, and use-once syscalls, a /proc interface would allow 
me to change the CPU affinity of processes that aren't {get, set}_affinity 
aware (i.e., all Linux applications written up to this point). This isn't 
very different from how it's possible to change a processes other scheduling 
properties (priority, scheduler) from another process. Imagine if renice(8) 
had to be implemented as attaching to a process and calling nice(2)... ick. 

 Also, as an application developer, I try to avoid conditionally compiled, 
system-specific calls. I would have much less "cleanliness" objections 
towards testing for the /proc/<pid>/cpu_affinity files existance and 
conditionally writing to it. Compare this to the hacks some network servers 
use to try to detect sendfile(2)'s presence at runtime, and you'll see what I 
mean. Remember, everything is a file ;)

 And one final thing... what sort of benifit does CPU affinity have if we 
have the scheduler take in account CPU migration costs correctly? I can think 
of a lot of corner cases, but in general, it seems to me that it's a lot more 
sane to have the scheduler decide where processes belong. What if an 
application with n threads, where n is less than the number of CPUs, has to 
decide which CPUs to bind its threads to? What if a similar app, or another 
instance of the same app, already decided to bind against the same set of 
CPUs? The scheduler is stuck with an unfair scheduling load on those poor 
CPUs, because the scheduling decision was moved away from where it really 
should take place: the scheduler. I'm sure I'm missing something, though.

-Ryan 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
  2001-11-23  0:20   ` Ryan Cumming
@ 2001-11-23  0:36     ` Mark Hahn
  2001-11-23 11:46       ` Ingo Molnar
  2001-11-23  0:51     ` Robert Love
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Mark Hahn @ 2001-11-23  0:36 UTC (permalink / raw)
  To: Ryan Cumming; +Cc: linux-kernel

> CPUs, because the scheduling decision was moved away from where it really 
> should take place: the scheduler. I'm sure I'm missing something, though.

only that it's nontrivial to estimate the migration costs, I think.
at one point, around 2.3.3*, there was some effort at doing this - 
or something like it.  specifically, the scheduler kept track of 
how long a process ran on average, and was slightly more willing
to migrate a short-slice process than a long-slice.  "short" was 
defined relative to cache size and a WAG at dram bandwidth.

the rationale was that if you run for only 100 us, you probably
don't have a huge working set.  that justification is pretty thin,
and perhaps that's why the code gradually disappeared.

hmm, you really want to monitor things like paging and cache misses,
but both might be tricky, and would be tricky to use sanely.
a really simple, and appealing heuristic is to migrate a process
that hasn't run for a long while - any cache state it may have had
is probably gone by now, so there *should* be no affinity.

regards, mark hahn.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
  2001-11-23  0:20   ` Ryan Cumming
  2001-11-23  0:36     ` Mark Hahn
@ 2001-11-23  0:51     ` Robert Love
  2001-11-23  1:11       ` Andreas Dilger
  2001-11-23 11:36     ` Ingo Molnar
  2001-11-27  3:39     ` Robert Love
  3 siblings, 1 reply; 20+ messages in thread
From: Robert Love @ 2001-11-23  0:51 UTC (permalink / raw)
  To: Ryan Cumming; +Cc: linux-kernel, mingo

On Thu, 2001-11-22 at 19:20, Ryan Cumming wrote:

>  Here here, I was just thinking "Well, I like the CPU affinity idea, but I 
> loathe syscall creep... I hope this Robert Love fellow says something about 
> that" as I read your email's header.

Ah, we think the same way.  The reason I spoke up, though, is in
addition to disliking the syscall way and liking the proc way, I thought
Ingo's implementation was nicely done.  In particular, the use of
cpu_online_map and forcing the reschedule were things I probably
wouldn't of thought of.

>  In addition to keeping the syscall table from being filled with very 
> specific, non-standard, and use-once syscalls, a /proc interface would allow 
> me to change the CPU affinity of processes that aren't {get, set}_affinity 
> aware (i.e., all Linux applications written up to this point). This isn't 
> very different from how it's possible to change a processes other scheduling 
> properties (priority, scheduler) from another process. Imagine if renice(8) 
> had to be implemented as attaching to a process and calling nice(2)... ick. 

Heh, this seems like the strongest argument yet, and I didn't even
mention it.  Note, however, that there is a pid_t field in the syscall
and from glossing over the code it seems you can set the affinity of any
arbitrary task given you have the right permissions.  Thus, we could
make a binary that took in a pid and a cpu mask, and set the affinity. 
But I still think "echo 0xffffffff > /proc/768/cpu_affinity" is nicer.

This opens up the issue of permissions with my proc suggestion, and we
have some options:

	Users can set the affinity of their own task, root can set anything.
	One needs a CAP capability to set affinity (which root of course has).
	Everyone can set anything, or only root can set affinity.

I would suggest letting users set their own affinity (since it only
lessens what they can do) and let a capability dictate if non-root users
can set other user's tasks affinities.  CAP_SYS_ADMIN would do fine.

>  Also, as an application developer, I try to avoid conditionally compiled, 
> system-specific calls. I would have much less "cleanliness" objections 
> towards testing for the /proc/<pid>/cpu_affinity files existance and 
> conditionally writing to it. Compare this to the hacks some network servers 
> use to try to detect sendfile(2)'s presence at runtime, and you'll see what I 
> mean. Remember, everything is a file ;)

Agreed. This:

	sprintf(p, "%s%d%s", "/proc/", pid(), "/cpu_affinity");
	f = open(p, "rw");
	if (!f) /* no cpu_affinity ... */

Is a very simple check vs. the sort of magic hackery that I see to find
out if a syscall is supported at run-time.

Again I mention how we can move cpus_allowed now to any size, and even
support old sizes, since it is a non-issue with a string.

>  And one final thing... what sort of benifit does CPU affinity have if we 
> have the scheduler take in account CPU migration costs correctly? I can think 
> of a lot of corner cases, but in general, it seems to me that it's a lot more 
> sane to have the scheduler decide where processes belong. What if an 
> application with n threads, where n is less than the number of CPUs, has to 
> decide which CPUs to bind its threads to? What if a similar app, or another 
> instance of the same app, already decided to bind against the same set of 
> CPUs? The scheduler is stuck with an unfair scheduling load on those poor 
> CPUs, because the scheduling decision was moved away from where it really 
> should take place: the scheduler. I'm sure I'm missing something, though.

It is typically preferred not to force a specific CPU affinity.  Solaris
and NT both allow it, for varying reasons.  One would be if you set
aside a processor for a set of tasks and disallowed all other tasks from
operating there.  This is common with RT (and usually accompanied by
disabling interrupt processing on that CPU and using a fully preemptible
kernel -- i.e. you get complete freedom for the real-time task on the
affined CPU).  This is what Solaris's processor sets try to accomplish,
but IMO they are too heavy and this is why I like Ingo's proposal.  We
already have the cpus_allowed property which we respect, we just need to
let userspace set it.  The question is how?

	Robert Love

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
  2001-11-23  0:51     ` Robert Love
@ 2001-11-23  1:11       ` Andreas Dilger
  2001-11-23  1:16         ` Robert Love
  0 siblings, 1 reply; 20+ messages in thread
From: Andreas Dilger @ 2001-11-23  1:11 UTC (permalink / raw)
  To: Robert Love; +Cc: Ryan Cumming, linux-kernel, mingo

On Nov 22, 2001  19:51 -0500, Robert Love wrote:
> I would suggest letting users set their own affinity (since it only
> lessens what they can do) and let a capability dictate if non-root users
> can set other user's tasks affinities.  CAP_SYS_ADMIN would do fine.

Rather use something else, like CAP_SYS_NICE.  It ties in with the idea
of scheduling, and doesn't further abuse the CAP_SYS_ADMIN capability.
CAP_SYS_ADMIN, while it has a good name, has become the catch-all of
capabilities, and if you have it, it is nearly the keys to the kingdom,
just like root.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
  2001-11-23  1:11       ` Andreas Dilger
@ 2001-11-23  1:16         ` Robert Love
  0 siblings, 0 replies; 20+ messages in thread
From: Robert Love @ 2001-11-23  1:16 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Ryan Cumming, linux-kernel, mingo

On Thu, 2001-11-22 at 20:11, Andreas Dilger wrote:

> Rather use something else, like CAP_SYS_NICE.  It ties in with the idea
> of scheduling, and doesn't further abuse the CAP_SYS_ADMIN capability.
> CAP_SYS_ADMIN, while it has a good name, has become the catch-all of
> capabilities, and if you have it, it is nearly the keys to the kingdom,
> just like root.

Ah, forgot about CAP_SYS_NICE ... indeed, a better idea.  I suppose if
people want it a CAP_SYS_CPU_AFFINITY could do, but this is a simple and
rare enough task that we are better off sticking it under something
else.

	Robert Love


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
  2001-11-22 23:45 ` Robert Love
  2001-11-23  0:20   ` Ryan Cumming
@ 2001-11-23 11:02   ` Ingo Molnar
  1 sibling, 0 replies; 20+ messages in thread
From: Ingo Molnar @ 2001-11-23 11:02 UTC (permalink / raw)
  To: Robert Love; +Cc: linux-kernel, linux-smp

On 22 Nov 2001, Robert Love wrote:

> > the attached set-affinity-A1 patch is relative to the scheduler
> > fixes/cleanups in 2.4.15-pre9. It implements the following two
> > new system calls: [...]
>
> Ingo, I like your implementation, particularly the use of the
> cpu_online_map, although I am not sure all arch's implement it yet.
> [...]

cpu_online_map is (or should be) a standard component of the kernel, eg.
generic SMP code in init/main.c uses it. But this area can be changed in
whatever direction is needed - we should always keep an eye on CPU
hot-swapping's architectural needs.

> [...] I am curious, however, what you would think of using a /proc
> interface instead of a set of syscalls ?

to compare it this situation to a similar situation, i made
/proc/irq/N/smp_affinity a /proc thing because it appeared to be an
architecture-specific and nongeneric feature, seldom used by ordinary
processes and generally an admin thing. But i think setting affinity is a
natural extension of the existing sched_* class of system-calls. It could
be used by userspace webservers for example.

one issue is that /proc does not necesserily have to be mounted. But i
dont have any strong feelings either way - the syscall variant simply
looks a bit more correct.

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
  2001-11-23  0:20   ` Ryan Cumming
  2001-11-23  0:36     ` Mark Hahn
  2001-11-23  0:51     ` Robert Love
@ 2001-11-23 11:36     ` Ingo Molnar
  2001-11-24  2:01       ` Davide Libenzi
  2001-11-27  3:39     ` Robert Love
  3 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2001-11-23 11:36 UTC (permalink / raw)
  To: Ryan Cumming; +Cc: Robert Love, linux-kernel, linux-smp

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1765 bytes --]

On Thu, 22 Nov 2001, Ryan Cumming wrote:

> [...] a /proc interface would allow me to change the CPU affinity of
> processes that aren't {get, set}_affinity aware (i.e., all Linux
> applications written up to this point). [...]

had you read my patch then you'd perhaps have noticed how easy it is
actually. I've attached a simple utility called 'chaff' (change affinity)
that allows to change the affinity of unaware processes:

 mars:~> ./chaff 714 0xf0
 pid 714's old affinity: 000000ff.
 pid 714's new affinity: 000000f0.

>  And one final thing... what sort of benifit does CPU affinity have if
> we have the scheduler take in account CPU migration costs correctly?
> [...]

while you are right that the scheduler can and should guess lots of
things, but it cannot guess some things. Eg. it has no idea whether a
particular process' workload is related to any IRQ source or not. And if
we bind IRQ sources for performance reasons, then the scheduler has no
chance finding the right CPU for the process. (I have attempted to
implement such a generic mechanizm a few months ago but quickly realized
that nothing like that will ever be accepted in the mainline kernel -
there is simply no way establish any reliable link between IRQ load and
process activities.)

So i implemented the smp_affinity and ->cpus_allowed mechanizms to allow
specific applications (who know the kind of load they generate) to bind to
specific CPUs, and to bind IRQs to CPUs. Obviously we still want the
scheduler to make good decisions - but linking IRQ load and scheduling
activity is too expensive. (i have a scheduler improvement patch that does
do some of this work at wakeup time, and which patch benefits Apache, but
this is still not enough to get the 'best' affinity.)

	Ingo

[-- Attachment #2: Type: TEXT/PLAIN, Size: 1301 bytes --]

/*
 * Simple loop testing the CPU-affinity syscall.
 */
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <linux/unistd.h>

#define __NR_sched_set_affinity 226
_syscall3 (int, sched_set_affinity, pid_t, pid, unsigned int, mask_len, unsigned long *, mask)

#define __NR_sched_get_affinity 227
_syscall3 (int, sched_get_affinity, pid_t, pid, unsigned int *, mask_len, unsigned long *, mask)

int main (int argc, char **argv)
{
	int pid, ret;
	unsigned int mask_len;
	unsigned long mask, new_mask;

	if (argc != 3) {
		printf("usage: chaff <pid> <hex_mask>\n");
		exit(-1);
	}
	pid = atol(argv[1]);
	sscanf(argv[2], "%lx", &new_mask);

printf("pid: %d. new_mask: (%s) %08lx.\n", pid, argv[2], new_mask);

	ret = sched_get_affinity(pid, &mask_len, &mask);
	if (ret) {
		printf("could not get pid %d's affinity.\n", pid);
		return -1;
	}
	printf("pid %d's old affinity: %08lx.", pid, mask);

	ret = sched_set_affinity(pid, sizeof(new_mask), &new_mask);
	if (ret) {
		printf("could not set pid %d's affinity.\n", pid);
		return -1;
	}
	ret = sched_get_affinity(pid, &mask_len, &mask);
	if (ret) {
		printf("sched_get_affinity returned %d, exiting.\n", ret);
		return -1;
	}
	printf("pid %d's new affinity: %08lx.", pid, mask);
	return 0;
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
  2001-11-23  0:36     ` Mark Hahn
@ 2001-11-23 11:46       ` Ingo Molnar
  2001-11-24 22:44         ` Davide Libenzi
  0 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2001-11-23 11:46 UTC (permalink / raw)
  To: Mark Hahn; +Cc: Ryan Cumming, linux-kernel

On Thu, 22 Nov 2001, Mark Hahn wrote:

> only that it's nontrivial to estimate the migration costs, I think. at
> one point, around 2.3.3*, there was some effort at doing this - or
> something like it.  specifically, the scheduler kept track of how long
> a process ran on average, and was slightly more willing to migrate a
> short-slice process than a long-slice.  "short" was defined relative
> to cache size and a WAG at dram bandwidth.

yes. I added the avg_slice code, and i removed it as well - it was
hopeless to get it right and it was causing bad performance for certain
application sloads. Current CPUs simply do not support any good way of
tracking cache footprint of processes. There are methods that are an
approximation (eg. uninterrupted runtime and cache footprint are in a
monotonic relationship), but none of the methods (including cache traffic
machine counters) are good enough to cover all the important corner cases,
due to cache aliasing, MESI-invalidation and other effects.

> the rationale was that if you run for only 100 us, you probably don't
> have a huge working set.  that justification is pretty thin, and
> perhaps that's why the code gradually disappeared.

yes.

> hmm, you really want to monitor things like paging and cache misses,
> but both might be tricky, and would be tricky to use sanely. a really
> simple, and appealing heuristic is to migrate a process that hasn't
> run for a long while - any cache state it may have had is probably
> gone by now, so there *should* be no affinity.

well it doesnt take much for a process to populate the whole L1 cache with
dirty cachelines. (which then have to be cross-invalidated if this process
is moved to another CPU.)

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
  2001-11-23 11:36     ` Ingo Molnar
@ 2001-11-24  2:01       ` Davide Libenzi
  0 siblings, 0 replies; 20+ messages in thread
From: Davide Libenzi @ 2001-11-24  2:01 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: lkml, linux-smp

On Fri, 23 Nov 2001, Ingo Molnar wrote:

[...]

Isn't it better to expose "number" cpu masks instead of
"logical" ones ?
Right now you set the raw cpus_allowed field that is a "logical" cpu
bitmask.
By using "number" maps the user can use 0..N-1 w/out having to
know internal cpu mapping.

- Davide

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
  2001-11-23 11:46       ` Ingo Molnar
@ 2001-11-24 22:44         ` Davide Libenzi
  0 siblings, 0 replies; 20+ messages in thread
From: Davide Libenzi @ 2001-11-24 22:44 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Mark Hahn, lkml

On Fri, 23 Nov 2001, Ingo Molnar wrote:

>
> On Thu, 22 Nov 2001, Mark Hahn wrote:
>
> > only that it's nontrivial to estimate the migration costs, I think. at
> > one point, around 2.3.3*, there was some effort at doing this - or
> > something like it.  specifically, the scheduler kept track of how long
> > a process ran on average, and was slightly more willing to migrate a
> > short-slice process than a long-slice.  "short" was defined relative
> > to cache size and a WAG at dram bandwidth.
>
> yes. I added the avg_slice code, and i removed it as well - it was
> hopeless to get it right and it was causing bad performance for certain
> application sloads. Current CPUs simply do not support any good way of
> tracking cache footprint of processes. There are methods that are an
> approximation (eg. uninterrupted runtime and cache footprint are in a
> monotonic relationship), but none of the methods (including cache traffic
> machine counters) are good enough to cover all the important corner cases,
> due to cache aliasing, MESI-invalidation and other effects.

Uninterrupted run-time is a good approximation of a task's cache footprint.
It's true, it's not 100% successful, processes like :

for (;;);

are uncorrectly classified but it's still way better than the method we're
currently using ( PROC_CHANGE_PENALTY ).
By taking the avg :

AVG = (AVG + LAST) >> 1;

run-time in jiffies is 1) fast 2) has a nice hysteresis property 3) gives
you a pretty good estimation of the "nature" of the task.
I'm currently using it as 1) classification for load balancing between
CPUs 2) task's watermark value for your counter decay patch :

[kernel/timer.c]

        if (p->counter > p->avg_jrun)
            --p->counter;
        else if (++p->timer_ticks >= p->counter) {
            p->counter = 0;
            p->timer_ticks = 0;
            p->need_resched = 1;
        }

In this way I/O bound tasks have a counter decay behavior like the
standard scheduler while CPU bound ones preserve the priority inversion
proof.




- Davide



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
  2001-11-23  0:20   ` Ryan Cumming
                       ` (2 preceding siblings ...)
  2001-11-23 11:36     ` Ingo Molnar
@ 2001-11-27  3:39     ` Robert Love
  2001-11-27  7:13       ` Joe Korty
  2001-11-27  8:40       ` Ingo Molnar
  3 siblings, 2 replies; 20+ messages in thread
From: Robert Love @ 2001-11-27  3:39 UTC (permalink / raw)
  To: Ryan Cumming, mingo; +Cc: linux-kernel

On Thu, 2001-11-22 at 19:20, Ryan Cumming wrote: 

>  Here here, I was just thinking "Well, I like the CPU affinity idea, but I 
> loathe syscall creep... I hope this Robert Love fellow says something about 
> that" as I read your email's header.

I did a procfs-based implementation of a user interface for setting CPU
affinity.  It implements various features like Ingo's with the change
that it is, obviously, a procfs entry and not a set of syscalls. 

It is readable and writable via /proc/<pid>/affinity 

I posted a patch to lkml moments ago, but it is also available at 
	ftp://ftp.kernel.org/pub/linux/kernel/people/rml/cpu-affinity
(please use a mirror).

Comments, suggestions, et cetera welcome -- if possible under the new thread.

	Robert Love


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
  2001-11-27  3:39     ` Robert Love
@ 2001-11-27  7:13       ` Joe Korty
  2001-11-27 20:53         ` Robert Love
  2001-11-27  8:40       ` Ingo Molnar
  1 sibling, 1 reply; 20+ messages in thread
From: Joe Korty @ 2001-11-27  7:13 UTC (permalink / raw)
  To: mingo, Robert Love; +Cc: Ryan Cumming, linux-kernel

At 09:40 AM 11/27/01 +0100, Ingo Molnar wrote:
> > This patch comes about as an alternative to Ingo Molnar's
> > syscall-implemented version.  Ingo's code is nice; however I and
> > others expressed discontent as yet another syscall. [...]
>
>i do express discontent over yet another procfs bloat. What if procfs is
>not mounted in a high security installation? Are affinities suddenly
>unavailable? Such kind of dependencies are unacceptable IMO - if we want
>to export the setting of affinities to user-space, then it should be a
>system call.

...

> > [...] Other benefits include the ease with which to set the affinity
> > of tasks that are unaware of the new interface [...]


I have not yet seen the patch, but one nice feature that a system call 
interface
could provide is the ability to *atomically* change the cpu affinities of 
sets of
processes -- for example, all processes with a certain uid or gid.  All that
would be required would be for the system call to accept a command integer
value which would define what the argument integer value would mean -- a pid,
a gid, or a uid.

Joe



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
       [not found] ` <5.0.2.1.2.20011127020817.009ed3d0@pop.mindspring.com.suse.lists.linux.kernel>
@ 2001-11-27  7:32   ` Andi Kleen
  2001-11-27 21:01     ` Robert Love
  0 siblings, 1 reply; 20+ messages in thread
From: Andi Kleen @ 2001-11-27  7:32 UTC (permalink / raw)
  To: Joe Korty; +Cc: linux-kernel

Joe Korty <l-k@mindspring.com> writes:
> 
> I have not yet seen the patch, but one nice feature that a system call 
> interface
> could provide is the ability to *atomically* change the cpu affinities of 
> sets of
> processes

Could you quickly explain an use case where it makes a difference if 
CPU affinity settings for multiple processes are done atomically or not ? 

The only way to make CPU affinity settings of processes really atomically 
without a "consolidation window" is to
do them before the process starts up. This is easy when they're inherited --
just set them for the parent before starting the other processes. This 
works with any interface; proc based or not as long as it inherits.

-Andi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
  2001-11-27  3:39     ` Robert Love
  2001-11-27  7:13       ` Joe Korty
@ 2001-11-27  8:40       ` Ingo Molnar
  1 sibling, 0 replies; 20+ messages in thread
From: Ingo Molnar @ 2001-11-27  8:40 UTC (permalink / raw)
  To: Robert Love; +Cc: Ryan Cumming, linux-kernel


your comments about syscall vs. procfs:

> This patch comes about as an alternative to Ingo Molnar's
> syscall-implemented version.  Ingo's code is nice; however I and
> others expressed discontent as yet another syscall. [...]

i do express discontent over yet another procfs bloat. What if procfs is
not mounted in a high security installation? Are affinities suddenly
unavailable? Such kind of dependencies are unacceptable IMO - if we want
to export the setting of affinities to user-space, then it should be a
system call.

(Also procfs is visibly slower than a system-call - i can well imagine
this to be an issue in some sort of threaded environment that creates and
destroys threads at a high rate, and wants to have a different affinity
for every new thread.)

> [...] Other benefits include the ease with which to set the affinity
> of tasks that are unaware of the new interface [...]

this was a red herring - see chaff.c.

> [...] and that with this approach applications don't need to hackishly
> check for the existence of a syscall.

uhm, what check? A nonexistent system call does not have to be checked
for.

(so far no legitimate technical point has been made against the
syscall-based setting of affinities.)

	Ingo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
  2001-11-27  7:13       ` Joe Korty
@ 2001-11-27 20:53         ` Robert Love
  2001-11-27 21:31           ` Nathan Dabney
  0 siblings, 1 reply; 20+ messages in thread
From: Robert Love @ 2001-11-27 20:53 UTC (permalink / raw)
  To: Joe Korty; +Cc: mingo, Ryan Cumming, linux-kernel

On Tue, 2001-11-27 at 02:13, Joe Korty wrote:

> I have not yet seen the patch, but one nice feature that a system call 
> interface could provide is the ability to *atomically* change the cpu
> affinities of sets of processes -- for example, all processes with a
> certain uid or gid.  All that would be required would be for the system
> call to accept a command integer value which would define what the
> argument integer value would mean -- a pid, a gid, or a uid.

Effecting all tasks matching a uid or some other filter is a little
beyond what either patch does.  Note however that both interfaces have
atomicity.

You can open and write to proc from within a program ... very easily, in
fact.

Also, with some sed and grep magic, you can set the affinity of all
tasks via the proc interface pretty easy.  Just a couple lines.

	Robert Love

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
  2001-11-27  7:32   ` [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9 Andi Kleen
@ 2001-11-27 21:01     ` Robert Love
  0 siblings, 0 replies; 20+ messages in thread
From: Robert Love @ 2001-11-27 21:01 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Joe Korty, linux-kernel

On Tue, 2001-11-27 at 02:32, Andi Kleen wrote:
> Could you quickly explain an use case where it makes a difference if 
> CPU affinity settings for multiple processes are done atomically or not ? 
> 
> The only way to make CPU affinity settings of processes really atomically 
> without a "consolidation window" is to
> do them before the process starts up. This is easy when they're inherited --
> just set them for the parent before starting the other processes. This 
> works with any interface; proc based or not as long as it inherits.

I assume he meant to prevent the case of setting affinity _after_ a
process forks.  In other words, "atomically" in the sense that it occurs
prior to some action, in order to affect properly all children.

This could be done in program with by writing to the proc entry before
forking, or can be done in a wrapper script (set affinity of self, exec
new task).

cpus_allowed is inherited by all children so this works fine.

	Robert Love

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9
  2001-11-27 20:53         ` Robert Love
@ 2001-11-27 21:31           ` Nathan Dabney
  0 siblings, 0 replies; 20+ messages in thread
From: Nathan Dabney @ 2001-11-27 21:31 UTC (permalink / raw)
  To: Robert Love; +Cc: Joe Korty, mingo, Ryan Cumming, linux-kernel

On Tue, Nov 27, 2001 at 03:53:04PM -0500, Robert Love wrote:
> Effecting all tasks matching a uid or some other filter is a little
> beyond what either patch does.  Note however that both interfaces have
> atomicity.

I don't see a need for that either, the inheritance and single-process change
are the major abilities needed.

> You can open and write to proc from within a program ... very easily, in
> fact.
> 
> Also, with some sed and grep magic, you can set the affinity of all
> tasks via the proc interface pretty easy.  Just a couple lines.

>From the admin point of view, this last ability is a good one.

A read-only entry in proc wouldn't do much good by itself.  The writable /proc
entry is the one that sounds interesting.

-Nathan

> 	Robert Love
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2001-11-27 21:32 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1006832357.1385.3.camel@icbm.suse.lists.linux.kernel>
     [not found] ` <5.0.2.1.2.20011127020817.009ed3d0@pop.mindspring.com.suse.lists.linux.kernel>
2001-11-27  7:32   ` [patch] sched_[set|get]_affinity() syscall, 2.4.15-pre9 Andi Kleen
2001-11-27 21:01     ` Robert Love
2001-11-22  8:59 Ingo Molnar
2001-11-22 20:22 ` Davide Libenzi
2001-11-22 23:45 ` Robert Love
2001-11-23  0:20   ` Ryan Cumming
2001-11-23  0:36     ` Mark Hahn
2001-11-23 11:46       ` Ingo Molnar
2001-11-24 22:44         ` Davide Libenzi
2001-11-23  0:51     ` Robert Love
2001-11-23  1:11       ` Andreas Dilger
2001-11-23  1:16         ` Robert Love
2001-11-23 11:36     ` Ingo Molnar
2001-11-24  2:01       ` Davide Libenzi
2001-11-27  3:39     ` Robert Love
2001-11-27  7:13       ` Joe Korty
2001-11-27 20:53         ` Robert Love
2001-11-27 21:31           ` Nathan Dabney
2001-11-27  8:40       ` Ingo Molnar
2001-11-23 11:02   ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox