Re: 2.4.19pre9aa1

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: 2.4.19pre9aa1
  2002-05-30  1:01 2.4.19pre9aa1 Andrea Arcangeli
@ 2002-05-30  0:38 ` Marcelo Tosatti
  2002-05-30  1:43   ` 2.4.19pre9aa1 Andrea Arcangeli
  2002-05-30  1:32 ` 2.4.19pre9aa1 William Lee Irwin III
  1 sibling, 1 reply; 6+ messages in thread
From: Marcelo Tosatti @ 2002-05-30  0:38 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel



On Thu, 30 May 2002, Andrea Arcangeli wrote:

> Only in 2.4.19pre8aa3: 00_get_pid-no-deadlock-and-boosted-3
> Only in 2.4.19pre9aa1: 10_get_pid-no-deadlock-and-boosted-4
>
> 	Discard the inferior attempt in pre9 and rediff (as Ihno noticed in
> 	practice the complexity dominates, if you fill the pid space the fix in
> 	mainline is useless anyways). Wonder why this much better fix isn't
> 	been merged instead (it is been submitted for both 2.2 and 2.4).
> 	This also fix a longstanding fork race present even in 2.2 that can
> 	lead to two tasks getting the same pid.

Could you be more verbose in explaining the problems with the current
approach and the advantages of your patch ?

Thanks


^ permalink raw reply	[flat|nested] 6+ messages in thread

* 2.4.19pre9aa1
@ 2002-05-30  1:01 Andrea Arcangeli
  2002-05-30  0:38 ` 2.4.19pre9aa1 Marcelo Tosatti
  2002-05-30  1:32 ` 2.4.19pre9aa1 William Lee Irwin III
  0 siblings, 2 replies; 6+ messages in thread
From: Andrea Arcangeli @ 2002-05-30  1:01 UTC (permalink / raw)
  To: linux-kernel

NOTE: this release is highly experimental, while it worked solid so far
it's not well tested yet, so please don't use in production
environments! (yet :)

The o1 scheduler integration will take a few weeks to settle and to
compile on all archs. I would suggest the big-iron folks to give this
kernel a spin, in particular for o1, shm-rmid fix, p4/pmd fix,
inode-leak fix. The only rejected feature is been the node-affine
allocations of per-cpu data structures in the numa-sched (matters only
for numa, but o1 is more sensible optimization for numa anyways).
Currently only x86 and alpha compiles and runs as expected. x86-64,
ia64, ppc, s390*, sparc64 doesn't compile yet. uml worst of all compiles
but it doesn't run correctly :), however it runs pretty well too, simply
it hangs sometime and you've to press a key in the terminal and then it
resumes as if nothing has happened.

URL:

	http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre9aa1.gz
	http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre9aa1/

Diff between 2.4.19pre8aa3 and 2.4.19pre9aa1 besides migrating to pre9.

Only in 2.4.19pre8aa3: 00_block-highmem-all-18b-11.gz
Only in 2.4.19pre9aa1: 00_block-highmem-all-18b-12.gz
Only in 2.4.19pre8aa3: 00_x86-fast-pte-1
Only in 2.4.19pre9aa1: 10_x86-fast-pte-2
Only in 2.4.19pre8aa3: 20_pte-highmem-24
Only in 2.4.19pre9aa1: 20_pte-highmem-25
Only in 2.4.19pre8aa3: 30_x86_setup-boot-cleanup-3
Only in 2.4.19pre9aa1: 30_x86_setup-boot-cleanup-4
Only in 2.4.19pre8aa3: 90_init-survive-threaded-race-2
Only in 2.4.19pre9aa1: 90_init-survive-threaded-race-3
Only in 2.4.19pre8aa3: 91_zone_start_pfn-3
Only in 2.4.19pre9aa1: 91_zone_start_pfn-5
Only in 2.4.19pre8aa3: 93_NUMAQ-1
Only in 2.4.19pre9aa1: 93_NUMAQ-2

	Rediffed.

Only in 2.4.19pre8aa3: 00_compile-nfsroot-1
Only in 2.4.19pre8aa3: 00_initrd-free-2
Only in 2.4.19pre8aa3: 00_ufs-compile-1

	Merged in mainline.

Only in 2.4.19pre8aa3: 00_cpu-affinity-rml-3

	Dropped (collided with o1 and it's not needed).

Only in 2.4.19pre8aa3: 00_cpu-affinity-syscall-rml-2.4.19-pre7-1
Only in 2.4.19pre9aa1: 00_cpu-affinity-syscall-rml-2.4.19-pre9-1

	Ported to the o1 sched with the -ac patch in rml/sched.

Only in 2.4.19pre8aa3: 00_dnotify-cleanup-1
Only in 2.4.19pre9aa1: 00_dnotify-cleanup-2

	Just keep the leftover out deletion in -ac.

Only in 2.4.19pre9aa1: 00_ext2-ext3-warning-1

	Warning when mounting an ext3 as an ext2 from Andrew Morton.

Only in 2.4.19pre9aa1: 00_free_pgtable-and-p4-tlb-race-fixes-1

	Fix the pagetable freeing races introduced by the speculative
	random userspace tlb fills of the p4, patch discussed on l-k.
	Also fix a definitive kernel smp race condition in free_pgtables.

Only in 2.4.19pre8aa3: 00_get_pid-no-deadlock-and-boosted-3
Only in 2.4.19pre9aa1: 10_get_pid-no-deadlock-and-boosted-4

	Discard the inferior attempt in pre9 and rediff (as Ihno noticed in
	practice the complexity dominates, if you fill the pid space the fix in
	mainline is useless anyways). Wonder why this much better fix isn't
	been merged instead (it is been submitted for both 2.2 and 2.4).
	This also fix a longstanding fork race present even in 2.2 that can
	lead to two tasks getting the same pid.

Only in 2.4.19pre9aa1: 00_negative-dentry-waste-ram-1

	Collect negative dentries after unlink or creat-failure
	(discussed on l-k, for 2.5 a very-low-prio lru list can
	be implemented instead).

Only in 2.4.19pre8aa3: 00_rcu-poll-5
Only in 2.4.19pre9aa1: 10_rcu-poll-6

	Ported to the o1 scheduler, this has a fix in force_cpu_reschedule
	compared to the rcu-poll patch in 2.5, and also it saves cachelines
	by colaescing the per-cpu quiescent sequence number in the runqueue
	structure, the quiescent will be increased at every schedule() call
	(i.e. every time we reach the quiescent point), so it is optimal
	to coalesce it in the same cacheline with the other fields later used
	by schedule(). the per-cpu quiescent++ is the _only_ fixed cost of
	rcu-poll, this is been a design choice to keep the fast-path overhead
	as low as possible.

Only in 2.4.19pre9aa1: 00_sched-O1-rml-2.4.19-pre9-1.gz

	2.5 O1 scheduler from Ingo Molnar backported to 2.4 from Robert Love.

Only in 2.4.19pre9aa1: 00_shm_destroy-deadlock-1

	Fix SMP deadlock due scheduling with spinlock held in IPC_RMID
	(fput is a blocking operation).

Only in 2.4.19pre9aa1: 00_vm86-pagetablelock-1

	Add pagetable lock to the vm86 pagetable walking,
	from Benjamin LaHaise.

Only in 2.4.19pre9aa1: 02_sched-19pre8ac5-1

	Use the wq locks so it stays generic and the switch in wait.h
	doesn't break. From -ac o1 sched comparison.

Only in 2.4.19pre9aa1: 02_sched-alpha-1

	alpha updates to make o1 working. This is also a good tutorial
	to port all the other archs.

Only in 2.4.19pre9aa1: 02_sched-sparc64-1

	Little attempt, only takes care of two bits, still lots of stuff
	missing for sparc64 and x86-64, they won't compile at the moment.

Only in 2.4.19pre9aa1: 02_sched-x86-1

	Additional fix for x86 (needed because of the rcu-poll changes
	to make the runqueue structure visible to the common code).

Only in 2.4.19pre8aa3: 05_vm_03_vm_tunables-1
Only in 2.4.19pre9aa1: 05_vm_03_vm_tunables-2
Only in 2.4.19pre8aa3: 05_vm_06_swap_out-1
Only in 2.4.19pre9aa1: 05_vm_06_swap_out-2
Only in 2.4.19pre8aa3: 05_vm_07_local_pages-1
Only in 2.4.19pre9aa1: 05_vm_07_local_pages-2
Only in 2.4.19pre8aa3: 05_vm_08_try_to_free_pages_nozone-1
Only in 2.4.19pre9aa1: 05_vm_08_try_to_free_pages_nozone-2
Only in 2.4.19pre8aa3: 05_vm_09_misc_junk-1
Only in 2.4.19pre9aa1: 05_vm_09_misc_junk-2
Only in 2.4.19pre8aa3: 05_vm_17_rest-4
Only in 2.4.19pre9aa1: 05_vm_17_rest-7

	Various random rediffing to sync with the o1 scheduler
	introduction.

Only in 2.4.19pre9aa1: 05_vm_18_buffer-page-uptodate-1

	Optimization to avoid losing the page-uptodate information
	while dropping the bh, from Andrew Morton.

Only in 2.4.19pre9aa1: 10_inode-highmem-1

	Avoid highmem pagecache to pin inodes in memory undefinitely,
	if we detect the "pinned inode" condition we shrink the cache.
	This should fix the last (known :) highmem vm unbalance problem.

Only in 2.4.19pre8aa3: 10_numa-sched-18

	Dropped in favour of the o1 scheduler, but it's not obsoleted
	by the o1 scheduler, for istance all the runqueue should go into
	the zone local memory so even the spinlocks are local etc...

Only in 2.4.19pre9aa1: 10_o1-sched-64-cpu-1

	Fix 64bit bug in o1 scheduler.

Only in 2.4.19pre9aa1: 10_o1-sched-fixes-1

	Other o1 sched fixes.

Only in 2.4.19pre9aa1: 10_o1-sched-nfs-1

	Fix nfs to compile with the o1 scheduler.

Only in 2.4.19pre9aa1: 10_parent-timeslice-10
Only in 2.4.19pre8aa3: 10_parent-timeslice-9

	Port this longstanding scheduler fix in the "share" timeslice
	algorithm on top of the o1 scheduler (with o1 the coding of the
	algorithm changed but the bug was stll there). This bug
	is been noticed in real life workloads that rendered the machine
	not interactive during intensive bash/exec/wait workloads.

Only in 2.4.19pre8aa3: 10_tlb-state-2
Only in 2.4.19pre9aa1: 10_tlb-state-3

	Mainline patch is inferior, backedout and resurrected this one.

Only in 2.4.19pre8aa3: 20_share-timeslice-2

	Obsoleted by the o1 scheduler, I missed the code in
	wake_up_forked_process, I didn't verified that it really works as
	expected in practice with gdb but the code looked ok. We'll see
	with the next round of benchmarks from Randy (the fact 2.5 wasn't
	forking as fast as previous -aa make me to think it might be
	related to an difference here).

Only in 2.4.19pre8aa3: 30_dyn-sched-6

	Obsoleted in favour of o1 that also provides similar
	dynamic priority.

Only in 2.4.19pre8aa3: 50_uml-patch-2.4.18-25.gz
Only in 2.4.19pre9aa1: 50_uml-patch-2.4.18-30.gz

	New patch from Jeff. NOTE: this compiles with o1 but
	it hangs sometime until you press a character on the
	konsole and then it restarts like if it didn't hanged,
	I checked schedule_ticks keeps running and that rq->nr_running
	is zero, so I'm not sure what's wrong at the moment. Could
	even be a bug between -29 and -30, I tested only -30 with
	o1, OTOH -29 was working fine w/o o1. Jeff could you
	give a spin to 2.4.19pre9aa1 and see what's wrong with uml?
	(sounds like a problem with the uml internals that you certainly
	know better than anyone else :)

Only in 2.4.19pre8aa3: 51_uml-ac-to-aa-8
Only in 2.4.19pre9aa1: 51_uml-ac-to-aa-9
Only in 2.4.19pre9aa1: 51_uml-o1-1
Only in 2.4.19pre8aa3: 57_uml-dyn_sched-1
Only in 2.4.19pre8aa3: 59_uml-yield-1
Only in 2.4.19pre9aa1: 59_uml-yield-2

	Various updates to compile with o1.

Only in 2.4.19pre8aa3: 60_show-stack-1
Only in 2.4.19pre9aa1: 62_tux-dump-stack-1

	Use dump_stack instead.

Only in 2.4.19pre8aa3: 60_tux-exports-3
Only in 2.4.19pre9aa1: 60_tux-exports-4
Only in 2.4.19pre8aa3: 60_tux-kstat-3
Only in 2.4.19pre9aa1: 60_tux-kstat-4
Only in 2.4.19pre9aa1: 60_tux-o1-1

	Further o1 scheduler updates this time for tux.

Only in 2.4.19pre8aa3: 70_xfs-1.1-1.gz
Only in 2.4.19pre9aa1: 70_xfs-1.1-2.gz
Only in 2.4.19pre9aa1: 75_compile-dmapi-1

	Rediffed.

Only in 2.4.19pre8aa3: 80_x86_64-common-code-3
Only in 2.4.19pre9aa1: 80_x86_64-common-code-4
Only in 2.4.19pre8aa3: 82_x86-64-compile-aa-5
Only in 2.4.19pre9aa1: 82_x86-64-compile-aa-6

	Some fixup for rejects and first o1 bits, but not complete yet.

Andrea

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.4.19pre9aa1
  2002-05-30  1:01 2.4.19pre9aa1 Andrea Arcangeli
  2002-05-30  0:38 ` 2.4.19pre9aa1 Marcelo Tosatti
@ 2002-05-30  1:32 ` William Lee Irwin III
  2002-05-30  1:40   ` 2.4.19pre9aa1 Andrea Arcangeli
  1 sibling, 1 reply; 6+ messages in thread
From: William Lee Irwin III @ 2002-05-30  1:32 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

On Thu, May 30, 2002 at 03:01:25AM +0200, Andrea Arcangeli wrote:
> NOTE: this release is highly experimental, while it worked solid so far
> it's not well tested yet, so please don't use in production
> environments! (yet :)
> The o1 scheduler integration will take a few weeks to settle and to
> compile on all archs. I would suggest the big-iron folks to give this
> kernel a spin, in particular for o1, shm-rmid fix, p4/pmd fix,
> inode-leak fix. The only rejected feature is been the node-affine
> allocations of per-cpu data structures in the numa-sched (matters only
> for numa, but o1 is more sensible optimization for numa anyways).
> Currently only x86 and alpha compiles and runs as expected. x86-64,
> ia64, ppc, s390*, sparc64 doesn't compile yet. uml worst of all compiles
> but it doesn't run correctly :), however it runs pretty well too, simply
> it hangs sometime and you've to press a key in the terminal and then it
> resumes as if nothing has happened.

I noticed what looked like missed wakeups in tty code in early 2.4.x
ports of the O(1) scheduler, though I saw a somewhat different failure
mode, that is, the terminal echo would remain one character behind
forever (and if it happened again, more than one). I never got a real
answer to this, unfortunately, as it appeared to go away after a certain
revision of the scheduler. The failure mode you describe is slightly
different, but perhaps related.

And thanks for looking into shm, I understand that area is a bit
painful to work around, but fixes are certainly needed there.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.4.19pre9aa1
  2002-05-30  1:32 ` 2.4.19pre9aa1 William Lee Irwin III
@ 2002-05-30  1:40   ` Andrea Arcangeli
  2002-05-31 19:34     ` 2.4.19pre9aa1 Andrea Arcangeli
  0 siblings, 1 reply; 6+ messages in thread
From: Andrea Arcangeli @ 2002-05-30  1:40 UTC (permalink / raw)
  To: William Lee Irwin III, linux-kernel

On Wed, May 29, 2002 at 06:32:00PM -0700, William Lee Irwin III wrote:
> On Thu, May 30, 2002 at 03:01:25AM +0200, Andrea Arcangeli wrote:
> > NOTE: this release is highly experimental, while it worked solid so far
> > it's not well tested yet, so please don't use in production
> > environments! (yet :)
> > The o1 scheduler integration will take a few weeks to settle and to
> > compile on all archs. I would suggest the big-iron folks to give this
> > kernel a spin, in particular for o1, shm-rmid fix, p4/pmd fix,
> > inode-leak fix. The only rejected feature is been the node-affine
> > allocations of per-cpu data structures in the numa-sched (matters only
> > for numa, but o1 is more sensible optimization for numa anyways).
> > Currently only x86 and alpha compiles and runs as expected. x86-64,
> > ia64, ppc, s390*, sparc64 doesn't compile yet. uml worst of all compiles
> > but it doesn't run correctly :), however it runs pretty well too, simply
> > it hangs sometime and you've to press a key in the terminal and then it
> > resumes as if nothing has happened.
> 
> I noticed what looked like missed wakeups in tty code in early 2.4.x
> ports of the O(1) scheduler, though I saw a somewhat different failure
> mode, that is, the terminal echo would remain one character behind
> forever (and if it happened again, more than one). I never got a real
> answer to this, unfortunately, as it appeared to go away after a certain
> revision of the scheduler. The failure mode you describe is slightly
> different, but perhaps related.

interesting, a tty problem could explain it probably, but being it
reproducible only with uml it should be still some uml internal that
broke, not a generic bug, there are no changes to the tty code and it's
unlikely that only the tty code broke due a generic o1 bug and that
additionally it is reproducible only in uml.

> 
> And thanks for looking into shm, I understand that area is a bit
> painful to work around, but fixes are certainly needed there.

you're very welcome.

Andrea

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.4.19pre9aa1
  2002-05-30  0:38 ` 2.4.19pre9aa1 Marcelo Tosatti
@ 2002-05-30  1:43   ` Andrea Arcangeli
  0 siblings, 0 replies; 6+ messages in thread
From: Andrea Arcangeli @ 2002-05-30  1:43 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1180 bytes --]

On Wed, May 29, 2002 at 09:38:29PM -0300, Marcelo Tosatti wrote:
> 
> 
> On Thu, 30 May 2002, Andrea Arcangeli wrote:
> 
> > Only in 2.4.19pre8aa3: 00_get_pid-no-deadlock-and-boosted-3
> > Only in 2.4.19pre9aa1: 10_get_pid-no-deadlock-and-boosted-4
> >
> > 	Discard the inferior attempt in pre9 and rediff (as Ihno noticed in
> > 	practice the complexity dominates, if you fill the pid space the fix in
> > 	mainline is useless anyways). Wonder why this much better fix isn't
> > 	been merged instead (it is been submitted for both 2.2 and 2.4).
> > 	This also fix a longstanding fork race present even in 2.2 that can
> > 	lead to two tasks getting the same pid.
> 
> Could you be more verbose in explaining the problems with the current
> approach and the advantages of your patch ?

see the attached email (I got no reply IIRC), I sent it to you and Linus
some week ago (only for the commentary, the latest version of the patch
is 10_get_pid-no-deadlock-and-boosted-4, that btw, probably it won't apply
anymore cleanly against pristine pre9 mainline because I put o1 at the
very top of the patch-chain, because it's the thing most likely to need
updates at the moment)

Andrea

[-- Attachment #2: Type: message/rfc822, Size: 7235 bytes --]

From: Andrea Arcangeli <andrea@suse.de>
To: Marcelo Tosatti <marcelo@conectiva.com.br>
Cc: linux-kernel@vger.kernel.org, Ihno Krumreich <ihno@suse.de>, Linus Torvalds <torvalds@transmeta.com>
Subject: get_pid fixes against 2.4.19pre7
Date: Fri, 26 Apr 2002 13:44:09 +0200
Message-ID: <20020426134409.C19278@dualathlon.random>

Hello,

Could you have a look at these get_pid fixes? Besides the deadlocking
while running out of pids and reducing from quadratic to linear the
complexity of get_pids, it also addresses a longstanding non trivial
race present in 2.2 too that can lead to pid collisions even on UP
(noticed today while merging two more fixes from Ihno on the other
part). the fix reduces a bit scalability of simultaneous forks from
different cpus, but it's obviously right at least. Putting non-ready
tasks into the tasklists asks for troubles (signals...).  For more
details see the comment at the end of the patch. If you've suggestions
they're welcome, thanks.

Patch in pre7aa2 against 2.4.19pre7:

diff -urN 2.4.19pre7/include/linux/threads.h get_pid-1/include/linux/threads.h
--- 2.4.19pre7/include/linux/threads.h	Thu Apr 18 07:51:30 2002
+++ get_pid-1/include/linux/threads.h	Fri Apr 26 09:10:30 2002
@@ -19,6 +19,6 @@
 /*
  * This controls the maximum pid allocated to a process
  */
-#define PID_MAX 0x8000
+#define PID_NR 0x8000
 
 #endif
diff -urN 2.4.19pre7/kernel/fork.c get_pid-1/kernel/fork.c
--- 2.4.19pre7/kernel/fork.c	Tue Apr 16 08:12:09 2002
+++ get_pid-1/kernel/fork.c	Fri Apr 26 10:25:33 2002
@@ -37,6 +37,12 @@
 
 struct task_struct *pidhash[PIDHASH_SZ];
 
+/*
+ * Protectes next_unsafe, last_pid and it avoids races
+ * between get_pid and SET_LINKS().
+ */
+static DECLARE_MUTEX(getpid_mutex);
+
 void add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait)
 {
 	unsigned long flags;
@@ -79,51 +85,105 @@
 	init_task.rlim[RLIMIT_NPROC].rlim_max = max_threads/2;
 }
 
-/* Protects next_safe and last_pid. */
-spinlock_t lastpid_lock = SPIN_LOCK_UNLOCKED;
-
+/*
+ *	Get the next free pid for a new process/thread.
+ *
+ *	Strategy: last_pid and next_unsafe (excluded) are an interval where all pids
+ *		  are free, so next pid is just last_pid + 1 if it's also < next_unsafe.
+ *		  If last_pid + 1 >= next_unsafe the interval is completely used.
+ *		  In this case a bitmap with all used pids/tgids/pgrp/seesion is
+ *		  is created. This bitmap is looked for the next free pid and next_unsafe.
+ *		  If all pids are used, a kernel warning is issued.
+ */
 static int get_pid(unsigned long flags)
 {
-	static int next_safe = PID_MAX;
+	static int next_unsafe = PID_NR;
+#define PID_FIRST	2 /* pid 1 is init, first usable pid is 2 */
+#define PID_BITMAP_SIZE	((((PID_NR + 7) / 8) + sizeof(long) - 1 ) / (sizeof(long)))
+	/*
+	 * Even if this could be local per-thread, keep it static and protected by
+	 * the lock because we don't want to overflow the stack and we wouldn't
+	 * SMP scale better anyways. It doesn't waste disk space because it's in
+	 * the .bss.
+	 */
+	static unsigned long pid_bitmap[PID_BITMAP_SIZE];
+
+	/* from here the stuff on the stack */
 	struct task_struct *p;
-	int pid;
+	int pid, found_pid;
 
 	if (flags & CLONE_PID)
 		return current->pid;
 
-	spin_lock(&lastpid_lock);
-	if((++last_pid) & 0xffff8000) {
-		last_pid = 300;		/* Skip daemons etc. */
-		goto inside;
-	}
-	if(last_pid >= next_safe) {
-inside:
-		next_safe = PID_MAX;
+	pid = last_pid + 1;
+	if (pid >= next_unsafe) {
+		next_unsafe = PID_NR;
+		memset(pid_bitmap, 0, PID_BITMAP_SIZE*sizeof(long));
+
 		read_lock(&tasklist_lock);
-	repeat:
+		/*
+		 * Build the bitmap and calc next_unsafe.
+		 */
 		for_each_task(p) {
-			if(p->pid == last_pid	||
-			   p->pgrp == last_pid	||
-			   p->tgid == last_pid	||
-			   p->session == last_pid) {
-				if(++last_pid >= next_safe) {
-					if(last_pid & 0xffff8000)
-						last_pid = 300;
-					next_safe = PID_MAX;
+			set_bit(p->pid, pid_bitmap);
+			set_bit(p->pgrp, pid_bitmap);
+			set_bit(p->tgid, pid_bitmap);
+			set_bit(p->session, pid_bitmap);
+
+			if (next_unsafe > p->pid && p->pid > pid)
+				next_unsafe = p->pid;
+			if (next_unsafe > p->pgrp && p->pgrp > pid)
+				next_unsafe = p->pgrp;
+			if (next_unsafe > p->tgid && p->tgid > pid)
+				next_unsafe = p->tgid;
+			if (next_unsafe > p->session && p->session > pid)
+				next_unsafe = p->session;
+		}
+
+		/*
+		 * Release the tasklist_lock, after the unlock it may happen that
+		 * a pid is freed while it's still marked in use
+		 * in the pid_bitmap[].
+		 */
+		read_unlock(&tasklist_lock);
+
+		found_pid = find_next_zero_bit(pid_bitmap, PID_NR, pid);
+		if (found_pid >= PID_NR) {
+			next_unsafe = 0; /* depends on PID_FIRST > 0 */
+			found_pid = find_next_zero_bit(pid_bitmap, pid, PID_FIRST);
+			/* We scanned the whole bitmap without finding a free pid. */
+			if (found_pid >= pid) {
+				static long last_get_pid_warning;
+				if ((unsigned long) (jiffies - last_get_pid_warning) >= HZ) {
+					printk(KERN_NOTICE "No more PIDs (PID_NR = %d)\n", PID_NR);
+					last_get_pid_warning = jiffies;
 				}
-				goto repeat;
+				return -1;
+			}
+		}
+
+		pid = found_pid;
+
+		if (pid > next_unsafe) {
+			/* recalc next_unsafe by looking for the next bit set in the bitmap */
+			unsigned long * start = pid_bitmap;
+			unsigned long * p = start + (pid / (sizeof(long) * 8));
+			unsigned long * end = pid_bitmap + PID_BITMAP_SIZE;
+			unsigned long mask = ~((1UL << (pid & ((sizeof(long) * 8 - 1)))) - 1);
+
+			*p &= (mask << 1);
+
+			while (p < end) {
+				if (*p) {
+					next_unsafe = ffz(~*p) + (p - start) * sizeof(long) * 8;
+					break;
+				}
+				p++;
 			}
-			if(p->pid > last_pid && next_safe > p->pid)
-				next_safe = p->pid;
-			if(p->pgrp > last_pid && next_safe > p->pgrp)
-				next_safe = p->pgrp;
-			if(p->session > last_pid && next_safe > p->session)
-				next_safe = p->session;
 		}
-		read_unlock(&tasklist_lock);
 	}
-	pid = last_pid;
-	spin_unlock(&lastpid_lock);
+
+	last_pid = pid;
 
 	return pid;
 }
@@ -623,7 +683,10 @@
 	p->state = TASK_UNINTERRUPTIBLE;
 
 	copy_flags(clone_flags, p);
+	down(&getpid_mutex);
 	p->pid = get_pid(clone_flags);
+	if (p->pid < 0) /* valid pids are >= 0 */
+		goto bad_fork_cleanup;
 
 	p->run_list.next = NULL;
 	p->run_list.prev = NULL;
@@ -730,7 +793,17 @@
 		list_add(&p->thread_group, &current->thread_group);
 	}
 
+	/*
+	 * We must do the SET_LINKS() under the getpid_mutex, to avoid
+	 * another CPU to get our same PID between the release of of the
+	 * getpid_mutex and the SET_LINKS().
+	 *
+	 * In short to avoid SMP races the new child->pid must be just visible
+	 * in the tasklist by the time we drop the getpid_mutex.
+	 */
 	SET_LINKS(p);
+	up(&getpid_mutex);
+
 	hash_pid(p);
 	nr_threads++;
 	write_unlock_irq(&tasklist_lock);
@@ -757,6 +830,7 @@
 bad_fork_cleanup_files:
 	exit_files(p); /* blocking */
 bad_fork_cleanup:
+	up(&getpid_mutex);
 	put_exec_domain(p->exec_domain);
 	if (p->binfmt && p->binfmt->module)
 		__MOD_DEC_USE_COUNT(p->binfmt->module);

Andrea

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.4.19pre9aa1
  2002-05-30  1:40   ` 2.4.19pre9aa1 Andrea Arcangeli
@ 2002-05-31 19:34     ` Andrea Arcangeli
  0 siblings, 0 replies; 6+ messages in thread
From: Andrea Arcangeli @ 2002-05-31 19:34 UTC (permalink / raw)
  To: William Lee Irwin III, linux-kernel

On Thu, May 30, 2002 at 03:40:09AM +0200, Andrea Arcangeli wrote:
> On Wed, May 29, 2002 at 06:32:00PM -0700, William Lee Irwin III wrote:
> > On Thu, May 30, 2002 at 03:01:25AM +0200, Andrea Arcangeli wrote:
> > > NOTE: this release is highly experimental, while it worked solid so far
> > > it's not well tested yet, so please don't use in production
> > > environments! (yet :)
> > > The o1 scheduler integration will take a few weeks to settle and to
> > > compile on all archs. I would suggest the big-iron folks to give this
> > > kernel a spin, in particular for o1, shm-rmid fix, p4/pmd fix,
> > > inode-leak fix. The only rejected feature is been the node-affine
> > > allocations of per-cpu data structures in the numa-sched (matters only
> > > for numa, but o1 is more sensible optimization for numa anyways).
> > > Currently only x86 and alpha compiles and runs as expected. x86-64,
> > > ia64, ppc, s390*, sparc64 doesn't compile yet. uml worst of all compiles
> > > but it doesn't run correctly :), however it runs pretty well too, simply
> > > it hangs sometime and you've to press a key in the terminal and then it
> > > resumes as if nothing has happened.
> > 
> > I noticed what looked like missed wakeups in tty code in early 2.4.x
> > ports of the O(1) scheduler, though I saw a somewhat different failure
> > mode, that is, the terminal echo would remain one character behind
> > forever (and if it happened again, more than one). I never got a real
> > answer to this, unfortunately, as it appeared to go away after a certain
> > revision of the scheduler. The failure mode you describe is slightly
> > different, but perhaps related.
> 
> interesting, a tty problem could explain it probably, but being it

JFYI: the uml-hang gone away with 2.4.19pre9aa2, not sure why. I start to
wonder that it happened because I didn't run a full 'make distclean'
while I was updating it, maybe it was miscompiled.

> reproducible only with uml it should be still some uml internal that
> broke, not a generic bug, there are no changes to the tty code and it's
> unlikely that only the tty code broke due a generic o1 bug and that
> additionally it is reproducible only in uml.
> 
> > 
> > And thanks for looking into shm, I understand that area is a bit
> > painful to work around, but fixes are certainly needed there.
> 
> you're very welcome.
> 
> Andrea


Andrea

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2002-05-31 19:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-05-30  1:01 2.4.19pre9aa1 Andrea Arcangeli
2002-05-30  0:38 ` 2.4.19pre9aa1 Marcelo Tosatti
2002-05-30  1:43   ` 2.4.19pre9aa1 Andrea Arcangeli
2002-05-30  1:32 ` 2.4.19pre9aa1 William Lee Irwin III
2002-05-30  1:40   ` 2.4.19pre9aa1 Andrea Arcangeli
2002-05-31 19:34     ` 2.4.19pre9aa1 Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox