From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934582Ab1IOTcm (ORCPT ); Thu, 15 Sep 2011 15:32:42 -0400 Received: from casper.infradead.org ([85.118.1.10]:45789 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934571Ab1IOTck convert rfc822-to-8bit (ORCPT ); Thu, 15 Sep 2011 15:32:40 -0400 Subject: Re: [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme From: Peter Zijlstra To: Manfred Spraul Cc: Ingo Molnar , Thomas Gleixner , linux-kernel@vger.kernel.org, Steven Rostedt , Darren Hart , David Miller , Eric Dumazet , Mike Galbraith Date: Thu, 15 Sep 2011 21:32:15 +0200 In-Reply-To: <4E7235F6.1030303@colorfullife.com> References: <20110914133034.687048806@chello.nl> <20110914133750.916911903@chello.nl> <4E7235F6.1030303@colorfullife.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT X-Mailer: Evolution 3.0.3- Message-ID: <1316115135.4060.19.camel@twins> Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2011-09-15 at 19:29 +0200, Manfred Spraul wrote: > Hi Peter, > What is broken? I'm not quite sure yet, but the results are that sembench doesn't complete properly; http://oss.oracle.com/~mason/sembench.c That seems to be happening is that we get spurious wakeups in the ipc/sem code resulting it semtimedop returning -EINTR, even though there's no pending signal. (there really should be a if (!signal_pending(current)) goto again thing in that semtimedop wait loop) Adding a loop in userspace like: again: ret = semtimedop(semid_lookup[l->id], &sb, 1, tvp); if (ret) { if (errno == EINTR) { l->spurious++; kill_tracer(); goto again; } perror("semtimedop"); } makes it complete again (although performance seems to suffer a lot compared to a kernel without this patch). It seems related to patch 2/3 converting the futex code, without that patch I can't seem to reproduce. All this is strange though, because if there were multiple wakeups on the same task wake_lists ought to result in less wakeups in total, not more. I've been trying to trace the thing but so far no luck.. when I enable too much tracing it goes away.. silly heisenbugger. > > +static void wake_up_sem_queue_prepare(struct wake_list_head *wake_list, > > struct sem_queue *q, int error) > > { > > + struct task_struct *p = ACCESS_ONCE(q->sleeper); > > > > + get_task_struct(p); > > + q->status = error; > > + /* > > + * implies a full barrier > > + */ > > + wake_list_add(wake_list, p); > > + put_task_struct(p); > > } > I think the get_task_struct()/put_task_struct is not necessary: > Just do the wake_list_add() before writing q->status: > wake_list_add() is identical to list_add_tail(&q->simple_list, pt). > [except that it contains additional locking, which doesn't matter here] But the moment we write q->status, q can disappear right? Suppose the task gets a wakeup (say from a signal) right after we write q->status. Then p can disappear (do_exit) and we'd try to enqueue dead memory -> BOOM! > > +static void wake_up_sem_queue_do(struct wake_list_head *wake_list) > > { > > + wake_up_list(wake_list, TASK_ALL); > > } > > > wake_up_list() calls wake_up_state() that calls try_to_wake_up(). > try_to_wake_up() seems to return immediately when the state is TASK_DEAD. > > That leaves: Is it safe to call wake_up_list() in parallel with do_exit()? > The current implementation avoids that. Ah, wake_list_add() does get_task_struct() and wake_up_list() will first issue the wakeup and then drop the reference. Hrmm,. it looks like its all these atomic ops {get,put}_task_struct() that are causing the performance drop.. I just removed the ones in wake_up_sem_queue_prepare() just for kicks and I got about half my performance gap back.