From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Subject: Re: PROBLEM: relay - stale data copied to user space From: Martin Peschke In-Reply-To: <1237436347.7834.13.camel@charm-linux> References: <1237388848.4084.64.camel@kitka.ibm.com> <1237436347.7834.13.camel@charm-linux> Content-Type: text/plain Date: Thu, 19 Mar 2009 18:50:40 +0100 Message-Id: <1237485040.4752.16.camel@kitka.ibm.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-Archive: List-Post: To: Tom Zanussi Cc: linux-kernel@vger.kernel.org, linux-s390@vger.kernel.org List-ID: On Wed, 2009-03-18 at 23:19 -0500, Tom Zanussi wrote: > On Wed, 2009-03-18 at 16:07 +0100, Martin Peschke wrote > > This is my theory: > > Timing matters. It's a race caused by improper protection of critical > > sections in a producer-consumer scenario. A bug in the bookkeeping > > allows a reader to read at a position that is just being written to. > > > > It does look consistent with a reader reading an event that's been > reserved but not yet written, or partially written e.g. if an event > being written on one cpu was read by another before the first one > finished. So this is part of relay's design, and it's up to user space to make sure that reader and writer are on the same CPU? > Can you see if the below patch to blktrace userspace helps? It appears to fix it. I will give it more testing in a larger environment. > Or failing that, explicitly using gettid() in place of getpid() in > sched_setaffinity(). Or, failing that, you had mentioned previously > that you would try to reproduce the problem on your laptop - were you > able to do that? If so, it would help in debugging it further... This didn't work out. But then, it's a single-CPU machine. Thanks, Martin