From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S268083AbUJGVLW (ORCPT ); Thu, 7 Oct 2004 17:11:22 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S268042AbUJGVJZ (ORCPT ); Thu, 7 Oct 2004 17:09:25 -0400 Received: from cantor.suse.de ([195.135.220.2]:40396 "EHLO Cantor.suse.de") by vger.kernel.org with ESMTP id S268283AbUJGUtk (ORCPT ); Thu, 7 Oct 2004 16:49:40 -0400 Date: Thu, 7 Oct 2004 22:46:53 +0200 From: Olaf Hering To: linux-kernel@vger.kernel.org Cc: Olaf Hering Subject: out of order execution of shell commands Message-ID: <20041007204653.GA26820@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-DOS: I got your 640K Real Mode Right Here Buddy! X-Homeland-Security: You are not supposed to read this line! You are a terrorist! User-Agent: Mutt und vi sind doch schneller als Notes (und GroupWise) Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org we have a weird bug where the open of the shell output redirection happens before the mkdir command returns. As a result, the redirection will fail. Unexpected data loss. This bug was seen on ppc64 (dual ppc970) and also on i386, single cpu 3GHz xeon with smp kernel. Running kernel was our SLES9 2.6.5. It did also happen on ppc64 with 2.6.9-rc2-mm4, but I havent verified that. It was not seen with just one script running, but with 2 (or more). The inner loop fails after 2-3 hours, sometimes earlier, sometimes never. Some tracing shows: sys32_open(2076) fs-stress-olh.s(2736):c1,j3193150 S /dev/zero sys32_open(2076) fs-stress-olh.s(2736):c1,j3193154 S 01/11/59/8/data sys32_open(2076) fs-stress-olh.s(2736):c1,j3193157 S ./01/11/59/9/DIRNAME sys_mkdir(1696) fs-stress-olh.s(2736):c1,j3193162 E 01/11/59/9 That means, a gap of up 5 jiffies betwen the start of the open syscall and the end of the mkdir syscall. In this case, everything happend on cpu1. The dd command does never fail. If I add more printk (start/end of sys_open, sys_mkdir), the bug will disappear. A quick look at the bash2 sources shows a wait(-1,..), I believe the bash does nothing unusual. Any ideas whats going on here, and how to debug it further? testcase: fs-stress-olh.sh #!/bin/bash export PATH= /bin/echo $$ > /dev/shm/bug42232-pid-$$ /bin/mkdir stress-root cd stress-root /bin/echo Setting up ... for i in 0 1 2 3 4 5 6 7 8 9; do for j in 0 1 2 3 4 5 6 7 8 9; do for k in 0 1 2 3 4 5 6 7 8 9; do /bin/echo -n "$i$j$k " >> onethousand done /bin/echo -n "$i$j " >> onehundret done /bin/echo -n "$i " >> ten done /bin/echo -n "0 1 2 3 4 5 6 7 8 9" > onethousand for h in 0 1 ; do for i in `/bin/cat onethousand`; do /bin/mkdir $h$i /bin/echo $h$i > ./$h$i/DIRNAME for j in `/bin/cat onehundret`; do /bin/mkdir $h$i/$j /bin/echo $h$i/$j > ./$h$i/$j/DIRNAME for k in `/bin/cat onehundret`; do /bin/mkdir $h$i/$j/$k /bin/echo $h$i/$j/$k > ./$h$i/$j/$k/DIRNAME for l in `/bin/cat ten`; do /bin/mkdir $h$i/$j/$k/$l /bin/echo $h$i/$j/$k/$l > ./$h$i/$j/$k/$l/DIRNAME || /bin/kill -STOP `/bin/cat /dev/shm/bug42232-pid-*` /bin/dd if=/dev/zero of=$h$i/$j/$k/$l/data bs=1024 count=1 &> /dev/zero done done done done done cd .. /bin/rm -rf stress-root /dev/shm/bug42232-pid-$$ start_test.sh #!/bin/bash set -ex cd /bug42232 test -x ~olh/bugs/42232/fs-stress-olh.sh numcpus=`grep -Ec '^cpu[0-9]' /proc/stat` for i in `seq 1 $numcpus` do (rm -rf $i ; mkdir -v $i; cd $i || exit 1 ; pwd ; nohup sh -c 'while date ; do ~olh/bugs/42232/fs-stress-olh.sh ; done' < /dev/null > logfile 2>&1 & ) done sleep $numcpus ( tail --follow=name --retry --lines=42 ?/logfile & ) -- USB is for mice, FireWire is for men! sUse lINUX ag, nÜRNBERG