From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: PROBLEM: write to jbod with 3TB and 160GB drives hits BUG/oops Date: Mon, 27 Apr 2015 11:11:59 +1000 Message-ID: <20150427111159.46b9781a@notabene.brown> References: <55398462.1000202@cox.net> <553AA35A.5050300@cox.net> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; boundary="Sig_/_=GCmFhDkMSHjzHTrb.2L1u"; protocol="application/pgp-signature" Return-path: In-Reply-To: <553AA35A.5050300@cox.net> Sender: linux-raid-owner@vger.kernel.org To: Charles Bertsch Cc: linux-raid@vger.kernel.org, "BertschC@acm.org" List-Id: linux-raid.ids --Sig_/_=GCmFhDkMSHjzHTrb.2L1u Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Fri, 24 Apr 2015 13:11:06 -0700 Charles Bertsch wrote: > On 04/23/2015 06:55 PM, NeilBrown wrote: > > > > By "jbod" I assume you mean "linear array". > > > > You say this happens without any filesystem on the array, yet the stack > > traces clearly show ext2 in use. > > Maybe some weird interaction is happening between the the filesystem an= d the > > linear array. > > But please confirm that the stack trace happened when there was no file= system > > on the array you were testing, and report what filesystems you do have = which > > use ext2. > > > Neil -- > Yes, I do mean linear array. >=20 > At the point of the stack trace, there was no file-system on the linear=20 > 2-drive array. The test-jbod-2 script would create the array and then=20 > write directly to /dev/md0. Any evidence of previous existence of a=20 > file-system would have been obliterated by earlier runs copying=20 > /dev/zero everywhere. >=20 > The file-systems in use -- > -- The rootfs is an initrd file, squashfs, and mounted read-only. > -- An ext3 for configuration and logs is mounted RW on /flash > -- An ext2 using 8MB of RAM is mounted RW on /var > -- The file-server is derived from a much earlier design that required=20 > some RW directories within the root. These entries appear in the mount=20 > command as ext2, but are part of /var (and not separate file systems) -- > -- mount --bind /var/hd /hd > -- mount --bind /var/home /home >=20 > -- A devtmpfs mounted on /dev, tmpfs on /dev/shm, proc on /proc, sysfs=20 > on /sys, and another mount --bind from within /flash for nfs. >=20 > # mount > /dev/root on / type squashfs (ro,relatime) > devtmpfs on /dev type devtmpfs=20 > (rw,relatime,size=3D1002600k,nr_inodes=3D250650,mode=3D755) > proc on /proc type proc (rw,relatime) > sysfs on /sys type sysfs (rw,relatime) > /dev/ram1 on /var type ext2 (rw,relatime,errors=3Dcontinue) > /dev/ram1 on /hd type ext2 (rw,relatime,errors=3Dcontinue) > /dev/ram1 on /home type ext2 (rw,relatime,errors=3Dcontinue) > tmpfs on /dev/shm type tmpfs (rw,relatime) > /dev/sdb1 on /flash type ext3=20 > (rw,noatime,errors=3Dcontinue,commit=3D60,barrier=3D1,data=3Dordered) > /dev/sdb1 on /var/lib/nfs type ext3=20 > (rw,noatime,errors=3Dcontinue,commit=3D60,barrier=3D1,data=3Dordered) > nfsd on /proc/fs/nfsd type nfsd (rw,relatime) > # Thanks for the details. On the whole, I don't think it is likely that your problem is directly related to md - just a coincidence that it happened when you were using md things. But one never knows until that actual cause is found. >=20 > > Is there any chance you could use "git bisect" to find out exactly whi= ch > > commit introduced the problem? That is the mostly likely path to a=20 > solution. > > >=20 >=20 > I am not familiar with "git bisect". Would this be similar to=20 > downloading a series of kernel releases from linux-3.3.5 up to 3.18.5=20 > using a binary search to find which release (rather than which commit)=20 > has the problem ? Similar, but (some of) the boring work is all done for you. It would be best to stick to mainline kernels for testing. i.e. just '3.x', not '3.x.y'. So presumably 3.3 works, and 3.18 fails. In that case: git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux cd linux git bisect start git bisect good v3.3 git bisect bad v3.18 That should get you started, except that it seems to take an incredibly long time. So probably do the first few steps by hand. e.g git checkout v3.10 and test that. Then try v3.7 or v3.14. Once you know which of those are good or bad, run e.g. git bisect start git bisect good v3.7 git bisect bad v3.10 and that will checkout a kernel somewhere in the middle and tell you there are 14 (or so) steps to go. Then build and test the kernel. If it is good, run "git bisect good". If bad, "git bisect bad". If you can persist through testing over a dozen kernels (takes some patience!!) it should lead you to the commit that introduced the problem. It is always best to be caution before declaring a kernel 'good' - run the test a few times. >=20 > Thanks >=20 > Charles Bertsch NeilBrown --Sig_/_=GCmFhDkMSHjzHTrb.2L1u Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUBVT2M3znsnt1WYoG5AQIauhAAxJz1jKeqkty81UV4dVs4x8J4QMC6oSVs YGfuifC8k0aO7ArppLmPU9BLIcYb85OXK7gjZd5gLizgOICl6USF76FdaOe1w6mW mDYWjh6nkkIOJmEl4Jc+MqID7Kqo06lVK3D8QZOZN6uu5+36SWUpAFzXoYI7FwJ9 aKIHZIWMr4R/tS6VPlxbiOsgpvyCClR1EdoZ8jHc7prHxJr0POzcQ49MauWh1YWN QqTXwTHNrOW3ZhhrtQJs/MUwS2udYtBrhKBrX0/RPk1dlewqgFLyjhpYmdKM6Y7B uZgzD50G99TDi9J2gNnrFGwhMPTGBx8Dzxp4L5s0K9I+I+GhaO4r7gbxv9yKM5fx eAyTO1koEvgalj1NgLOr32J8KExdFAMQpghbVN3MN0SDHhxwSY37nomS1N/X8g1p NpJjz4HkvUDd74av7K7X2oIegJ8hPvUGH+BYGNuJIIq5GqJDCQuuKwvaaY1usVFw ZbllQ6RmkUpEyERk6DJjBIXX+QVn/EXKNi9RKvsE5dDL4tdIJ7MleEwmTWUfuvi1 6G9q18ao49R91zVDaJGpCo1uI3hJpFTreRGFPLs7rkp0xuXbOV8BHi/i4rz56as+ Akv9UKaRkS7Py2EmH+iFIMnMruOGNjT7pSdmGAvsG7ohWZPH4a9VmCOXPxqMTNTK EONdEMrZ9Hw= =DKD2 -----END PGP SIGNATURE----- --Sig_/_=GCmFhDkMSHjzHTrb.2L1u--