From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ethan Wilson Subject: Re: strange problem with raid6 read errors on active non-degraded array Date: Wed, 02 Jul 2014 18:35:09 +0200 Message-ID: <53B434BD.30301@shiftmail.org> References: <20140702103241.Horde.iempNvYRo99Ts9G5Op7ionA@webmail.aeiou.pt> <20140702204502.6b538fa8@notabene.brown> <20140702125434.Horde.abbwKfYRo99Ts-L6UvsCEIA@webmail.aeiou.pt> <20140702152429.742a3e8ea8bd100f5b3bae1f@bbaw.de> <20140702151406.Horde.HZoGSPYRo99TtBOu1q6B-GA@webmail.aeiou.pt> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <20140702151406.Horde.HZoGSPYRo99TtBOu1q6B-GA@webmail.aeiou.pt> Sender: linux-raid-owner@vger.kernel.org To: Pedro Teixeira , =?UTF-8?B?TGFycyBUw6R1YmVy?= Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids You have multiple bad-blocks list (an MD feature) which are already ful= l=20 of sectors. Those are earlier disk errors which were stored on MD=20 headers (one list per drive). MD will not try to read from such sectors anymore, and during reads MD=20 will return error to the upper layers immediately. This is if the strip= e=20 does not have enough good components to read after excluding the bad=20 blocks, e.g. raid5 is able to tolerate up to 1 disk with badblocks in a= =20 stripe, so with 2 badblocks in 2 different disks in the same stripes MD= =20 will return a read error immediately and without trying. That's why in dmesg you are seeing read errors from MD but not from the= =20 component devices. Now the question is how could so many badblocks be recorded on your arr= ay. It seems very unlikely that so many disks of your array are in such bad= =20 shape . This might indicate an MD bug in the badblocks code. I am thinking some form of erroneous propagation of bad blocks, so that= =20 e.g. writing to an area where an MD badblock exists, instead of clearin= g=20 the bad block could have propagated the badblock to the other disks in=20 the same stripe. Something like that. See if you can check that writing to a bad block clears it. It will be=20 difficult to compute the correct offset to write to, though. You might=20 want to do some trials-and-errors with dd together with blktrace. If yo= u=20 can do that, you might want to check that it behaves correctly even whe= n=20 writing something that does not align to 512b or 4k . Obviously this=20 test is desctructive wrt your data in that location. Another easier test is if to try to read with dd from a component devic= e=20 itself. If MD has recorded (even if happened long time in the past) a=20 bad block there, the direct read with dd should also hit it, return=20 error and stop, because badblocks in the surface of disks do not heal b= y=20 themselves with time. Another test is to read from md0 with dd from an area where you see tha= t=20 only 1 disk has badblocks (probably requires some trial and error with=20 blktrace because the offsets of md0 are not equal to the offsets of the= =20 component devices) . If MD works correctly, with such read it should=20 "heal" the badblock: compute from parity from the other disks, then=20 write over the badblock. The MD badblock should disappear. The last 2 tests I described should not be destructive except in case o= f=20 MD bugs. EW On 02/07/2014 16:14, Pedro Teixeira wrote: > Hi Lars, > > the output of those commands: > > root@nas3:/# cat /sys/block/sdb/queue/physical_block_size > 4096 > root@nas3:/# cat /sys/block/md0/queue/physical_block_size > 4096 > root@nas3:/# > > The strange thing here is that dmesg is not poluted with sata errors=20 > like it is usual when a hard disk has bad sectors or some other=20 > hardware problem. the only thing in dmesg that hints to why reading=20 > the md volume fails are from dm itself. > > Cheers > Pedro > > > Citando Lars T=C3=A4uber >> Hi Pedro, >> >> maybe an issue with the logical/physical blocksize? >> What tell these commands: >> >> cat /sys/block/sdb/queue/physical_block_size >> cat /sys/block/md0/queue/physical_block_size >> >> Seagate says there are 4096 bytes/sector on this devices. >> >> Lars > > > > _____________________________________________________________________= ___________=20 > > Mensagem enviada atrav=C3=A9s do email gr=C3=A1tis AEIOU > http://www.aeiou.pt > --=20 > To unsubscribe from this list: send the line "unsubscribe linux-raid"= in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html