From mboxrd@z Thu Jan 1 00:00:00 1970 From: James J Subject: Re: md failing mechanism Date: Sat, 23 Jan 2016 20:02:00 +0100 Message-ID: <56A3CE28.9090901@shiftmail.org> References: <56A26E11.2090703@yandex.ru> <56A28309.9080806@turmel.org> <56A2A2C3.9000801@yandex.ru> <56A2BDF7.7020101@shiftmail.org> <56A389A9.1080203@youngman.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <56A389A9.1080203@youngman.org.uk> Sender: linux-raid-owner@vger.kernel.org To: Wols Lists , linux-raid List-Id: linux-raid.ids On 23/01/2016 15:09, Wols Lists wrote: > On 22/01/16 23:40, James J wrote: >> The recommentation of raising the timeout to 120+ is for the opposite >> purpose of what you want. It is for the case the sysadmin accepts to >> wait a long time because he wants to prevent the kicking of the drive at >> the first read-error (normally drives are kicked for a write error). >> This might be wanted in order to a) defer the replacement of the drive, >> either to perform the replacement at a more opportune time and/or in a >> better manner such as a no-degrade replace operation, or b) because he >> does not want to replace the drive at all: maybe he believes that the >> error might be spurious and will not happen again and the drive is still >> of acceptable fitness for the purpose, e.g. in a low-cost file server. > Except, aiui, even in your scenario! drives are kicked for a *write* error. > > What happens (should be) is the kernel times out, the raid handles the > read error by trying a rewrite, the drive is still hung on the read > error so it doesn't respond to the write request, and the drive gets > kicked for a write failure. Oh yes you are correct, so the drive would be kicked after 60secs and not after 30secs contrary to what I said. So the sequence would be: drive stuck on read --> scsi read failure due to timeout at the 30th second --> MD receives failure and attempts rewrite --> scsi write failure due to timeout at the 60th second --> drive kicked by MD at the 60th second I think this is what should have happened, but it didn't happen like this anyway so I think there is probably a kernel bug somewhere.