From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from sender163-mail.zoho.com ([74.201.84.163]:24597 "EHLO sender163-mail.zoho.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751806AbcFFCko (ORCPT ); Sun, 5 Jun 2016 22:40:44 -0400 From: "James Johnston" To: "'Chris Murphy'" , "'Mladen Milinkovic'" Cc: "'Austin S. Hemmelgarn'" , "'Martin'" , "'Btrfs BTRFS'" References: <73123a36-6502-d735-c813-fce43b620e5a@smoothware.net> In-Reply-To: Subject: RE: Recommended why to use btrfs for production? Date: Mon, 6 Jun 2016 02:40:36 -0000 Message-ID: <0b4e01d1bf9c$cf89c110$6e9d4330$@codenest.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 06/06/2016 at 01:47, Chris Murphy wrote: > On Sun, Jun 5, 2016 at 4:45 AM, Mladen Milinkovic wrote: > > On 06/03/2016 04:05 PM, Chris Murphy wrote: > >> Make certain the kernel command timer value is greater than the driver > >> error recovery timeout. The former is found in sysfs, per block > >> device, the latter can be get and set with smartctl. Wrong > >> configuration is common (it's actually the default) when using > >> consumer drives, and inevitably leads to problems, even the loss of > >> the entire array. It really is a terrible default. > > > > Since it's first time i've heard of this I did some googling. > > > > Here's some nice article about these timeouts: > > http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive- > timeouts/comment-page-1/ > > > > And some udev rules that should apply this automatically: > > http://comments.gmane.org/gmane.linux.raid/48193 > > Yes it's a constant problem that pops up on the linux-raid list. > Sometimes the list is quiet on this issue but it really seems like > it's once a week. From last week... > > http://www.spinics.net/lists/raid/msg52447.html It seems like it would be useful if the distributions or the kernel could automatically set the kernel timeout to an appropriate value. If the TLER can be indeed be queried via smartctl, then it would be easy to automatically read it, and then calculate a suitable timeout. A RAID-oriented drive would end up leaving the current 30 seconds, while if it can't successfully query for TLER or the drive just doesn't support it, then assume a consumer drive and set timeout for 180 seconds. That way, zero user configuration would be needed in the common case. Or is it not that simple? James