From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ric Wheeler Subject: Re: SYNCHRONIZE_CACHE from sd_preppare_flush does not have retries.! Date: Thu, 29 Apr 2010 16:13:03 -0400 Message-ID: <4BD9E84F.8000707@redhat.com> References: <1C9608B8A4CD534FB19C7C7543CBB249021C7718FA@inbmail02.lsi.com> <4BCC82DB.1070402@cs.wisc.edu> <201004192014.51019.bernd.schubert@fastmail.fm> <1271702740.2849.67.camel@mulgrave.site> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mx1.redhat.com ([209.132.183.28]:45305 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933157Ab0D3RV6 (ORCPT ); Fri, 30 Apr 2010 13:21:58 -0400 In-Reply-To: <1271702740.2849.67.camel@mulgrave.site> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: James Bottomley Cc: Bernd Schubert , Mike Christie , "Desai, Kashyap" , "linux-scsi@vger.kernel.org" On 04/19/2010 02:45 PM, James Bottomley wrote: > On Mon, 2010-04-19 at 20:14 +0200, Bernd Schubert wrote: > >> On Monday 19 April 2010, Mike Christie wrote: >> >>> On 04/19/2010 06:32 AM, Desai, Kashyap wrote: >>> >>>> I am facing one issue with scsi stack. >>>> Here is a background of my test. >>>> >>>> Mount ext3 file system with journaling support with barrier=1, commit=5 >>>> Now, with this setup file system will do submit_bh with WRITE_BARRIER >>>> flag set for interval of 5 seconds. (This is a part of journaling.) >>>> Eventually it will call queue_flush() which will generate SCSI command of >>>> CDB: SYNCHRONIZE_CAHCE and insert it into the request queue. I observed >>>> that creation of SYNCHRONIZE_CACHE is a part of sd_prepare_flush(). Here >>>> we have timeout set to SD_TIMEOUT but retries are not set. Because of >>>> retries of the request is not set, there is no retries allowed for >>>> SYNCHRONIZE_CACHE at mid layer. >>>> >>>> Because of zero retries for SYNCHRONIZE_CACHE command at mid-layer, it is >>>> creating trouble for file system. In current situation, Even though LLD >>>> send back commands with DID_RESET, SYNCHRONIZE_CACHE will fail >>>> immediately without going for any retries, when HBA is in recovery state. >>>> Eventually this information goes to File system and it sees >>>> SYNCHRONIZE_CAHCE is failed and file system goes to Read only mode. >>>> >>>> My question is "Can we add in sd_prepare_flush(), rq->retries = X" some >>>> reasonable retries value ? >>>> >>> I am not sure where we want it, but I think we want to be able to set >>> both the retries and timeout. I have seen where a sync cache can take >>> longer than the default 30 secs. >>> >>> Do you think we want to the block layer to manage retries/timeouts for >>> all block device flushes or is this more device specific? I was thinking >>> that we may want to create a sysfs interface under the block dirs and >>> have blk-sysfs.c and blk-barrier.c handle this. queue_flush could set >>> the timeout and retries that is set by some new files under >>> /sys/block/sdX/queue/ ? >>> >> >> Good that now also other people run into it. 30s is far too small for any >> hardware raid unit with SATA disks. >> > It's far too short for just about any HW RAID since they all tend to > have multi-megabytes to gigabytes of cache (some of the high end have > terrabytes). It has to be said that most arrays with battery backed > caches lie when asked to flush the cache, but we probably need to get > users into the habit of not using flush barriers with external Arrays. > > As you say, a lot of arrays don't do anything with these requests. We should be using barriers (and sending down the synch commands) when the device advertises a write back cache. For a simple hardware RAID card, it is not sufficient simply to flush the write cache on the HBA. It must also either disable the write cache on the downstream devices (S-ATA/SAS disks, etc) or send cache flush commands to each device. In those cases, this could be a quite lengthy process... >> http://markmail.org/message/ewicheafcvgwm4p7 >> >> I wrote this patch while having trouble with Infortrend Raids, but it also >> comes up with DDN storage if the write back cache is enabled. >> Shall I update the patch, add retries and then resend the entire series? >> > rq->timeout is the timeout of the request triggering the flush ... it's > likely the wrong value since it's for a fast completing r/w operation, > whereas this is a slow drain operation. > > James > >