From mboxrd@z Thu Jan 1 00:00:00 1970 From: James Bottomley Subject: Summary of the Multi-Path BOF at OLS and future directions Date: 04 Aug 2003 20:54:55 -0700 Sender: linux-scsi-owner@vger.kernel.org Message-ID: <1060042082.1985.53.camel@fuzzy> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-GziSwaPjwi0KoEUT16ms" Return-path: Received: from nat9.steeleye.com ([65.114.3.137]:13063 "EHLO hancock.sc.steeleye.com") by vger.kernel.org with ESMTP id S272407AbTHEDy6 (ORCPT ); Mon, 4 Aug 2003 23:54:58 -0400 Received: from midgard.sc.steeleye.com (midgard.sc.steeleye.com [172.17.6.40]) by hancock.sc.steeleye.com (8.11.6/linuxconf) with ESMTP id h753ssI23876 for ; Mon, 4 Aug 2003 23:54:55 -0400 List-Id: linux-scsi@vger.kernel.org To: SCSI Mailing List --=-GziSwaPjwi0KoEUT16ms Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi All, For those of you who couldn't attend OLS, I thought a short summary of what went on might be useful. Multi-path was a hot topic throughout both the Kernel Summit and OLS. Thing began with a requirement inputs panel of vendors identifying multi-path as one of their primary problems. Followed by an invited discussion with Lars Marowski-Br=E9e and Mike Anderson on multi-path. At OLS, there was a paper presentation by Mike and Patrick Mansfield on the IBM SCSI layer multi-pathing solution and finally there was the BOF session which tried to pick a way forwards for us in 2.6/2.7 What I'd like to summarise is what I think the conclusions we reached are: 1. Multi-path is relevant to more layers of the I/O stack than just SCSI. Thus, it makes sense to do it at the layer just above bio. This would either be md/multipath or the Device Mapper multi-path module. 2. Doing multi-path at that level is not easy without fast failure indications. 2a. On discussion of this, it was decided that on each bio/request, the upper layers would like to indicate which failures they wish to be fast and which they wish not to know about. The two principle ones were transport errors (relevant to multi-path) and medium errors (relevant to software raid). 2b. Upwards, on fast failure, we would send back the raw sense data (probably encoded in the sense request) plus a translated indication of what the problem was. The translations would probably be a combination of (fatal|retryable) and (driver error (card out of resources/failure)|transport error|medum error). 3. It was noted that symmetric active multi-path in this scheme is not possible without the ability to place a proper elevator above the multi-pathing driver (and have a simple queue only noop elevator below). This should help alleviate the current fragmentation issues where symmetric active multi-path produces I/O in decidedly non-optimal page sized chunks. 4. Configuration of this solution would be extremely important. The idea here is to rely on the udev solution currently making its way into the kernel and essentially have a vendor specific multi-path configuration as a udev plug-in. 5. Vendor value add for specific devices could be encoded both as configuration (udev) pieces and plug-ins to the upper layer multi-path driver to activate any proprietary vendor specific configuration options that may be needed for specific solutions. 6. Ownership. This wasn't exactly discussed, but in light of the problems with even SCSI-3 reservations, it is becoming clear that storage ownership in a multi-path configuration is getting impossible to maintain from user level. Therefore, I at least will be giving thought to an ownership API that could be used to manage storage ownership from the kernel in the face of path fail overs. As far as the beginnings of implementation go, we already have md/multi-path. Joe Thorber of Sistina will shortly be releasing the code to do multi-path over the device mapper interface, and our trusty block layer maintainer, Jens Axboe, has done the skeleton of a fast fail infrastructure for us (in 2.6.0-test2). The attached patch should add the fast fail capability to SCSI (although without the upwards/downwards failure indications) and we should be able to build the rest of the infrastructure on this framework. As far as errors and omissions go, I found KS/OLS to go rather fast and be a bit blurry, so hopefully those who were also present can chime in on this thread to amplify/correct the points I actually managed to grasp and summarise the ones I missed. Thanks, James --=-GziSwaPjwi0KoEUT16ms Content-Disposition: attachment; filename=tmp.diff Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; name=tmp.diff; charset=ISO-8859-1 =3D=3D=3D=3D=3D scsi_error.c 1.60 vs edited =3D=3D=3D=3D=3D --- 1.60/drivers/scsi/scsi_error.c Thu Jul 31 07:32:18 2003 +++ edited/scsi_error.c Mon Aug 4 14:20:24 2003 @@ -1285,7 +1285,12 @@ =20 maybe_retry: =20 - if ((++scmd->retries) < scmd->allowed) { + /* we requeue for retry because the error was retryable, and + * the request was not marked fast fail. Note that above, + * even if the request is marked fast fail, we still requeue + * for queue congestion conditions (QUEUE_FULL or BUSY) */ + if ((++scmd->retries) < scmd->allowed=20 + && !blk_noretry_request(scmd->request)) { return NEEDS_RETRY; } else { /* =3D=3D=3D=3D=3D scsi_lib.c 1.108 vs edited =3D=3D=3D=3D=3D --- 1.108/drivers/scsi/scsi_lib.c Sat Aug 2 10:18:20 2003 +++ edited/scsi_lib.c Mon Aug 4 14:26:46 2003 @@ -497,6 +497,13 @@ struct request *req =3D cmd->request; unsigned long flags; =20 + /* If failfast is enabled, override the number of completed + * sectors to make sure the entire request is finished right + * now */ + if(blk_noretry_request(req)) { + sectors =3D req->hard_nr_sectors; + } + /* * If there are blocks left over at the end, set up the command * to queue the remainder of them. --=-GziSwaPjwi0KoEUT16ms--