From mboxrd@z Thu Jan 1 00:00:00 1970 From: Philip Pokorny Subject: Re: SRPT and SCST Date: Thu, 05 Nov 2009 03:51:13 -0500 Message-ID: <4AF29201.6000606@penguincomputing.com> References: <3142CEFB1403044F9954E2DF6C85660FBB34BD@orca.penguincomputing.com> <3142CEFB1403044F9954E2DF6C85660FBB34BF@orca.penguincomputing.com> <654FA770A883FB43BAF3CB0B1E1DAC8C01C8C4DD@orca.penguincomputing.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <654FA770A883FB43BAF3CB0B1E1DAC8C01C8C4DD-/U8SqUwOx9/OOpeOfUw7maQk6oIRg43YAL8bYrjMMd8@public.gmane.org> Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Philip Pokorny , scst-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: Arend Dittmer List-Id: linux-rdma@vger.kernel.org Chris Worley asked that we post this to scst and linux-rdma lists for discussion. We're trying to get IB SRPT working and can't seem to get a stable configuration using any of the various SCST, IB_SRPT, and kernel/distro versions out there. In most cases, we're able to crash the connection and typically the target within minutes of pounding by 4 initiators doing "mkfs.ext3", "tar xf" and "fsck" to the SRP block device. Our target is a Penguin Computing Altus 2704 with disk expansion chassis. That's a 4-socket AMD hex-core (24 total cores) with 128GB of memory and 24 1TB drives attached to two LSI 1068 SAS controllers. (aka 3801E) The drives are configured as 12 RAID-1 mirrors and 3-wide LVM stripes over those mirrors. There are an additional 6 SSD's in the server in a "fast" VG also RAID-1 mirrored and LVM striped. Read ahead is disabled on the LVM volumes. LVM volumes are exported via SCST as FILEIO block devices to initiators. 50 groups are defined with two LVM volumes/block devices per group. One initiator per group. (NODE GUID added to "names" in the group) With only 4 initiators, almost 100% of I/O is to RAM and no disk I/O is seen on the target. Performance (when it's working) is generally good at 800MB/sec aggregate, but we'd like to see better. It appeared we were getting 1.3GB/s at one point. On Wed, Nov 4, 2009 at 5:34 PM, Philip Pokorny wrote: > We got a serial console attached and ran a test using the SCST and IB_SRPT > versions that you recommended (Arend set it up so I'll defer to him on the > exact SVN checkout that he used). > >> What sort of crashes are you seeing? I also have a customer >> experiencing a crash, but I can't get details out of them. > > The client gets SCSI I/O errors and aborts the filesystem (putting it in > read-only mode). > > After about 400 seconds of testing, the server side logs the following: > > [ 8418.697830] <6>[12426]: scst_check_sense:2444:Clearing dbl_ua_possible > flag (dev ffff811816136000, cmd ffff81081017c1c8) > [ 8418.697836] <6>[12426]: scst_dec_on_dev_cmd:577:cmd ffff81081017c1c8 (tag > 17): unblocking dev ffff811816136000 > [ 8418.697843] <6>[0]: scst_unblock_dev:4653:Device UNBLOCK(new 0), dev > ffff811816136000 > [ 8864.258468] ib_mthca 0000:81:00.0: SQ 000405 full (999320 head, 997272 > tail, 2048 max, 0 nreq) > [ 8864.294450] ***ERROR***: srpt_xfer_data[2374] ret=-12 > [ 8864.326702] <6>[0]: scst_queue_retry_cmd:1099:TGT QUEUE FULL: > incrementing retry_cmds 1 > [ 8864.326709] <6>[0]: scst_queue_retry_cmd:1106:Some command(s) finished, > direct retry (finished_cmds=2023031, tgt->finished_cmds=2023137, > retry_cmds=0) > [ 8878.447081] ib_mthca 0000:81:00.0: SQ 000406 full (1080498 head, 1078450 > tail, 2048 max, 0 nreq) > [ 8878.484452] ***ERROR***: srpt_xfer_data[2374] ret=-12 > [ 8878.517595] <6>[0]: scst_queue_retry_cmd:1099:TGT QUEUE FULL: > incrementing retry_cmds 1 > [ 8878.517608] <6>[0]: scst_queue_retry_cmd:1106:Some command(s) finished, > direct retry (finished_cmds=2256307, tgt->finished_cmds=2256504, > retry_cmds=0) > [ 8882.694684] ib_mthca 0000:81:00.0: SQ 000404 full (1087484 head, 1085436 > tail, 2048 max, 0 nreq) > [ 8882.732542] ***ERROR***: srpt_xfer_data[2374] ret=-12 > [ 8882.766396] <6>[0]: scst_queue_retry_cmd:1099:TGT QUEUE FULL: > incrementing retry_cmds 1 > [ 8882.766403] <6>[0]: scst_queue_retry_cmd:1106:Some command(s) finished, > direct retry (finished_cmds=2310445, tgt->finished_cmds=2310539, > retry_cmds=0) > [ 8891.650890] ib_mthca 0000:81:00.0: SQ 000407 full (1155377 head, 1153329 > tail, 2048 max, 0 nreq) > [ 8891.689016] ***ERROR***: srpt_xfer_data[2374] ret=-12 > [ 8891.723548] <6>[0]: scst_queue_retry_cmd:1099:TGT QUEUE FULL: > incrementing retry_cmds 1 > [ 8891.723556] <6>[0]: scst_queue_retry_cmd:1106:Some command(s) finished, > direct retry (finished_cmds=2381910, tgt->finished_cmds=2382001, > retry_cmds=0) > [ 8891.723573] ib_mthca 0000:81:00.0: too many gathers > [ 8891.758000] ***ERROR***: srpt_xfer_data[2374] ret=-22 > [ 8891.792888] <6>[0]: scst: scst_rdy_to_xfer:985:***ERROR***: Target driver > ib_srpt rdy_to_xfer() returned fatal error > > I hope that helps. > > I've seen that same "rdy_to_xfer() returned fatal error several times in > different configurations. The screen shots we sent earlier had the same > "ib_mthca ... SQ ... full (xx head..." message at the start. So that seems > to be related as well. > > Thanks for the help, > Phil P. > > -- > Philip Pokorny, RHCE > Chief Hardware Architect - Penguin Computing > Voice: 415-370-0835 Toll free: 888-PENGUIN > www.penguincomputing.com > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html