From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CF686C001DE for ; Fri, 4 Aug 2023 08:07:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232246AbjHDIHK (ORCPT ); Fri, 4 Aug 2023 04:07:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50336 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230481AbjHDIHJ (ORCPT ); Fri, 4 Aug 2023 04:07:09 -0400 Received: from mail-pf1-x42c.google.com (mail-pf1-x42c.google.com [IPv6:2607:f8b0:4864:20::42c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4CBE44684 for ; Fri, 4 Aug 2023 01:07:08 -0700 (PDT) Received: by mail-pf1-x42c.google.com with SMTP id d2e1a72fcca58-686c06b806cso1298506b3a.2 for ; Fri, 04 Aug 2023 01:07:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1691136428; x=1691741228; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=QV8xiLzpyRdogmF9Xy586z9SW9lMLBPYfXECh6abF2A=; b=HriQsqUmhwrVfoFj1dguYvQFMdfc/Coj113W77etdiPh+3/+6YozADM38/2AzwpxlT wFNFw1j/FIujvTINGTFneYG2xtzPH9Sy9C6e3cUflcTnK0pGaGd205OTGNWJG7t4fRPh XoW16i+B0y/6TCr3wp1TWzC4JIVfL6NIqoR4ZsofD3hFWKkBZ7LutuXiL1r8lKU5R4mb sLkvUNXXi3hAVv8KnQZ+NnFU3718r6wD2yJe1/bVgs1UNy9kLvwcdLAjiAskSwkjHrKe fItQI5wi00jEJyuLmLjcrT98h/LlwJx+UPBIpMw94W2Kmj1mlhg5bgeNbOmd0yIOZcT6 e+qg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691136428; x=1691741228; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=QV8xiLzpyRdogmF9Xy586z9SW9lMLBPYfXECh6abF2A=; b=YGi8K0B5GwIrl7lFDZjKwRypBJ8z5CGM9BmQx54KNPqQXkJXMxzpwYmkBMLpnW2jGP V64iydHc+yHb5MAgewNYZJmNc3DK+srd3HEBXGSIgY5ngwgBNuPU5zeMM33EpjrfZoZP vYPBevngn+i26xwZuzs1QM6Zu5FIb3PUz05o/PhNRFR66osf5GZacPEN+oNjCmLdWRlP A7Pi7shzvc5fRa66GPHqJiKrRmhOTuwvyAIL4ljNWOXRMpY8w2R37ZePY0nRqEhDliEm 81lr3jsSkOS6iGBD4GcsEoQFOi5IgvHzQNBLB0OGSImmkOU3+L8LJdH+I1FQCPsvIAIH AtsQ== X-Gm-Message-State: AOJu0YwFE9eFmTw2SAvNxTBUr+Gk2wxLfp/LuKG3R5T5zuP7hU8R6HGK OQUG+Q/fbXRxyjjOoqZc7eN0uw== X-Google-Smtp-Source: AGHT+IHZid3ZRzcpfrNJYMglE6GRQYrSTBUn/9yoLr0M67bxKXiocAkJLBHSTvZ1y2K3xRyQb6WJ2g== X-Received: by 2002:a05:6a00:2495:b0:66c:9e97:aece with SMTP id c21-20020a056a00249500b0066c9e97aecemr900710pfv.10.1691136427646; Fri, 04 Aug 2023 01:07:07 -0700 (PDT) Received: from dread.disaster.area (pa49-180-166-213.pa.nsw.optusnet.com.au. [49.180.166.213]) by smtp.gmail.com with ESMTPSA id e8-20020aa78248000000b00682a0184742sm1055376pfn.148.2023.08.04.01.07.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 04 Aug 2023 01:07:07 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1qRppv-000tCq-2G; Fri, 04 Aug 2023 18:07:03 +1000 Date: Fri, 4 Aug 2023 18:07:03 +1000 From: Dave Chinner To: Corey Hickey Cc: linux-xfs@vger.kernel.org Subject: Re: read-modify-write occurring for direct I/O on RAID-5 Message-ID: References: <55225218-b866-d3db-d62b-7c075dd712de@fatooh.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <55225218-b866-d3db-d62b-7c075dd712de@fatooh.org> Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Thu, Aug 03, 2023 at 10:44:31PM -0700, Corey Hickey wrote: > Hello, > > I am having a problem with write performance via direct I/O. My setup is: > * Debian Sid > * Linux 6.3.0-2 (Debian Kernel) > * 3-disk MD RAID-5 of hard disks > * XFS > > When I do large sequential writes via direct I/O, sometimes the writes are > fast, but sometimes the RAID ends up doing RMW and performance gets slow. > > If I use regular buffered I/O, then performance is better, presumably due to > the MD stripe cache. I could just use buffered writes, of course, but I am > really trying to make sure I get the alignment correct to start with. > > > I can reproduce the problem on a fresh RAID. > ----------------------------------------------------------------------- > $ sudo mdadm --create /dev/md10 -n 3 -l 5 -z 30G /dev/sd[ghi] > mdadm: largest drive (/dev/sdg) exceeds size (31457280K) by more than 1% > Continue creating array? y > mdadm: Defaulting to version 1.2 metadata > mdadm: array /dev/md10 started. > ----------------------------------------------------------------------- > For testing, I'm using "-z 30G" to limit the duration of the initial RAID > resync. > > > For XFS I can use default options: > ----------------------------------------------------------------------- > $ sudo mkfs.xfs /dev/md10 > log stripe unit (524288 bytes) is too large (maximum is 256KiB) > log stripe unit adjusted to 32KiB > meta-data=/dev/md10 isize=512 agcount=16, agsize=983040 blks So an AG size of just under 2GB. > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=0 > = reflink=1 bigtime=1 inobtcount=1 > nrext64=0 > data = bsize=4096 blocks=15728640, imaxpct=25 > = sunit=128 swidth=68352 blks ^^^^^^^^^^^^^^^^^ Something is badly broken in MD land. ..... > The default chunk size is 512K > ----------------------------------------------------------------------- > $ sudo mdadm --detail /dev/md10 | grep Chunk > Chunk Size : 512K > $ sudo blkid -i /dev/md10 > /dev/md10: MINIMUM_IO_SIZE="524288" OPTIMAL_IO_SIZE="279969792" ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Yup, that's definitely broken. > PHYSICAL_SECTOR_SIZE="512" LOGICAL_SECTOR_SIZE="512" > ----------------------------------------------------------------------- > I don't know why blkid is reporting such a large OPTIMAL_IO_SIZE. I would > expect this to be 1024K (due to two data disks in a three-disk RAID-5). Yup, it's broken. :/ > Translating into 512-byte sectors, I think the topology should be: > chunk size (sunit): 1024 sectors > stripe size (swidth): 2048 sectors Yup, or as it reports from mkfs, sunit=128 fsbs, swidth=256 fsbs. > ----------------------------------------------------------------------- > $ sudo blktrace -d /dev/md10 -o - | blkparse -i - | grep ' Q ' > 9,10 15 1 0.000000000 186548 Q WS 3829760 + 2048 [dd] > 9,10 15 3 0.021087119 186548 Q WS 3831808 + 2048 [dd] > 9,10 15 5 0.023605705 186548 Q WS 3833856 + 2048 [dd] > 9,10 15 7 0.026093572 186548 Q WS 3835904 + 2048 [dd] > 9,10 15 9 0.028595887 186548 Q WS 3837952 + 2048 [dd] > 9,10 15 11 0.031171221 186548 Q WS 3840000 + 2048 [dd] > [...] > 9,10 5 441 14.601942400 186608 Q WS 8082432 + 2048 [dd] > 9,10 5 443 14.620316654 186608 Q WS 8084480 + 2048 [dd] > 9,10 5 445 14.646707430 186608 Q WS 8086528 + 2048 [dd] > 9,10 5 447 14.654519976 186608 Q WS 8088576 + 2048 [dd] > 9,10 5 449 14.680901605 186608 Q WS 8090624 + 2048 [dd] > 9,10 5 451 14.689156421 186608 Q WS 8092672 + 2048 [dd] > 9,10 5 453 14.706529362 186608 Q WS 8094720 + 2048 [dd] > 9,10 5 455 14.732451407 186608 Q WS 8096768 + 2048 [dd] > ----------------------------------------------------------------------- > In the beginning, writes queued are stripe-aligned. For example: > 3829760 / 2048 == 1870 > > Later on, writes end up getting misaligned by half a stripe. For example: > 8082432 / 2048 == 3946.5 So it's aligned to sunit, not swidth. That will match up with a discontiguity in the file layout. i.e. an extent boundary. And given this is at just under 4GB written, and the AG size is just under 2GB, this discontiguity is going to occur as writing fills AG 1 and allocation switches to AG 2. > I tried manually specifying '-d sunit=1024,swidth=2048' for mkfs.xfs, but > that had pretty much the same behavior when writing (the RMW starts later, > but it still starts). It won't change anything, actually. The first allocation in an AG will determine which stripe unit the new extent starts on, and then for the entire AG the write will be aligned to that choice. If you do IOs much larger than the stripe width (e.g. 16MB at a time) the impact of the head/tail RMW will largely go away. The problem is that you are doing exactly stripe width sized IOs and so is the worse case for any allocation misalignment that might occur. > Am I doing something wrong, or is there a bug, or are my expectations > incorrect? I had expected that large sequential writes would be aligned with > swidth. Expectations are wrong. Large allocations are aligned to stripe unit in XFS by default. THis is because XFS was tuned for *large* multi-layer RAID setups like RAID-50 that had hardware RAID 5 luns stripe together via RAID-0 in the volume manager. In these setups, the stripe unit is the hardware RAID-5 lun stripe width (the minimum size that avoids RMW) and the stripe width is the RAID-0 width. Hence for performance, it didn't matter which sunit allocation aligned to as long as writes spanned the entire stripe width. That way they would hit every lun. In general, we don't want stripe width aligned allocation, because that hot-spots the first stripe unit in the stripe as all file data first writes to that unit. A raid stripe is only as fast as it's slowest disk, and so having a hot stripe unit slows everything down. Hence by default we move the initial allocation around the stripe units, and that largely removes the hotspots in the RAID luns... So, yeah, there are good reasons for stripe unit aligned allocation rather than stripe width aligned. The problem is that MD has never behaved this way - it has always exposed it's individual disk chunk size as the minimum IO size (i.e. the stripe unit) and the stripe width as the optimal IO size to avoid RMW cycles. If you want to force XFS to do stripe width aligned allocation for large files to match with how MD exposes it's topology to filesytsems, use the 'swalloc' mount option. The down side is that you'll hotspot the first disk in the MD array.... -Dave. -- Dave Chinner david@fromorbit.com