From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id ACAE3C4332F for ; Thu, 15 Dec 2022 00:20:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229994AbiLOAUy (ORCPT ); Wed, 14 Dec 2022 19:20:54 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40496 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230002AbiLOAUY (ORCPT ); Wed, 14 Dec 2022 19:20:24 -0500 Received: from mail-pg1-x531.google.com (mail-pg1-x531.google.com [IPv6:2607:f8b0:4864:20::531]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 28F1C2B1A4 for ; Wed, 14 Dec 2022 16:19:48 -0800 (PST) Received: by mail-pg1-x531.google.com with SMTP id 79so3097266pgf.11 for ; Wed, 14 Dec 2022 16:19:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20210112.gappssmtp.com; s=20210112; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=rRpO3QnyU6ChdHPazl5zqgJFuB/om7CEHPWIzKyAGmw=; b=Uc2FAWV50Q/fgLz4Kz7Iz3qqqUvK/eUk42Ll9KWqr0gl6i2qOyWB+317FVLE2jTac1 bIbzepGCO/N3nxywK3m74mwleAn2fKtycxVSS2rb0Xgj7xHXx3O1bbKZ5JYf6JDcCdGK aue79iEp4gj5T2M9MYgo3HxPyQ4WRsacGqxVfecpnOWZQCHelcifAPr6ZJSiF92Zhpws yNm3h6A4bresmGu09PV7G+MiubISp99oeqL/0chOl8EfTcMlmcIxwOY9NsFAH420VIPl lbEvdz502Q6/yhOsG2pQLCrL8aLQ+W4+HGQAxtUp6Ye637IVRE3a8ny8ZfpFzOr55NdV +jDQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=rRpO3QnyU6ChdHPazl5zqgJFuB/om7CEHPWIzKyAGmw=; b=N69G+MSejH3Dtx7JLHX3xCGlNzzpnyBzkD5883fnJI59gAT42QMVTtrhkr1v3p8NJp 1ZfYssujZKWTTIbaV/8awdaowSuPUi9/FlkV8q9t8NHpwchGbKncALGXnSoaAzvEIi0q EGVpUKE4zgp7fLnTY16nuomEU1k9oJTmCgDhhXPMe32ngacvp5lXbr2iLJTU+ZjA2h18 p1GW9FjEgjgvhKwbvb8hAh1YLqr3Odw9EyQt9KEcSZ1NXLQ9rgUHBsE4kK+5HIMrFL3H 4Ynv+h4r0d7qpLCubgUHs9jGGj27qKHUoWOD0CAqSvpebKAZeFs3V4Yz93BU/ip8NmCg mAUw== X-Gm-Message-State: ANoB5pkC69okpuYtaOIFccnMR/c/VoOIdAWPOOIoGOnC1Hj7E9y7H3aR vpQ2BVNtrRUJXYhh4V3FBR04kw== X-Google-Smtp-Source: AA0mqf5/FJf+I6OMoKPqxw8Vu5oa5pdOexrWPhBqwuSiBZxvPhIailjRjQAynaVor/Gm5opVVMQDaA== X-Received: by 2002:aa7:814f:0:b0:56c:232e:3b00 with SMTP id d15-20020aa7814f000000b0056c232e3b00mr23787141pfn.7.1671063587594; Wed, 14 Dec 2022 16:19:47 -0800 (PST) Received: from dread.disaster.area (pa49-181-138-158.pa.nsw.optusnet.com.au. [49.181.138.158]) by smtp.gmail.com with ESMTPSA id i11-20020aa796eb000000b005749f5d9d07sm429565pfq.99.2022.12.14.16.19.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 14 Dec 2022 16:19:47 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1p5byS-008XDn-KN; Thu, 15 Dec 2022 11:19:44 +1100 Date: Thu, 15 Dec 2022 11:19:44 +1100 From: Dave Chinner To: Suyash Mahar Cc: "Darrick J. Wong" , linux-xfs@vger.kernel.org, tpkelly@eecs.umich.edu, Suyash Mahar Subject: Re: XFS reflink overhead, ioctl(FICLONE) Message-ID: <20221215001944.GC1971568@dread.disaster.area> References: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Tue, Dec 13, 2022 at 08:47:03PM -0800, Suyash Mahar wrote: > Hi Darrick, > > Thank you for the response. I have replied inline. > > -Suyash > > Le mar. 13 déc. 2022 à 09:18, Darrick J. Wong a écrit : > > > > [ugh, your email never made it to the list. I bet the email security > > standards have been tightened again. > silent failures here>] :( > > > > On Sat, Dec 10, 2022 at 09:28:36PM -0800, Suyash Mahar wrote: > > > Hi all! > > > > > > While using XFS's ioctl(FICLONE), we found that XFS seems to have > > > poor performance (ioctl takes milliseconds for sparse files) and the > > > overhead > > > increases with every call. > > > > > > For the demo, we are using an Optane DC-PMM configured as a > > > block device (fsdax) and running XFS (Linux v5.18.13). > > > > How are you using fsdax and reflink on a 5.18 kernel? That combination > > of features wasn't supported until 6.0, and the data corruption problems > > won't get fixed until a pull request that's about to happen for 6.2. > > We did not enable the dax option. The optane DIMMs are configured to > appear as a block device. > > $ mount | grep xfs > /dev/pmem0p4 on /mnt/pmem0p4 type xfs > (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) > > Regardless of the block device (the plot includes results for optane > and RamFS), it seems like the ioctl(FICLONE) call is slow. Please define "slow" - is it actually slower than it should be (i.e. a bug) or does it simply not perform according to your expectations? A few things that you can quantify to answer these questions. 1. What is the actual rate it is cloning extents at? i.e. extent count / clone time? Is this rate consistent/sustained, or is it dropping substantially over time and/or increase in extent count? 3. How does clone speed of a given file compare to the actual data copy speed of that file (please include fsync time in the data copy results)? Is cloning faster or slower than copying the data? What is the extent count of the file at the cross-over point where cloning goes from being faster to slower than copying the data? 3. How does it compare with btrfs running the same write/clone workload? Does btrfs run faster? Does it perform better with high extent counts than XFS? What about with high sharing counts (e.g. after 500 or 1000 clones of the source file)? Basically, I'm trying to understand what "slow" means in teh context of the operations you are performing. I haven't seen any recent performance regressions in clone speed on XFS, so I'm trying to understand what you are seeing and why you think it is slower than it should be. > > > We create a 1 GiB dense file, then repeatedly modify a tiny random > > > fraction of it and make a clone via ioctl(FICLONE). > > > > Yay, random cow writes, that will slowly increase the number of space > > mapping records in the file metadata. Yup, the scripts I use do exactly this - 10,000 random 4kB writes to a 8GB file between reflink clones. I then iterate a few thousand times and measure the reflink time. > > > The time required for the ioctl() calls increases from large to insane > > > over the course of ~250 iterations: From roughly a millisecond for the > > > first iteration or two (which seems high, given that this is on > > > Optane and the code doesn't fsync or msync anywhere at all, ever) to 20 > > > milliseconds (which seems crazy). > > > > Does the system call runtime increase with O(number_extents)? You might > > record the number of extents in the file you're cloning by running this > > periodically: > > > > xfs_io -c stat $path | grep fsxattr.nextents > > The extent count does increase linearly (just like the ioctl() call latency). As expected. Changing the sharing state a single extent has a roughly constant overhead regardless of the number of extents in the file. Hence clone time should scale linearly with the number of extents that need to have their shared state modified. > I used the xfs_bmap tool, let me know if this is not the right way. If > it is not, I'll update the microbenchmark to run xfs_io. xfs_bmap is the slow way - it has to iterate every extents and format them out to userspace. the above mechanism just does a single syscall to query the count of extents from the inode. Using the fsxattr extent count query is much faster, especially when you have files with tens of millions of extents in them.... Cheers, Dave. -- Dave Chinner david@fromorbit.com