From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 403ECC433E0 for ; Wed, 20 May 2020 20:44:42 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0CEF320823 for ; Wed, 20 May 2020 20:44:42 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="kl+Wps9y" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727973AbgETUol (ORCPT ); Wed, 20 May 2020 16:44:41 -0400 Received: from userp2120.oracle.com ([156.151.31.85]:36116 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726827AbgETUol (ORCPT ); Wed, 20 May 2020 16:44:41 -0400 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 04KKg3Hf038808; Wed, 20 May 2020 20:44:36 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=date : from : to : cc : subject : message-id : references : mime-version : content-type : in-reply-to; s=corp-2020-01-29; bh=/POAZxKldOJGPEideVc9QbpLFBV5vkppruxdJXBuCFg=; b=kl+Wps9ycU3TUNcjznzpH5ZIyOtkE5S+WCBlDdvt1XB6FMK4GPGQ+qJeaOkZLL7/E8Uq w0pw6J3trL8P7+6b9wMMN26ZQe9k83o/XcggA2KtzrTFg78BeKWrYWhFu0Xx7VUyrVr9 6d8Qb9x63v1+RJMB5IoWqAqp8z475vc/c41jKmMGUIBMUJUr3W8jxPyzAunm+hLjaCLS LMAcjc/UQBx9KNod7JXHcpINzoVtrZEISMX17tw8xOkE0jDFHmtbrpvtrd63/TjbocNw p9Wer01IC/WVvY0NkdNhHTc/ShtLhONXVEe7hj9dI9QbKwYFGhqjzcpVIrhTrfux3eMv Mw== Received: from aserp3020.oracle.com (aserp3020.oracle.com [141.146.126.70]) by userp2120.oracle.com with ESMTP id 31501rbutu-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Wed, 20 May 2020 20:44:36 +0000 Received: from pps.filterd (aserp3020.oracle.com [127.0.0.1]) by aserp3020.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 04KKhCZ2138044; Wed, 20 May 2020 20:44:36 GMT Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by aserp3020.oracle.com with ESMTP id 312t38rrxx-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 20 May 2020 20:44:35 +0000 Received: from abhmp0009.oracle.com (abhmp0009.oracle.com [141.146.116.15]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id 04KKiZ54001481; Wed, 20 May 2020 20:44:35 GMT Received: from localhost (/67.169.218.210) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 20 May 2020 13:44:35 -0700 Date: Wed, 20 May 2020 13:44:34 -0700 From: "Darrick J. Wong" To: Dave Chinner Cc: linux-xfs@vger.kernel.org Subject: Re: [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount Message-ID: <20200520204434.GE17627@magnolia> References: <20200519222310.2576434-1-david@fromorbit.com> <20200519222310.2576434-2-david@fromorbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200519222310.2576434-2-david@fromorbit.com> X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9627 signatures=668686 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 bulkscore=0 spamscore=0 mlxlogscore=999 phishscore=0 mlxscore=0 malwarescore=0 suspectscore=5 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2005200164 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9627 signatures=668686 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 lowpriorityscore=0 spamscore=0 mlxlogscore=999 clxscore=1015 priorityscore=1501 cotscore=-2147483648 impostorscore=0 bulkscore=0 adultscore=0 malwarescore=0 phishscore=0 mlxscore=0 suspectscore=5 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2005200164 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Wed, May 20, 2020 at 08:23:09AM +1000, Dave Chinner wrote: > From: Dave Chinner > > Seeing massive cpu usage from xfs_agino_range() on one machine; > instruction level profiles look similar to another machine running > the same workload, only one machien is consuming 10x as much CPU as > the other and going much slower. The only real difference between > the two machines is core count per socket. Both are running > identical 16p/16GB virtual machine configurations > > Machine A: > > 25.83% [k] xfs_agino_range > 12.68% [k] __xfs_dir3_data_check > 6.95% [k] xfs_verify_ino > 6.78% [k] xfs_dir2_data_entry_tag_p > 3.56% [k] xfs_buf_find > 2.31% [k] xfs_verify_dir_ino > 2.02% [k] xfs_dabuf_map.constprop.0 > 1.65% [k] xfs_ag_block_count > > And takes around 13 minutes to remove 50 million inodes. > > Machine B: > > 13.90% [k] __pv_queued_spin_lock_slowpath > 3.76% [k] do_raw_spin_lock > 2.83% [k] xfs_dir3_leaf_check_int > 2.75% [k] xfs_agino_range > 2.51% [k] __raw_callee_save___pv_queued_spin_unlock > 2.18% [k] __xfs_dir3_data_check > 2.02% [k] xfs_log_commit_cil > > And takes around 5m30s to remove 50 million inodes. > > Suspect is cacheline contention on m_sectbb_log which is used in one > of the macros in xfs_agino_range. This is a read-only variable but > shares a cacheline with m_active_trans which is a global atomic that > gets bounced all around the machine. > > The workload is trying to run hundreds of thousands of transactions > per second and hence cacheline contention will be occuring on this > atomic counter. Hence xfs_agino_range() is likely just be an > innocent bystander as the cache coherency protocol fights over the > cacheline between CPU cores and sockets. > > On machine A, this rearrangement of the struct xfs_mount > results in the profile changing to: > > 9.77% [kernel] [k] xfs_agino_range > 6.27% [kernel] [k] __xfs_dir3_data_check > 5.31% [kernel] [k] __pv_queued_spin_lock_slowpath > 4.54% [kernel] [k] xfs_buf_find > 3.79% [kernel] [k] do_raw_spin_lock > 3.39% [kernel] [k] xfs_verify_ino > 2.73% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock > > Vastly less CPU usage in xfs_agino_range(), but still 3x the amount > of machine B and still runs substantially slower than it should. > > Current rm -rf of 50 million files: > > vanilla patched > machine A 13m20s 6m42s > machine B 5m30s 5m02s > > It's an improvement, hence indicating that separation and further > optimisation of read-only global filesystem data is worthwhile, but > it clearly isn't the underlying issue causing this specific > performance degradation. > > Signed-off-by: Dave Chinner Looks ok, Reviewed-by: Darrick J. Wong --D > --- > fs/xfs/xfs_mount.h | 148 +++++++++++++++++++++++++-------------------- > 1 file changed, 82 insertions(+), 66 deletions(-) > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > index 4835581f3eb00..c1f92c1847bb2 100644 > --- a/fs/xfs/xfs_mount.h > +++ b/fs/xfs/xfs_mount.h > @@ -55,61 +55,25 @@ struct xfs_error_cfg { > long retry_timeout; /* in jiffies, -1 = infinite */ > }; > > +/* > + * The struct xfsmount layout is optimised to separate read-mostly variables > + * from variables that are frequently modified. We put the read-mostly variables > + * first, then place all the other variables at the end. > + * > + * Typically, read-mostly variables are those that are set at mount time and > + * never changed again, or only change rarely as a result of things like sysfs > + * knobs being tweaked. > + */ > typedef struct xfs_mount { > + struct xfs_sb m_sb; /* copy of fs superblock */ > struct super_block *m_super; > - > - /* > - * Bitsets of per-fs metadata that have been checked and/or are sick. > - * Callers must hold m_sb_lock to access these two fields. > - */ > - uint8_t m_fs_checked; > - uint8_t m_fs_sick; > - /* > - * Bitsets of rt metadata that have been checked and/or are sick. > - * Callers must hold m_sb_lock to access this field. > - */ > - uint8_t m_rt_checked; > - uint8_t m_rt_sick; > - > struct xfs_ail *m_ail; /* fs active log item list */ > - > - struct xfs_sb m_sb; /* copy of fs superblock */ > - spinlock_t m_sb_lock; /* sb counter lock */ > - struct percpu_counter m_icount; /* allocated inodes counter */ > - struct percpu_counter m_ifree; /* free inodes counter */ > - struct percpu_counter m_fdblocks; /* free block counter */ > - /* > - * Count of data device blocks reserved for delayed allocations, > - * including indlen blocks. Does not include allocated CoW staging > - * extents or anything related to the rt device. > - */ > - struct percpu_counter m_delalloc_blks; > - > struct xfs_buf *m_sb_bp; /* buffer for superblock */ > char *m_rtname; /* realtime device name */ > char *m_logname; /* external log device name */ > - int m_bsize; /* fs logical block size */ > - xfs_agnumber_t m_agfrotor; /* last ag where space found */ > - xfs_agnumber_t m_agirotor; /* last ag dir inode alloced */ > - spinlock_t m_agirotor_lock;/* .. and lock protecting it */ > - xfs_agnumber_t m_maxagi; /* highest inode alloc group */ > - uint m_allocsize_log;/* min write size log bytes */ > - uint m_allocsize_blocks; /* min write size blocks */ > struct xfs_da_geometry *m_dir_geo; /* directory block geometry */ > struct xfs_da_geometry *m_attr_geo; /* attribute block geometry */ > struct xlog *m_log; /* log specific stuff */ > - struct xfs_ino_geometry m_ino_geo; /* inode geometry */ > - int m_logbufs; /* number of log buffers */ > - int m_logbsize; /* size of each log buffer */ > - uint m_rsumlevels; /* rt summary levels */ > - uint m_rsumsize; /* size of rt summary, bytes */ > - /* > - * Optional cache of rt summary level per bitmap block with the > - * invariant that m_rsum_cache[bbno] <= the minimum i for which > - * rsum[i][bbno] != 0. Reads and writes are serialized by the rsumip > - * inode lock. > - */ > - uint8_t *m_rsum_cache; > struct xfs_inode *m_rbmip; /* pointer to bitmap inode */ > struct xfs_inode *m_rsumip; /* pointer to summary inode */ > struct xfs_inode *m_rootip; /* pointer to root directory */ > @@ -117,9 +81,26 @@ typedef struct xfs_mount { > xfs_buftarg_t *m_ddev_targp; /* saves taking the address */ > xfs_buftarg_t *m_logdev_targp;/* ptr to log device */ > xfs_buftarg_t *m_rtdev_targp; /* ptr to rt device */ > + /* > + * Optional cache of rt summary level per bitmap block with the > + * invariant that m_rsum_cache[bbno] <= the minimum i for which > + * rsum[i][bbno] != 0. Reads and writes are serialized by the rsumip > + * inode lock. > + */ > + uint8_t *m_rsum_cache; > + struct xfs_mru_cache *m_filestream; /* per-mount filestream data */ > + struct workqueue_struct *m_buf_workqueue; > + struct workqueue_struct *m_unwritten_workqueue; > + struct workqueue_struct *m_cil_workqueue; > + struct workqueue_struct *m_reclaim_workqueue; > + struct workqueue_struct *m_eofblocks_workqueue; > + struct workqueue_struct *m_sync_workqueue; > + > + int m_bsize; /* fs logical block size */ > uint8_t m_blkbit_log; /* blocklog + NBBY */ > uint8_t m_blkbb_log; /* blocklog - BBSHIFT */ > uint8_t m_agno_log; /* log #ag's */ > + uint8_t m_sectbb_log; /* sectlog - BBSHIFT */ > uint m_blockmask; /* sb_blocksize-1 */ > uint m_blockwsize; /* sb_blocksize in words */ > uint m_blockwmask; /* blockwsize-1 */ > @@ -138,47 +119,83 @@ typedef struct xfs_mount { > xfs_extlen_t m_ag_prealloc_blocks; /* reserved ag blocks */ > uint m_alloc_set_aside; /* space we can't use */ > uint m_ag_max_usable; /* max space per AG */ > - struct radix_tree_root m_perag_tree; /* per-ag accounting info */ > - spinlock_t m_perag_lock; /* lock for m_perag_tree */ > - struct mutex m_growlock; /* growfs mutex */ > + int m_dalign; /* stripe unit */ > + int m_swidth; /* stripe width */ > + xfs_agnumber_t m_maxagi; /* highest inode alloc group */ > + uint m_allocsize_log;/* min write size log bytes */ > + uint m_allocsize_blocks; /* min write size blocks */ > + int m_logbufs; /* number of log buffers */ > + int m_logbsize; /* size of each log buffer */ > + uint m_rsumlevels; /* rt summary levels */ > + uint m_rsumsize; /* size of rt summary, bytes */ > int m_fixedfsid[2]; /* unchanged for life of FS */ > - uint64_t m_flags; /* global mount flags */ > - bool m_finobt_nores; /* no per-AG finobt resv. */ > uint m_qflags; /* quota status flags */ > + uint64_t m_flags; /* global mount flags */ > + int64_t m_low_space[XFS_LOWSP_MAX]; > + struct xfs_ino_geometry m_ino_geo; /* inode geometry */ > struct xfs_trans_resv m_resv; /* precomputed res values */ > + /* low free space thresholds */ > + bool m_always_cow; > + bool m_fail_unmount; > + bool m_finobt_nores; /* no per-AG finobt resv. */ > + bool m_update_sb; /* sb needs update in mount */ > + > + /* > + * Bitsets of per-fs metadata that have been checked and/or are sick. > + * Callers must hold m_sb_lock to access these two fields. > + */ > + uint8_t m_fs_checked; > + uint8_t m_fs_sick; > + /* > + * Bitsets of rt metadata that have been checked and/or are sick. > + * Callers must hold m_sb_lock to access this field. > + */ > + uint8_t m_rt_checked; > + uint8_t m_rt_sick; > + > + /* > + * End of read-mostly variables. Frequently written variables and locks > + * should be placed below this comment from now on. The first variable > + * here is marked as cacheline aligned so they it is separated from > + * the read-mostly variables. > + */ > + > + spinlock_t ____cacheline_aligned m_sb_lock; /* sb counter lock */ > + struct percpu_counter m_icount; /* allocated inodes counter */ > + struct percpu_counter m_ifree; /* free inodes counter */ > + struct percpu_counter m_fdblocks; /* free block counter */ > + /* > + * Count of data device blocks reserved for delayed allocations, > + * including indlen blocks. Does not include allocated CoW staging > + * extents or anything related to the rt device. > + */ > + struct percpu_counter m_delalloc_blks; > + > + struct radix_tree_root m_perag_tree; /* per-ag accounting info */ > + spinlock_t m_perag_lock; /* lock for m_perag_tree */ > uint64_t m_resblks; /* total reserved blocks */ > uint64_t m_resblks_avail;/* available reserved blocks */ > uint64_t m_resblks_save; /* reserved blks @ remount,ro */ > - int m_dalign; /* stripe unit */ > - int m_swidth; /* stripe width */ > - uint8_t m_sectbb_log; /* sectlog - BBSHIFT */ > atomic_t m_active_trans; /* number trans frozen */ > - struct xfs_mru_cache *m_filestream; /* per-mount filestream data */ > struct delayed_work m_reclaim_work; /* background inode reclaim */ > struct delayed_work m_eofblocks_work; /* background eof blocks > trimming */ > struct delayed_work m_cowblocks_work; /* background cow blocks > trimming */ > - bool m_update_sb; /* sb needs update in mount */ > - int64_t m_low_space[XFS_LOWSP_MAX]; > - /* low free space thresholds */ > struct xfs_kobj m_kobj; > struct xfs_kobj m_error_kobj; > struct xfs_kobj m_error_meta_kobj; > struct xfs_error_cfg m_error_cfg[XFS_ERR_CLASS_MAX][XFS_ERR_ERRNO_MAX]; > struct xstats m_stats; /* per-fs stats */ > + xfs_agnumber_t m_agfrotor; /* last ag where space found */ > + xfs_agnumber_t m_agirotor; /* last ag dir inode alloced */ > + spinlock_t m_agirotor_lock;/* .. and lock protecting it */ > > /* > * Workqueue item so that we can coalesce multiple inode flush attempts > * into a single flush. > */ > struct work_struct m_flush_inodes_work; > - struct workqueue_struct *m_buf_workqueue; > - struct workqueue_struct *m_unwritten_workqueue; > - struct workqueue_struct *m_cil_workqueue; > - struct workqueue_struct *m_reclaim_workqueue; > - struct workqueue_struct *m_eofblocks_workqueue; > - struct workqueue_struct *m_sync_workqueue; > > /* > * Generation of the filesysyem layout. This is incremented by each > @@ -190,9 +207,8 @@ typedef struct xfs_mount { > * to various other kinds of pain inflicted on the pNFS server. > */ > uint32_t m_generation; > + struct mutex m_growlock; /* growfs mutex */ > > - bool m_always_cow; > - bool m_fail_unmount; > #ifdef DEBUG > /* > * Frequency with which errors are injected. Replaces xfs_etest; the > -- > 2.26.2.761.g0e0b3e54be >