linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Omar Sandoval <osandov@osandov.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] xfs: cache minimum realtime summary level
Date: Tue, 6 Nov 2018 09:20:37 -0800	[thread overview]
Message-ID: <20181106172037.GA26112@vader> (raw)
In-Reply-To: <20181105234948.GL19305@dastard>

On Tue, Nov 06, 2018 at 10:49:48AM +1100, Dave Chinner wrote:
> On Fri, Nov 02, 2018 at 12:38:00PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > The realtime summary is a two-dimensional array on disk, effectively:
> > 
> > u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap]
> > 
> > rsum[log][bbno] is the number of extents of size 2**log which start in
> > bitmap block bbno.
> > 
> > xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether
> > rsum[log][bbno] != 0 for any log level. However, the summary array is
> > stored in row-major order (i.e., like an array in C), so all of these
> > entries are not adjacent, but rather spread across the entire summary
> > file. In the worst case (a full bitmap block), xfs_rtany_summary() has
> > to check every level.
> > 
> > This means that on a moderately-used realtime device, an allocation will
> > waste a lot of time finding, reading, and releasing buffers for the
> > realtime summary. In particular, one of our storage services (which runs
> > on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems)
> > spends almost 5% of its CPU cycles in xfs_rtbuf_get() and
> > xfs_trans_brelse() called from xfs_rtany_summary().
> 
> Yup, the RT allocator is showing that it was never intended for
> general purpose data storage workloads... :P

Indeed. What they really want is the AG allocator with metadata on a
separate device, but that sounds like a much bigger project.

> So how much memory would it require to keep an in-memory copy of
> the summary information? i.e. do an in-memory copy search, then once
> the block is found, pull in the buffer that needs modifying and log
> it? That gets rid of the buffer management overhead, and potentially
> allows for more efficient search algorithms to be used.

Yeah, I considered that. We use 256kB realtime extents for the 8 TB
filesystem, so it's 100kB. If we were using 4kB realtime extents, it'd
be about 7MB. So, it's doable but not the best.

> > One solution would be to swap the dimensions of the summary array so
> > that different sizes on the same bitmap block are adjacent. However,
> > this would require a disk format change to a very old component of XFS,
> > and it would slow down xfs_rtallocate_extent_size().
> 
> Or maybe we should put a second summary on disk, ordered the other
> way similar to  how the btree allocator freespace indexes are
> organised.

That's another thing I considered, but I wanted to avoid a disk format
change if possible, and this small cache is pretty much just as good.

> > Instead, we can cache the minimum size which contains any extents. We do
> > so lazily; rather than guaranteeing that the cache contains the precise
> > minimum, it always contains a loose lower bound which we tighten when we
> > read or update a summary block. This only uses a few kilobytes of memory
> > and is already serialized via the realtime bitmap and summary inode
> > locks, so the cost is minimal. With this change, the same workload only
> > spends 0.2% of its CPU cycles in the realtime allocator.
> 
> Good win.
> 
> > @@ -1187,8 +1195,8 @@ xfs_rtmount_init(
> >  }
> >  
> >  /*
> > - * Get the bitmap and summary inodes into the mount structure
> > - * at mount time.
> > + * Get the bitmap and summary inodes and the summary cache into the mount
> > + * structure at mount time.
> >   */
> >  int					/* error */
> >  xfs_rtmount_inodes(
> > @@ -1211,6 +1219,16 @@ xfs_rtmount_inodes(
> >  		return error;
> >  	}
> >  	ASSERT(mp->m_rsumip != NULL);
> > +	/*
> > +	 * The rsum cache is initialized to all zeroes, which trivially
> > +	 * satisfies the invariant.
> 
> Satisfies what invariant? Please explain why this is OK and how it
> interacts with the allocation code so we're not left guessing what
> this means in the future.
> 
> Remember: comments are not for now - they are for 10 years time when
> no-one remembers what this code does or is actually used for.

The invariant I specified in the definition in xfs_mount_t, but I'll
clarify this.

> > +	 */
> > +	mp->m_rsum_cache = kvzalloc(sbp->sb_rbmblocks, GFP_KERNEL);
> 
> kmem_zalloc_large().

Will change.

> Which makes me ask, just how large is this allocation if you're
> using vmalloc() straight out of the box?

For our 256kB extent size, it's only ~1000 bytes. However, for the same
size filesystem with 4kB extents, it'd be ~60kB.

> > +	if (!mp->m_rsum_cache) {
> > +		xfs_irele(mp->m_rbmip);
> > +		xfs_irele(mp->m_rsumip);
> > +		return -ENOMEM;
> > +	}
> >  	return 0;
> >  }
> >  
> > @@ -1218,6 +1236,7 @@ void
> >  xfs_rtunmount_inodes(
> >  	struct xfs_mount	*mp)
> >  {
> > +	kvfree(mp->m_rsum_cache);
> 
> kmem_free().

Will change.

Thanks for the review!

  reply	other threads:[~2018-11-07  2:46 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-02 19:38 [PATCH] xfs: cache minimum realtime summary level Omar Sandoval
2018-11-05 23:49 ` Dave Chinner
2018-11-06 17:20   ` Omar Sandoval [this message]
2018-11-06 20:58     ` Dave Chinner
2018-11-11  3:59       ` Omar Sandoval

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181106172037.GA26112@vader \
    --to=osandov@osandov.com \
    --cc=david@fromorbit.com \
    --cc=kernel-team@fb.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).