linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Moyer <jmoyer@redhat.com>
To: Andi Kleen <andi@firstfloor.org>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	hch@infradead.org, Andi Kleen <ak@linux.intel.com>
Subject: Re: [PATCH 11/11] DIO: optimize cache misses in the submission path
Date: Mon, 08 Aug 2011 14:43:38 -0400	[thread overview]
Message-ID: <x49zkjjg2w5.fsf@segfault.boston.devel.redhat.com> (raw)
In-Reply-To: <1312259893-4548-12-git-send-email-andi@firstfloor.org> (Andi Kleen's message of "Mon, 1 Aug 2011 21:38:13 -0700")

Andi Kleen <andi@firstfloor.org> writes:

> From: Andi Kleen <ak@linux.intel.com>
>
> Some investigation of a transaction processing workload showed that
> a major consumer of cycles in __blockdev_direct_IO is the cache miss
> while accessing the block size. This is because it has to walk
> the chain from block_dev to gendisk to queue.
>
> The block size is needed early on to check alignment and sizes.
> It's only done if the check for the inode block size fails.
> But the costly block device state is unconditionally fetched.
>
> - Reorganize the code to only fetch block dev state when actually
> needed.
>
> Then do a prefetch on the block dev early on in the direct IO
> path. This is worth it, because there is substantial code run
> before we actually touch the block dev now.
>
> - I also added some unlikelies to make it clear the compiler
> that block device fetch code is not normally executed.
>
> This gave a small, but measurable improvement on a large database
> benchmark (about 0.3%)
>
> BTW the check code looks somewhat dubious to me: why is the block size
> blk size only checked when the inode size check fails? Can
> someone explain the difference between all these different block
> sizes? Are they cheaper in a dozen?

There are two block sizes, the block size of the file system (typically
PAGE_SHIFT), and the logical block size of the underlying storage.  The
dio blkfactor represents the number of dio blocks in a single fs block.
Alignment to the fs block means that you don't have to do any sub-block
zeroing.  It also means you don't have to do as much math in converting
between dio blocks and fs blocks (big deal, right?).

I bet we could default to using the smaller block size all the time, and
still be able to detect when we don't have to do the sub-block zeroing.
Maybe that would be a good follow-on patch.

> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  fs/direct-io.c |   47 +++++++++++++++++++++++++++++++++++++----------
>  1 files changed, 37 insertions(+), 10 deletions(-)
>
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 03bcc6f..c424b88 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1086,8 +1086,8 @@ static inline int drop_refcount(struct dio *dio)
>   * individual fields and will generate much worse code. 
>   * This is important for the whole file.
>   */
> -ssize_t
> -__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
> +static inline ssize_t
> +do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
>  	struct block_device *bdev, const struct iovec *iov, loff_t offset, 
>  	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
>  	dio_submit_t submit_io,	int flags)
> @@ -1096,7 +1096,6 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
>  	size_t size;
>  	unsigned long addr;
>  	unsigned blkbits = inode->i_blkbits;
> -	unsigned bdev_blkbits = 0;
>  	unsigned blocksize_mask = (1 << blkbits) - 1;
>  	ssize_t retval = -EINVAL;
>  	loff_t end = offset;
> @@ -1109,12 +1108,14 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
>  	if (rw & WRITE)
>  		rw = WRITE_ODIRECT;
>  
> -	if (bdev)
> -		bdev_blkbits = blksize_bits(bdev_logical_block_size(bdev));
> +	/* 
> +	 * Avoid references to bdev if not absolutely needed to give
> +	 * the early prefetch in the caller enough time.
> +	 */
>  
> -	if (offset & blocksize_mask) {
> +	if (unlikely(offset & blocksize_mask)) {

You can't make this assumption.  Userspace controls what size/alignment
of blocks to send in.

> @@ -1312,6 +1315,30 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
>  out:
>  	return retval;
>  }
> +
> +ssize_t
> +__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
> +	struct block_device *bdev, const struct iovec *iov, loff_t offset, 
> +	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
> +	dio_submit_t submit_io,	int flags)
> +{
> +	/* 
> +	 * The block device state is needed in the end to finally
> +	 * submit everything.  Since it's likely to be cache cold
> +	 * prefetch it here as first thing to hide some of the
> +	 * latency.
> +	 * 
> +	 * Attempt to prefetch the pieces we likely need later.
> +	 */
> +	prefetch(&bdev->bd_disk->part_tbl);
> +	prefetch(bdev->bd_queue);
> +	prefetch((char *)bdev->bd_queue + SMP_CACHE_BYTES);
> +
> +	return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
> +				     nr_segs, get_block, end_io,
> +				     submit_io, flags);
> +}
> +
>  EXPORT_SYMBOL(__blockdev_direct_IO);

Heh... you broke direct_io_worker out again (kind of).  ;-)

Cheers,
Jeff

  reply	other threads:[~2011-08-08 18:43 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-02  4:38 Updated direct IO optimization patchkit v2 Andi Kleen
2011-08-02  4:38 ` [PATCH 01/11] DIO: Separate fields only used in the submission path from struct dio Andi Kleen
2011-08-08 17:59   ` Jeff Moyer
2011-08-08 19:43     ` Andi Kleen
2011-08-08 19:46       ` Jeff Moyer
2011-08-02  4:38 ` [PATCH 02/11] DIO: Fix a wrong comment Andi Kleen
2011-08-08 17:59   ` Jeff Moyer
2011-08-02  4:38 ` [PATCH 03/11] DIO: Rearrange fields in dio/dio_submit to avoid holes Andi Kleen
2011-08-08 18:00   ` Jeff Moyer
2011-08-02  4:38 ` [PATCH 04/11] DIO: Use a slab cache for struct dio Andi Kleen
2011-08-08 18:01   ` Jeff Moyer
2011-08-02  4:38 ` [PATCH 05/11] DIO: Separate map_bh from dio v2 Andi Kleen
2011-08-08 18:11   ` Jeff Moyer
2011-08-02  4:38 ` [PATCH 06/11] DIO: Inline the complete submission path v2 Andi Kleen
2011-08-08 18:14   ` Jeff Moyer
2011-08-02  4:38 ` [PATCH 07/11] DIO: Merge direct_io_walker into __blockdev_direct_IO Andi Kleen
2011-08-08 18:20   ` Jeff Moyer
2011-08-02  4:38 ` [PATCH 08/11] DIO: Remove unnecessary dio argument from dio_pages_present() Andi Kleen
2011-08-08 18:21   ` Jeff Moyer
2011-08-02  4:38 ` [PATCH 09/11] DIO: Remove unused dio parameter from dio_bio_add_page Andi Kleen
2011-08-08 18:21   ` Jeff Moyer
2011-08-02  4:38 ` [PATCH 10/11] VFS: Cache request_queue in struct block_device Andi Kleen
2011-08-08 18:22   ` Jeff Moyer
2011-08-18 19:42   ` Vivek Goyal
2011-08-18 21:03     ` Andi Kleen
2011-08-19 14:14       ` Vivek Goyal
2011-08-19 15:36         ` Andi Kleen
2011-08-19 15:55           ` Vivek Goyal
2011-08-19 16:23             ` Andi Kleen
2011-08-19 16:51               ` Vivek Goyal
2011-08-02  4:38 ` [PATCH 11/11] DIO: optimize cache misses in the submission path Andi Kleen
2011-08-08 18:43   ` Jeff Moyer [this message]
2011-08-08 19:32     ` Andi Kleen
2011-08-08 19:38       ` Jeff Moyer
2011-08-18 17:53 ` Updated direct IO optimization patchkit v2 Jeff Moyer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=x49zkjjg2w5.fsf@segfault.boston.devel.redhat.com \
    --to=jmoyer@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=andi@firstfloor.org \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).