From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1756627AbbDOS1k (ORCPT <rfc822;w@1wt.eu>);
	Wed, 15 Apr 2015 14:27:40 -0400
Received: from mail-ob0-f182.google.com ([209.85.214.182]:34939 "EHLO
	mail-ob0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1756392AbbDOS1h (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Wed, 15 Apr 2015 14:27:37 -0400
Message-ID: <552EAD95.2080707@kernel.dk>
Date: Wed, 15 Apr 2015 12:27:33 -0600
From: Jens Axboe <axboe@kernel.dk>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0
MIME-Version: 1.0
To: "Elliott, Robert (Server Storage)" <Elliott@hp.com>,
        Christoph Hellwig <hch@lst.de>,
        "viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
        Mike Snitzer <snitzer@redhat.com>
CC: "linux-nvdimm@ml01.01.org" <linux-nvdimm@ml01.01.org>,
        "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "x86@kernel.org" <x86@kernel.org>,
        "ross.zwisler@linux.intel.com" <ross.zwisler@linux.intel.com>,
        "boaz@plexistor.com" <boaz@plexistor.com>,
        "Kani, Toshimitsu" <toshi.kani@hp.com>,
        "Knippers, Linda" <linda.knippers@hp.com>,
        Andrew Morton <akpm@linux-foundation.org>
Subject: Re: pmem and i_dio_count overhead
References: <94D0CD8314A33A4D9D801C0FE68B40295A858600@G9W0745.americas.hpqcorp.net>
In-Reply-To: <94D0CD8314A33A4D9D801C0FE68B40295A858600@G9W0745.americas.hpqcorp.net>
Content-Type: text/plain; charset=koi8-r; format=flowed
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 04/03/2015 03:35 PM, Elliott, Robert (Server Storage) wrote:
> Jens, one of your patches from October 2013 never made it
> to the kernel, but would be beneficial for pmem.  It helps
> IOPS about 15%.
>
> Original patch: https://lkml.org/lkml/2013/10/24/130
>
>>  From Jens Axboe
>> Subject [PATCH 05/11] direct-io: only inc/dec inode->i_dio_count for file systems
>> Date Thu, 24 Oct 2013 10:25:58 +0100
>>
>> We don't need truncate protection for block devices, so add a flag
>> bypassing this cache line dirtying twice for every IO. This easily
>> contributes to 5-10% of the CPU time on high IOPS O_DIRECT testing.
>
> Here are perf top results while running fio to pmem devices
> using memcpy with non-temporal load and store instructions:
>
>   20.54%  [pmem]                   [k] pmem_do_bvec.isra.6   <the memcpy function>
>   10.13%  [kernel]                 [k] do_blockdev_direct_IO
>    5.93%  [kernel]                 [k] inode_dio_done
>    4.46%  [kernel]                 [k] bio_endio
>    3.07%  fio                      [.] get_io_u
>    2.08%  fio                      [.] do_io
>
> Inside do_blockdev_direct_io (10%), 60% of the time is spent
> atomically incrementing i_dio_count:
>
>               static inline void atomic_inc(atomic_t *v)
>               {
>                       asm volatile(LOCK_PREFIX "incl %0"
>    0.06  225:   lock   incl   0x134(%r14)
>                       atomic_inc(&inode->i_dio_count);
>         
>                       retval = 0;
>                       sdio.blkbits = blkbits;
>                       sdio.blkfactor = i_blkbits - blkbits;
>                       sdio.block_in_file = offset >> blkbits;
>   60.31         mov    -0x1d0(%rbp),%rdx
>    0.16         mov    %r12d,%ecx
>                        */
>                       atomic_inc(&inode->i_dio_count);
>         
>                       retval = 0;
>                       sdio.blkbits = blkbits;
>                       sdio.blkfactor = i_blkbits - blkbits;
>    0.00         sub    %r12d,%ebx
>                        * Will be decremented at I/O completion time.
>                        */
>                       atomic_inc(&inode->i_dio_count);
>
> inode_dio_done is taking all of its 5.8% time doing the
> corresponding atomic_dec.
>
> So, they're combining for 11.8% of the overall CPU time.
> The problem is more atomic contention than cache line dirtying.
>
> Applying your patch (changing the bitmask from 0x04 to
> 0x08, since 0x04 is taken now) eliminates those
> instructions from perf top and improves the high IOPS
> results by 5 to 15%.
>
> Attr	Copy		Read IOPS		Write IOPS
> ====	====		=========		==========
> UC	NT rd,wr	513 K			326 K
> with the patch:	510 K			325 K
>
> WB	NT rd,wr	3.3 M			3.5 M
> with the patch:	3.8 M			3.9 M
>
> WC	NT rd,wr	3.0 M			3.9 M
> with the patch:	3.1 M			4.1 M
>
> WT	NT rd,wr	3.3 M			2.1 M
> with the patch:	3.7 M			3.7 M
>
> (there is some other test environment inconsistency
> with WT writes - I don't think this change really
> helped by 76%)

Just re-posted a cleaned up variant, forgot to CC you... You've got it 
in private email as well.

Yes, lets finally get this in! Andrew, we ended up bike shedding on this 
patch a lot this time, which is ultimately why it got dropped on the 
floor. I CC'ed you on the new submission as well.

-- 
Jens Axboe