From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756627AbbDOS1k (ORCPT ); Wed, 15 Apr 2015 14:27:40 -0400 Received: from mail-ob0-f182.google.com ([209.85.214.182]:34939 "EHLO mail-ob0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756392AbbDOS1h (ORCPT ); Wed, 15 Apr 2015 14:27:37 -0400 Message-ID: <552EAD95.2080707@kernel.dk> Date: Wed, 15 Apr 2015 12:27:33 -0600 From: Jens Axboe User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: "Elliott, Robert (Server Storage)" , Christoph Hellwig , "viro@zeniv.linux.org.uk" , Mike Snitzer CC: "linux-nvdimm@ml01.01.org" , "linux-fsdevel@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "x86@kernel.org" , "ross.zwisler@linux.intel.com" , "boaz@plexistor.com" , "Kani, Toshimitsu" , "Knippers, Linda" , Andrew Morton Subject: Re: pmem and i_dio_count overhead References: <94D0CD8314A33A4D9D801C0FE68B40295A858600@G9W0745.americas.hpqcorp.net> In-Reply-To: <94D0CD8314A33A4D9D801C0FE68B40295A858600@G9W0745.americas.hpqcorp.net> Content-Type: text/plain; charset=koi8-r; format=flowed Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/03/2015 03:35 PM, Elliott, Robert (Server Storage) wrote: > Jens, one of your patches from October 2013 never made it > to the kernel, but would be beneficial for pmem. It helps > IOPS about 15%. > > Original patch: https://lkml.org/lkml/2013/10/24/130 > >> From Jens Axboe >> Subject [PATCH 05/11] direct-io: only inc/dec inode->i_dio_count for file systems >> Date Thu, 24 Oct 2013 10:25:58 +0100 >> >> We don't need truncate protection for block devices, so add a flag >> bypassing this cache line dirtying twice for every IO. This easily >> contributes to 5-10% of the CPU time on high IOPS O_DIRECT testing. > > Here are perf top results while running fio to pmem devices > using memcpy with non-temporal load and store instructions: > > 20.54% [pmem] [k] pmem_do_bvec.isra.6 > 10.13% [kernel] [k] do_blockdev_direct_IO > 5.93% [kernel] [k] inode_dio_done > 4.46% [kernel] [k] bio_endio > 3.07% fio [.] get_io_u > 2.08% fio [.] do_io > > Inside do_blockdev_direct_io (10%), 60% of the time is spent > atomically incrementing i_dio_count: > > static inline void atomic_inc(atomic_t *v) > { > asm volatile(LOCK_PREFIX "incl %0" > 0.06 225: lock incl 0x134(%r14) > atomic_inc(&inode->i_dio_count); > > retval = 0; > sdio.blkbits = blkbits; > sdio.blkfactor = i_blkbits - blkbits; > sdio.block_in_file = offset >> blkbits; > 60.31 mov -0x1d0(%rbp),%rdx > 0.16 mov %r12d,%ecx > */ > atomic_inc(&inode->i_dio_count); > > retval = 0; > sdio.blkbits = blkbits; > sdio.blkfactor = i_blkbits - blkbits; > 0.00 sub %r12d,%ebx > * Will be decremented at I/O completion time. > */ > atomic_inc(&inode->i_dio_count); > > inode_dio_done is taking all of its 5.8% time doing the > corresponding atomic_dec. > > So, they're combining for 11.8% of the overall CPU time. > The problem is more atomic contention than cache line dirtying. > > Applying your patch (changing the bitmask from 0x04 to > 0x08, since 0x04 is taken now) eliminates those > instructions from perf top and improves the high IOPS > results by 5 to 15%. > > Attr Copy Read IOPS Write IOPS > ==== ==== ========= ========== > UC NT rd,wr 513 K 326 K > with the patch: 510 K 325 K > > WB NT rd,wr 3.3 M 3.5 M > with the patch: 3.8 M 3.9 M > > WC NT rd,wr 3.0 M 3.9 M > with the patch: 3.1 M 4.1 M > > WT NT rd,wr 3.3 M 2.1 M > with the patch: 3.7 M 3.7 M > > (there is some other test environment inconsistency > with WT writes - I don't think this change really > helped by 76%) Just re-posted a cleaned up variant, forgot to CC you... You've got it in private email as well. Yes, lets finally get this in! Andrew, we ended up bike shedding on this patch a lot this time, which is ultimately why it got dropped on the floor. I CC'ed you on the new submission as well. -- Jens Axboe