From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.7 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS,UNPARSEABLE_RELAY,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7DC5FC43381 for ; Fri, 22 Feb 2019 18:45:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4B91620657 for ; Fri, 22 Feb 2019 18:45:41 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="3t7ukrCH" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726223AbfBVSpk (ORCPT ); Fri, 22 Feb 2019 13:45:40 -0500 Received: from userp2120.oracle.com ([156.151.31.85]:56850 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726023AbfBVSpk (ORCPT ); Fri, 22 Feb 2019 13:45:40 -0500 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x1MIiVoH065363; Fri, 22 Feb 2019 18:45:35 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=date : from : to : cc : subject : message-id : references : mime-version : content-type : in-reply-to; s=corp-2018-07-02; bh=hTgdF4MSXd6uxHYO6tvE8JrsCbFGmA81FcIFtI6YW2o=; b=3t7ukrCHHkT0oOwCo5//y7YpXC0RAS/rm7bCEYBpLrHPGsrKHgH9ZkAPbXK4tiTHEpbV 60AID+wvyHm1dwqppVf10vBd7SlTYBlqRET2cmbLnoNUqkEosudgth5jkiwXQ+nmIlwq cWnn8fKsH6z+ZlSs0jUdRw/M0jjGanOSDjL6tdUPIQyA2a0JGUgKzl3V2HZhOYi3S6AT 6egExah99Xcj39zpPyinj0OBRzrr+3T/nziFwT6vX1RqTbdDT5yOLNPlhuJwFJjUy25C 1D3WKF9jXaGmLC6PdTsepN0iwKPBWya2opynxfEiOHHxEmYR4WwCp0W6B3q5F2zbF0Qp vA== Received: from aserv0022.oracle.com (aserv0022.oracle.com [141.146.126.234]) by userp2120.oracle.com with ESMTP id 2qpb5s0srf-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 22 Feb 2019 18:45:35 +0000 Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by aserv0022.oracle.com (8.14.4/8.14.4) with ESMTP id x1MIjXl8015290 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 22 Feb 2019 18:45:34 GMT Received: from abhmp0022.oracle.com (abhmp0022.oracle.com [141.146.116.28]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id x1MIjXPQ021452; Fri, 22 Feb 2019 18:45:33 GMT Received: from localhost (/10.159.254.125) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Fri, 22 Feb 2019 10:45:32 -0800 Date: Fri, 22 Feb 2019 10:45:25 -0800 From: "Darrick J. Wong" To: Dan Williams Cc: linux-nvdimm , Ross Zwisler , Vishal L Verma , xfs , linux-fsdevel Subject: Re: [RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax Message-ID: <20190222184525.GA21626@magnolia> References: <20190222182008.GT6503@magnolia> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.4 (2018-02-28) X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9175 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1902220128 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Fri, Feb 22, 2019 at 10:28:15AM -0800, Dan Williams wrote: > On Fri, Feb 22, 2019 at 10:21 AM Darrick J. Wong > wrote: > > > > Hi all! > > > > Uh, we have an internal customer who's been trying out MAP_SYNC > > on pmem, and they've observed that one has to do a fair amount of > > legwork (in the form of mkfs.xfs parameters) to get the kernel to set up > > 2M PMD mappings. They (of course) want to mmap hundreds of GB of pmem, > > so the PMD mappings are much more efficient. > > > > I started poking around w.r.t. what mkfs.xfs was doing and realized that > > if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will > > set up all the parameters automatically. Below is my ham-handed attempt > > to teach the kernel to do this. > > > > Comments, flames, "WTF is this guy smoking?" are all welcome. :) > > > > --D > > > > --- > > Configure pmem devices to advertise the default page alignment when said > > block device supports fsdax. Certain filesystems use these iomin/ioopt > > hints to try to create aligned file extents, which makes it much easier > > for mmaps to take advantage of huge page table entries. > > > > Signed-off-by: Darrick J. Wong > > --- > > drivers/nvdimm/pmem.c | 5 ++++- > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > > index bc2f700feef8..3eeb9dd117d5 100644 > > --- a/drivers/nvdimm/pmem.c > > +++ b/drivers/nvdimm/pmem.c > > @@ -441,8 +441,11 @@ static int pmem_attach_disk(struct device *dev, > > blk_queue_logical_block_size(q, pmem_sector_size(ndns)); > > blk_queue_max_hw_sectors(q, UINT_MAX); > > blk_queue_flag_set(QUEUE_FLAG_NONROT, q); > > - if (pmem->pfn_flags & PFN_MAP) > > + if (pmem->pfn_flags & PFN_MAP) { > > blk_queue_flag_set(QUEUE_FLAG_DAX, q); > > + blk_queue_io_min(q, PFN_DEFAULT_ALIGNMENT); > > + blk_queue_io_opt(q, PFN_DEFAULT_ALIGNMENT); > > The device alignment might sometimes be bigger than this default. > Would there be any detrimental effects for filesystems if io_min and > io_opt were set to 1GB? Hmmm, that's going to be a struggle on ext4 and the xfs data device because we'd be preferentially skipping the 1023.8MB immediately after each allocation group's metadata. It already does this now with a 2MB io hint, but losing 1.8MB here and there isn't so bad. We'd have to study it further, though; filesystems historically have interpreted the iomin/ioopt hints as RAID striping geometry, and I don't think very many people set up 1GB raid stripe units. (I doubt very many people have done 2M raid stripes either, but it seems to work easily where we've tried it...) > I'm thinking and xfs-realtime configuration might be able to support > 1GB mappings in the future. The xfs realtime device ought to be able to support 1g alignment pretty easily though. :) --D