From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.4 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D5882C34022 for ; Thu, 27 Feb 2020 15:52:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 9AC7C24688 for ; Thu, 27 Feb 2020 15:52:12 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="J7Og/Wnv" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729186AbgB0PwM (ORCPT ); Thu, 27 Feb 2020 10:52:12 -0500 Received: from aserp2120.oracle.com ([141.146.126.78]:47506 "EHLO aserp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729110AbgB0PwL (ORCPT ); Thu, 27 Feb 2020 10:52:11 -0500 Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 01RFcE0u086775; Thu, 27 Feb 2020 15:52:07 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=mime-version : message-id : date : from : to : cc : subject : references : in-reply-to : content-type : content-transfer-encoding; s=corp-2020-01-29; bh=oycC1LR8ShBOC3M1KYUC6pNxjaBlEcgGpD3CdpNscmE=; b=J7Og/WnvtEN3r3fkwxOxUYKFErfaHBvoCRa5tqOPfWtf7Td1qkkk5AllbiuIy9FipJYP Te1bnE3nZU5f/1VmIn3XoOxgc0AxfnQTHqq/XAh6DfBTz5B3d5W30abZw4KE7duELNgw e6aF7K7DnFh/VHY03A0dov1/ySpd4hKuLLvKreSPv2MRtY1xillSa+UE/V90gNAE/SOY qaGoFafaFOGzAiu095ee8srnk+6XnP+8YWGQQic7VPhaioj37Q9nzHcxqVBCuyDpa5M3 xlE+tmy4GR3SlXtH6a8BBKYwIDrUcPT/KAidfMXogW9Cxqv2pbGiQ7FeLeeAD/3X5vJd Jg== Received: from aserp3020.oracle.com (aserp3020.oracle.com [141.146.126.70]) by aserp2120.oracle.com with ESMTP id 2ydybcntag-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 27 Feb 2020 15:52:07 +0000 Received: from pps.filterd (aserp3020.oracle.com [127.0.0.1]) by aserp3020.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 01RFiV4p032861; Thu, 27 Feb 2020 15:52:06 GMT Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by aserp3020.oracle.com with ESMTP id 2ydcs9kehy-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 27 Feb 2020 15:52:06 +0000 Received: from abhmp0007.oracle.com (abhmp0007.oracle.com [141.146.116.13]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id 01RFq5a5028369; Thu, 27 Feb 2020 15:52:05 GMT Received: from localhost (/10.159.148.209) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Thu, 27 Feb 2020 07:51:45 -0800 USER-AGENT: Mutt/1.9.4 (2018-02-28) MIME-Version: 1.0 Message-ID: <20200227155145.GM8045@magnolia> Date: Thu, 27 Feb 2020 07:51:45 -0800 (PST) From: "Darrick J. Wong" To: Brian Foster Cc: linux-xfs@vger.kernel.org Subject: Re: [PATCH 2/3] xfs: implement block reservation accounting for btrees we're staging References: <157063978750.2914891.14339604572380248276.stgit@magnolia> <157063979983.2914891.13811468205423934367.stgit@magnolia> <20191025142227.GB11837@bfoster> <20191025163517.GK913374@magnolia> <20191025173554.GH11837@bfoster> <20191025205241.GR913374@magnolia> In-Reply-To: <20191025205241.GR913374@magnolia> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9543 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 adultscore=0 bulkscore=0 phishscore=0 mlxlogscore=999 spamscore=0 suspectscore=2 mlxscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2001150001 definitions=main-2002270123 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9543 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 impostorscore=0 adultscore=0 priorityscore=1501 suspectscore=2 mlxlogscore=999 lowpriorityscore=0 phishscore=0 spamscore=0 clxscore=1015 malwarescore=0 mlxscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2001150001 definitions=main-2002270123 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Fri, Oct 25, 2019 at 01:52:41PM -0700, Darrick J. Wong wrote: > On Fri, Oct 25, 2019 at 01:35:54PM -0400, Brian Foster wrote: > > On Fri, Oct 25, 2019 at 09:35:17AM -0700, Darrick J. Wong wrote: > > > On Fri, Oct 25, 2019 at 10:22:27AM -0400, Brian Foster wrote: > > > > On Wed, Oct 09, 2019 at 09:49:59AM -0700, Darrick J. Wong wrote: > > > > > From: Darrick J. Wong > > > > > > > > > > Create a new xrep_newbt structure to encapsulate a fake root for > > > > > creating a staged btree cursor as well as to track all the blocks that > > > > > we need to reserve in order to build that btree. > > > > > > > > > > Signed-off-by: Darrick J. Wong > > > > > --- > > > > > fs/xfs/scrub/repair.c | 260 +++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > fs/xfs/scrub/repair.h | 61 +++++++++++ > > > > > fs/xfs/scrub/trace.h | 58 +++++++++++ > > > > > 3 files changed, 379 insertions(+) > > > > > > > > > > > > > > > diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c > > > > > index 588bc054db5c..beebd484c5f3 100644 > > > > > --- a/fs/xfs/scrub/repair.c > > > > > +++ b/fs/xfs/scrub/repair.c > > > > > @@ -359,6 +359,266 @@ xrep_init_btblock( > > > > > return 0; > > > > > } > > > > > > > > > ... > > > > > +/* Initialize accounting resources for staging a new inode fork btree. */ > > > > > +void > > > > > +xrep_newbt_init_inode( > > > > > + struct xrep_newbt *xnr, > > > > > + struct xfs_scrub *sc, > > > > > + int whichfork, > > > > > + const struct xfs_owner_info *oinfo) > > > > > +{ > > > > > + memset(xnr, 0, sizeof(struct xrep_newbt)); > > > > > + xnr->sc = sc; > > > > > + xnr->oinfo = *oinfo; /* structure copy */ > > > > > + xnr->alloc_hint = XFS_INO_TO_FSB(sc->mp, sc->ip->i_ino); > > > > > + xnr->resv = XFS_AG_RESV_NONE; > > > > > + xnr->ifake.if_fork = kmem_zone_zalloc(xfs_ifork_zone, 0); > > > > > + xnr->ifake.if_fork_size = XFS_IFORK_SIZE(sc->ip, whichfork); > > > > > + INIT_LIST_HEAD(&xnr->reservations); > > > > > > > > Looks like this could reuse the above function for everything outside of > > > > the fake root bits. > > > > > > Ok. > > > > > > > > +} > > > > > + > > > > > +/* > > > > > + * Initialize accounting resources for staging a new btree. Callers are > > > > > + * expected to add their own reservations (and clean them up) manually. > > > > > + */ > > > > > +void > > > > > +xrep_newbt_init_bare( > > > > > + struct xrep_newbt *xnr, > > > > > + struct xfs_scrub *sc) > > > > > +{ > > > > > + xrep_newbt_init_ag(xnr, sc, &XFS_RMAP_OINFO_ANY_OWNER, NULLFSBLOCK, > > > > > + XFS_AG_RESV_NONE); > > > > > +} > > > > > + > > > > > +/* Add a space reservation manually. */ > > > > > +int > > > > > +xrep_newbt_add_reservation( > > > > > + struct xrep_newbt *xnr, > > > > > + xfs_fsblock_t fsbno, > > > > > + xfs_extlen_t len) > > > > > +{ > > > > > > > > FWIW the "reservation" terminology sounds a bit funny to me. Perhaps > > > > it's just because I've had log reservation on my mind :P, but something > > > > that "reserves blocks" as opposed to "adds reservation" might be a bit > > > > more clear from a naming perspective. > > > > > > xrep_newbt_reserve_space() ? > > > > > > I feel that's a little awkward since it's not actually reserving > > > anything; all it's doing is some accounting work for some space that the > > > caller already allocated. But it's probably no worse than the current > > > name. :) > > > > > > > Maybe _add_blocks() and _alloc_blocks() for these two and something like > > _[get|use]_block() in the later helper that populates the btree..? That > > seems more descriptive to me than "reservation" and "space," but that's > > just my .02. > > Yeah, that works. Fixed. > > > Brian > > > > > > > + struct xrep_newbt_resv *resv; > > > > > + > > > > > + resv = kmem_alloc(sizeof(struct xrep_newbt_resv), KM_MAYFAIL); > > > > > + if (!resv) > > > > > + return -ENOMEM; > > > > > + > > > > > + INIT_LIST_HEAD(&resv->list); > > > > > + resv->fsbno = fsbno; > > > > > + resv->len = len; > > > > > + resv->used = 0; > > > > > > > > Is ->used purely a count or does it also serve as a pointer to the next > > > > "unused" block? > > > > > > It's a counter, as documented in the struct declaration. > > > > > > > > + list_add_tail(&resv->list, &xnr->reservations); > > > > > + return 0; > > > > > +} > > > > > + > > > > > +/* Reserve disk space for our new btree. */ > > > > > +int > > > > > +xrep_newbt_reserve_space( > > > > > + struct xrep_newbt *xnr, > > > > > + uint64_t nr_blocks) > > > > > +{ > > > > > + struct xfs_scrub *sc = xnr->sc; > > > > > + xfs_alloctype_t type; > > > > > + xfs_fsblock_t alloc_hint = xnr->alloc_hint; > > > > > + int error = 0; > > > > > + > > > > > + type = sc->ip ? XFS_ALLOCTYPE_START_BNO : XFS_ALLOCTYPE_NEAR_BNO; > > > > > + > > > > > > > > So I take it this distinguishes between reconstruction of a bmapbt > > > > where we can allocate across AGs vs an AG tree..? If so, a one liner > > > > comment wouldn't hurt here. > > > > > > Ok. > > > > > > > > + while (nr_blocks > 0 && !error) { > > > > > + struct xfs_alloc_arg args = { > > > > > + .tp = sc->tp, > > > > > + .mp = sc->mp, > > > > > + .type = type, > > > > > + .fsbno = alloc_hint, > > > > > + .oinfo = xnr->oinfo, > > > > > + .minlen = 1, > > > > > + .maxlen = nr_blocks, > > > > > + .prod = nr_blocks, > > > > > > > > Why .prod? Is this relevant if .mod isn't set? > > > > > > Not sure why that's even in there. :/ Oh, dumb copy pasta error. That ought to be .prod = 1 since we don't care about alignment. --D > > > > > + .resv = xnr->resv, > > > > > + }; > > > > > + > > > > > + error = xfs_alloc_vextent(&args); > > > > > + if (error) > > > > > + return error; > > > > > + if (args.fsbno == NULLFSBLOCK) > > > > > + return -ENOSPC; > > > > > + > > > > > + trace_xrep_newbt_reserve_space(sc->mp, > > > > > + XFS_FSB_TO_AGNO(sc->mp, args.fsbno), > > > > > + XFS_FSB_TO_AGBNO(sc->mp, args.fsbno), > > > > > + args.len, xnr->oinfo.oi_owner); > > > > > + > > > > > + error = xrep_newbt_add_reservation(xnr, args.fsbno, args.len); > > > > > + if (error) > > > > > + break; > > > > > + > > > > > + nr_blocks -= args.len; > > > > > + alloc_hint = args.fsbno + args.len - 1; > > > > > + > > > > > + if (sc->ip) > > > > > + error = xfs_trans_roll_inode(&sc->tp, sc->ip); > > > > > + else > > > > > + error = xrep_roll_ag_trans(sc); > > > > > + } > > > > > + > > > > > + return error; > > > > > +} > > > > > + > > > > > +/* Free all the accounting info and disk space we reserved for a new btree. */ > > > > > +void > > > > > +xrep_newbt_destroy( > > > > > + struct xrep_newbt *xnr, > > > > > + int error) > > > > > +{ > > > > > + struct xfs_scrub *sc = xnr->sc; > > > > > + struct xrep_newbt_resv *resv, *n; > > > > > + > > > > > + if (error) > > > > > + goto junkit; > > > > > + > > > > > + list_for_each_entry_safe(resv, n, &xnr->reservations, list) { > > > > > + /* Free every block we didn't use. */ > > > > > + resv->fsbno += resv->used; > > > > > + resv->len -= resv->used; > > > > > + resv->used = 0; > > > > > > > > That answers my count/pointer question. :) > > > > > > > > + > > > > > + if (resv->len > 0) { > > > > > + trace_xrep_newbt_unreserve_space(sc->mp, > > > > > + XFS_FSB_TO_AGNO(sc->mp, resv->fsbno), > > > > > + XFS_FSB_TO_AGBNO(sc->mp, resv->fsbno), > > > > > + resv->len, xnr->oinfo.oi_owner); > > > > > + > > > > > + __xfs_bmap_add_free(sc->tp, resv->fsbno, resv->len, > > > > > + &xnr->oinfo, true); > > > > > + } > > > > > + > > > > > + list_del(&resv->list); > > > > > + kmem_free(resv); > > > > > + } > > > > > + > > > > > +junkit: > > > > > + list_for_each_entry_safe(resv, n, &xnr->reservations, list) { > > > > > + list_del(&resv->list); > > > > > + kmem_free(resv); > > > > > + } > > > > > > > > Seems like this could be folded into the above loop by just checking > > > > error and skipping the free logic appropriately. > > > > > > > > > + > > > > > + if (sc->ip) { > > > > > + kmem_zone_free(xfs_ifork_zone, xnr->ifake.if_fork); > > > > > + xnr->ifake.if_fork = NULL; > > > > > + } > > > > > +} > > > > > + > > > > > +/* Feed one of the reserved btree blocks to the bulk loader. */ > > > > > +int > > > > > +xrep_newbt_alloc_block( > > > > > + struct xfs_btree_cur *cur, > > > > > + struct xrep_newbt *xnr, > > > > > + union xfs_btree_ptr *ptr) > > > > > +{ > > > > > + struct xrep_newbt_resv *resv; > > > > > + xfs_fsblock_t fsb; > > > > > + > > > > > + /* > > > > > + * If last_resv doesn't have a block for us, move forward until we find > > > > > + * one that does (or run out of reservations). > > > > > + */ > > > > > + if (xnr->last_resv == NULL) { > > > > > + list_for_each_entry(resv, &xnr->reservations, list) { > > > > > + if (resv->used < resv->len) { > > > > > + xnr->last_resv = resv; > > > > > + break; > > > > > + } > > > > > + } > > > > > > > > Not a big deal right now, but it might be worth eventually considering > > > > something more efficient. For example, perhaps we could rotate depleted > > > > entries to the end of the list and if we rotate and find nothing in the > > > > next entry at the head, we know we've run out of space. > > > > > > Hm, yeah, this part would be much simpler if all we had to do was latch > > > on to the head element and rotate them to the tail when we're done. > > > > > > > > > > > > + if (xnr->last_resv == NULL) > > > > > + return -ENOSPC; > > > > > + } else if (xnr->last_resv->used == xnr->last_resv->len) { > > > > > + if (xnr->last_resv->list.next == &xnr->reservations) > > > > > + return -ENOSPC; > > > > > + xnr->last_resv = list_entry(xnr->last_resv->list.next, > > > > > + struct xrep_newbt_resv, list); > > > > > + } > > > > > + > > > > > + /* Nab the block. */ > > > > > + fsb = xnr->last_resv->fsbno + xnr->last_resv->used; > > > > > + xnr->last_resv->used++; > > > > > + > > > > > + trace_xrep_newbt_alloc_block(cur->bc_mp, > > > > > + XFS_FSB_TO_AGNO(cur->bc_mp, fsb), > > > > > + XFS_FSB_TO_AGBNO(cur->bc_mp, fsb), > > > > > + xnr->oinfo.oi_owner); > > > > > + > > > > > + if (cur->bc_flags & XFS_BTREE_LONG_PTRS) > > > > > + ptr->l = cpu_to_be64(fsb); > > > > > + else > > > > > + ptr->s = cpu_to_be32(XFS_FSB_TO_AGBNO(cur->bc_mp, fsb)); > > > > > + return 0; > > > > > +} > > > > > + > > > > ... > > > > > diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h > > > > > index 479cfe38065e..ab6c1199ecc0 100644 > > > > > --- a/fs/xfs/scrub/repair.h > > > > > +++ b/fs/xfs/scrub/repair.h > > > > ... > > > > > @@ -59,6 +63,63 @@ int xrep_agf(struct xfs_scrub *sc); > > > > > int xrep_agfl(struct xfs_scrub *sc); > > > > > int xrep_agi(struct xfs_scrub *sc); > > > > > > > > > ... > > > > > + > > > > > +#define for_each_xrep_newbt_reservation(xnr, resv, n) \ > > > > > + list_for_each_entry_safe((resv), (n), &(xnr)->reservations, list) > > > > > > > > This is unused (and seems unnecessary for a simple list). > > > > > > It's used by the free space rebuilder in the next patch; I suppose I > > > could move it down. That said, I've been trying to keep the common code > > > out of that particular patch so that the repair patches can be merged in > > > any order. > > > > > > > ... > > > > > diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h > > > > > index 3362bae28b46..deb177abf652 100644 > > > > > --- a/fs/xfs/scrub/trace.h > > > > > +++ b/fs/xfs/scrub/trace.h > > > > > @@ -904,6 +904,64 @@ TRACE_EVENT(xrep_ialloc_insert, > > > > > __entry->freemask) > > > > > ) > > > > > > > > > ... > > > > > +#define DEFINE_NEWBT_EXTENT_EVENT(name) \ > > > > > +DEFINE_EVENT(xrep_newbt_extent_class, name, \ > > > > > + TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \ > > > > > + xfs_agblock_t agbno, xfs_extlen_t len, \ > > > > > + int64_t owner), \ > > > > > + TP_ARGS(mp, agno, agbno, len, owner)) > > > > > +DEFINE_NEWBT_EXTENT_EVENT(xrep_newbt_reserve_space); > > > > > +DEFINE_NEWBT_EXTENT_EVENT(xrep_newbt_unreserve_space); > > > > > + > > > > > +TRACE_EVENT(xrep_newbt_alloc_block, > > > > > + TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, > > > > > + xfs_agblock_t agbno, int64_t owner), > > > > > > > > This could be folded into the above class if we just passed 1 for the > > > > length, eh? > > > > > > Er, yes. Fixed. > > > > > > --D > > > > > > > Brian > > > > > > > > > + TP_ARGS(mp, agno, agbno, owner), > > > > > + TP_STRUCT__entry( > > > > > + __field(dev_t, dev) > > > > > + __field(xfs_agnumber_t, agno) > > > > > + __field(xfs_agblock_t, agbno) > > > > > + __field(int64_t, owner) > > > > > + ), > > > > > + TP_fast_assign( > > > > > + __entry->dev = mp->m_super->s_dev; > > > > > + __entry->agno = agno; > > > > > + __entry->agbno = agbno; > > > > > + __entry->owner = owner; > > > > > + ), > > > > > + TP_printk("dev %d:%d agno %u agbno %u owner %lld", > > > > > + MAJOR(__entry->dev), MINOR(__entry->dev), > > > > > + __entry->agno, > > > > > + __entry->agbno, > > > > > + __entry->owner) > > > > > +); > > > > > + > > > > > #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */ > > > > > > > > > > #endif /* _TRACE_XFS_SCRUB_TRACE_H */ > > > > > > > > > > >