From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f169.google.com (mail-pf1-f169.google.com [209.85.210.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9A3522D46B6 for ; Wed, 13 Aug 2025 08:05:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.169 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755072308; cv=none; b=YO8orKE8N4NS5yBaXtmqXsaw28qyR0bjd6NtDfOpftYyqjS6qU87EfwjMJ8QOOoWQz9vQO9zIvbSx6DqOn14BpAZ4v73g3tkgcEL+cvfL1YFv3HgFjkySrpr08V2MyCnOLXyAE8JugWAKXyq8XCpOKyE9kUye8BPuoaNsV4kto4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755072308; c=relaxed/simple; bh=yFwqGkNPjA0B735kIzF6s/wl6ummDEcV/1bwlDg7jH4=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=HvCFdrKgFCHPnWRBfqXUW0uyuJ3mgd4p5bdi2d0lCVU+N+7J4lLzWd1noYtSM2UnBbUVdTcKt30l4HRK6PPWADMPLOKZ2ZMWE0mnavs/W0N3K1MIJuVQ/BhxIYaGC8oRFSlhtDXgCTtHiHJiwZBr2ggnnMqXJijOzoTSCCEThiI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=GyKNW2/0; arc=none smtp.client-ip=209.85.210.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="GyKNW2/0" Received: by mail-pf1-f169.google.com with SMTP id d2e1a72fcca58-76c3607d960so6813670b3a.1 for ; Wed, 13 Aug 2025 01:05:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1755072305; x=1755677105; darn=vger.kernel.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=lRrC2C+y4CmWUZ1CDgAewNL6QjBeswr1jTwcGwfnMU8=; b=GyKNW2/0ceuo1TdGCEanlXxPl2PX8HhfksTybEkGrobLuL1MuU/FAeVc2u/t4Ed93R VCvXgWSRVBc2qT+p4OGeh0lO5A2VOGTRZHxnVdXkprKJDQGzWscOGdxdxmwmNJ+weqZu pOAuW0asB1ZBEQqyuwCcrXdMwsnxuW5VGDKqa/HEDs+eUodugVpkLQVLM9rKENyoZzWo zPi5b3XoszkPJWkksc2+cbIAoKyjSY0HvugLj0lANOh55nSmPO4Ok5qyhwix+FMrxgqf /oSxq0lnMjEqpePEY+pqfcxjh7FnRSG84fEwHKqDSEpa4m4j+zlpKzMyA+9EQIdszod/ jLlQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755072305; x=1755677105; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=lRrC2C+y4CmWUZ1CDgAewNL6QjBeswr1jTwcGwfnMU8=; b=kfX/Eq6qXPCdeTzuJOSaptrZ/TOJ5RwnXm67aJgM/pcpSBo/dqzC1uxHe/J2RTXf/Y cs1Cztz5DCpncChDdXjNCIf9rYslCrM0XcV8p/wThE2/TnlFM7xQbrV5ntfs7sFkqhcc JMOuNWrfy9n+uiXAKCj7nJY//t1zrHeABzH796HqPPKKimiZtPnzfxgrNIqcdg84/tX0 Pi04QkPY0xazmwDV3Act3DVMhg8au2G8CuNwOQNw73vYfCPJBTFFAD9tP099+ZND/hyE UInTsfCzN+APjgFcgnQkVQ4zlNAbQ4A3fvpRDaC0CFXMQ6+t3WoQ7wfHtpWzLww3zL0h DjrQ== X-Gm-Message-State: AOJu0YyMyrOLYfEHheLJbCjojJDWiSjnxDmZHxqHqVn/HU7JjAFBPGb/ kX81ITIpMH/WoMMGFEKl7QnNJqUEM88wKLq2cuRmNL59fhu9TrSx+/zXjEpf7A== X-Gm-Gg: ASbGnctoOrbJhxrrP4TEs4JB7B6/v7mzcmnuYPy4irYdfP/rPZj879xQ/aHqvI7UnjR 3jLhJ39I1AACiPX7m5ONJXyQBOKV3O3H42zBw8AjeoZjqZSi11KQdTJJw+6K3JzTG0qaMzN6d9T H+hoGSIo5sROPXsvyaKwS5FzymLWmiPBUzcK6Y6yE/dpMaeJ2v+mrxvkhT2/mHxUrqwfw4jsjQw d6B72R3qM3SvF49v/L/eZ07vNdUHfGBTaod9+PxkapKR3HO4uWVT9eiyK+9qEzc5mCD5x0Qnv0S hsBoUSw6A6ZerWKoDRKI6TVEIg4iPYv0a1O150+YZDrnIre9gYz5VMdfTwfd/m9Lr9sBQbPOhBv trg7tPnnen8dGz0gEjpg/U0g0oFZJ X-Google-Smtp-Source: AGHT+IH00atXFBnHUWwOKQzqCXxAbHQXf++EDIKgbt+m0bsixXRMedG7BCJ09zVUc0BvT0ZonjSD0A== X-Received: by 2002:a05:6a20:7347:b0:240:4ba:8dcc with SMTP id adf61e73a8af0-240a8a42164mr3547006637.16.1755072304091; Wed, 13 Aug 2025 01:05:04 -0700 (PDT) Received: from [9.109.246.38] ([129.41.58.6]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-76bd733580fsm29946203b3a.7.2025.08.13.01.05.01 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 13 Aug 2025 01:05:03 -0700 (PDT) Message-ID: <878dcefe-944a-44ad-8975-890a3973c959@gmail.com> Date: Wed, 13 Aug 2025 13:34:59 +0530 Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 3/3] xfs: Add support to shrink multiple empty AGs Content-Language: en-US To: "Darrick J. Wong" Cc: linux-xfs@vger.kernel.org, ritesh.list@gmail.com, ojaswin@linux.ibm.com, bfoster@redhat.com, david@fromorbit.com, hsiangkao@linux.alibaba.com References: <20250729210550.GG2672049@frogsfrogsfrogs> <8dc0f2e8-843b-4e55-9625-974e92933ee2@gmail.com> <20250812190740.GM7965@frogsfrogsfrogs> From: "Nirjhar Roy (IBM)" In-Reply-To: <20250812190740.GM7965@frogsfrogsfrogs> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 8/13/25 00:37, Darrick J. Wong wrote: > On Thu, Jul 31, 2025 at 12:06:11AM +0530, Nirjhar Roy (IBM) wrote: >> On 7/30/25 02:35, Darrick J. Wong wrote: >>> On Thu, Jul 17, 2025 at 04:00:45PM +0530, Nirjhar Roy (IBM) wrote: >>>> This patch is based on a previous RFC[1] by Gao Xiang and various >>>> ideas proposed by Dave Chinner in the RFC[1]. >>>> >>>> This patch adds the functionality to shrink the filesystem beyond >>>> 1 AG. We can remove only empty AGs to prevent loss of data. >>>> Before I summarize the overall steps of the shrink process, I >>>> would like to introduce some of the terminologies: >>>> >>>> 1. Empty AG - An AG that is completely used, and no block >>>> is being used/allocated for data or metadata and no >>>> log blocks are allocated here. This ensures that >>>> removal of this AG doesn't result in data loss. >>>> >>>> 2. Active/Online AG - Online AG and active AG will be used >>>> interchangebly. An AG is active or online when all the regular >>>> operations can be done on it. When we mount a filesystem, all >>>> the AGs are by default online/active. In terms of implementation, >>>> an online AG will have number of active references greater than 1 >>>> (default is 1 i.e, an AG by default is online/active). >>>> >>>> 3. AG offlining/deactivation - AG Offlining and AG deactivation will >>>> be used interchangebly. An AG is said to be offline/deactivated >>>> when no new high level operation can be started on the AG. This is >>>> implemented with the help of active references. When the active >>>> reference count of an AG is 0, the AG is said to be deactivated. >>>> No new active reference can be taken if the present active reference >>>> count is 0. This way a barrier is formed from preventing new high >>>> level operations to get started on an already offlined AG. >>>> >>>> 4. Reactivating an AG - If we try to remove an offline AG but for some >>>> reason, we can't, then we reactivate the AG i.e, the AG will once >>>> more be in an usable state i.e, the active reference count will be >>>> set to 1. All the high level operations can now be performed on this >>>> AG. In terms of implementation, in order to activate an AG, we >>>> atomically set the active reference count to 1. >>>> >>>> 5. AG removal - This means that AG no longer exists in the filesystem. >>>> It will be reflected in the usable/total size of the device too >>>> (using tools like df). >>>> >>>> 6. New tail AG - This refers to the last AG that will be formed after >>>> removal of 1 or more AGs. For example if there 4 AGs and each >>>> with 32 blocks, so there are total of 4 * 32 = 128 blocks. Now, >>>> if we remove 40 blocks, AG 3(indexed at 0 will be completely >>>> removed (32 blocks) and from AG 2, we will remove 8 blocks. >>>> So AG 2, is the new tail AG. >>>> 7. Old tail AG - This is the last AG before the start of the shrink >>>> process. >>>> >>>> 8. AG stabilization - This simply means that the in-memory contents >>>> are synched to the disk. >>>> >>>> The overall steps for shrinking AGs is as follows: >>>> PHASE 1: Preparing the AGs for removal >>>> 1. Deactivate the AGs to be removed completely - This is done >>>> by the function xfs_shrinkfs_deactivate_ags(). This step involves >>>> waiting for the active references for target AGs to go come down >>>> to 0. >>>> This is done so that no other entity is racing while the removal >>>> is in progress i.e, no new high level operation can start on that >>>> AG while we are trying to remove the AG. >>>> 2. Once we have waited for the active references to come down to 0, >>>> we make sure that all the pending operations on that AG are completed >>>> and the in-core and on-disk structures are in synch i.e, the AG is >>>> stablized on to the disk. >>> stabilized >> Noted. >>>> The steps to stablize the AG onto the disk are as follows: >>>> 2.a Wait for all the busy extents for the target AGs to be resolved >>>> (done by the function xfs_extent_busy_wait_range()) >>>> 2.b Flush the xfs_discard_wq workqueue >>>> 2.c We need to flush and empty the logs and wait for all the pending >>>> I/Os to complete - for this perform a log quiesce by calling >>>> xfs_log_quiesce(). This also ensures that none of the future >>>> logged transactions will refer to these AGs during log >>>> recovery in case if sudden shutdown/crash happens while we >>>> are trying to remove these AGs. >>> Doesn't flushing the log put a bunch of busy extents on xfs_discard_wq? >>> So doesn't 2c have to happen before 2b? >> Can you please tell me how flushing of logs will add more busy extents? My >> understanding is that busy extents get created during extent free >> operations(like file delete/truncate etc). Are you saying that once the logs >> are flushed, the log infrastructure will free the extents it was >> using(during reservation) and that would add to the list of busy extents? > IIRC, forcing the log to disk writes any open iclogs to disk, and the IO > completion from the iclog kicks off bdev discards of everything on the > busy extent list. When the discard bios complete, they get put on > xfs_discard_wq for ioend processing, which removes the busy extent list > entry. > > In short, I think you need to wait for the log cover to hit the disk and > then for all the discard bios to complete, so that there really is no > more IO going to the blocks in the doomed AG(s) before we get to step 3. Okay. Thank you for the explanation. I will dig deeper into this based on your explanation to have a better understanding. > >>>> 3. Once the AG is deactivated and stabilized on to the disk, we check if >>>> all the target AGs are empty, and if not, we fail the shrink process. >>>> We are not supporting partial shrink support i.e, the shrink will >>>> either completely fail or completely succeed. >>>> >>>> PHASE 2: Actual removal of the AGs >>> So in phase 2, we know that all the AGs can be freed and the new tail AG >>> shrunk? >> After phase 1 is completed, and we are in the beginning of phase 2, we know >> that all the AGs (which we are trying to remove COMPLETELY) can be >> removed/freed. The tail/new-tail AG is NOT shrunk in phase 1. It will be >> shrunk in phase 2. Phase 2 does the complete AG removal + tail/new-tail AG >> shrink. Does that answer your question? >> >> Maybe I can rephrase it to "PHASE 2: Actual removal of the AGs + >> tail/new-tail AG shrink" - what do you think? > "PHASE 2: Shrink new tail group, punch out totally empty groups" ? Yeah, this sounds good. > >>>> 4. Once the preparation phase is over, we start the actual removal >>>> process. This is done in the function xfs_shrink_start(). Here we >>>> first remove the blocks, then update the metadata of new last tail >>>> AG and then remove the AGs (and their associated data structures) >>>> one by one (in function xfs_shrink_remove_empty_ag()). >>>> 5. In the end we log the changes and commit the transaction. >>>> >>>> Removal of each AG is done by the function xfs_shrink_remove_empty_ag(). >>>> The steps can be outlined as follows: >>>> 1. Free the per AG reservation - this will result in correct free >>>> space/used space information. >>>> 2. Freeing the intents drain queue. >>>> 3. Freeing busy extents list. >>>> 4. Remove the perag cached buffers and then the buffer cache. >>>> 5. Freeing the struct xfs_group pointer - Before this is done, we >>>> assert that all the active and passive references are down to 0. >>>> We remove all the cached buffers associated with the offlined AGs >>>> to be removed - this releases the passive references of the AGs >>>> consumed by the cached buffers. >>> Migrating data out of the AG is left as an exercise to userspace, I >>> assume? >> That is correct. For now, I am targeting to remove only empty regular >> (non-realtime) AGs. >>>> [1] https://lore.kernel.org/all/20210414195240.1802221-1-hsiangkao@redhat.com/ >>>> >>>> Signed-off-by: Nirjhar Roy (IBM) >>>> Inspired-by: Gao Xiang >>>> Suggested-by: Dave Chinner >>>> --- >>>> fs/xfs/libxfs/xfs_ag.c | 135 ++++++++++++++++++- >>>> fs/xfs/libxfs/xfs_ag.h | 10 ++ >>>> fs/xfs/libxfs/xfs_alloc.c | 9 +- >>>> fs/xfs/xfs_buf.c | 76 +++++++++++ >>>> fs/xfs/xfs_buf.h | 1 + >>>> fs/xfs/xfs_buf_item_recover.c | 37 ++++-- >>>> fs/xfs/xfs_extent_busy.c | 28 ++++ >>>> fs/xfs/xfs_extent_busy.h | 2 + >>>> fs/xfs/xfs_fsops.c | 241 ++++++++++++++++++++++++++++++++-- >>>> fs/xfs/xfs_trans.c | 1 - >>>> 10 files changed, 502 insertions(+), 38 deletions(-) >>>> >>>> diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c >>>> index dcaf5683028e..f7fa7f53f2ec 100644 >>>> --- a/fs/xfs/libxfs/xfs_ag.c >>>> +++ b/fs/xfs/libxfs/xfs_ag.c >>>> @@ -193,21 +193,32 @@ xfs_agino_range( >>>> } >>>> /* >>>> - * Update the perag of the previous tail AG if it has been changed during >>>> - * recovery (i.e. recovery of a growfs). >>>> + * This function does the following: >>>> + * - Updates the previous perag tail if prev_agcount < current agcount i.e, the >>>> + * filesystem has grown OR >>>> + * - Updates the current tail AG when prev_agcount > current agcount i.e, the >>>> + * filesystem has shrunk beyond 1 AG OR >>>> + * - Updates the current tail AG when only the last AG was shrunk or grown i.e, >>>> + * prev_agcount == mp->m_sb.sb_agcount. >>>> */ >>>> int >>>> xfs_update_last_ag_size( >>>> struct xfs_mount *mp, >>>> xfs_agnumber_t prev_agcount) >>>> { >>>> - struct xfs_perag *pag = xfs_perag_grab(mp, prev_agcount - 1); >>>> + xfs_agnumber_t agno; >>>> + >>>> + if (prev_agcount >= mp->m_sb.sb_agcount) >>>> + agno = mp->m_sb.sb_agcount - 1; >>>> + else >>>> + agno = prev_agcount - 1; >>>> + >>>> + struct xfs_perag *pag = xfs_perag_grab(mp, agno); >>> Can we please keep the variable declarations at the top? >>> >>> I know I'm an old C89 dinosaur. >> Noted. >>>> if (!pag) >>>> return -EFSCORRUPTED; >>>> - pag_group(pag)->xg_block_count = __xfs_ag_block_count(mp, >>>> - prev_agcount - 1, mp->m_sb.sb_agcount, >>>> - mp->m_sb.sb_dblocks); >>>> + pag_group(pag)->xg_block_count = __xfs_ag_block_count(mp, agno, >>>> + mp->m_sb.sb_agcount, mp->m_sb.sb_dblocks); >>> and please keep the two level indent for the continuation >> Noted. >>>> __xfs_agino_range(mp, pag_group(pag)->xg_block_count, &pag->agino_min, >>>> &pag->agino_max); >>>> xfs_perag_rele(pag); >>>> @@ -290,6 +301,22 @@ xfs_initialize_perag( >>>> return error; >>>> } >>>> +void >>>> +xfs_activate_ag(struct xfs_perag *pag) >>> xfs_perag_activate() ? >>> >>> Functions that act upon an xfs_perag object tend to have "xfs_perag" in >>> their name, not xfs_ag. >> Okay. Noted. >>>> +{ >>>> + ASSERT(!xfs_ag_is_active(pag)); >>>> + init_waitqueue_head(&pag_group(pag)->xg_active_wq); >>>> + atomic_set(&pag_group(pag)->xg_active_ref, 1); >>>> +} >>>> + >>>> +void >>>> +xfs_deactivate_ag(struct xfs_perag *pag) >>>> +{ >>>> + ASSERT(xfs_ag_is_active(pag)); >>>> + xfs_perag_rele(pag); >>>> + wait_event(pag_group(pag)->xg_active_wq, !xfs_ag_is_active(pag)); >>>> +} >>>> + >>>> static int >>>> xfs_get_aghdr_buf( >>>> struct xfs_mount *mp, >>>> @@ -758,7 +785,6 @@ xfs_ag_shrink_space( >>>> xfs_agblock_t aglen; >>>> int error, err2; >>>> - ASSERT(pag_agno(pag) == mp->m_sb.sb_agcount - 1); >>>> error = xfs_ialloc_read_agi(pag, *tpp, 0, &agibp); >>>> if (error) >>>> return error; >>>> @@ -872,6 +898,101 @@ xfs_ag_shrink_space( >>>> return err2; >>>> } >>>> +/* >>>> + * This function checks whether an AG is empty. An AG is eligbible to be >>> eligible >>> 1 >> Noted. >>>> + * removed if it empty. >>>> + */ >>>> +bool >>>> +xfs_ag_is_empty(struct xfs_perag *pag) >>>> +{ >>>> + struct xfs_buf *agfbp = NULL; >>>> + struct xfs_mount *mp = pag_mount(pag); >>>> + bool is_empty = false; >>>> + int error = 0; >>> (More indenting/style problems) >> Noted. I will fix all the style/typos in the next revision. >>>> + >>>> + /* >>>> + * Read the on-disk data structures to get the correct length of the AG. >>>> + * All the AGs have the same length except the last AG. >>>> + */ >>>> + error = xfs_alloc_read_agf(pag, NULL, 0, &agfbp); >>>> + if (!error) { >>>> + struct xfs_agf *agf = agfbp->b_addr; >>>> + /* >>>> + * We don't need to check if log blocks belong here since the >>>> + * log blocks are taken from the number of free blocks, and if >>>> + * the given AG has log blocks, then those many number of >>>> + * blocks will be consumed from the number of free blocks and >>>> + * the AG empty condition will not hold true. >>>> + */ >>>> + if (pag->pagf_freeblks + pag->pagf_flcount + >>>> + mp->m_ag_prealloc_blocks == >>>> + be32_to_cpu(agf->agf_length)) { >>>> + ASSERT(!xfs_ag_contains_log(mp, pag_agno(pag))); >>> Why not just check for this at the top? >>> >>> if (xfs_ag_contains_log(mp, pag_agno(pag))) >>> return false; >> "if (pag->pagf_freeblks + pag->pagf_flcount + mp->m_ag_prealloc_blocks == >> be32_to_cpu(agf->agf_length))" already covers this case. Or are you saying >> that by checking for the logs first saves us some time from calling >> xfs_alloc_read_agf() - if the AG indeed contains log blocks? My motive for >> the ASSERTion is mostly out of paranoia. Do you feel this ASSERTion is >> redundant? > I think I was reading too fast. But maybe the assertion isn't necessary > at all? The free space btrees will not have an entry for the log space, > so we should never be able to delete the log's space anyway. Okay. > > What happens if you want to shrink the fs so that the last block of the > fs is the last block of the log? If we're shrinking such that the last fsblock of the filesystem = last fsblock of the log, then the shrink fails with -ENOSPC. The code that fails is the following (in xfs_ag_shrink_space()):     /* internal log shouldn't also show up in the free space btrees */     error = xfs_alloc_vextent_exact_bno(&args,             xfs_agbno_to_fsb(pag, aglen - delta));     if (!error && args.agbno == NULLAGBLOCK) {         error = -ENOSPC;     } For example, the empty fs I tried has 16 AGs of size 128m (with 4k block size). So total of blocks=524288 and  agcount=16, agsize=32768. From xfs_db, I found out the logstart and logblocks logstart = 262151 and logblocks=16384. So, the log blocks are from 262151 to 278534. Now I tried root@AMARPC-vm: /home/ubuntu$ xfs_growfs -D $(( 262151 + 16384 - 1 )) /mnt1/scratch ... [EXPERIMENTAL] try to shrink unused space 278534, old size is 524288 xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: No space left on device One interesting thing that I noticed is that, the shrink fails for all the fssizes from the last log fsblock to last log fsblock+18 i.e, xfs_growfs -D {278534 to 278551} fails with -ENOSPC (same code in xfs_ag_shrink_space() mentioned above). I am not sure what these extra blocks are allocated for (in an empty fs), maybe some reservation for btree growth or something? Let me figure that out. > >>> error = xfs_alloc_read_agf() >>> >>>> + is_empty = true; >>>> + } >>>> + xfs_buf_relse(agfbp); >>>> + } >>>> + return is_empty; >>>> +} >>>> + >>>> +/* >>>> + * This function removes an entire empty AG. Before removing the struct >>>> + * xfs_perag reference, it removes the associated data structures. Before >>>> + * removing an AG, the caller must ensure that the AG has been deactivated with >>>> + * no active references and it has been fully stabilized on the disk. >>>> + */ >>>> +void >>>> +xfs_shrinkfs_remove_ag(struct xfs_mount *mp, xfs_agnumber_t agno) >>>> +{ >>>> + /* >>>> + * Number of AGs can't be less than 2 >>>> + */ >>>> + ASSERT(agno >= 2); >>>> + struct xfs_group *xg = xa_erase(&mp->m_groups[XG_TYPE_AG].xa, agno); >>>> + struct xfs_perag *cur_pag = to_perag(xg); >>>> + >>>> + ASSERT(!xfs_ag_is_active(cur_pag)); >>>> + /* >>>> + * Since we are freeing the AG, we should clear the perag reservations >>>> + * for the corresponding AGs. >>>> + */ >>>> + xfs_ag_resv_free(cur_pag); >>>> + /* >>>> + * We have already ensured in the AG preparation phase that all intents >>>> + * for the offlined AGs have been resolved. So it safe to free it here. >>>> + */ >>>> + xfs_defer_drain_free(&xg->xg_intents_drain); >>>> + /* >>>> + * We have already ensured in the AG preparation phase that all busy >>>> + * extents for the offlined AGs have been resolved. So it safe to free >>>> + * it here. >>>> + */ >>>> + kfree(xg->xg_busy_extents); >>>> + cancel_delayed_work_sync(&cur_pag->pag_blockgc_work); >>>> + >>>> + /* >>>> + * Remove all the cached buffers for the given AG. >>>> + */ >>>> + xfs_buf_offline_perag_rele_cached(cur_pag); >>>> + /* >>>> + * Now that the cached buffers have been released, remove the >>>> + * cache/hashtable itself. We should not change the order of the buffer >>>> + * removal and cache removal. >>>> + */ >>>> + xfs_buf_cache_destroy(&cur_pag->pag_bcache); >>>> + /* >>>> + * One final assert, before we remove the xg. Since the cached buffers >>>> + * for the offlined AGs are already removed, their passive references >>>> + * should be 0. Also, the active references are 0 too, so no new >>>> + * operation can start and race and get new references. >>>> + */ >>>> + XFS_IS_CORRUPT(mp, atomic_read(&pag_group(cur_pag)->xg_ref) != 0); >>>> + /* >>>> + * Finally free the struct xfs_perag of the AG. >>>> + */ >>>> + kfree_rcu_mightsleep(xg); >>>> +} >>>> + >>>> void >>>> xfs_growfs_get_delta(struct xfs_mount *mp, xfs_rfsblock_t nb, >>>> int64_t *deltap, xfs_agnumber_t *nagcountp) >>>> diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h >>>> index 190af11f6941..15886e2b40aa 100644 >>>> --- a/fs/xfs/libxfs/xfs_ag.h >>>> +++ b/fs/xfs/libxfs/xfs_ag.h >>>> @@ -112,6 +112,11 @@ static inline xfs_agnumber_t pag_agno(const struct xfs_perag *pag) >>>> return pag->pag_group.xg_gno; >>>> } >>>> +static inline bool xfs_ag_is_active(struct xfs_perag *pag) >>>> +{ >>>> + return atomic_read(&pag_group(pag)->xg_active_ref) > 0; >>>> +} >>>> + >>>> /* >>>> * Per-AG operational state. These are atomic flag bits. >>>> */ >>>> @@ -140,6 +145,7 @@ void xfs_free_perag_range(struct xfs_mount *mp, xfs_agnumber_t first_agno, >>>> xfs_agnumber_t end_agno); >>>> int xfs_initialize_perag_data(struct xfs_mount *mp, xfs_agnumber_t agno); >>>> int xfs_update_last_ag_size(struct xfs_mount *mp, xfs_agnumber_t prev_agcount); >>>> +bool xfs_ag_is_empty(struct xfs_perag *pag); >>>> /* Passive AG references */ >>>> static inline struct xfs_perag * >>>> @@ -263,6 +269,9 @@ xfs_ag_contains_log(struct xfs_mount *mp, xfs_agnumber_t agno) >>>> agno == XFS_FSB_TO_AGNO(mp, mp->m_sb.sb_logstart); >>>> } >>>> +void xfs_activate_ag(struct xfs_perag *pag); >>>> +void xfs_deactivate_ag(struct xfs_perag *pag); >>>> + >>>> static inline struct xfs_perag * >>>> xfs_perag_next_wrap( >>>> struct xfs_perag *pag, >>>> @@ -331,6 +340,7 @@ struct aghdr_init_data { >>>> int xfs_ag_init_headers(struct xfs_mount *mp, struct aghdr_init_data *id); >>>> int xfs_ag_shrink_space(struct xfs_perag *pag, struct xfs_trans **tpp, >>>> xfs_extlen_t delta); >>>> +void xfs_shrinkfs_remove_ag(struct xfs_mount *mp, xfs_agnumber_t agno); >>>> void >>>> xfs_growfs_get_delta(struct xfs_mount *mp, xfs_rfsblock_t nb, >>>> int64_t *deltap, xfs_agnumber_t *nagcountp); >>>> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c >>>> index 000cc7f4a3ce..e16803214223 100644 >>>> --- a/fs/xfs/libxfs/xfs_alloc.c >>>> +++ b/fs/xfs/libxfs/xfs_alloc.c >>>> @@ -3209,11 +3209,12 @@ xfs_validate_ag_length( >>>> if (length != mp->m_sb.sb_agblocks) { >>>> /* >>>> * During growfs, the new last AG can get here before we >>>> - * have updated the superblock. Give it a pass on the seqno >>>> - * check. >>>> + * have updated the superblock. During shrink, the new last AG >>>> + * will be updated and the AGs from newag to old AG will be >>>> + * removed. So seqno here maybe not be equal to >>>> + * mp->m_sb.sb_agcount - 1 since the super block is not yet >>>> + * updated globally. >>>> */ >>>> - if (bp->b_pag && seqno != mp->m_sb.sb_agcount - 1) >>>> - return __this_address; >>>> if (length < XFS_MIN_AG_BLOCKS) >>>> return __this_address; >>>> if (length > mp->m_sb.sb_agblocks) >>>> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c >>>> index ba5bd6031ece..d372f65068a2 100644 >>>> --- a/fs/xfs/xfs_buf.c >>>> +++ b/fs/xfs/xfs_buf.c >>>> @@ -951,6 +951,82 @@ xfs_buf_rele( >>>> xfs_buf_rele_cached(bp); >>>> } >>>> +/* >>>> + * This function populates a list of all the cached buffers of the given AG >>>> + * in the to_be_free list head. >>>> + */ >>>> +static void >>>> +xfs_pag_populate_cached_bufs(struct xfs_perag *pag, >>> Do both of these functions implement a buf_cache invalidation? >> Yes. Invalidation and freeing/removal of the xfs_buf_cache along with the >> cached buffers inside the xfs_buf_cache. >>> This one I think is xfs_buf_cache_grab_all()... >> Okay, I can rename this. >>>> + struct list_head *to_be_freed) >>>> +{ >>>> + struct xfs_buf *bp; >>>> + struct rhashtable_iter iter; >>>> + >>>> + INIT_LIST_HEAD(to_be_freed); >>> Initialize the list where you declare it, please. >> Noted. >>>> + rhashtable_walk_enter(&pag->pag_bcache.bc_hash, &iter); >>>> + do { >>>> + rhashtable_walk_start(&iter); >>>> + while ((bp = rhashtable_walk_next(&iter)) && !IS_ERR(bp)) { >>>> + ASSERT(list_empty(&bp->b_list)); >>>> + ASSERT(list_empty(&bp->b_li_list)); >>>> + list_add_tail(&bp->b_list, to_be_freed); >>>> + } >>>> + rhashtable_walk_stop(&iter); >>>> + } while (cond_resched(), bp == ERR_PTR(-EAGAIN)); >>>> + rhashtable_walk_exit(&iter); >>>> +} >>>> + >>>> +/* >>>> + * This function frees all the cached buffers (struct xfs_buf) associated with >>>> + * the given offline AG. The caller must ensure that the AG which is passed >>>> + * is offline and completely stabilized on the disk. Also, the caller should >>>> + * ensure that all the cached buffers are not queued for any pending i/o >>>> + * i.e, the b_list for all the cached buffers are empty - since we will be using >>>> + * b_list to get list of all the bufs that need to be freed. >>>> + */ >>>> +void >>>> +xfs_buf_offline_perag_rele_cached(struct xfs_perag *pag) >>> ...and this is xfs_buf_cache_invalidate(). >> Okay, I can rename this. I was thinking to somehow have the word "offline" >> in the function name so that it is obvious that we are removing the cached >> bufs ONLY from the offline/deactivated AGs (at least right now). > "offline" is one of those annoying words that's an adjective and a verb. > :) Okay. Sure, I will rename the functions based on the above suggestions. > >>>> +{ >>>> + ASSERT(!xfs_ag_is_active(pag)); >>>> + /* >>>> + * First get the list of buffers we want to free. >>>> + * We need to populate to_be_freed list and cannot directly free >>>> + * the buffers during the hashtable walk. rhashtable_walk_start() takes >>>> + * an RCU and xfs_buf_rele eventually calls xfs_buf_free (for >>>> + * cached buffers). xfs_buf_free() might sleep (depending on the >>>> + * whether the buffer was allocated using vmalloc or kmalloc) and >>>> + * cannot be called within an RCU context. Hence we first populate >>>> + * the buffers within an RCU context and free them outside it. >>>> + */ >>>> + struct list_head to_be_freed; >>>> + struct xfs_buf *bp, *tmp; >>>> + >>>> + xfs_pag_populate_cached_bufs(pag, &to_be_freed); >>>> + list_for_each_entry_safe(bp, tmp, &to_be_freed, b_list) { >>>> + list_del(&bp->b_list); >>>> + spin_lock(&bp->b_lock); >>>> + ASSERT(bp->b_pag == pag); >>>> + ASSERT(!xfs_buf_is_uncached(bp)); >>>> + /* >>>> + * Since we have made sure that this is being called on an >>>> + * AG with active refcount = 0, the b_hold value of any cached >>>> + * buffer should not exceed 1 (i.e, the default value) and hence >>>> + * can be safely removed. Hence, it should also be in an >>>> + * unlocked state. >>>> + */ >>>> + ASSERT(bp->b_hold == 1); >>>> + ASSERT(!xfs_buf_islocked(bp)); >>>> + /* >>>> + * We should set b_lru_ref to 0 so that it gets deleted from >>>> + * the lru during the call to xfs_buf_rele. >>>> + */ >>>> + atomic_set(&bp->b_lru_ref, 0); >>>> + spin_unlock(&bp->b_lock); >>>> + xfs_buf_rele(bp); >>>> + } >>>> +} >>>> + >>>> /* >>>> * Lock a buffer object, if it is not already locked. >>>> * >>>> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h >>>> index 15fc56948346..4c7023a5afb2 100644 >>>> --- a/fs/xfs/xfs_buf.h >>>> +++ b/fs/xfs/xfs_buf.h >>>> @@ -282,6 +282,7 @@ void xfs_buf_hold(struct xfs_buf *bp); >>>> /* Releasing Buffers */ >>>> extern void xfs_buf_rele(struct xfs_buf *); >>>> +void xfs_buf_offline_perag_rele_cached(struct xfs_perag *pag); >>>> /* Locking and Unlocking Buffers */ >>>> extern int xfs_buf_trylock(struct xfs_buf *); >>>> diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c >>>> index d4c5cef5bc43..088a9b046af1 100644 >>>> --- a/fs/xfs/xfs_buf_item_recover.c >>>> +++ b/fs/xfs/xfs_buf_item_recover.c >>>> @@ -737,8 +737,7 @@ xlog_recover_do_primary_sb_buffer( >>>> xfs_sb_from_disk(&mp->m_sb, dsb); >>>> if (mp->m_sb.sb_agcount < orig_agcount) { >>>> - xfs_alert(mp, "Shrinking AG count in log recovery not supported"); >>>> - return -EFSCORRUPTED; >>>> + xfs_warn_experimental(mp, XFS_EXPERIMENTAL_SHRINK); >>>> } >>>> if (mp->m_sb.sb_rgcount < orig_rgcount) { >>>> xfs_warn(mp, >>>> @@ -764,18 +763,28 @@ xlog_recover_do_primary_sb_buffer( >>>> if (error) >>>> return error; >>>> } >>>> - >>>> - /* >>>> - * Initialize the new perags, and also update various block and inode >>>> - * allocator setting based off the number of AGs or total blocks. >>>> - * Because of the latter this also needs to happen if the agcount did >>>> - * not change. >>>> - */ >>>> - error = xfs_initialize_perag(mp, orig_agcount, mp->m_sb.sb_agcount, >>>> - mp->m_sb.sb_dblocks, &mp->m_maxagi); >>>> - if (error) { >>>> - xfs_warn(mp, "Failed recovery per-ag init: %d", error); >>>> - return error; >>>> + if (orig_agcount > mp->m_sb.sb_agcount) { >>>> + /* >>>> + * Remove the old AGs that were removed previously by a growfs >>>> + */ >>>> + xfs_free_perag_range(mp, mp->m_sb.sb_agcount, orig_agcount); >>>> + mp->m_maxagi = xfs_set_inode_alloc(mp, mp->m_sb.sb_agcount); >>>> + mp->m_ag_prealloc_blocks = xfs_prealloc_blocks(mp); >>>> + } else { >>>> + /* >>>> + * Initialize the new perags, and also the update various block >>>> + * and inode allocator setting based off the number of AGs or >>>> + * total blocks. >>>> + * Because of the latter, this also needs to happen if the >>>> + * agcount did not change. >>>> + */ >>>> + error = xfs_initialize_perag(mp, orig_agcount, >>>> + mp->m_sb.sb_agcount, >>>> + mp->m_sb.sb_dblocks, &mp->m_maxagi); >>>> + if (error) { >>>> + xfs_warn(mp, "Failed recovery per-ag init: %d", error); >>>> + return error; >>>> + } >>>> } >>>> mp->m_alloc_set_aside = xfs_alloc_set_aside(mp); >>>> diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c >>>> index da3161572735..1055681648ba 100644 >>>> --- a/fs/xfs/xfs_extent_busy.c >>>> +++ b/fs/xfs/xfs_extent_busy.c >>>> @@ -676,6 +676,34 @@ xfs_extent_busy_wait_all( >>>> xfs_extent_busy_wait_group(rtg_group(rtg)); >>>> } >>>> +/* >>>> + * Similar to xfs_extent_busy_wait_all() - It waits for all the busy extents to >>>> + * get resolved for the range of AGs provided. For now, this function is >>>> + * introduced to be used in online shrink process. Unlike >>>> + * xfs_extent_busy_wait_all(), this takes a passive reference, because this >>>> + * function is expected to be called for the AGs whose active reference has >>>> + * been reduced to 0 i.e, offline AGs. >>>> + * >>>> + * @mp - The xfs mount point >>>> + * @first_agno - The 0 based AG index of the range of AGs from which we will >>>> + * start. >>>> + * @end_agno - The 0 based AG index of the range of AGs from till which we will >>>> + * traverse. >>>> + */ >>>> +void >>>> +xfs_extent_busy_wait_range(struct xfs_mount *mp, xfs_agnumber_t first_agno, >>> Range of what? Blocks? >> Range of AGs, i.e, waiting for the busy extents to get resolved for a range >> of AGs (AGs that we are trying to remove completely). >>>> + xfs_agnumber_t end_agno) >>>> +{ >>> Perhaps this should be called xfs_extent_busy_wait_ags(). >> Okay. >>>> + xfs_agnumber_t agno; >>>> + struct xfs_perag *pag = NULL; >>>> + >>>> + for (agno = end_agno; agno >= first_agno; agno--) { >>>> + pag = xfs_perag_get(mp, agno); >>>> + xfs_extent_busy_wait_group(pag_group(pag)); >>>> + xfs_perag_put(pag); >>>> + } >>>> +} >>>> + >>>> /* >>>> * Callback for list_sort to sort busy extents by the group they reside in. >>>> */ >>>> diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h >>>> index f069b04e8ea1..69ae78964828 100644 >>>> --- a/fs/xfs/xfs_extent_busy.h >>>> +++ b/fs/xfs/xfs_extent_busy.h >>>> @@ -57,6 +57,8 @@ bool xfs_extent_busy_trim(struct xfs_group *xg, xfs_extlen_t minlen, >>>> unsigned *busy_gen); >>>> int xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_group *xg, >>>> unsigned busy_gen, uint32_t alloc_flags); >>>> +void xfs_extent_busy_wait_range(struct xfs_mount *mp, xfs_agnumber_t first_agno, >>>> + xfs_agnumber_t end_agno); >>>> void xfs_extent_busy_wait_all(struct xfs_mount *mp); >>>> bool xfs_extent_busy_list_empty(struct xfs_group *xg, unsigned int *busy_gen); >>>> struct xfs_extent_busy_tree *xfs_extent_busy_alloc(void); >>>> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c >>>> index 91da9f733659..d9d33de4e679 100644 >>>> --- a/fs/xfs/xfs_fsops.c >>>> +++ b/fs/xfs/xfs_fsops.c >>>> @@ -83,6 +83,219 @@ xfs_resizefs_init_new_ags( >>>> return error; >>>> } >>>> +/* >>>> + * Get new active references for all the AGs. This might be called when >>>> + * shrinkage process encounters a failure at an intermediate stage after the >>>> + * active references of all the target AGs have become 0. >>>> + */ >>>> +static void >>>> +xfs_shrinkfs_reactivate_ags(struct xfs_mount *mp, xfs_agnumber_t oagcount, >>>> + xfs_agnumber_t nagcount) >>>> +{ >>>> + struct xfs_perag *pag = NULL; >>>> + >>>> + if (nagcount >= oagcount) >>>> + return; >>> When would anyone call xfs_shrinkfs_reactivate_ags with n >= o? >> n >= o is invalid, that is why I am simply returning with a no-op. Or should >> I just have an ASSERTion here ASSERT(nagcount < oagcount);? Which one do you >> prefer? > I think an assert is fine. Okay. > >>>> + for (xfs_agnumber_t agno = oagcount - 1; agno > nagcount - 1; agno--) { >>> (Plz follow XFS coding conventions for consistency with the rest of the >>> codebase.) >> Yes. I will fix all the style related issues in the next version. >>>> + pag = xfs_perag_get(mp, agno); >>>> + xfs_activate_ag(pag); >>>> + xfs_perag_put(pag); >>>> + } >>>> +} >>>> + >>>> +/* >>>> + * The function deactivates or puts the AGs to an offline mode. AG deactivation >>>> + * or AG offlining means that no new operation can be started on that AG. The AG >>>> + * still exists, however no new high level operation (like extent allocation) >>>> + * can be started. In terms of implementation, an AG is taken offline or is >>>> + * deactivated when xg_active_ref of the struct xfs_perag is 0 i.e, the number >>>> + * of active references becomes 0. >>>> + * Since active references act as a form of barrier, so once the active >>>> + * reference of an AG is 0, no new entity can get an active reference and in >>>> + * this way we ensure that once an AG is offline (i.e, active reference count is >>>> + * 0), no one will be able to start a new operation in it unless the active >>>> + * reference count is explicitly set to 1 i.e, the AG is made online/activated. >>>> + */ >>>> +static int >>>> +xfs_shrinkfs_deactivate_ags(struct xfs_mount *mp, xfs_agnumber_t oagcount, >>>> + xfs_agnumber_t nagcount) >>>> +{ >>>> + int error = 0; >>>> + struct xfs_perag *pag = NULL; >>>> + >>>> + if (oagcount <= nagcount) >>>> + return 0; >>>> + /* >>>> + * If we are removing 1 or more entire AGs, we only need to take those >>>> + * AGs offline which we are planning to remove completely. The new tail >>>> + * AG which will be partially shrunk need not be taken offline - since >>> should not be taken offline >> Okay. >>>> + * we will be doing an online operation on them, just like any other >>>> + * high level operation. For complete AG removal, we need to take them >>>> + * offline since we cannot start any new operation on them as they will >>>> + * be removed eventually. >>>> + * >>>> + * However, if the number of blocks that we are trying to remove is >>>> + * an exact multiple of the AG size (in blocks), then the new tail AG >>>> + * will not be shrunk at all. >>>> + */ >>>> + for (xfs_agnumber_t agno = oagcount - 1; agno > nagcount - 1; agno--) { >>> maybe this ought to be a for_each_perag_range_reverse() iterator macro? >> Okay, I can introduce this macro and use that instead in place where I have >> this loop. There are a couple of places I have used this loop. Thank you for >> the suggestion. > > >>>> + pag = xfs_perag_get(mp, agno); >>>> + xfs_deactivate_ag(pag); >>>> + xfs_perag_put(pag); >>>> + } >>>> + /* >>>> + * Now that we have deactivated/offlined the AGs, we need to make sure >>>> + * that all the pending operations are completed and the in-core and >>>> + * the on disk contents are completely in synch. >>>> + */ >>>> + >>>> + /* >>>> + * Wait for all the busy extents to get resolved along with pending trim >>>> + * ops for all the offlined AGs. >>>> + */ >>>> + xfs_extent_busy_wait_range(mp, nagcount, oagcount - 1); >>>> + flush_workqueue(xfs_discard_wq); >>>> + /* >>>> + * We should wait for the log to be empty and all the pending I/Os to >>>> + * be completed so that the AGs are compeletly stabilized before we >>> completely >> Noted. >>>> + * start tearing them down. xfs_log_quiesce() call here ensures that >>>> + * none of the future logged transactions will refer to these AGs >>>> + * during log recovery in case if sudden shutdown/crash happens while >>>> + * we are trying to remove these AGs. >>>> + */ >>>> + error = xfs_log_quiesce(mp); >>> So I guess we're quiescing the log here so that it flushes the log to >>> disk, pushes the AIL so everything in the log is written back to the >>> filesystem, and then covers the log so that after this point, recovery >>> will only see log contents from this point forward -- the log won't >>> updates involving the perags that we just deactivated 30 lines ago? >> Yes, after this point, the recovery code will not see any log/update >> involving the deactivated AGs. >>>> + if (error) >>>> + xfs_shrinkfs_reactivate_ags(mp, oagcount, nagcount); >>>> + /* >>>> + * Reactivate the log work queue which was deactivated in >>>> + * xfs_log_quiesce >>>> + */ >>>> + xfs_log_work_queue(mp); >>>> + return error; >>>> +} >>>> + >>>> +/* >>>> + * This function does 3 things: >>>> + * 1. Deactivate the AGs i.e, wait for all the active references to come to 0. >>>> + * 2. Checks whether all the AGs that shrink process needs to remove are empty. >>>> + * If at least one of the target AGs is non-empty, shrink fails and >>>> + * xfs_shrinkfs_reactivate_ags() is called. >>>> + * 3. Calculates the total number of fdblocks (free data blocks) that will be >>>> + * removed and stores in id->nfree. >>>> + * Please look into the individual functions for more details and the definition >>>> + * of the terminologies. >>>> + */ >>>> +static int >>>> +xfs_shrinkfs_prepare_ags(struct xfs_mount *mp, xfs_agnumber_t oagcount, >>>> + xfs_agnumber_t nagcount, struct aghdr_init_data *id) >>>> +{ >>>> + ASSERT(nagcount < oagcount); >>>> + struct xfs_perag *pag = NULL; >>>> + xfs_agnumber_t agno; >>>> + int error = 0; >>>> + /* >>>> + * Deactivating/offlining the AGs i.e waiting for the active references >>>> + * to come down to 0. >>>> + */ >>>> + error = xfs_shrinkfs_deactivate_ags(mp, oagcount, nagcount); >>>> + if (error) >>>> + return error; >>>> + /* >>>> + * At this point the AGs have been deactivated/offlined and the in-core >>>> + * and the on-disk are synch. So now we need to check whether all the >>>> + * AGs that we are trying to remove/delete are empty. Since we are not >>>> + * supporting partial shrink success (i.e, the entire requested size >>>> + * will be removed or none), we will bail out with a failure code even >>>> + * if 1 AG is non-empty. >>>> + */ >>>> + for (agno = oagcount - 1; agno > nagcount - 1; agno--) { >>>> + pag = xfs_perag_get(mp, agno); >>>> + if (!xfs_ag_is_empty(pag)) { >>>> + /* Error out even if one AG is non-empty */ >>>> + error = -ENOTEMPTY; >>>> + xfs_perag_put(pag); >>>> + xfs_shrinkfs_reactivate_ags(mp, oagcount, nagcount); >>>> + return error; >>>> + } >>>> + /* >>>> + * Since these are removed, these free blocks should also be >>>> + * subtracted from the total list of free blocks. >>>> + */ >>>> + id->nfree += (pag->pagf_freeblks + pag->pagf_flcount); >>>> + xfs_perag_put(pag); >>>> + >>>> + } >>>> + return 0; >>>> +} >>>> + >>>> +/* >>>> + * This function does the job of fully removing the blocks and empty AGs ( >>>> + * depending of the values of oagcount and nagcount). By removal it means, >>>> + * removal of all the perag data structures, other data structures associated >>>> + * with it and all the perag cached buffers (when AGs are removed). Once this >>>> + * function succeeds, the AGs/blocks will no longer exist. >>>> + * The overall steps are as follows (details are in the function): >>>> + * - calculate the number of blocks that will be removed from the new tail AG >>>> + * i.e, the AG that will be shrunk partially. >>>> + * - call xfs_shrinkfs_remove_ag() that removes the perag cached buffers, >>>> + * then frees the perag reservation, other associated datastructures and >>>> + * finally the in-memory perag group instance. >>>> + */ >>>> +static int >>>> +xfs_shrinkfs_remove_ags(struct xfs_mount *mp, struct xfs_trans **tp, >>>> + xfs_agnumber_t oagcount, xfs_agnumber_t nagcount, >>>> + int64_t delta_rem, xfs_agnumber_t *nagmax) >>>> +{ >>>> + xfs_agnumber_t agno; >>>> + int error = 0; >>>> + struct xfs_perag *cur_pag = NULL; >>>> + >>>> + /* >>>> + * This loop is calculating the number of blocks that needs to be >>>> + * removed from the new tail AG. If delta_rem is 0 after the loop exits, >>>> + * then it means that the number of blocks we want to remove is a >>>> + * multiple of AG size (in blocks). >>>> + */ >>>> + for (agno = oagcount - 1; agno > nagcount - 1; agno--) { >>>> + cur_pag = xfs_perag_get(mp, agno); >>>> + delta_rem -= xfs_ag_block_count(mp, agno); >>>> + xfs_perag_put(cur_pag); >>>> + } >>>> + /* >>>> + * We are first removing blocks from the AG that will form the new tail >>>> + * AG. The reason is that, if we encounter an error here, we can simply >>>> + * reactivate the AGs (by calling xfs_shrinkfs_reactivate_ags()). >>>> + * Removal of complete empty AGs always succeed anyway. However if we >>>> + * remove the empty AGs first (which will succeed) and then the new >>>> + * last AG shrink fails, then we will again have to re-initialize the >>>> + * removed AGs. Hence the former approach seems more efficient to me. >>>> + */ >>>> + if (delta_rem) { >>>> + /* >>>> + * Remove delta_rem blocks from the AG that will form the new >>>> + * tail AG after the AGs are removed. If the number of blocks to >>>> + * be removed is a multiple of AG size, then nothing is done >>>> + * here. >>>> + */ >>>> + cur_pag = xfs_perag_get(mp, nagcount - 1); >>>> + error = xfs_ag_shrink_space(cur_pag, tp, delta_rem); >>>> + xfs_perag_put(cur_pag); >>>> + if (error) { >>>> + xfs_shrinkfs_reactivate_ags(mp, oagcount, nagcount); >>>> + return error; >>>> + } >>> How do we know that the last @delta_rem fsblocks in the new tail AG are >>> free? >> The last delta_rem blocks in the tail/new-tail are removed by >> xfs_ag_shrink_space(). The xfs_ag_shrink_space() takes care of this. If I >> understand it correctly, the following check in xfs_ag_shrink_space() makes >> sure that the last delta_rem blocks in the tail/new-tail AG are >> free/unallocated: >> >>     error = xfs_alloc_vextent_exact_bno(&args, >>             xfs_agbno_to_fsb(pag, aglen - delta)); >>     if (!error && args.agbno == NULLAGBLOCK) >>         error = -ENOSPC; >> if we are not able to allocate "aglen - delta"th block(AG relative fsblock >> number) in the pag passed to it, then this function fails and the overall >> shrink fails. Do you agree with this? > Ah yes. Thanks for doing the research. :) Thank you. -- > > --D > >>>> + } >>>> + /* >>>> + * Now, in this final step we remove the perag instance and the >>>> + * associated datastructures and cached buffers. This fully removes the >>>> + * AG. >>>> + */ >>>> + for (agno = oagcount - 1; agno > nagcount - 1; agno--) >>>> + xfs_shrinkfs_remove_ag(mp, agno); >>>> + *nagmax = xfs_set_inode_alloc(mp, nagcount); >>>> + return error; >>>> +} >>>> + >>>> /* >>>> * growfs operations >>>> */ >>>> @@ -101,7 +314,6 @@ xfs_growfs_data_private( >>>> bool lastag_extended = false; >>>> struct xfs_trans *tp; >>>> struct aghdr_init_data id = {}; >>>> - struct xfs_perag *last_pag; >>>> error = xfs_sb_validate_fsb_count(&mp->m_sb, nb); >>>> if (error) >>>> @@ -122,7 +334,6 @@ xfs_growfs_data_private( >>>> if (error) >>>> return error; >>>> xfs_growfs_get_delta(mp, nb, &delta, &nagcount); >>>> - >>>> /* >>>> * Reject filesystems with a single AG because they are not >>>> * supported, and reject a shrink operation that would cause a >>>> @@ -135,9 +346,11 @@ xfs_growfs_data_private( >>>> if (delta == 0) >>>> return 0; >>>> - /* TODO: shrinking the entire AGs hasn't yet completed */ >>>> - if (nagcount < oagcount) >>>> - return -EINVAL; >>>> + if (nagcount < oagcount) { >>>> + error = xfs_shrinkfs_prepare_ags(mp, oagcount, nagcount, &id); >>>> + if (error) >>>> + return error; >>>> + } >>>> /* allocate the new per-ag structures */ >>>> error = xfs_initialize_perag(mp, oagcount, nagcount, nb, &nagimax); >>>> @@ -154,15 +367,16 @@ xfs_growfs_data_private( >>>> if (error) >>>> goto out_free_unused_perag; >>>> - last_pag = xfs_perag_get(mp, oagcount - 1); >>>> if (delta > 0) { >>>> + struct xfs_perag *last_pag = xfs_perag_get(mp, oagcount - 1); >>> Blank line between declarations and code please. >> Noted. >> >> --NR >> >>> --D >>> >>>> error = xfs_resizefs_init_new_ags(tp, &id, oagcount, nagcount, >>>> delta, last_pag, &lastag_extended); >>>> + xfs_perag_put(last_pag); >>>> } else { >>>> xfs_warn_experimental(mp, XFS_EXPERIMENTAL_SHRINK); >>>> - error = xfs_ag_shrink_space(last_pag, &tp, -delta); >>>> + error = xfs_shrinkfs_remove_ags(mp, &tp, oagcount, nagcount, >>>> + -delta, &nagimax); >>>> } >>>> - xfs_perag_put(last_pag); >>>> if (error) >>>> goto out_trans_cancel; >>>> @@ -171,12 +385,14 @@ xfs_growfs_data_private( >>>> * seen by the rest of the world until the transaction commit applies >>>> * them atomically to the superblock. >>>> */ >>>> - if (nagcount > oagcount) >>>> - xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount); >>>> + if (nagcount != oagcount) >>>> + xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, >>>> + (int64_t)nagcount - (int64_t)oagcount); >>>> if (delta) >>>> xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, delta); >>>> if (id.nfree) >>>> - xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree); >>>> + xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, >>>> + delta > 0 ? id.nfree : (int64_t)-id.nfree); >>>> /* >>>> * Sync sb counters now to reflect the updated values. This is >>>> @@ -190,10 +406,11 @@ xfs_growfs_data_private( >>>> error = xfs_trans_commit(tp); >>>> if (error) >>>> return error; >>>> - >>>> /* New allocation groups fully initialized, so update mount struct */ >>>> if (nagimax) >>>> mp->m_maxagi = nagimax; >>>> + if (nagcount < oagcount) >>>> + mp->m_ag_prealloc_blocks = xfs_prealloc_blocks(mp); >>>> xfs_set_low_space_thresholds(mp); >>>> mp->m_alloc_set_aside = xfs_alloc_set_aside(mp); >>>> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c >>>> index b4a07af513ba..e94d96db5383 100644 >>>> --- a/fs/xfs/xfs_trans.c >>>> +++ b/fs/xfs/xfs_trans.c >>>> @@ -438,7 +438,6 @@ xfs_trans_mod_sb( >>>> tp->t_dblocks_delta += delta; >>>> break; >>>> case XFS_TRANS_SB_AGCOUNT: >>>> - ASSERT(delta > 0); >>>> tp->t_agcount_delta += delta; >>>> break; >>>> case XFS_TRANS_SB_IMAXPCT: >>>> -- >>>> 2.43.5 >>>> >>>> >> -- >> Nirjhar Roy >> Linux Kernel Developer >> IBM, Bangalore >> >> -- Nirjhar Roy Linux Kernel Developer IBM, Bangalore