From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-14.0 required=3.0 tests=BAYES_00,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6D544C433ED for ; Mon, 5 Apr 2021 23:02:52 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E27F1613BC for ; Mon, 5 Apr 2021 23:02:51 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E27F1613BC Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 6F2AA6B0082; Mon, 5 Apr 2021 19:02:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6C9136B0083; Mon, 5 Apr 2021 19:02:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 56D7A6B0085; Mon, 5 Apr 2021 19:02:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0008.hostedemail.com [216.40.44.8]) by kanga.kvack.org (Postfix) with ESMTP id 3B7C66B0082 for ; Mon, 5 Apr 2021 19:02:51 -0400 (EDT) Received: from smtpin37.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id F2C5D180ACEE6 for ; Mon, 5 Apr 2021 23:02:50 +0000 (UTC) X-FDA: 77999840100.37.25B6A1B Received: from mail-io1-f45.google.com (mail-io1-f45.google.com [209.85.166.45]) by imf10.hostedemail.com (Postfix) with ESMTP id 1A35C40002CE for ; Mon, 5 Apr 2021 23:02:47 +0000 (UTC) Received: by mail-io1-f45.google.com with SMTP id e8so13552387iok.5 for ; Mon, 05 Apr 2021 16:02:50 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=R7FAxrpr5NVcCsUfbE21LFT03ax27o1Pts1X3zv3PHI=; b=j3v1yi3CTeewo38OW0SHWrBfGFCFYEbz0QqGpfq7HgX0HKbCqGXsTACMaejSvZTl6Y AgEzDl0C+O6ZuI9wL1LhMr0ZfCl8KBKUBT8mnLodu6r2TKd7MoQKHKHeG5j/9XvLCRNQ 8KupsJiCOIpz28qZe4dKWYrqes52asJfkqdu4WElyjc37yfBYCrhJsBsSgZZ3APM14Fw q+yyBTjmZ59thuefXUv3enZtvdEUD8j++3ik88P3jnECDAbn3stw3Ov0zaZdXlaFVZzd SpPjkCntUP9NI/dUMF4oOxdx/vu8o5wOziXiNjX0LEKJGpAurW9yJLBBDr3mK5SZl6Sr 5F9w== X-Gm-Message-State: AOAM530ZTbUD3y3yo4AugrNoF81rPUxAjzpHwOc8VcwHQFkzatAtmF8y HBHFdpqALYrWqSqQEXgK5EE= X-Google-Smtp-Source: ABdhPJyoW1Hmg2MUGhFbDAFEmuSzWZ+H6IOY5NLkfVnQQpGZX4LCUM4RTkrRuD++NqTWVXlR38d7sA== X-Received: by 2002:a05:6602:228f:: with SMTP id d15mr20384037iod.141.1617663769961; Mon, 05 Apr 2021 16:02:49 -0700 (PDT) Received: from google.com (243.199.238.35.bc.googleusercontent.com. [35.238.199.243]) by smtp.gmail.com with ESMTPSA id x8sm12457180iov.7.2021.04.05.16.02.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 05 Apr 2021 16:02:49 -0700 (PDT) Date: Mon, 5 Apr 2021 23:02:48 +0000 From: Dennis Zhou To: Roman Gushchin Cc: Tejun Heo , Christoph Lameter , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v1 5/5] percpu: implement partial chunk depopulation Message-ID: References: <20210401214301.1689099-1-guro@fb.com> <20210401214301.1689099-6-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210401214301.1689099-6-guro@fb.com> X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 1A35C40002CE X-Stat-Signature: nhn8mmcdsoagiee9354snnr615hqex49 Received-SPF: none (gmail.com>: No applicable sender policy available) receiver=imf10; identity=mailfrom; envelope-from=""; helo=mail-io1-f45.google.com; client-ip=209.85.166.45 X-HE-DKIM-Result: none/none X-HE-Tag: 1617663767-532464 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Apr 01, 2021 at 02:43:01PM -0700, Roman Gushchin wrote: > This patch implements partial depopulation of percpu chunks. > > As now, a chunk can be depopulated only as a part of the final > destruction, if there are no more outstanding allocations. However > to minimize a memory waste it might be useful to depopulate a > partially filed chunk, if a small number of outstanding allocations > prevents the chunk from being fully reclaimed. > > This patch implements the following depopulation process: it scans > over the chunk pages, looks for a range of empty and populated pages > and performs the depopulation. To avoid races with new allocations, > the chunk is previously isolated. After the depopulation the chunk is > returned to the original slot (but is appended to the tail of the list > to minimize the chances of population). > > Because the pcpu_lock is dropped while calling pcpu_depopulate_chunk(), > the chunk can be concurrently moved to a different slot. To prevent > this, bool chunk->isolated flag is introduced. If set, the chunk can't > be moved to a different slot. > > The depopulation is scheduled on the free path. Is the chunk: > 1) has more than 1/8 of total pages free and populated > 2) the system has enough free percpu pages aside of this chunk > 3) isn't the reserved chunk > 4) isn't the first chunk > 5) isn't entirely free > it's a good target for depopulation. > > If so, the chunk is moved to a special pcpu_depopulate_list, > chunk->isolate flag is set and the async balancing is scheduled. > > The async balancing moves pcpu_depopulate_list to a local list > (because pcpu_depopulate_list can be changed when pcpu_lock is > releases), and then tries to depopulate each chunk. Successfully > or not, at the end all chunks are returned to appropriate slots > and their isolated flags are cleared. > > Many thanks to Dennis Zhou for his great ideas and a very constructive > discussion which led to many improvements in this patchset! > > Signed-off-by: Roman Gushchin > --- > mm/percpu-internal.h | 1 + > mm/percpu.c | 101 ++++++++++++++++++++++++++++++++++++++++++- > 2 files changed, 100 insertions(+), 2 deletions(-) > > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h > index 095d7eaa0db4..ff318752915d 100644 > --- a/mm/percpu-internal.h > +++ b/mm/percpu-internal.h > @@ -67,6 +67,7 @@ struct pcpu_chunk { > > void *data; /* chunk data */ > bool immutable; /* no [de]population allowed */ > + bool isolated; /* isolated from chunk slot lists */ > int start_offset; /* the overlap with the previous > region to have a page aligned > base_addr */ > diff --git a/mm/percpu.c b/mm/percpu.c > index e20119668c42..dae0b870e10a 100644 > --- a/mm/percpu.c > +++ b/mm/percpu.c > @@ -181,6 +181,12 @@ static LIST_HEAD(pcpu_map_extend_chunks); > */ > int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES]; > > +/* > + * List of chunks with a lot of free pages. Used to depopulate them > + * asynchronously. > + */ > +static LIST_HEAD(pcpu_depopulate_list); > + Now that pcpu_nr_empty_pop_pages is per chunk_type I think the depopulate_list should be per chunk_type. > /* > * The number of populated pages in use by the allocator, protected by > * pcpu_lock. This number is kept per a unit per chunk (i.e. when a page gets > @@ -542,7 +548,7 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot) > { > int nslot = pcpu_chunk_slot(chunk); > > - if (oslot != nslot) > + if (!chunk->isolated && oslot != nslot) > __pcpu_chunk_move(chunk, nslot, oslot < nslot); > } > > @@ -2048,6 +2054,82 @@ static void pcpu_grow_populated(enum pcpu_chunk_type type, int nr_to_pop) > } > } > > +/** > + * pcpu_shrink_populated - scan chunks and release unused pages to the system > + * @type: chunk type > + * > + * Scan over all chunks, find those marked with the depopulate flag and > + * try to release unused pages to the system. On every attempt clear the > + * chunk's depopulate flag to avoid wasting CPU by scanning the same > + * chunk again and again. > + */ There no longer is a depopulate flag. > +static void pcpu_shrink_populated(enum pcpu_chunk_type type) > +{ > + struct pcpu_block_md *block; > + struct pcpu_chunk *chunk, *tmp; > + LIST_HEAD(to_depopulate); > + int i, start; > + > + spin_lock_irq(&pcpu_lock); > + > + list_splice_init(&pcpu_depopulate_list, &to_depopulate); > + > + list_for_each_entry_safe(chunk, tmp, &to_depopulate, list) { > + WARN_ON(chunk->immutable); > + > + for (i = 0, start = -1; i < chunk->nr_pages; i++) { > + /* > + * If the chunk has no empty pages or > + * we're short on empty pages in general, > + * just put the chunk back into the original slot. > + */ > + if (!chunk->nr_empty_pop_pages || > + pcpu_nr_empty_pop_pages[type] < > + PCPU_EMPTY_POP_PAGES_HIGH) > + break; This isn't ideal because if we do drop below PCPU_EMPTY_POP_PAGES_HIGH because of the next deallocation range, then we're leaving the region that's going to get allocated next unpopulated and the populated pages stranded later on. See below for more discussion. > + > + /* > + * If the page is empty and populated, start or > + * extend the [start, i) range. > + */ > + block = chunk->md_blocks + i; > + if (block->contig_hint == PCPU_BITMAP_BLOCK_BITS && > + test_bit(i, chunk->populated)) { > + if (start == -1) > + start = i; > + continue; > + } > + > + /* > + * Otherwise check if there is an active range, > + * and if yes, depopulate it. > + */ > + if (start == -1) > + continue; > + > + spin_unlock_irq(&pcpu_lock); > + pcpu_depopulate_chunk(chunk, start, i); > + cond_resched(); > + spin_lock_irq(&pcpu_lock); > + > + pcpu_chunk_depopulated(chunk, start, i); > + > + /* > + * Reset the range and continue. > + */ > + start = -1; > + } > + > + /* > + * Return the chunk to the corresponding slot. > + */ > + chunk->isolated = false; > + pcpu_chunk_relocate(chunk, -1); > + } > + > + spin_unlock_irq(&pcpu_lock); > +} > + > /** > * pcpu_balance_populated - manage the amount of populated pages > * @type: chunk type > @@ -2078,6 +2160,8 @@ static void pcpu_balance_populated(enum pcpu_chunk_type type) > } else if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_HIGH) { > nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH - pcpu_nr_empty_pop_pages[type]; > pcpu_grow_populated(type, nr_to_pop); > + } else if (!list_empty(&pcpu_depopulate_list)) { > + pcpu_shrink_populated(type); > } > } > > @@ -2135,7 +2219,12 @@ void free_percpu(void __percpu *ptr) > > pcpu_memcg_free_hook(chunk, off, size); > > - /* if there are more than one fully free chunks, wake up grim reaper */ > + /* > + * If there are more than one fully free chunks, wake up grim reaper. > + * Otherwise if at least 1/8 of its pages are empty and there is no > + * system-wide shortage of empty pages aside from this chunk, isolate > + * the chunk and schedule an async depopulation. > + */ > if (chunk->free_bytes == pcpu_unit_size) { > struct pcpu_chunk *pos; > > @@ -2144,6 +2233,14 @@ void free_percpu(void __percpu *ptr) > need_balance = true; > break; > } > + } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk && > + !chunk->isolated && > + chunk->nr_empty_pop_pages >= chunk->nr_pages / 8 && > + pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] > > + PCPU_EMPTY_POP_PAGES_HIGH + chunk->nr_empty_pop_pages) { > + list_move(&chunk->list, &pcpu_depopulate_list); I'm missing something here quite possibly. Right now, when we free, we're not doing anything to place the floating empty pages in a particular chunk (such as the first chunk in slot[PAGE_SIZE]). If the ordering is bad, we can end up leaving the float pages in some random chunk and because we scan forwards, they'll be populated pages at the end of a chunk. Not exactly the most useful. I wonder if we should free everything in the depopulate list and then scan up and repopulate a # of pages scanning up from slot(PAGE_SIZE)? This would add additional churn though. If not we need to switch the direction of freeing. So questions: 1. How do we keep the float pages in a way that they're most likely to be found next by the allocator? 2. Do we need to change the direction of freeing or change the accounting above? > + chunk->isolated = true; > + need_balance = true; > } > > trace_percpu_free_percpu(chunk->base_addr, off, ptr); > -- > 2.30.2 >