From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 79B39C28CC3 for ; Tue, 4 Jun 2019 13:16:46 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 24CC32484F for ; Tue, 4 Jun 2019 13:16:46 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 24CC32484F Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B63F36B0010; Tue, 4 Jun 2019 09:16:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B137D6B026B; Tue, 4 Jun 2019 09:16:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9B5646B026C; Tue, 4 Jun 2019 09:16:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from mail-oi1-f199.google.com (mail-oi1-f199.google.com [209.85.167.199]) by kanga.kvack.org (Postfix) with ESMTP id 630806B0010 for ; Tue, 4 Jun 2019 09:16:45 -0400 (EDT) Received: by mail-oi1-f199.google.com with SMTP id v4so1892499oia.13 for ; Tue, 04 Jun 2019 06:16:45 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:references:organization:message-id:date:user-agent :mime-version:in-reply-to; bh=7TGVeDBtvYHOwGzNLDOD60JXnfc15cSfy7orKJmBVsQ=; b=pST+Y9dmVOvSuzNYQOCzbHsGHD/Hw+4RTwEMyeCXfsIHi4OO+RMXJ/AYgpUTey0Eq2 mjvR1MgUDYHurOeTwz+TNAv9gzR4/4/jN00PhjuhnSIMWoRG0nNP7IbGcvXnRIuABoyp 2xn3D3ltZmTHGv6HR2ZujBS5EoJReNrFBw8tO7+iC5i+FtUBYndDeeGz9KJRuK32gVHQ whbLiyOPBCBJ6n0NJBPBy9EoXrTU3olEBtOX2b3I2HnjXq+nDdI/wv3be4OmoINAZMgJ hxnSBSM3jgIk5bauHdwYHjvfD+DJiKMq/1V6+pIn3qZGsP/j7GA/9A+XtjhO3YgYibN7 v5Zg== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of nitesh@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=nitesh@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com X-Gm-Message-State: APjAAAUzBw9TYG84BpWRIVRJ9meq5to97TOIfl6OzrSIMlltXJJsH0yR HOuGujNmpapAepYez/TAsZDbELonLXbcK2ukx9Kt9IOoaua9253wCTdi7ncKHD9vUaUzoTHzNgP UUhWDTn7EBBcjSG7n8Zb8FbGPKgCmk9YAKL4sfZGRLcfiUtcyItzXwyfwqEVsOE1+4A== X-Received: by 2002:a9d:3c5:: with SMTP id f63mr3748139otf.210.1559654205068; Tue, 04 Jun 2019 06:16:45 -0700 (PDT) X-Google-Smtp-Source: APXvYqyhUX4SXsheFFK6yELbHij4UmtiGpjwoJi1fKQDkf16IflGsS7nlhCAaH3qtenG+9J23OE5 X-Received: by 2002:a9d:3c5:: with SMTP id f63mr3748046otf.210.1559654203405; Tue, 04 Jun 2019 06:16:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1559654203; cv=none; d=google.com; s=arc-20160816; b=Pu3Qoik6lCpQU9l+pD8l8+H+JUkvHBxzwNcE1PYutDHwiwqaLfVsqxsIL3Tn0fnHTY /1FGWYulbqrzFRHcDZPyh+EUJcEBlgeKGkSbQTmWPW+bJmKrSdk29st1mzyskfnmV4Xj +Q/MGTlPkZXPcfMhjVBqSyHXnbG+HtOtkpbJ66afWYf7Yztnm1ICLwfs+AIoyXHzKFRE GSZORUH7GQFD5chmRRRpunyt+aCFGfDaV5mLWqpDe6vwezDerjDLdZJ2nE8B5Ol2vjxf QNlG2wKSDMNYRhjxocb/zBJDzq2eAfkIDqEH5XNvmuMqaxQmsh6FlT7RGNrMnU91fZ26 raJw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:mime-version:user-agent:date:message-id:organization :references:subject:cc:to:from; bh=7TGVeDBtvYHOwGzNLDOD60JXnfc15cSfy7orKJmBVsQ=; b=dogFZJcd3iUOQfbmKkyi9xN+dpYTfsAhcxzu0KtxeO3p2LMcIHFJ3m1zEVZ6/vsDcV 66MuW0ARUqN10HXP1mLs10z/dcqpLUui5Dle8p1aig2XEuhiyNlox/I80iaGxnYHLFsl Shxv72bkgZVW0HLi9Foc4yWlPMTgkDl9XmOt2l6gNoYatC9SSWgoPCNZZBWkWr4B1j+B j0bypaf/zRHpqGqwBWvH2KgymvIdCE3tyZGSX4/mAz3dmh5+2caitwDE6vZaFW0wK9BK 8NvSFSpUVBdN+XCHXpvSVcs9/jkEoKtVKDDx/+nGUV07pX2pZDuzD3MYZDnftTq3zLzV 9rsA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of nitesh@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=nitesh@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id m6si2321162otc.76.2019.06.04.06.16.43 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 04 Jun 2019 06:16:43 -0700 (PDT) Received-SPF: pass (google.com: domain of nitesh@redhat.com designates 209.132.183.28 as permitted sender) client-ip=209.132.183.28; Authentication-Results: mx.google.com; spf=pass (google.com: domain of nitesh@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=nitesh@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id B656543AA3; Tue, 4 Jun 2019 13:16:42 +0000 (UTC) Received: from [10.40.205.182] (unknown [10.40.205.182]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 651675D9D2; Tue, 4 Jun 2019 13:16:19 +0000 (UTC) From: Nitesh Narayan Lal To: David Hildenbrand Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, pbonzini@redhat.com, lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com, yang.zhang.wz@gmail.com, riel@surriel.com, mst@redhat.com, dodgen@google.com, konrad.wilk@oracle.com, dhildenb@redhat.com, aarcange@redhat.com, alexander.duyck@gmail.com Subject: Re: [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure References: <20190603170306.49099-1-nitesh@redhat.com> <20190603170306.49099-2-nitesh@redhat.com> Organization: Red Hat Inc, Message-ID: Date: Tue, 4 Jun 2019 09:16:15 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="hun85KKfyjW7Qo3FnHVFXjLAU0BGPerX8" X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Tue, 04 Jun 2019 13:16:42 +0000 (UTC) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --hun85KKfyjW7Qo3FnHVFXjLAU0BGPerX8 Content-Type: multipart/mixed; boundary="vnK5kVg8UEY0Uit0GfVEPGNN08aRZ9XVd"; protected-headers="v1" From: Nitesh Narayan Lal To: David Hildenbrand Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, pbonzini@redhat.com, lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com, yang.zhang.wz@gmail.com, riel@surriel.com, mst@redhat.com, dodgen@google.com, konrad.wilk@oracle.com, dhildenb@redhat.com, aarcange@redhat.com, alexander.duyck@gmail.com Message-ID: Subject: Re: [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure --vnK5kVg8UEY0Uit0GfVEPGNN08aRZ9XVd Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 6/3/19 3:57 PM, David Hildenbrand wrote: > On 03.06.19 19:03, Nitesh Narayan Lal wrote: >> This patch introduces the core infrastructure for free page hinting in= >> virtual environments. It enables the kernel to track the free pages wh= ich >> can be reported to its hypervisor so that the hypervisor could >> free and reuse that memory as per its requirement. >> >> While the pages are getting processed in the hypervisor (e.g., >> via MADV_FREE), the guest must not use them, otherwise, data loss >> would be possible. To avoid such a situation, these pages are >> temporarily removed from the buddy. The amount of pages removed >> temporarily from the buddy is governed by the backend(virtio-balloon >> in our case). >> >> To efficiently identify free pages that can to be hinted to the >> hypervisor, bitmaps in a coarse granularity are used. Only fairly big >> chunks are reported to the hypervisor - especially, to not break up TH= P >> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bit= s >> in the bitmap are an indication whether a page *might* be free, not a >> guarantee. A new hook after buddy merging sets the bits. >> >> Bitmaps are stored per zone, protected by the zone lock. A workqueue >> asynchronously processes the bitmaps, trying to isolate and report pag= es >> that are still free. The backend (virtio-balloon) is responsible for >> reporting these batched pages to the host synchronously. Once reportin= g/ >> freeing is complete, isolated pages are returned back to the buddy. >> >> There are still various things to look into (e.g., memory hotplug, mor= e >> efficient locking, possible races when disabling). >> >> Signed-off-by: Nitesh Narayan Lal >> --- >> drivers/virtio/Kconfig | 1 + >> include/linux/page_hinting.h | 46 +++++++ >> mm/Kconfig | 6 + >> mm/Makefile | 2 + >> mm/page_alloc.c | 17 +-- >> mm/page_hinting.c | 236 ++++++++++++++++++++++++++++++++++= + >> 6 files changed, 301 insertions(+), 7 deletions(-) >> create mode 100644 include/linux/page_hinting.h >> create mode 100644 mm/page_hinting.c >> >> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig >> index 35897649c24f..5a96b7a2ed1e 100644 >> --- a/drivers/virtio/Kconfig >> +++ b/drivers/virtio/Kconfig >> @@ -46,6 +46,7 @@ config VIRTIO_BALLOON >> tristate "Virtio balloon driver" >> depends on VIRTIO >> select MEMORY_BALLOON >> + select PAGE_HINTING >> ---help--- >> This driver supports increasing and decreasing the amount >> of memory within a KVM guest. >> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting= =2Eh >> new file mode 100644 >> index 000000000000..e65188fe1e6b >> --- /dev/null >> +++ b/include/linux/page_hinting.h >> @@ -0,0 +1,46 @@ >> +/* SPDX-License-Identifier: GPL-2.0 */ >> +#ifndef _LINUX_PAGE_HINTING_H >> +#define _LINUX_PAGE_HINTING_H >> + >> +/* >> + * Minimum page order required for a page to be hinted to the host. >> + */ >> +#define PAGE_HINTING_MIN_ORDER (MAX_ORDER - 2) >> + >> +/* >> + * struct page_hinting_cb: holds the callbacks to store, report and c= leanup >> + * isolated pages. >> + * @prepare: Callback responsible for allocating an array to hold >> + * the isolated pages. >> + * @hint_pages: Callback which reports the isolated pages synchornou= sly >> + * to the host. >> + * @cleanup: Callback to free the the array used for reporting the >> + * isolated pages. >> + * @max_pages: Maxmimum pages that are going to be hinted to the hos= t >> + * at a time of granularity >=3D PAGE_HINTING_MIN_ORDER. >> + */ >> +struct page_hinting_cb { >> + int (*prepare)(void); >> + void (*hint_pages)(struct list_head *list); >> + void (*cleanup)(void); >> + int max_pages; > If we allocate the array in virtio-balloon differently (e.g. similar to= > bulk inflation/deflation of pfn's right now), we can most probably get > rid of prepare() and cleanup(), simplifying the code further. Yeap, as Alexander suggested. I will go for static allocation and remove this prepare() and cleanup(). >> +}; >> + >> +#ifdef CONFIG_PAGE_HINTING >> +void page_hinting_enqueue(struct page *page, int order); >> +void page_hinting_enable(const struct page_hinting_cb *cb); >> +void page_hinting_disable(void); >> +#else >> +static inline void page_hinting_enqueue(struct page *page, int order)= >> +{ >> +} >> + >> +static inline void page_hinting_enable(struct page_hinting_cb *cb) >> +{ >> +} >> + >> +static inline void page_hinting_disable(void) >> +{ >> +} >> +#endif >> +#endif /* _LINUX_PAGE_HINTING_H */ >> diff --git a/mm/Kconfig b/mm/Kconfig >> index ee8d1f311858..177d858de758 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -764,4 +764,10 @@ config GUP_BENCHMARK >> config ARCH_HAS_PTE_SPECIAL >> bool >> =20 >> +# PAGE_HINTING will allow the guest to report the free pages to the >> +# host in regular interval of time. >> +config PAGE_HINTING >> + bool >> + def_bool n >> + depends on X86_64 >> endmenu >> diff --git a/mm/Makefile b/mm/Makefile >> index ac5e5ba78874..bec456dfee34 100644 >> --- a/mm/Makefile >> +++ b/mm/Makefile >> @@ -41,6 +41,7 @@ obj-y :=3D filemap.o mempool.o oom_kill.o fadvise.= o \ >> interval_tree.o list_lru.o workingset.o \ >> debug.o $(mmu-y) >> =20 >> + >> # Give 'page_alloc' its own module-parameter namespace >> page-alloc-y :=3D page_alloc.o >> page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) +=3D shuffle.o >> @@ -94,6 +95,7 @@ obj-$(CONFIG_Z3FOLD) +=3D z3fold.o >> obj-$(CONFIG_GENERIC_EARLY_IOREMAP) +=3D early_ioremap.o >> obj-$(CONFIG_CMA) +=3D cma.o >> obj-$(CONFIG_MEMORY_BALLOON) +=3D balloon_compaction.o >> +obj-$(CONFIG_PAGE_HINTING) +=3D page_hinting.o >> obj-$(CONFIG_PAGE_EXTENSION) +=3D page_ext.o >> obj-$(CONFIG_CMA_DEBUGFS) +=3D cma_debug.o >> obj-$(CONFIG_USERFAULTFD) +=3D userfaultfd.o >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 3b13d3914176..d12f69e0e402 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -68,6 +68,7 @@ >> #include >> #include >> #include >> +#include >> =20 >> #include >> #include >> @@ -873,10 +874,10 @@ compaction_capture(struct capture_control *capc,= struct page *page, >> * -- nyc >> */ >> =20 >> -static inline void __free_one_page(struct page *page, >> +inline void __free_one_page(struct page *page, >> unsigned long pfn, >> struct zone *zone, unsigned int order, >> - int migratetype) >> + int migratetype, bool hint) >> { >> unsigned long combined_pfn; >> unsigned long uninitialized_var(buddy_pfn); >> @@ -951,6 +952,8 @@ static inline void __free_one_page(struct page *pa= ge, >> done_merging: >> set_page_order(page, order); >> =20 >> + if (hint) >> + page_hinting_enqueue(page, order); >> /* >> * If this is not the largest possible page, check if the buddy >> * of the next-highest order is free. If it is, it's possible >> @@ -1262,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone= , int count, >> if (unlikely(isolated_pageblocks)) >> mt =3D get_pageblock_migratetype(page); >> =20 >> - __free_one_page(page, page_to_pfn(page), zone, 0, mt); >> + __free_one_page(page, page_to_pfn(page), zone, 0, mt, true); >> trace_mm_page_pcpu_drain(page, 0, mt); >> } >> spin_unlock(&zone->lock); >> @@ -1271,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zo= ne, int count, >> static void free_one_page(struct zone *zone, >> struct page *page, unsigned long pfn, >> unsigned int order, >> - int migratetype) >> + int migratetype, bool hint) >> { >> spin_lock(&zone->lock); >> if (unlikely(has_isolate_pageblock(zone) || >> is_migrate_isolate(migratetype))) { >> migratetype =3D get_pfnblock_migratetype(page, pfn); >> } >> - __free_one_page(page, pfn, zone, order, migratetype); >> + __free_one_page(page, pfn, zone, order, migratetype, hint); >> spin_unlock(&zone->lock); >> } >> =20 >> @@ -1368,7 +1371,7 @@ static void __free_pages_ok(struct page *page, u= nsigned int order) >> migratetype =3D get_pfnblock_migratetype(page, pfn); >> local_irq_save(flags); >> __count_vm_events(PGFREE, 1 << order); >> - free_one_page(page_zone(page), page, pfn, order, migratetype); >> + free_one_page(page_zone(page), page, pfn, order, migratetype, true);= >> local_irq_restore(flags); >> } >> =20 >> @@ -2968,7 +2971,7 @@ static void free_unref_page_commit(struct page *= page, unsigned long pfn) >> */ >> if (migratetype >=3D MIGRATE_PCPTYPES) { >> if (unlikely(is_migrate_isolate(migratetype))) { >> - free_one_page(zone, page, pfn, 0, migratetype); >> + free_one_page(zone, page, pfn, 0, migratetype, true); >> return; >> } >> migratetype =3D MIGRATE_MOVABLE; >> diff --git a/mm/page_hinting.c b/mm/page_hinting.c >> new file mode 100644 >> index 000000000000..7341c6462de2 >> --- /dev/null >> +++ b/mm/page_hinting.c >> @@ -0,0 +1,236 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* >> + * Page hinting support to enable a VM to report the freed pages back= >> + * to the host. >> + * >> + * Copyright Red Hat, Inc. 2019 >> + * >> + * Author(s): Nitesh Narayan Lal >> + */ >> + >> +#include >> +#include >> +#include >> +#include >> + >> +/* >> + * struct hinting_bitmap: holds the bitmap pointer which tracks the f= reed PFNs >> + * and other required parameters which could help in retrieving the o= riginal >> + * PFN value using the bitmap. >> + * @bitmap: Pointer to the bitmap of free PFN. >> + * @base_pfn: Starting PFN value for the zone whose bitmap is stored= =2E >> + * @free_pages: Tracks the number of free pages of granularity >> + * PAGE_HINTING_MIN_ORDER. >> + * @nbits: Indicates the total size of the bitmap in bits allocated >> + * at the time of initialization. >> + */ >> +struct hinting_bitmap { >> + unsigned long *bitmap; >> + unsigned long base_pfn; >> + atomic_t free_pages; >> + unsigned long nbits; >> +} bm_zone[MAX_NR_ZONES]; >> + >> +static void init_hinting_wq(struct work_struct *work); >> +extern int __isolate_free_page(struct page *page, unsigned int order)= ; >> +extern void __free_one_page(struct page *page, unsigned long pfn, >> + struct zone *zone, unsigned int order, >> + int migratetype, bool hint); >> +const struct page_hinting_cb *hcb; >> +struct work_struct hinting_work; >> + >> +static unsigned long find_bitmap_size(struct zone *zone) >> +{ >> + unsigned long nbits =3D ALIGN(zone->spanned_pages, >> + PAGE_HINTING_MIN_ORDER); >> + >> + nbits =3D nbits >> PAGE_HINTING_MIN_ORDER; >> + return nbits; > I think we can simplify this to > > return (zone->spanned_pages >> PAGE_HINTING_MIN_ORDER) + 1; > I will check this. >> +} >> + >> +void page_hinting_enable(const struct page_hinting_cb *callback) >> +{ >> + struct zone *zone; >> + int idx =3D 0; >> + unsigned long bitmap_size =3D 0; > You should probably protect enabling via a mutex and return -EINVAL or > similar if we already have a callback set (if we ever have different > drivers). But this has very little priority :) I will have to look into this. >> + >> + for_each_populated_zone(zone) { >> + spin_lock(&zone->lock); >> + bitmap_size =3D find_bitmap_size(zone); >> + bm_zone[idx].bitmap =3D bitmap_zalloc(bitmap_size, GFP_KERNEL); >> + if (!bm_zone[idx].bitmap) >> + return; >> + bm_zone[idx].nbits =3D bitmap_size; >> + bm_zone[idx].base_pfn =3D zone->zone_start_pfn; >> + spin_unlock(&zone->lock); >> + idx++; >> + } >> + hcb =3D callback; >> + INIT_WORK(&hinting_work, init_hinting_wq); > There are also possible races when enabling, you will have to take care= > of at one point. No page will be enqueued until hcb is not set. I can probably move it below INIT_WORK. >> +} >> +EXPORT_SYMBOL_GPL(page_hinting_enable); >> + >> +void page_hinting_disable(void) >> +{ >> + struct zone *zone; >> + int idx =3D 0; >> + >> + cancel_work_sync(&hinting_work); >> + hcb =3D NULL; >> + for_each_populated_zone(zone) { >> + spin_lock(&zone->lock); >> + bitmap_free(bm_zone[idx].bitmap); >> + bm_zone[idx].base_pfn =3D 0; >> + bm_zone[idx].nbits =3D 0; >> + atomic_set(&bm_zone[idx].free_pages, 0); >> + spin_unlock(&zone->lock); >> + idx++; >> + } >> +} >> +EXPORT_SYMBOL_GPL(page_hinting_disable); >> + >> +static unsigned long pfn_to_bit(struct page *page, int zonenum) >> +{ >> + unsigned long bitnr; >> + >> + bitnr =3D (page_to_pfn(page) - bm_zone[zonenum].base_pfn) >> + >> PAGE_HINTING_MIN_ORDER; >> + return bitnr; >> +} >> + >> +static void release_buddy_pages(struct list_head *pages) > maybe "release_isolated_pages", not sure. > >> +{ >> + int mt =3D 0, zonenum, order; >> + struct page *page, *next; >> + struct zone *zone; >> + unsigned long bitnr; >> + >> + list_for_each_entry_safe(page, next, pages, lru) { >> + zonenum =3D page_zonenum(page); >> + zone =3D page_zone(page); >> + bitnr =3D pfn_to_bit(page, zonenum); >> + spin_lock(&zone->lock); >> + list_del(&page->lru); >> + order =3D page_private(page); >> + set_page_private(page, 0); >> + mt =3D get_pageblock_migratetype(page); >> + __free_one_page(page, page_to_pfn(page), zone, >> + order, mt, false); >> + spin_unlock(&zone->lock); >> + } >> +} >> + >> +static void bm_set_pfn(struct page *page) >> +{ >> + unsigned long bitnr =3D 0; >> + int zonenum =3D page_zonenum(page); >> + struct zone *zone =3D page_zone(page); >> + >> + lockdep_assert_held(&zone->lock); >> + bitnr =3D pfn_to_bit(page, zonenum); >> + if (bm_zone[zonenum].bitmap && >> + bitnr < bm_zone[zonenum].nbits && >> + !test_and_set_bit(bitnr, bm_zone[zonenum].bitmap)) >> + atomic_inc(&bm_zone[zonenum].free_pages); >> +} >> + >> +static void scan_hinting_bitmap(int zonenum, int free_pages) >> +{ >> + unsigned long set_bit, start =3D 0; >> + struct page *page; >> + struct zone *zone; >> + int scanned_pages =3D 0, ret =3D 0, order, isolated_cnt =3D 0; >> + LIST_HEAD(isolated_pages); >> + >> + ret =3D hcb->prepare(); >> + if (ret < 0) >> + return; >> + for (;;) { >> + ret =3D 0; >> + set_bit =3D find_next_bit(bm_zone[zonenum].bitmap, >> + bm_zone[zonenum].nbits, start); >> + if (set_bit >=3D bm_zone[zonenum].nbits) >> + break; >> + page =3D pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) + >> + bm_zone[zonenum].base_pfn); >> + if (!page) >> + continue; > You are not clearing the bit / decrementing the counter. > >> + zone =3D page_zone(page); >> + spin_lock(&zone->lock); >> + >> + if (PageBuddy(page) && page_private(page) >=3D >> + PAGE_HINTING_MIN_ORDER) { >> + order =3D page_private(page); >> + ret =3D __isolate_free_page(page, order); >> + } >> + clear_bit(set_bit, bm_zone[zonenum].bitmap); >> + spin_unlock(&zone->lock); >> + if (ret) { >> + /* >> + * restoring page order to use it while releasing >> + * the pages back to the buddy. >> + */ >> + set_page_private(page, order); >> + list_add_tail(&page->lru, &isolated_pages); >> + isolated_cnt++; >> + if (isolated_cnt =3D=3D hcb->max_pages) { >> + hcb->hint_pages(&isolated_pages); >> + release_buddy_pages(&isolated_pages); >> + isolated_cnt =3D 0; >> + } >> + } >> + start =3D set_bit + 1; >> + scanned_pages++; >> + } >> + if (isolated_cnt) { >> + hcb->hint_pages(&isolated_pages); >> + release_buddy_pages(&isolated_pages); >> + } >> + hcb->cleanup(); >> + if (scanned_pages > free_pages) >> + atomic_sub((scanned_pages - free_pages), >> + &bm_zone[zonenum].free_pages); > This looks overly complicated. Can't we somehow simply decrement when > clearing a bit? > >> +} >> + >> +static bool check_hinting_threshold(void) >> +{ >> + int zonenum =3D 0; >> + >> + for (; zonenum < MAX_NR_ZONES; zonenum++) { >> + if (atomic_read(&bm_zone[zonenum].free_pages) >=3D >> + hcb->max_pages) >> + return true; >> + } >> + return false; >> +} >> + >> +static void init_hinting_wq(struct work_struct *work) >> +{ >> + int zonenum =3D 0, free_pages =3D 0; >> + >> + for (; zonenum < MAX_NR_ZONES; zonenum++) { >> + free_pages =3D atomic_read(&bm_zone[zonenum].free_pages); >> + if (free_pages >=3D hcb->max_pages) { >> + /* Find a better way to synchronize per zone >> + * free_pages. >> + */ >> + atomic_sub(free_pages, >> + &bm_zone[zonenum].free_pages); > I can't follow yet why we need that information.=20 We don't want to enqueue multiple jobs just because we are delaying the decrementing of free_pages. I agree, even I am not convinced with the approach which I have right now= =2E I will try to come up with a better way. > Wouldn't it be enough > to just track the number of set bits in the bitmap and start hinting > depending on that count? (there are false positives, but do we really c= are?) In an attempt to minimize the false positives, I might have overly complicated this part. I will try to simplify this in the next posting. > >> + scan_hinting_bitmap(zonenum, free_pages); >> + } >> + } >> +} >> + >> +void page_hinting_enqueue(struct page *page, int order) >> +{ >> + if (hcb && order >=3D PAGE_HINTING_MIN_ORDER) >> + bm_set_pfn(page); >> + else >> + return; >> + >> + if (check_hinting_threshold()) { >> + int cpu =3D smp_processor_id(); >> + >> + queue_work_on(cpu, system_wq, &hinting_work); >> + } >> +} >> > --=20 Regards Nitesh --vnK5kVg8UEY0Uit0GfVEPGNN08aRZ9XVd-- --hun85KKfyjW7Qo3FnHVFXjLAU0BGPerX8 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEkXcoRVGaqvbHPuAGo4ZA3AYyozkFAlz2bx8ACgkQo4ZA3AYy ozk+5Q//UpfP9IuIrdtP35s53eyDylrQjhQhU0BYDcW2QGkdGVoTDfnLeMCMZr1y CIiWtliyZ0EjsI1Z1C+Pib0cYsxbpKIhWaftfJXe2sCidPxE4RfmkJ2InXLxK09V mrCqWInVNFzk7WB8HrY8iXlxKP+OrOtVBZiIjFpOwp30tPrNwkAy1vmPBgogBdQP 0UDED2iIUF2t5n1JifDTUsBemelfnYUp1JhnlhYyEzNYmZrlZbXFRfxoPpPsATfx hhK3l4WTkRyyf4Yj2SJ2RST66Bkj7cCu+c7VT0kURaxrvYe/XbJRMKA+guFwIJW3 QLpIz2BHkV8mNfrFTM+LYTax9oeOE3gi0qFPtOlYXb/yK3jma3r53+4XfsIS9Sbo Xosnh7K0JiZ38e9al29RzWqWRgO4UeHs78uew6dCp2gEy01WdrDAN7fkDl7nBoJP cJgEI+TCa1fKfUY/pSShw9fxO5mxoszv2wp1TFbyb1hPiQtwaBEMWHEU5I2pGkyn kDfphuracrSlRn66nS8OHmQkLy1sT54YK34DaPWXLvtScgPztDZuaUA2wnqT8wR9 xnQJZ2gDiEYFSur+lENHLnAaBU/2UxUsziN46szMBA5nyu1iiED9FZSOKyQow8yI HAqqwykS6hVw910E4RjJmoRKQL93cO/3MIsajcV7SafpaYX4MRc= =23/B -----END PGP SIGNATURE----- --hun85KKfyjW7Qo3FnHVFXjLAU0BGPerX8--