[PATCH v2 0/4] mm: clarify nofail memory allocation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] mm: clarify nofail memory allocation
@ 2024-07-31  0:01 Barry Song
  2024-07-31  0:01 ` [PATCH RFT v2 1/4] vpda: try to fix the potential crash due to misusing __GFP_NOFAIL Barry Song
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Barry Song @ 2024-07-31  0:01 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, lstoakes, mhocko,
	penberg, rientjes, roman.gushchin, torvalds, urezki, v-songbaohua,
	vbabka, virtualization

From: Barry Song <v-songbaohua@oppo.com>

-v2:
 * adjust vpda fix according to Jason and Michal's feedback, I would
   expect Jason to test it, thanks!
 * split BUG_ON of unavoidable failure and the case GFP_ATOMIC |
   __GFP_NOFAIL into two patches according to Vlastimil and Michal.
 * collect Michal's acked-by for patch 2 - the doc;
 * remove the new GFP_NOFAIL from this series, that one would be a
   separate enhancement patchset later on.

-v1:
 https://lore.kernel.org/linux-mm/20240724085544.299090-1-21cnbao@gmail.com/

__GFP_NOFAIL carries the semantics of never failing, so its callers
do not check the return value:
  %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
  cannot handle allocation failures. The allocation could block
  indefinitely but will never return with failure. Testing for
  failure is pointless.

However, __GFP_NOFAIL can sometimes fail if it exceeds size limits
or is used with GFP_ATOMIC/GFP_NOWAIT in a non-sleepable context.
This can expose security vulnerabilities due to potential NULL
dereferences.

Since __GFP_NOFAIL does not support non-blocking allocation, we introduce
GFP_NOFAIL with inclusive blocking semantics and encourage using GFP_NOFAIL
as a replacement for __GFP_NOFAIL in non-mm.

If we must still fail a nofail allocation, we should trigger a BUG rather
than exposing NULL dereferences to callers who do not check the return
value.

* The discussion started from this topic:
 [PATCH RFC] mm: warn potential return NULL for kmalloc_array and
             kvmalloc_array with __GFP_NOFAIL

 https://lore.kernel.org/linux-mm/20240717230025.77361-1-21cnbao@gmail.com/

Thank you to Michal, Christoph, Vlastimil, and Hailong for all the
comments.

Barry Song (4):
  vpda: try to fix the potential crash due to misusing __GFP_NOFAIL
  mm: Document __GFP_NOFAIL must be blockable
  mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  mm: prohibit NULL deference exposed for unsupported non-blockable 
    __GFP_NOFAIL

 drivers/vdpa/vdpa_user/iova_domain.c | 31 +++++++++++++++++++++++-----
 drivers/vdpa/vdpa_user/iova_domain.h |  5 ++++-
 drivers/vdpa/vdpa_user/vduse_dev.c   |  4 +++-
 include/linux/gfp_types.h            |  5 ++++-
 include/linux/slab.h                 |  4 +++-
 mm/page_alloc.c                      | 14 +++++++------
 mm/util.c                            |  1 +
 7 files changed, 49 insertions(+), 15 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH RFT v2 1/4] vpda: try to fix the potential crash due to misusing __GFP_NOFAIL
  2024-07-31  0:01 [PATCH v2 0/4] mm: clarify nofail memory allocation Barry Song
@ 2024-07-31  0:01 ` Barry Song
  2024-07-31  3:09   ` Jason Wang
  2024-07-31  0:01 ` [PATCH v2 2/4] mm: Document __GFP_NOFAIL must be blockable Barry Song
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 28+ messages in thread
From: Barry Song @ 2024-07-31  0:01 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, lstoakes, mhocko,
	penberg, rientjes, roman.gushchin, torvalds, urezki, v-songbaohua,
	vbabka, virtualization, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Maxime Coquelin

From: Barry Song <v-songbaohua@oppo.com>

mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
__GFP_NOFAIL without direct reclamation may just result in a busy
loop within non-sleepable contexts.

static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
                                                struct alloc_context *ac)
{
        ...
        /*
         * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
         * we always retry
         */
        if (gfp_mask & __GFP_NOFAIL) {
                /*
                 * All existing users of the __GFP_NOFAIL are blockable, so warn
                 * of any new users that actually require GFP_NOWAIT
                 */
                if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
                        goto fail;
                ...
        }
        ...
fail:
        warn_alloc(gfp_mask, ac->nodemask,
                        "page allocation failure: order:%u", order);
got_pg:
        return page;
}

Let's move the memory allocation out of the atomic context and use
the normal sleepable context to get pages.

[RFT]: This has only been compile-tested; I'd prefer if the VDPA maintainers
handles it.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: "Eugenio Pérez" <eperezma@redhat.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 drivers/vdpa/vdpa_user/iova_domain.c | 31 +++++++++++++++++++++++-----
 drivers/vdpa/vdpa_user/iova_domain.h |  5 ++++-
 drivers/vdpa/vdpa_user/vduse_dev.c   |  4 +++-
 3 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
index 791d38d6284c..9318f059a8b5 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.c
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -283,7 +283,23 @@ int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
 	return ret;
 }
 
-void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
+struct page **vduse_domain_alloc_pages_to_remove_bounce(struct vduse_iova_domain *domain)
+{
+	struct page **pages;
+	unsigned long count, i;
+
+	if (!domain->user_bounce_pages)
+		return NULL;
+
+	count = domain->bounce_size >> PAGE_SHIFT;
+	pages = kmalloc_array(count, sizeof(*pages), GFP_KERNEL | __GFP_NOFAIL);
+	for (i = 0; i < count; i++)
+		pages[i] = alloc_page(GFP_KERNEL | __GFP_NOFAIL);
+
+	return pages;
+}
+
+void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain, struct page **pages)
 {
 	struct vduse_bounce_map *map;
 	unsigned long i, count;
@@ -294,15 +310,16 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
 
 	count = domain->bounce_size >> PAGE_SHIFT;
 	for (i = 0; i < count; i++) {
-		struct page *page = NULL;
+		struct page *page = pages[i];
 
 		map = &domain->bounce_maps[i];
-		if (WARN_ON(!map->bounce_page))
+		if (WARN_ON(!map->bounce_page)) {
+			put_page(page);
 			continue;
+		}
 
 		/* Copy user page to kernel page if it's in use */
 		if (map->orig_phys != INVALID_PHYS_ADDR) {
-			page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);
 			memcpy_from_page(page_address(page),
 					 map->bounce_page, 0, PAGE_SIZE);
 		}
@@ -310,6 +327,7 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
 		map->bounce_page = page;
 	}
 	domain->user_bounce_pages = false;
+	kfree(pages);
 out:
 	write_unlock(&domain->bounce_lock);
 }
@@ -543,10 +561,13 @@ static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
 static int vduse_domain_release(struct inode *inode, struct file *file)
 {
 	struct vduse_iova_domain *domain = file->private_data;
+	struct page **pages;
+
+	pages = vduse_domain_alloc_pages_to_remove_bounce(domain);
 
 	spin_lock(&domain->iotlb_lock);
 	vduse_iotlb_del_range(domain, 0, ULLONG_MAX);
-	vduse_domain_remove_user_bounce_pages(domain);
+	vduse_domain_remove_user_bounce_pages(domain, pages);
 	vduse_domain_free_kernel_bounce_pages(domain);
 	spin_unlock(&domain->iotlb_lock);
 	put_iova_domain(&domain->stream_iovad);
diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
index f92f22a7267d..17efa5555b3f 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.h
+++ b/drivers/vdpa/vdpa_user/iova_domain.h
@@ -74,7 +74,10 @@ void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain);
 int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
 				       struct page **pages, int count);
 
-void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain);
+void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain,
+					   struct page **pages);
+
+struct page **vduse_domain_alloc_pages_to_remove_bounce(struct vduse_iova_domain *domain);
 
 void vduse_domain_destroy(struct vduse_iova_domain *domain);
 
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 7ae99691efdf..5d8d5810df57 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -1030,6 +1030,7 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
 static int vduse_dev_dereg_umem(struct vduse_dev *dev,
 				u64 iova, u64 size)
 {
+	struct page **pages;
 	int ret;
 
 	mutex_lock(&dev->mem_lock);
@@ -1044,7 +1045,8 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
 	if (dev->umem->iova != iova || size != dev->domain->bounce_size)
 		goto unlock;
 
-	vduse_domain_remove_user_bounce_pages(dev->domain);
+	pages = vduse_domain_alloc_pages_to_remove_bounce(dev->domain);
+	vduse_domain_remove_user_bounce_pages(dev->domain, pages);
 	unpin_user_pages_dirty_lock(dev->umem->pages,
 				    dev->umem->npages, true);
 	atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH RFT v2 1/4] vpda: try to fix the potential crash due to misusing __GFP_NOFAIL
  2024-07-31  0:01 ` [PATCH RFT v2 1/4] vpda: try to fix the potential crash due to misusing __GFP_NOFAIL Barry Song
@ 2024-07-31  3:09   ` Jason Wang
  2024-07-31  3:15     ` Barry Song
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Wang @ 2024-07-31  3:09 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	lstoakes, mhocko, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Michael S. Tsirkin,
	Xuan Zhuo, Eugenio Pérez, Maxime Coquelin

On Wed, Jul 31, 2024 at 8:03 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> __GFP_NOFAIL without direct reclamation may just result in a busy
> loop within non-sleepable contexts.
>
> static inline struct page *
> __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>                                                 struct alloc_context *ac)
> {
>         ...
>         /*
>          * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
>          * we always retry
>          */
>         if (gfp_mask & __GFP_NOFAIL) {
>                 /*
>                  * All existing users of the __GFP_NOFAIL are blockable, so warn
>                  * of any new users that actually require GFP_NOWAIT
>                  */
>                 if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
>                         goto fail;
>                 ...
>         }
>         ...
> fail:
>         warn_alloc(gfp_mask, ac->nodemask,
>                         "page allocation failure: order:%u", order);
> got_pg:
>         return page;
> }
>
> Let's move the memory allocation out of the atomic context and use
> the normal sleepable context to get pages.
>
> [RFT]: This has only been compile-tested; I'd prefer if the VDPA maintainers
> handles it.
>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> Cc: "Eugenio Pérez" <eperezma@redhat.com>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  drivers/vdpa/vdpa_user/iova_domain.c | 31 +++++++++++++++++++++++-----
>  drivers/vdpa/vdpa_user/iova_domain.h |  5 ++++-
>  drivers/vdpa/vdpa_user/vduse_dev.c   |  4 +++-
>  3 files changed, 33 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> index 791d38d6284c..9318f059a8b5 100644
> --- a/drivers/vdpa/vdpa_user/iova_domain.c
> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> @@ -283,7 +283,23 @@ int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
>         return ret;
>  }
>
> -void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> +struct page **vduse_domain_alloc_pages_to_remove_bounce(struct vduse_iova_domain *domain)
> +{
> +       struct page **pages;
> +       unsigned long count, i;
> +
> +       if (!domain->user_bounce_pages)
> +               return NULL;
> +
> +       count = domain->bounce_size >> PAGE_SHIFT;
> +       pages = kmalloc_array(count, sizeof(*pages), GFP_KERNEL | __GFP_NOFAIL);
> +       for (i = 0; i < count; i++)
> +               pages[i] = alloc_page(GFP_KERNEL | __GFP_NOFAIL);
> +
> +       return pages;
> +}
> +
> +void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain, struct page **pages)
>  {
>         struct vduse_bounce_map *map;
>         unsigned long i, count;
> @@ -294,15 +310,16 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
>
>         count = domain->bounce_size >> PAGE_SHIFT;
>         for (i = 0; i < count; i++) {
> -               struct page *page = NULL;
> +               struct page *page = pages[i];
>
>                 map = &domain->bounce_maps[i];
> -               if (WARN_ON(!map->bounce_page))
> +               if (WARN_ON(!map->bounce_page)) {
> +                       put_page(page);
>                         continue;
> +               }
>
>                 /* Copy user page to kernel page if it's in use */
>                 if (map->orig_phys != INVALID_PHYS_ADDR) {
> -                       page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);
>                         memcpy_from_page(page_address(page),
>                                          map->bounce_page, 0, PAGE_SIZE);
>                 }
> @@ -310,6 +327,7 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
>                 map->bounce_page = page;
>         }
>         domain->user_bounce_pages = false;
> +       kfree(pages);
>  out:
>         write_unlock(&domain->bounce_lock);
>  }
> @@ -543,10 +561,13 @@ static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
>  static int vduse_domain_release(struct inode *inode, struct file *file)
>  {
>         struct vduse_iova_domain *domain = file->private_data;
> +       struct page **pages;
> +
> +       pages = vduse_domain_alloc_pages_to_remove_bounce(domain);
>
>         spin_lock(&domain->iotlb_lock);
>         vduse_iotlb_del_range(domain, 0, ULLONG_MAX);
> -       vduse_domain_remove_user_bounce_pages(domain);
> +       vduse_domain_remove_user_bounce_pages(domain, pages);
>         vduse_domain_free_kernel_bounce_pages(domain);
>         spin_unlock(&domain->iotlb_lock);
>         put_iova_domain(&domain->stream_iovad);
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> index f92f22a7267d..17efa5555b3f 100644
> --- a/drivers/vdpa/vdpa_user/iova_domain.h
> +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> @@ -74,7 +74,10 @@ void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain);
>  int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
>                                        struct page **pages, int count);
>
> -void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain);
> +void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain,
> +                                          struct page **pages);
> +
> +struct page **vduse_domain_alloc_pages_to_remove_bounce(struct vduse_iova_domain *domain);
>
>  void vduse_domain_destroy(struct vduse_iova_domain *domain);
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 7ae99691efdf..5d8d5810df57 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -1030,6 +1030,7 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
>  static int vduse_dev_dereg_umem(struct vduse_dev *dev,
>                                 u64 iova, u64 size)
>  {
> +       struct page **pages;
>         int ret;
>
>         mutex_lock(&dev->mem_lock);
> @@ -1044,7 +1045,8 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
>         if (dev->umem->iova != iova || size != dev->domain->bounce_size)
>                 goto unlock;
>
> -       vduse_domain_remove_user_bounce_pages(dev->domain);
> +       pages = vduse_domain_alloc_pages_to_remove_bounce(dev->domain);
> +       vduse_domain_remove_user_bounce_pages(dev->domain, pages);
>         unpin_user_pages_dirty_lock(dev->umem->pages,
>                                     dev->umem->npages, true);
>         atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);

We miss a kfree(pages); here?

Thanks

> --
> 2.34.1
>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFT v2 1/4] vpda: try to fix the potential crash due to misusing __GFP_NOFAIL
  2024-07-31  3:09   ` Jason Wang
@ 2024-07-31  3:15     ` Barry Song
  2024-07-31  3:58       ` Jason Wang
  0 siblings, 1 reply; 28+ messages in thread
From: Barry Song @ 2024-07-31  3:15 UTC (permalink / raw)
  To: Jason Wang
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	lstoakes, mhocko, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Michael S. Tsirkin,
	Xuan Zhuo, Eugenio Pérez, Maxime Coquelin

On Wed, Jul 31, 2024 at 11:10 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Jul 31, 2024 at 8:03 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > __GFP_NOFAIL without direct reclamation may just result in a busy
> > loop within non-sleepable contexts.
> >
> > static inline struct page *
> > __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >                                                 struct alloc_context *ac)
> > {
> >         ...
> >         /*
> >          * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
> >          * we always retry
> >          */
> >         if (gfp_mask & __GFP_NOFAIL) {
> >                 /*
> >                  * All existing users of the __GFP_NOFAIL are blockable, so warn
> >                  * of any new users that actually require GFP_NOWAIT
> >                  */
> >                 if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> >                         goto fail;
> >                 ...
> >         }
> >         ...
> > fail:
> >         warn_alloc(gfp_mask, ac->nodemask,
> >                         "page allocation failure: order:%u", order);
> > got_pg:
> >         return page;
> > }
> >
> > Let's move the memory allocation out of the atomic context and use
> > the normal sleepable context to get pages.
> >
> > [RFT]: This has only been compile-tested; I'd prefer if the VDPA maintainers
> > handles it.
> >
> > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > Cc: Jason Wang <jasowang@redhat.com>
> > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  drivers/vdpa/vdpa_user/iova_domain.c | 31 +++++++++++++++++++++++-----
> >  drivers/vdpa/vdpa_user/iova_domain.h |  5 ++++-
> >  drivers/vdpa/vdpa_user/vduse_dev.c   |  4 +++-
> >  3 files changed, 33 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> > index 791d38d6284c..9318f059a8b5 100644
> > --- a/drivers/vdpa/vdpa_user/iova_domain.c
> > +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> > @@ -283,7 +283,23 @@ int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
> >         return ret;
> >  }
> >
> > -void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> > +struct page **vduse_domain_alloc_pages_to_remove_bounce(struct vduse_iova_domain *domain)
> > +{
> > +       struct page **pages;
> > +       unsigned long count, i;
> > +
> > +       if (!domain->user_bounce_pages)
> > +               return NULL;
> > +
> > +       count = domain->bounce_size >> PAGE_SHIFT;
> > +       pages = kmalloc_array(count, sizeof(*pages), GFP_KERNEL | __GFP_NOFAIL);
> > +       for (i = 0; i < count; i++)
> > +               pages[i] = alloc_page(GFP_KERNEL | __GFP_NOFAIL);
> > +
> > +       return pages;
> > +}
> > +
> > +void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain, struct page **pages)
> >  {
> >         struct vduse_bounce_map *map;
> >         unsigned long i, count;
> > @@ -294,15 +310,16 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> >
> >         count = domain->bounce_size >> PAGE_SHIFT;
> >         for (i = 0; i < count; i++) {
> > -               struct page *page = NULL;
> > +               struct page *page = pages[i];
> >
> >                 map = &domain->bounce_maps[i];
> > -               if (WARN_ON(!map->bounce_page))
> > +               if (WARN_ON(!map->bounce_page)) {
> > +                       put_page(page);
> >                         continue;
> > +               }
> >
> >                 /* Copy user page to kernel page if it's in use */
> >                 if (map->orig_phys != INVALID_PHYS_ADDR) {
> > -                       page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);
> >                         memcpy_from_page(page_address(page),
> >                                          map->bounce_page, 0, PAGE_SIZE);
> >                 }
> > @@ -310,6 +327,7 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> >                 map->bounce_page = page;
> >         }
> >         domain->user_bounce_pages = false;
> > +       kfree(pages);
> >  out:
> >         write_unlock(&domain->bounce_lock);
> >  }
> > @@ -543,10 +561,13 @@ static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
> >  static int vduse_domain_release(struct inode *inode, struct file *file)
> >  {
> >         struct vduse_iova_domain *domain = file->private_data;
> > +       struct page **pages;
> > +
> > +       pages = vduse_domain_alloc_pages_to_remove_bounce(domain);
> >
> >         spin_lock(&domain->iotlb_lock);
> >         vduse_iotlb_del_range(domain, 0, ULLONG_MAX);
> > -       vduse_domain_remove_user_bounce_pages(domain);
> > +       vduse_domain_remove_user_bounce_pages(domain, pages);
> >         vduse_domain_free_kernel_bounce_pages(domain);
> >         spin_unlock(&domain->iotlb_lock);
> >         put_iova_domain(&domain->stream_iovad);
> > diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> > index f92f22a7267d..17efa5555b3f 100644
> > --- a/drivers/vdpa/vdpa_user/iova_domain.h
> > +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> > @@ -74,7 +74,10 @@ void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain);
> >  int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
> >                                        struct page **pages, int count);
> >
> > -void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain);
> > +void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain,
> > +                                          struct page **pages);
> > +
> > +struct page **vduse_domain_alloc_pages_to_remove_bounce(struct vduse_iova_domain *domain);
> >
> >  void vduse_domain_destroy(struct vduse_iova_domain *domain);
> >
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > index 7ae99691efdf..5d8d5810df57 100644
> > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -1030,6 +1030,7 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> >  static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> >                                 u64 iova, u64 size)
> >  {
> > +       struct page **pages;
> >         int ret;
> >
> >         mutex_lock(&dev->mem_lock);
> > @@ -1044,7 +1045,8 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> >         if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> >                 goto unlock;
> >
> > -       vduse_domain_remove_user_bounce_pages(dev->domain);
> > +       pages = vduse_domain_alloc_pages_to_remove_bounce(dev->domain);
> > +       vduse_domain_remove_user_bounce_pages(dev->domain, pages);
> >         unpin_user_pages_dirty_lock(dev->umem->pages,
> >                                     dev->umem->npages, true);
> >         atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
>
> We miss a kfree(pages); here?
no.
i've moved it into vduse_domain_remove_user_bounce_pages.

>
> Thanks
>
> > --
> > 2.34.1
> >
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFT v2 1/4] vpda: try to fix the potential crash due to misusing __GFP_NOFAIL
  2024-07-31  3:15     ` Barry Song
@ 2024-07-31  3:58       ` Jason Wang
  2024-07-31  4:11         ` Barry Song
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Wang @ 2024-07-31  3:58 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	lstoakes, mhocko, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Michael S. Tsirkin,
	Xuan Zhuo, Eugenio Pérez, Maxime Coquelin

On Wed, Jul 31, 2024 at 11:15 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Jul 31, 2024 at 11:10 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Jul 31, 2024 at 8:03 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > loop within non-sleepable contexts.
> > >
> > > static inline struct page *
> > > __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > >                                                 struct alloc_context *ac)
> > > {
> > >         ...
> > >         /*
> > >          * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
> > >          * we always retry
> > >          */
> > >         if (gfp_mask & __GFP_NOFAIL) {
> > >                 /*
> > >                  * All existing users of the __GFP_NOFAIL are blockable, so warn
> > >                  * of any new users that actually require GFP_NOWAIT
> > >                  */
> > >                 if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > >                         goto fail;
> > >                 ...
> > >         }
> > >         ...
> > > fail:
> > >         warn_alloc(gfp_mask, ac->nodemask,
> > >                         "page allocation failure: order:%u", order);
> > > got_pg:
> > >         return page;
> > > }
> > >
> > > Let's move the memory allocation out of the atomic context and use
> > > the normal sleepable context to get pages.
> > >
> > > [RFT]: This has only been compile-tested; I'd prefer if the VDPA maintainers
> > > handles it.
> > >
> > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > Cc: Jason Wang <jasowang@redhat.com>
> > > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > >  drivers/vdpa/vdpa_user/iova_domain.c | 31 +++++++++++++++++++++++-----
> > >  drivers/vdpa/vdpa_user/iova_domain.h |  5 ++++-
> > >  drivers/vdpa/vdpa_user/vduse_dev.c   |  4 +++-
> > >  3 files changed, 33 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> > > index 791d38d6284c..9318f059a8b5 100644
> > > --- a/drivers/vdpa/vdpa_user/iova_domain.c
> > > +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> > > @@ -283,7 +283,23 @@ int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
> > >         return ret;
> > >  }
> > >
> > > -void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> > > +struct page **vduse_domain_alloc_pages_to_remove_bounce(struct vduse_iova_domain *domain)
> > > +{
> > > +       struct page **pages;
> > > +       unsigned long count, i;
> > > +
> > > +       if (!domain->user_bounce_pages)
> > > +               return NULL;
> > > +
> > > +       count = domain->bounce_size >> PAGE_SHIFT;
> > > +       pages = kmalloc_array(count, sizeof(*pages), GFP_KERNEL | __GFP_NOFAIL);
> > > +       for (i = 0; i < count; i++)
> > > +               pages[i] = alloc_page(GFP_KERNEL | __GFP_NOFAIL);
> > > +
> > > +       return pages;
> > > +}
> > > +
> > > +void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain, struct page **pages)
> > >  {
> > >         struct vduse_bounce_map *map;
> > >         unsigned long i, count;
> > > @@ -294,15 +310,16 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> > >
> > >         count = domain->bounce_size >> PAGE_SHIFT;
> > >         for (i = 0; i < count; i++) {
> > > -               struct page *page = NULL;
> > > +               struct page *page = pages[i];
> > >
> > >                 map = &domain->bounce_maps[i];
> > > -               if (WARN_ON(!map->bounce_page))
> > > +               if (WARN_ON(!map->bounce_page)) {
> > > +                       put_page(page);
> > >                         continue;
> > > +               }
> > >
> > >                 /* Copy user page to kernel page if it's in use */
> > >                 if (map->orig_phys != INVALID_PHYS_ADDR) {
> > > -                       page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);
> > >                         memcpy_from_page(page_address(page),
> > >                                          map->bounce_page, 0, PAGE_SIZE);
> > >                 }
> > > @@ -310,6 +327,7 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> > >                 map->bounce_page = page;
> > >         }
> > >         domain->user_bounce_pages = false;
> > > +       kfree(pages);
> > >  out:
> > >         write_unlock(&domain->bounce_lock);
> > >  }
> > > @@ -543,10 +561,13 @@ static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
> > >  static int vduse_domain_release(struct inode *inode, struct file *file)
> > >  {
> > >         struct vduse_iova_domain *domain = file->private_data;
> > > +       struct page **pages;
> > > +
> > > +       pages = vduse_domain_alloc_pages_to_remove_bounce(domain);
> > >
> > >         spin_lock(&domain->iotlb_lock);
> > >         vduse_iotlb_del_range(domain, 0, ULLONG_MAX);
> > > -       vduse_domain_remove_user_bounce_pages(domain);
> > > +       vduse_domain_remove_user_bounce_pages(domain, pages);
> > >         vduse_domain_free_kernel_bounce_pages(domain);
> > >         spin_unlock(&domain->iotlb_lock);
> > >         put_iova_domain(&domain->stream_iovad);
> > > diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> > > index f92f22a7267d..17efa5555b3f 100644
> > > --- a/drivers/vdpa/vdpa_user/iova_domain.h
> > > +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> > > @@ -74,7 +74,10 @@ void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain);
> > >  int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
> > >                                        struct page **pages, int count);
> > >
> > > -void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain);
> > > +void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain,
> > > +                                          struct page **pages);
> > > +
> > > +struct page **vduse_domain_alloc_pages_to_remove_bounce(struct vduse_iova_domain *domain);
> > >
> > >  void vduse_domain_destroy(struct vduse_iova_domain *domain);
> > >
> > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > index 7ae99691efdf..5d8d5810df57 100644
> > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > @@ -1030,6 +1030,7 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> > >  static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > >                                 u64 iova, u64 size)
> > >  {
> > > +       struct page **pages;
> > >         int ret;
> > >
> > >         mutex_lock(&dev->mem_lock);
> > > @@ -1044,7 +1045,8 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > >         if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> > >                 goto unlock;
> > >
> > > -       vduse_domain_remove_user_bounce_pages(dev->domain);
> > > +       pages = vduse_domain_alloc_pages_to_remove_bounce(dev->domain);
> > > +       vduse_domain_remove_user_bounce_pages(dev->domain, pages);
> > >         unpin_user_pages_dirty_lock(dev->umem->pages,
> > >                                     dev->umem->npages, true);
> > >         atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> >
> > We miss a kfree(pages); here?
> no.
> i've moved it into vduse_domain_remove_user_bounce_pages.

Ok, but it seems tricky e.g allocated by the caller but freed in
callee. And I think I missed some important issues in the previous
review: The check of user_bounce_pages must be done under the
bounce_lock, otherwise it might race with umem_reg.

So in the case of release(), we know the device is gone, so there's no
need to allocate pages that will be released soon. So we can pass NULL
as a hint and just assign bounce_page to NULL in
vduse_domain_remove_user_bounce_pages().

And in the case of vduse_dev_dereg_umem(), we need to allocate the
pages without checking user_bounce_pages. So in
vduse_domain_remove_user_bounce_pages() if we can free the allocated
pages as well as the pages in the following check

        if (!domain->user_bounce_pages)
                goto out;

What do you think?

Thanks

>
> >
> > Thanks
> >
> > > --
> > > 2.34.1
> > >
> >
>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFT v2 1/4] vpda: try to fix the potential crash due to misusing __GFP_NOFAIL
  2024-07-31  3:58       ` Jason Wang
@ 2024-07-31  4:11         ` Barry Song
  2024-07-31  4:13           ` Jason Wang
  0 siblings, 1 reply; 28+ messages in thread
From: Barry Song @ 2024-07-31  4:11 UTC (permalink / raw)
  To: Jason Wang
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	lstoakes, mhocko, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Michael S. Tsirkin,
	Xuan Zhuo, Eugenio Pérez, Maxime Coquelin

On Wed, Jul 31, 2024 at 11:58 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Jul 31, 2024 at 11:15 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, Jul 31, 2024 at 11:10 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Wed, Jul 31, 2024 at 8:03 AM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > From: Barry Song <v-songbaohua@oppo.com>
> > > >
> > > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > > loop within non-sleepable contexts.
> > > >
> > > > static inline struct page *
> > > > __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > >                                                 struct alloc_context *ac)
> > > > {
> > > >         ...
> > > >         /*
> > > >          * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
> > > >          * we always retry
> > > >          */
> > > >         if (gfp_mask & __GFP_NOFAIL) {
> > > >                 /*
> > > >                  * All existing users of the __GFP_NOFAIL are blockable, so warn
> > > >                  * of any new users that actually require GFP_NOWAIT
> > > >                  */
> > > >                 if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > > >                         goto fail;
> > > >                 ...
> > > >         }
> > > >         ...
> > > > fail:
> > > >         warn_alloc(gfp_mask, ac->nodemask,
> > > >                         "page allocation failure: order:%u", order);
> > > > got_pg:
> > > >         return page;
> > > > }
> > > >
> > > > Let's move the memory allocation out of the atomic context and use
> > > > the normal sleepable context to get pages.
> > > >
> > > > [RFT]: This has only been compile-tested; I'd prefer if the VDPA maintainers
> > > > handles it.
> > > >
> > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > Cc: Jason Wang <jasowang@redhat.com>
> > > > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > > > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > > ---
> > > >  drivers/vdpa/vdpa_user/iova_domain.c | 31 +++++++++++++++++++++++-----
> > > >  drivers/vdpa/vdpa_user/iova_domain.h |  5 ++++-
> > > >  drivers/vdpa/vdpa_user/vduse_dev.c   |  4 +++-
> > > >  3 files changed, 33 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> > > > index 791d38d6284c..9318f059a8b5 100644
> > > > --- a/drivers/vdpa/vdpa_user/iova_domain.c
> > > > +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> > > > @@ -283,7 +283,23 @@ int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
> > > >         return ret;
> > > >  }
> > > >
> > > > -void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> > > > +struct page **vduse_domain_alloc_pages_to_remove_bounce(struct vduse_iova_domain *domain)
> > > > +{
> > > > +       struct page **pages;
> > > > +       unsigned long count, i;
> > > > +
> > > > +       if (!domain->user_bounce_pages)
> > > > +               return NULL;
> > > > +
> > > > +       count = domain->bounce_size >> PAGE_SHIFT;
> > > > +       pages = kmalloc_array(count, sizeof(*pages), GFP_KERNEL | __GFP_NOFAIL);
> > > > +       for (i = 0; i < count; i++)
> > > > +               pages[i] = alloc_page(GFP_KERNEL | __GFP_NOFAIL);
> > > > +
> > > > +       return pages;
> > > > +}
> > > > +
> > > > +void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain, struct page **pages)
> > > >  {
> > > >         struct vduse_bounce_map *map;
> > > >         unsigned long i, count;
> > > > @@ -294,15 +310,16 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> > > >
> > > >         count = domain->bounce_size >> PAGE_SHIFT;
> > > >         for (i = 0; i < count; i++) {
> > > > -               struct page *page = NULL;
> > > > +               struct page *page = pages[i];
> > > >
> > > >                 map = &domain->bounce_maps[i];
> > > > -               if (WARN_ON(!map->bounce_page))
> > > > +               if (WARN_ON(!map->bounce_page)) {
> > > > +                       put_page(page);
> > > >                         continue;
> > > > +               }
> > > >
> > > >                 /* Copy user page to kernel page if it's in use */
> > > >                 if (map->orig_phys != INVALID_PHYS_ADDR) {
> > > > -                       page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);
> > > >                         memcpy_from_page(page_address(page),
> > > >                                          map->bounce_page, 0, PAGE_SIZE);
> > > >                 }
> > > > @@ -310,6 +327,7 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> > > >                 map->bounce_page = page;
> > > >         }
> > > >         domain->user_bounce_pages = false;
> > > > +       kfree(pages);
> > > >  out:
> > > >         write_unlock(&domain->bounce_lock);
> > > >  }
> > > > @@ -543,10 +561,13 @@ static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
> > > >  static int vduse_domain_release(struct inode *inode, struct file *file)
> > > >  {
> > > >         struct vduse_iova_domain *domain = file->private_data;
> > > > +       struct page **pages;
> > > > +
> > > > +       pages = vduse_domain_alloc_pages_to_remove_bounce(domain);
> > > >
> > > >         spin_lock(&domain->iotlb_lock);
> > > >         vduse_iotlb_del_range(domain, 0, ULLONG_MAX);
> > > > -       vduse_domain_remove_user_bounce_pages(domain);
> > > > +       vduse_domain_remove_user_bounce_pages(domain, pages);
> > > >         vduse_domain_free_kernel_bounce_pages(domain);
> > > >         spin_unlock(&domain->iotlb_lock);
> > > >         put_iova_domain(&domain->stream_iovad);
> > > > diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> > > > index f92f22a7267d..17efa5555b3f 100644
> > > > --- a/drivers/vdpa/vdpa_user/iova_domain.h
> > > > +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> > > > @@ -74,7 +74,10 @@ void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain);
> > > >  int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
> > > >                                        struct page **pages, int count);
> > > >
> > > > -void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain);
> > > > +void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain,
> > > > +                                          struct page **pages);
> > > > +
> > > > +struct page **vduse_domain_alloc_pages_to_remove_bounce(struct vduse_iova_domain *domain);
> > > >
> > > >  void vduse_domain_destroy(struct vduse_iova_domain *domain);
> > > >
> > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > index 7ae99691efdf..5d8d5810df57 100644
> > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > @@ -1030,6 +1030,7 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> > > >  static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > > >                                 u64 iova, u64 size)
> > > >  {
> > > > +       struct page **pages;
> > > >         int ret;
> > > >
> > > >         mutex_lock(&dev->mem_lock);
> > > > @@ -1044,7 +1045,8 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > > >         if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> > > >                 goto unlock;
> > > >
> > > > -       vduse_domain_remove_user_bounce_pages(dev->domain);
> > > > +       pages = vduse_domain_alloc_pages_to_remove_bounce(dev->domain);
> > > > +       vduse_domain_remove_user_bounce_pages(dev->domain, pages);
> > > >         unpin_user_pages_dirty_lock(dev->umem->pages,
> > > >                                     dev->umem->npages, true);
> > > >         atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> > >
> > > We miss a kfree(pages); here?
> > no.
> > i've moved it into vduse_domain_remove_user_bounce_pages.
>
> Ok, but it seems tricky e.g allocated by the caller but freed in
> callee. And I think I missed some important issues in the previous
> review: The check of user_bounce_pages must be done under the
> bounce_lock, otherwise it might race with umem_reg.
>
> So in the case of release(), we know the device is gone, so there's no
> need to allocate pages that will be released soon. So we can pass NULL
> as a hint and just assign bounce_page to NULL in
> vduse_domain_remove_user_bounce_pages().
>
> And in the case of vduse_dev_dereg_umem(), we need to allocate the
> pages without checking user_bounce_pages. So in
> vduse_domain_remove_user_bounce_pages() if we can free the allocated
> pages as well as the pages in the following check
>
>         if (!domain->user_bounce_pages)
>                 goto out;
>
> What do you think?

I am not a vdpa guy, but  changing the current logic is another patch.
From mm perspective, I can only address the __GFP_NOFAIL issue.

I actually prefer you guys handle it  directly:-) I'd rather report a BUG
instead. TBH, I know nothing about vpda.

>
> Thanks
>
> >
> > >
> > > Thanks
> > >
> > > > --
> > > > 2.34.1
> > > >
> > >
Thanks
Barry


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFT v2 1/4] vpda: try to fix the potential crash due to misusing __GFP_NOFAIL
  2024-07-31  4:11         ` Barry Song
@ 2024-07-31  4:13           ` Jason Wang
  2024-07-31  5:05             ` Barry Song
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Wang @ 2024-07-31  4:13 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	lstoakes, mhocko, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Michael S. Tsirkin,
	Xuan Zhuo, Eugenio Pérez, Maxime Coquelin

On Wed, Jul 31, 2024 at 12:12 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Jul 31, 2024 at 11:58 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Jul 31, 2024 at 11:15 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Wed, Jul 31, 2024 at 11:10 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Wed, Jul 31, 2024 at 8:03 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > >
> > > > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > > > loop within non-sleepable contexts.
> > > > >
> > > > > static inline struct page *
> > > > > __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > > >                                                 struct alloc_context *ac)
> > > > > {
> > > > >         ...
> > > > >         /*
> > > > >          * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
> > > > >          * we always retry
> > > > >          */
> > > > >         if (gfp_mask & __GFP_NOFAIL) {
> > > > >                 /*
> > > > >                  * All existing users of the __GFP_NOFAIL are blockable, so warn
> > > > >                  * of any new users that actually require GFP_NOWAIT
> > > > >                  */
> > > > >                 if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > > > >                         goto fail;
> > > > >                 ...
> > > > >         }
> > > > >         ...
> > > > > fail:
> > > > >         warn_alloc(gfp_mask, ac->nodemask,
> > > > >                         "page allocation failure: order:%u", order);
> > > > > got_pg:
> > > > >         return page;
> > > > > }
> > > > >
> > > > > Let's move the memory allocation out of the atomic context and use
> > > > > the normal sleepable context to get pages.
> > > > >
> > > > > [RFT]: This has only been compile-tested; I'd prefer if the VDPA maintainers
> > > > > handles it.
> > > > >
> > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > Cc: Jason Wang <jasowang@redhat.com>
> > > > > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > > > > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > > > ---
> > > > >  drivers/vdpa/vdpa_user/iova_domain.c | 31 +++++++++++++++++++++++-----
> > > > >  drivers/vdpa/vdpa_user/iova_domain.h |  5 ++++-
> > > > >  drivers/vdpa/vdpa_user/vduse_dev.c   |  4 +++-
> > > > >  3 files changed, 33 insertions(+), 7 deletions(-)
> > > > >
> > > > > diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> > > > > index 791d38d6284c..9318f059a8b5 100644
> > > > > --- a/drivers/vdpa/vdpa_user/iova_domain.c
> > > > > +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> > > > > @@ -283,7 +283,23 @@ int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
> > > > >         return ret;
> > > > >  }
> > > > >
> > > > > -void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> > > > > +struct page **vduse_domain_alloc_pages_to_remove_bounce(struct vduse_iova_domain *domain)
> > > > > +{
> > > > > +       struct page **pages;
> > > > > +       unsigned long count, i;
> > > > > +
> > > > > +       if (!domain->user_bounce_pages)
> > > > > +               return NULL;
> > > > > +
> > > > > +       count = domain->bounce_size >> PAGE_SHIFT;
> > > > > +       pages = kmalloc_array(count, sizeof(*pages), GFP_KERNEL | __GFP_NOFAIL);
> > > > > +       for (i = 0; i < count; i++)
> > > > > +               pages[i] = alloc_page(GFP_KERNEL | __GFP_NOFAIL);
> > > > > +
> > > > > +       return pages;
> > > > > +}
> > > > > +
> > > > > +void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain, struct page **pages)
> > > > >  {
> > > > >         struct vduse_bounce_map *map;
> > > > >         unsigned long i, count;
> > > > > @@ -294,15 +310,16 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> > > > >
> > > > >         count = domain->bounce_size >> PAGE_SHIFT;
> > > > >         for (i = 0; i < count; i++) {
> > > > > -               struct page *page = NULL;
> > > > > +               struct page *page = pages[i];
> > > > >
> > > > >                 map = &domain->bounce_maps[i];
> > > > > -               if (WARN_ON(!map->bounce_page))
> > > > > +               if (WARN_ON(!map->bounce_page)) {
> > > > > +                       put_page(page);
> > > > >                         continue;
> > > > > +               }
> > > > >
> > > > >                 /* Copy user page to kernel page if it's in use */
> > > > >                 if (map->orig_phys != INVALID_PHYS_ADDR) {
> > > > > -                       page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);
> > > > >                         memcpy_from_page(page_address(page),
> > > > >                                          map->bounce_page, 0, PAGE_SIZE);
> > > > >                 }
> > > > > @@ -310,6 +327,7 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> > > > >                 map->bounce_page = page;
> > > > >         }
> > > > >         domain->user_bounce_pages = false;
> > > > > +       kfree(pages);
> > > > >  out:
> > > > >         write_unlock(&domain->bounce_lock);
> > > > >  }
> > > > > @@ -543,10 +561,13 @@ static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
> > > > >  static int vduse_domain_release(struct inode *inode, struct file *file)
> > > > >  {
> > > > >         struct vduse_iova_domain *domain = file->private_data;
> > > > > +       struct page **pages;
> > > > > +
> > > > > +       pages = vduse_domain_alloc_pages_to_remove_bounce(domain);
> > > > >
> > > > >         spin_lock(&domain->iotlb_lock);
> > > > >         vduse_iotlb_del_range(domain, 0, ULLONG_MAX);
> > > > > -       vduse_domain_remove_user_bounce_pages(domain);
> > > > > +       vduse_domain_remove_user_bounce_pages(domain, pages);
> > > > >         vduse_domain_free_kernel_bounce_pages(domain);
> > > > >         spin_unlock(&domain->iotlb_lock);
> > > > >         put_iova_domain(&domain->stream_iovad);
> > > > > diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> > > > > index f92f22a7267d..17efa5555b3f 100644
> > > > > --- a/drivers/vdpa/vdpa_user/iova_domain.h
> > > > > +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> > > > > @@ -74,7 +74,10 @@ void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain);
> > > > >  int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
> > > > >                                        struct page **pages, int count);
> > > > >
> > > > > -void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain);
> > > > > +void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain,
> > > > > +                                          struct page **pages);
> > > > > +
> > > > > +struct page **vduse_domain_alloc_pages_to_remove_bounce(struct vduse_iova_domain *domain);
> > > > >
> > > > >  void vduse_domain_destroy(struct vduse_iova_domain *domain);
> > > > >
> > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > index 7ae99691efdf..5d8d5810df57 100644
> > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > @@ -1030,6 +1030,7 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> > > > >  static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > > > >                                 u64 iova, u64 size)
> > > > >  {
> > > > > +       struct page **pages;
> > > > >         int ret;
> > > > >
> > > > >         mutex_lock(&dev->mem_lock);
> > > > > @@ -1044,7 +1045,8 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > > > >         if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> > > > >                 goto unlock;
> > > > >
> > > > > -       vduse_domain_remove_user_bounce_pages(dev->domain);
> > > > > +       pages = vduse_domain_alloc_pages_to_remove_bounce(dev->domain);
> > > > > +       vduse_domain_remove_user_bounce_pages(dev->domain, pages);
> > > > >         unpin_user_pages_dirty_lock(dev->umem->pages,
> > > > >                                     dev->umem->npages, true);
> > > > >         atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> > > >
> > > > We miss a kfree(pages); here?
> > > no.
> > > i've moved it into vduse_domain_remove_user_bounce_pages.
> >
> > Ok, but it seems tricky e.g allocated by the caller but freed in
> > callee. And I think I missed some important issues in the previous
> > review: The check of user_bounce_pages must be done under the
> > bounce_lock, otherwise it might race with umem_reg.
> >
> > So in the case of release(), we know the device is gone, so there's no
> > need to allocate pages that will be released soon. So we can pass NULL
> > as a hint and just assign bounce_page to NULL in
> > vduse_domain_remove_user_bounce_pages().
> >
> > And in the case of vduse_dev_dereg_umem(), we need to allocate the
> > pages without checking user_bounce_pages. So in
> > vduse_domain_remove_user_bounce_pages() if we can free the allocated
> > pages as well as the pages in the following check
> >
> >         if (!domain->user_bounce_pages)
> >                 goto out;
> >
> > What do you think?
>
> I am not a vdpa guy, but  changing the current logic is another patch.
> From mm perspective, I can only address the __GFP_NOFAIL issue.
>
> I actually prefer you guys handle it  directly:-) I'd rather report a BUG
> instead. TBH, I know nothing about vpda.

Fine, let me post a patch for this (no later than the end of this week).

Thanks

>
> >
> > Thanks
> >
> > >
> > > >
> > > > Thanks
> > > >
> > > > > --
> > > > > 2.34.1
> > > > >
> > > >
> Thanks
> Barry
>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFT v2 1/4] vpda: try to fix the potential crash due to misusing __GFP_NOFAIL
  2024-07-31  4:13           ` Jason Wang
@ 2024-07-31  5:05             ` Barry Song
  2024-07-31 10:20               ` Tetsuo Handa
  2024-08-01  2:30               ` Jason Wang
  0 siblings, 2 replies; 28+ messages in thread
From: Barry Song @ 2024-07-31  5:05 UTC (permalink / raw)
  To: Jason Wang, penguin-kernel
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	lstoakes, mhocko, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Michael S. Tsirkin,
	Xuan Zhuo, Eugenio Pérez, Maxime Coquelin

On Wed, Jul 31, 2024 at 12:13 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Jul 31, 2024 at 12:12 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, Jul 31, 2024 at 11:58 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Wed, Jul 31, 2024 at 11:15 AM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Wed, Jul 31, 2024 at 11:10 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Wed, Jul 31, 2024 at 8:03 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > >
> > > > > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > > > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > > > > loop within non-sleepable contexts.
> > > > > >
> > > > > > static inline struct page *
> > > > > > __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > > > >                                                 struct alloc_context *ac)
> > > > > > {
> > > > > >         ...
> > > > > >         /*
> > > > > >          * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
> > > > > >          * we always retry
> > > > > >          */
> > > > > >         if (gfp_mask & __GFP_NOFAIL) {
> > > > > >                 /*
> > > > > >                  * All existing users of the __GFP_NOFAIL are blockable, so warn
> > > > > >                  * of any new users that actually require GFP_NOWAIT
> > > > > >                  */
> > > > > >                 if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > > > > >                         goto fail;
> > > > > >                 ...
> > > > > >         }
> > > > > >         ...
> > > > > > fail:
> > > > > >         warn_alloc(gfp_mask, ac->nodemask,
> > > > > >                         "page allocation failure: order:%u", order);
> > > > > > got_pg:
> > > > > >         return page;
> > > > > > }
> > > > > >
> > > > > > Let's move the memory allocation out of the atomic context and use
> > > > > > the normal sleepable context to get pages.
> > > > > >
> > > > > > [RFT]: This has only been compile-tested; I'd prefer if the VDPA maintainers
> > > > > > handles it.
> > > > > >
> > > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > > Cc: Jason Wang <jasowang@redhat.com>
> > > > > > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > > > > > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > > > > ---
> > > > > >  drivers/vdpa/vdpa_user/iova_domain.c | 31 +++++++++++++++++++++++-----
> > > > > >  drivers/vdpa/vdpa_user/iova_domain.h |  5 ++++-
> > > > > >  drivers/vdpa/vdpa_user/vduse_dev.c   |  4 +++-
> > > > > >  3 files changed, 33 insertions(+), 7 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> > > > > > index 791d38d6284c..9318f059a8b5 100644
> > > > > > --- a/drivers/vdpa/vdpa_user/iova_domain.c
> > > > > > +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> > > > > > @@ -283,7 +283,23 @@ int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
> > > > > >         return ret;
> > > > > >  }
> > > > > >
> > > > > > -void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> > > > > > +struct page **vduse_domain_alloc_pages_to_remove_bounce(struct vduse_iova_domain *domain)
> > > > > > +{
> > > > > > +       struct page **pages;
> > > > > > +       unsigned long count, i;
> > > > > > +
> > > > > > +       if (!domain->user_bounce_pages)
> > > > > > +               return NULL;
> > > > > > +
> > > > > > +       count = domain->bounce_size >> PAGE_SHIFT;
> > > > > > +       pages = kmalloc_array(count, sizeof(*pages), GFP_KERNEL | __GFP_NOFAIL);
> > > > > > +       for (i = 0; i < count; i++)
> > > > > > +               pages[i] = alloc_page(GFP_KERNEL | __GFP_NOFAIL);
> > > > > > +
> > > > > > +       return pages;
> > > > > > +}
> > > > > > +
> > > > > > +void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain, struct page **pages)
> > > > > >  {
> > > > > >         struct vduse_bounce_map *map;
> > > > > >         unsigned long i, count;
> > > > > > @@ -294,15 +310,16 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> > > > > >
> > > > > >         count = domain->bounce_size >> PAGE_SHIFT;
> > > > > >         for (i = 0; i < count; i++) {
> > > > > > -               struct page *page = NULL;
> > > > > > +               struct page *page = pages[i];
> > > > > >
> > > > > >                 map = &domain->bounce_maps[i];
> > > > > > -               if (WARN_ON(!map->bounce_page))
> > > > > > +               if (WARN_ON(!map->bounce_page)) {
> > > > > > +                       put_page(page);
> > > > > >                         continue;
> > > > > > +               }
> > > > > >
> > > > > >                 /* Copy user page to kernel page if it's in use */
> > > > > >                 if (map->orig_phys != INVALID_PHYS_ADDR) {
> > > > > > -                       page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);
> > > > > >                         memcpy_from_page(page_address(page),
> > > > > >                                          map->bounce_page, 0, PAGE_SIZE);
> > > > > >                 }
> > > > > > @@ -310,6 +327,7 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> > > > > >                 map->bounce_page = page;
> > > > > >         }
> > > > > >         domain->user_bounce_pages = false;
> > > > > > +       kfree(pages);
> > > > > >  out:
> > > > > >         write_unlock(&domain->bounce_lock);
> > > > > >  }
> > > > > > @@ -543,10 +561,13 @@ static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
> > > > > >  static int vduse_domain_release(struct inode *inode, struct file *file)
> > > > > >  {
> > > > > >         struct vduse_iova_domain *domain = file->private_data;
> > > > > > +       struct page **pages;
> > > > > > +
> > > > > > +       pages = vduse_domain_alloc_pages_to_remove_bounce(domain);
> > > > > >
> > > > > >         spin_lock(&domain->iotlb_lock);
> > > > > >         vduse_iotlb_del_range(domain, 0, ULLONG_MAX);
> > > > > > -       vduse_domain_remove_user_bounce_pages(domain);
> > > > > > +       vduse_domain_remove_user_bounce_pages(domain, pages);
> > > > > >         vduse_domain_free_kernel_bounce_pages(domain);
> > > > > >         spin_unlock(&domain->iotlb_lock);
> > > > > >         put_iova_domain(&domain->stream_iovad);
> > > > > > diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> > > > > > index f92f22a7267d..17efa5555b3f 100644
> > > > > > --- a/drivers/vdpa/vdpa_user/iova_domain.h
> > > > > > +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> > > > > > @@ -74,7 +74,10 @@ void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain);
> > > > > >  int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
> > > > > >                                        struct page **pages, int count);
> > > > > >
> > > > > > -void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain);
> > > > > > +void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain,
> > > > > > +                                          struct page **pages);
> > > > > > +
> > > > > > +struct page **vduse_domain_alloc_pages_to_remove_bounce(struct vduse_iova_domain *domain);
> > > > > >
> > > > > >  void vduse_domain_destroy(struct vduse_iova_domain *domain);
> > > > > >
> > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > index 7ae99691efdf..5d8d5810df57 100644
> > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > @@ -1030,6 +1030,7 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> > > > > >  static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > > > > >                                 u64 iova, u64 size)
> > > > > >  {
> > > > > > +       struct page **pages;
> > > > > >         int ret;
> > > > > >
> > > > > >         mutex_lock(&dev->mem_lock);
> > > > > > @@ -1044,7 +1045,8 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > > > > >         if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> > > > > >                 goto unlock;
> > > > > >
> > > > > > -       vduse_domain_remove_user_bounce_pages(dev->domain);
> > > > > > +       pages = vduse_domain_alloc_pages_to_remove_bounce(dev->domain);
> > > > > > +       vduse_domain_remove_user_bounce_pages(dev->domain, pages);
> > > > > >         unpin_user_pages_dirty_lock(dev->umem->pages,
> > > > > >                                     dev->umem->npages, true);
> > > > > >         atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> > > > >
> > > > > We miss a kfree(pages); here?
> > > > no.
> > > > i've moved it into vduse_domain_remove_user_bounce_pages.
> > >
> > > Ok, but it seems tricky e.g allocated by the caller but freed in
> > > callee. And I think I missed some important issues in the previous
> > > review: The check of user_bounce_pages must be done under the
> > > bounce_lock, otherwise it might race with umem_reg.
> > >
> > > So in the case of release(), we know the device is gone, so there's no
> > > need to allocate pages that will be released soon. So we can pass NULL
> > > as a hint and just assign bounce_page to NULL in
> > > vduse_domain_remove_user_bounce_pages().
> > >
> > > And in the case of vduse_dev_dereg_umem(), we need to allocate the
> > > pages without checking user_bounce_pages. So in
> > > vduse_domain_remove_user_bounce_pages() if we can free the allocated
> > > pages as well as the pages in the following check
> > >
> > >         if (!domain->user_bounce_pages)
> > >                 goto out;
> > >
> > > What do you think?
> >
> > I am not a vdpa guy, but  changing the current logic is another patch.
> > From mm perspective, I can only address the __GFP_NOFAIL issue.
> >
> > I actually prefer you guys handle it  directly:-) I'd rather report a BUG
> > instead. TBH, I know nothing about vpda.
>
> Fine, let me post a patch for this (no later than the end of this week).
>
Jason,
Thank you very much. Also, Tetsuo reminded me that kmalloc_array() might be
problematic if the count is too large:
 pages = kmalloc_array(count, sizeof(*pages), GFP_KERNEL | __GFP_NOFAIL);

You might want to consider using vmalloc_array() or kvmalloc_array() instead
when you send a new version.

> Thanks
>
> >
> > >
> > > Thanks
> > >
> > > >
> > > > >
> > > > > Thanks
> > > > >
> > > > > > --
> > > > > > 2.34.1
> > > > > >
> > > > >

Thanks
Barry


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFT v2 1/4] vpda: try to fix the potential crash due to misusing __GFP_NOFAIL
  2024-07-31  5:05             ` Barry Song
@ 2024-07-31 10:20               ` Tetsuo Handa
  2024-08-01  2:37                 ` Jason Wang
  2024-08-01  2:30               ` Jason Wang
  1 sibling, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2024-07-31 10:20 UTC (permalink / raw)
  To: Barry Song, Jason Wang
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	lstoakes, mhocko, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Michael S. Tsirkin,
	Xuan Zhuo, Eugenio Pérez, Maxime Coquelin

On 2024/07/31 14:05, Barry Song wrote:
> Jason,
> Thank you very much. Also, Tetsuo reminded me that kmalloc_array() might be
> problematic if the count is too large:
>  pages = kmalloc_array(count, sizeof(*pages), GFP_KERNEL | __GFP_NOFAIL);

If "count" is guaranteed to be count <= 16, this might be tolerable.

Consider a situation where current thread was chosen as an global OOM victim.
Trying to allocate "count" pages using

	for (i = 0; i < count; i++)
		pages[i] = alloc_page(GFP_KERNEL | __GFP_NOFAIL);

is not good.

> 
> You might want to consider using vmalloc_array() or kvmalloc_array() instead
> when you send a new version.

There is a limitation at https://elixir.bootlin.com/linux/v6.11-rc1/source/mm/page_alloc.c#L3033
that you must satisfy count <= PAGE_SIZE * 2 / sizeof(*pages) if you use __GFP_NOFAIL.

But as already explained above, allocating 1024 pages (assuming PAGE_SIZE is 4096 and
pointer size is 8) when current thread was chosen as an OOM victim is not recommended.
You should implement proper error handling instead of using __GFP_NOFAIL if count can
become large.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFT v2 1/4] vpda: try to fix the potential crash due to misusing __GFP_NOFAIL
  2024-07-31 10:20               ` Tetsuo Handa
@ 2024-08-01  2:37                 ` Jason Wang
  2024-08-05  1:32                   ` Barry Song
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Wang @ 2024-08-01  2:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Barry Song, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, lstoakes, mhocko, penberg, rientjes,
	roman.gushchin, torvalds, urezki, v-songbaohua, vbabka,
	virtualization, Michael S. Tsirkin, Xuan Zhuo, Eugenio Pérez,
	Maxime Coquelin

On Wed, Jul 31, 2024 at 6:21 PM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> On 2024/07/31 14:05, Barry Song wrote:
> > Jason,
> > Thank you very much. Also, Tetsuo reminded me that kmalloc_array() might be
> > problematic if the count is too large:
> >  pages = kmalloc_array(count, sizeof(*pages), GFP_KERNEL | __GFP_NOFAIL);
>
> If "count" is guaranteed to be count <= 16, this might be tolerable.

It's not unfortunately, the maximum bounce buffer size is:

#define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)

>
> Consider a situation where current thread was chosen as an global OOM victim.
> Trying to allocate "count" pages using
>
>         for (i = 0; i < count; i++)
>                 pages[i] = alloc_page(GFP_KERNEL | __GFP_NOFAIL);
>
> is not good.

Right, I wonder if we need to add a shrink to reclaim the pages that
belong to VDUSE bounce pages.

>
> >
> > You might want to consider using vmalloc_array() or kvmalloc_array() instead
> > when you send a new version.
>
> There is a limitation at https://elixir.bootlin.com/linux/v6.11-rc1/source/mm/page_alloc.c#L3033
> that you must satisfy count <= PAGE_SIZE * 2 / sizeof(*pages) if you use __GFP_NOFAIL.
>
> But as already explained above, allocating 1024 pages (assuming PAGE_SIZE is 4096 and
> pointer size is 8) when current thread was chosen as an OOM victim is not recommended.
> You should implement proper error handling instead of using __GFP_NOFAIL if count can
> become large.

I think I need to consider a way to avoid __GFP_NOFAIL. A easy way is
not to free the kernel bounce pages, then we don't need to allocate
them again.

Thanks

>
>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFT v2 1/4] vpda: try to fix the potential crash due to misusing __GFP_NOFAIL
  2024-08-01  2:37                 ` Jason Wang
@ 2024-08-05  1:32                   ` Barry Song
  2024-08-05  8:19                     ` Jason Wang
  0 siblings, 1 reply; 28+ messages in thread
From: Barry Song @ 2024-08-05  1:32 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tetsuo Handa, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, lstoakes, mhocko, penberg, rientjes,
	roman.gushchin, torvalds, urezki, v-songbaohua, vbabka,
	virtualization, Michael S. Tsirkin, Xuan Zhuo, Eugenio Pérez,
	Maxime Coquelin

On Thu, Aug 1, 2024 at 2:37 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Jul 31, 2024 at 6:21 PM Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >
> > On 2024/07/31 14:05, Barry Song wrote:
> > > Jason,
> > > Thank you very much. Also, Tetsuo reminded me that kmalloc_array() might be
> > > problematic if the count is too large:
> > >  pages = kmalloc_array(count, sizeof(*pages), GFP_KERNEL | __GFP_NOFAIL);
> >
> > If "count" is guaranteed to be count <= 16, this might be tolerable.
>
> It's not unfortunately, the maximum bounce buffer size is:
>
> #define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)
>
> >
> > Consider a situation where current thread was chosen as an global OOM victim.
> > Trying to allocate "count" pages using
> >
> >         for (i = 0; i < count; i++)
> >                 pages[i] = alloc_page(GFP_KERNEL | __GFP_NOFAIL);
> >
> > is not good.
>
> Right, I wonder if we need to add a shrink to reclaim the pages that
> belong to VDUSE bounce pages.
>
> >
> > >
> > > You might want to consider using vmalloc_array() or kvmalloc_array() instead
> > > when you send a new version.
> >
> > There is a limitation at https://elixir.bootlin.com/linux/v6.11-rc1/source/mm/page_alloc.c#L3033
> > that you must satisfy count <= PAGE_SIZE * 2 / sizeof(*pages) if you use __GFP_NOFAIL.
> >
> > But as already explained above, allocating 1024 pages (assuming PAGE_SIZE is 4096 and
> > pointer size is 8) when current thread was chosen as an OOM victim is not recommended.
> > You should implement proper error handling instead of using __GFP_NOFAIL if count can
> > become large.
>
> I think I need to consider a way to avoid __GFP_NOFAIL. A easy way is
> not to free the kernel bounce pages, then we don't need to allocate
> them again.

Let's try to do a fix for this patch as we are waiting for your official patch
in mm.

I guess, further optimization can be a separate patch later in the driver's
tree :-)

>
> Thanks
>
> >
> >
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFT v2 1/4] vpda: try to fix the potential crash due to misusing __GFP_NOFAIL
  2024-08-05  1:32                   ` Barry Song
@ 2024-08-05  8:19                     ` Jason Wang
  0 siblings, 0 replies; 28+ messages in thread
From: Jason Wang @ 2024-08-05  8:19 UTC (permalink / raw)
  To: Barry Song
  Cc: Tetsuo Handa, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, lstoakes, mhocko, penberg, rientjes,
	roman.gushchin, torvalds, urezki, v-songbaohua, vbabka,
	virtualization, Michael S. Tsirkin, Xuan Zhuo, Eugenio Pérez,
	Maxime Coquelin, Yongji Xie

On Mon, Aug 5, 2024 at 9:32 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Aug 1, 2024 at 2:37 PM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Jul 31, 2024 at 6:21 PM Tetsuo Handa
> > <penguin-kernel@i-love.sakura.ne.jp> wrote:
> > >
> > > On 2024/07/31 14:05, Barry Song wrote:
> > > > Jason,
> > > > Thank you very much. Also, Tetsuo reminded me that kmalloc_array() might be
> > > > problematic if the count is too large:
> > > >  pages = kmalloc_array(count, sizeof(*pages), GFP_KERNEL | __GFP_NOFAIL);
> > >
> > > If "count" is guaranteed to be count <= 16, this might be tolerable.
> >
> > It's not unfortunately, the maximum bounce buffer size is:
> >
> > #define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)
> >
> > >
> > > Consider a situation where current thread was chosen as an global OOM victim.
> > > Trying to allocate "count" pages using
> > >
> > >         for (i = 0; i < count; i++)
> > >                 pages[i] = alloc_page(GFP_KERNEL | __GFP_NOFAIL);
> > >
> > > is not good.
> >
> > Right, I wonder if we need to add a shrink to reclaim the pages that
> > belong to VDUSE bounce pages.
> >
> > >
> > > >
> > > > You might want to consider using vmalloc_array() or kvmalloc_array() instead
> > > > when you send a new version.
> > >
> > > There is a limitation at https://elixir.bootlin.com/linux/v6.11-rc1/source/mm/page_alloc.c#L3033
> > > that you must satisfy count <= PAGE_SIZE * 2 / sizeof(*pages) if you use __GFP_NOFAIL.
> > >
> > > But as already explained above, allocating 1024 pages (assuming PAGE_SIZE is 4096 and
> > > pointer size is 8) when current thread was chosen as an OOM victim is not recommended.
> > > You should implement proper error handling instead of using __GFP_NOFAIL if count can
> > > become large.
> >
> > I think I need to consider a way to avoid __GFP_NOFAIL. A easy way is
> > not to free the kernel bounce pages, then we don't need to allocate
> > them again.
>
> Let's try to do a fix for this patch as we are waiting for your official patch
> in mm.

Will post this soon. One note here is that I don't have handy
usersapce that uses userspace bounce pages (neither libvduse nor DPDK
did that).

I hope YongJi can review and give a test on that.

Thanks

>
> I guess, further optimization can be a separate patch later in the driver's
> tree :-)
>
> >
> > Thanks
> >
> > >
> > >
> >
>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFT v2 1/4] vpda: try to fix the potential crash due to misusing __GFP_NOFAIL
  2024-07-31  5:05             ` Barry Song
  2024-07-31 10:20               ` Tetsuo Handa
@ 2024-08-01  2:30               ` Jason Wang
  1 sibling, 0 replies; 28+ messages in thread
From: Jason Wang @ 2024-08-01  2:30 UTC (permalink / raw)
  To: Barry Song
  Cc: penguin-kernel, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, lstoakes, mhocko, penberg, rientjes,
	roman.gushchin, torvalds, urezki, v-songbaohua, vbabka,
	virtualization, Michael S. Tsirkin, Xuan Zhuo, Eugenio Pérez,
	Maxime Coquelin

On Wed, Jul 31, 2024 at 1:05 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Jul 31, 2024 at 12:13 PM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Jul 31, 2024 at 12:12 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Wed, Jul 31, 2024 at 11:58 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Wed, Jul 31, 2024 at 11:15 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Wed, Jul 31, 2024 at 11:10 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, Jul 31, 2024 at 8:03 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > >
> > > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > > >
> > > > > > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > > > > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > > > > > loop within non-sleepable contexts.
> > > > > > >
> > > > > > > static inline struct page *
> > > > > > > __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > > > > >                                                 struct alloc_context *ac)
> > > > > > > {
> > > > > > >         ...
> > > > > > >         /*
> > > > > > >          * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
> > > > > > >          * we always retry
> > > > > > >          */
> > > > > > >         if (gfp_mask & __GFP_NOFAIL) {
> > > > > > >                 /*
> > > > > > >                  * All existing users of the __GFP_NOFAIL are blockable, so warn
> > > > > > >                  * of any new users that actually require GFP_NOWAIT
> > > > > > >                  */
> > > > > > >                 if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > > > > > >                         goto fail;
> > > > > > >                 ...
> > > > > > >         }
> > > > > > >         ...
> > > > > > > fail:
> > > > > > >         warn_alloc(gfp_mask, ac->nodemask,
> > > > > > >                         "page allocation failure: order:%u", order);
> > > > > > > got_pg:
> > > > > > >         return page;
> > > > > > > }
> > > > > > >
> > > > > > > Let's move the memory allocation out of the atomic context and use
> > > > > > > the normal sleepable context to get pages.
> > > > > > >
> > > > > > > [RFT]: This has only been compile-tested; I'd prefer if the VDPA maintainers
> > > > > > > handles it.
> > > > > > >
> > > > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > > > Cc: Jason Wang <jasowang@redhat.com>
> > > > > > > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > > > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > > > > > > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > > > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > > > > > ---
> > > > > > >  drivers/vdpa/vdpa_user/iova_domain.c | 31 +++++++++++++++++++++++-----
> > > > > > >  drivers/vdpa/vdpa_user/iova_domain.h |  5 ++++-
> > > > > > >  drivers/vdpa/vdpa_user/vduse_dev.c   |  4 +++-
> > > > > > >  3 files changed, 33 insertions(+), 7 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> > > > > > > index 791d38d6284c..9318f059a8b5 100644
> > > > > > > --- a/drivers/vdpa/vdpa_user/iova_domain.c
> > > > > > > +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> > > > > > > @@ -283,7 +283,23 @@ int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
> > > > > > >         return ret;
> > > > > > >  }
> > > > > > >
> > > > > > > -void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> > > > > > > +struct page **vduse_domain_alloc_pages_to_remove_bounce(struct vduse_iova_domain *domain)
> > > > > > > +{
> > > > > > > +       struct page **pages;
> > > > > > > +       unsigned long count, i;
> > > > > > > +
> > > > > > > +       if (!domain->user_bounce_pages)
> > > > > > > +               return NULL;
> > > > > > > +
> > > > > > > +       count = domain->bounce_size >> PAGE_SHIFT;
> > > > > > > +       pages = kmalloc_array(count, sizeof(*pages), GFP_KERNEL | __GFP_NOFAIL);
> > > > > > > +       for (i = 0; i < count; i++)
> > > > > > > +               pages[i] = alloc_page(GFP_KERNEL | __GFP_NOFAIL);
> > > > > > > +
> > > > > > > +       return pages;
> > > > > > > +}
> > > > > > > +
> > > > > > > +void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain, struct page **pages)
> > > > > > >  {
> > > > > > >         struct vduse_bounce_map *map;
> > > > > > >         unsigned long i, count;
> > > > > > > @@ -294,15 +310,16 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> > > > > > >
> > > > > > >         count = domain->bounce_size >> PAGE_SHIFT;
> > > > > > >         for (i = 0; i < count; i++) {
> > > > > > > -               struct page *page = NULL;
> > > > > > > +               struct page *page = pages[i];
> > > > > > >
> > > > > > >                 map = &domain->bounce_maps[i];
> > > > > > > -               if (WARN_ON(!map->bounce_page))
> > > > > > > +               if (WARN_ON(!map->bounce_page)) {
> > > > > > > +                       put_page(page);
> > > > > > >                         continue;
> > > > > > > +               }
> > > > > > >
> > > > > > >                 /* Copy user page to kernel page if it's in use */
> > > > > > >                 if (map->orig_phys != INVALID_PHYS_ADDR) {
> > > > > > > -                       page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);
> > > > > > >                         memcpy_from_page(page_address(page),
> > > > > > >                                          map->bounce_page, 0, PAGE_SIZE);
> > > > > > >                 }
> > > > > > > @@ -310,6 +327,7 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> > > > > > >                 map->bounce_page = page;
> > > > > > >         }
> > > > > > >         domain->user_bounce_pages = false;
> > > > > > > +       kfree(pages);
> > > > > > >  out:
> > > > > > >         write_unlock(&domain->bounce_lock);
> > > > > > >  }
> > > > > > > @@ -543,10 +561,13 @@ static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
> > > > > > >  static int vduse_domain_release(struct inode *inode, struct file *file)
> > > > > > >  {
> > > > > > >         struct vduse_iova_domain *domain = file->private_data;
> > > > > > > +       struct page **pages;
> > > > > > > +
> > > > > > > +       pages = vduse_domain_alloc_pages_to_remove_bounce(domain);
> > > > > > >
> > > > > > >         spin_lock(&domain->iotlb_lock);
> > > > > > >         vduse_iotlb_del_range(domain, 0, ULLONG_MAX);
> > > > > > > -       vduse_domain_remove_user_bounce_pages(domain);
> > > > > > > +       vduse_domain_remove_user_bounce_pages(domain, pages);
> > > > > > >         vduse_domain_free_kernel_bounce_pages(domain);
> > > > > > >         spin_unlock(&domain->iotlb_lock);
> > > > > > >         put_iova_domain(&domain->stream_iovad);
> > > > > > > diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> > > > > > > index f92f22a7267d..17efa5555b3f 100644
> > > > > > > --- a/drivers/vdpa/vdpa_user/iova_domain.h
> > > > > > > +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> > > > > > > @@ -74,7 +74,10 @@ void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain);
> > > > > > >  int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
> > > > > > >                                        struct page **pages, int count);
> > > > > > >
> > > > > > > -void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain);
> > > > > > > +void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain,
> > > > > > > +                                          struct page **pages);
> > > > > > > +
> > > > > > > +struct page **vduse_domain_alloc_pages_to_remove_bounce(struct vduse_iova_domain *domain);
> > > > > > >
> > > > > > >  void vduse_domain_destroy(struct vduse_iova_domain *domain);
> > > > > > >
> > > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > index 7ae99691efdf..5d8d5810df57 100644
> > > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > @@ -1030,6 +1030,7 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> > > > > > >  static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > > > > > >                                 u64 iova, u64 size)
> > > > > > >  {
> > > > > > > +       struct page **pages;
> > > > > > >         int ret;
> > > > > > >
> > > > > > >         mutex_lock(&dev->mem_lock);
> > > > > > > @@ -1044,7 +1045,8 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > > > > > >         if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> > > > > > >                 goto unlock;
> > > > > > >
> > > > > > > -       vduse_domain_remove_user_bounce_pages(dev->domain);
> > > > > > > +       pages = vduse_domain_alloc_pages_to_remove_bounce(dev->domain);
> > > > > > > +       vduse_domain_remove_user_bounce_pages(dev->domain, pages);
> > > > > > >         unpin_user_pages_dirty_lock(dev->umem->pages,
> > > > > > >                                     dev->umem->npages, true);
> > > > > > >         atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> > > > > >
> > > > > > We miss a kfree(pages); here?
> > > > > no.
> > > > > i've moved it into vduse_domain_remove_user_bounce_pages.
> > > >
> > > > Ok, but it seems tricky e.g allocated by the caller but freed in
> > > > callee. And I think I missed some important issues in the previous
> > > > review: The check of user_bounce_pages must be done under the
> > > > bounce_lock, otherwise it might race with umem_reg.
> > > >
> > > > So in the case of release(), we know the device is gone, so there's no
> > > > need to allocate pages that will be released soon. So we can pass NULL
> > > > as a hint and just assign bounce_page to NULL in
> > > > vduse_domain_remove_user_bounce_pages().
> > > >
> > > > And in the case of vduse_dev_dereg_umem(), we need to allocate the
> > > > pages without checking user_bounce_pages. So in
> > > > vduse_domain_remove_user_bounce_pages() if we can free the allocated
> > > > pages as well as the pages in the following check
> > > >
> > > >         if (!domain->user_bounce_pages)
> > > >                 goto out;
> > > >
> > > > What do you think?
> > >
> > > I am not a vdpa guy, but  changing the current logic is another patch.
> > > From mm perspective, I can only address the __GFP_NOFAIL issue.
> > >
> > > I actually prefer you guys handle it  directly:-) I'd rather report a BUG
> > > instead. TBH, I know nothing about vpda.
> >
> > Fine, let me post a patch for this (no later than the end of this week).
> >
> Jason,
> Thank you very much. Also, Tetsuo reminded me that kmalloc_array() might be
> problematic if the count is too large:
>  pages = kmalloc_array(count, sizeof(*pages), GFP_KERNEL | __GFP_NOFAIL);
>

I think it's a side effect of __GFP_NOFAIL?

> You might want to consider using vmalloc_array() or kvmalloc_array() instead
> when you send a new version.

Will consider that.

Thanks


>
> > Thanks
> >
> > >
> > > >
> > > > Thanks
> > > >
> > > > >
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > > --
> > > > > > > 2.34.1
> > > > > > >
> > > > > >
>
> Thanks
> Barry
>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 2/4] mm: Document __GFP_NOFAIL must be blockable
  2024-07-31  0:01 [PATCH v2 0/4] mm: clarify nofail memory allocation Barry Song
  2024-07-31  0:01 ` [PATCH RFT v2 1/4] vpda: try to fix the potential crash due to misusing __GFP_NOFAIL Barry Song
@ 2024-07-31  0:01 ` Barry Song
  2024-07-31 10:18   ` Vlastimil Babka
  2024-07-31 16:26   ` Christoph Hellwig
  2024-07-31  0:01 ` [PATCH v2 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails Barry Song
  2024-07-31  0:01 ` [PATCH v2 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL Barry Song
  3 siblings, 2 replies; 28+ messages in thread
From: Barry Song @ 2024-07-31  0:01 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, lstoakes, mhocko,
	penberg, rientjes, roman.gushchin, torvalds, urezki, v-songbaohua,
	vbabka, virtualization

From: Barry Song <v-songbaohua@oppo.com>

Non-blocking allocation with __GFP_NOFAIL is not supported and may
still result in NULL pointers (if we don't return NULL, we result
in busy-loop within non-sleepable contexts):

static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
						struct alloc_context *ac)
{
	...
	/*
	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
	 * we always retry
	 */
	if (gfp_mask & __GFP_NOFAIL) {
		/*
		 * All existing users of the __GFP_NOFAIL are blockable, so warn
		 * of any new users that actually require GFP_NOWAIT
		 */
		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
			goto fail;
		...
	}
	...
fail:
	warn_alloc(gfp_mask, ac->nodemask,
			"page allocation failure: order:%u", order);
got_pg:
	return page;
}

Highlight this in the documentation of __GFP_NOFAIL so that non-mm
subsystems can reject any illegal usage of __GFP_NOFAIL with
GFP_ATOMIC, GFP_NOWAIT, etc.

Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/gfp_types.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 313be4ad79fd..4a1fa7706b0c 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -215,7 +215,8 @@ enum {
  * the caller still has to check for failures) while costly requests try to be
  * not disruptive and back off even without invoking the OOM killer.
  * The following three modifiers might be used to override some of these
- * implicit rules.
+ * implicit rules. Please note that all of them must be used along with
+ * %__GFP_DIRECT_RECLAIM flag.
  *
  * %__GFP_NORETRY: The VM implementation will try only very lightweight
  * memory direct reclaim to get some memory under memory pressure (thus
@@ -246,6 +247,8 @@ enum {
  * cannot handle allocation failures. The allocation could block
  * indefinitely but will never return with failure. Testing for
  * failure is pointless.
+ * It _must_ be blockable and used together with __GFP_DIRECT_RECLAIM.
+ * It should _never_ be used in non-sleepable contexts.
  * New users should be evaluated carefully (and the flag should be
  * used only when there is no reasonable failure policy) but it is
  * definitely preferable to use the flag rather than opencode endless
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/4] mm: Document __GFP_NOFAIL must be blockable
  2024-07-31  0:01 ` [PATCH v2 2/4] mm: Document __GFP_NOFAIL must be blockable Barry Song
@ 2024-07-31 10:18   ` Vlastimil Babka
  2024-07-31 16:26   ` Christoph Hellwig
  1 sibling, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2024-07-31 10:18 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, lstoakes, mhocko,
	penberg, rientjes, roman.gushchin, torvalds, urezki, v-songbaohua,
	virtualization

On 7/31/24 2:01 AM, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> Non-blocking allocation with __GFP_NOFAIL is not supported and may
> still result in NULL pointers (if we don't return NULL, we result
> in busy-loop within non-sleepable contexts):
> 
> static inline struct page *
> __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> 						struct alloc_context *ac)
> {
> 	...
> 	/*
> 	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
> 	 * we always retry
> 	 */
> 	if (gfp_mask & __GFP_NOFAIL) {
> 		/*
> 		 * All existing users of the __GFP_NOFAIL are blockable, so warn
> 		 * of any new users that actually require GFP_NOWAIT
> 		 */
> 		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> 			goto fail;
> 		...
> 	}
> 	...
> fail:
> 	warn_alloc(gfp_mask, ac->nodemask,
> 			"page allocation failure: order:%u", order);
> got_pg:
> 	return page;
> }
> 
> Highlight this in the documentation of __GFP_NOFAIL so that non-mm
> subsystems can reject any illegal usage of __GFP_NOFAIL with
> GFP_ATOMIC, GFP_NOWAIT, etc.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/linux/gfp_types.h | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
> index 313be4ad79fd..4a1fa7706b0c 100644
> --- a/include/linux/gfp_types.h
> +++ b/include/linux/gfp_types.h
> @@ -215,7 +215,8 @@ enum {
>   * the caller still has to check for failures) while costly requests try to be
>   * not disruptive and back off even without invoking the OOM killer.
>   * The following three modifiers might be used to override some of these
> - * implicit rules.
> + * implicit rules. Please note that all of them must be used along with
> + * %__GFP_DIRECT_RECLAIM flag.
>   *
>   * %__GFP_NORETRY: The VM implementation will try only very lightweight
>   * memory direct reclaim to get some memory under memory pressure (thus
> @@ -246,6 +247,8 @@ enum {
>   * cannot handle allocation failures. The allocation could block
>   * indefinitely but will never return with failure. Testing for
>   * failure is pointless.
> + * It _must_ be blockable and used together with __GFP_DIRECT_RECLAIM.
> + * It should _never_ be used in non-sleepable contexts.

So this second line is a bit redundant as you can't use __GFP_DIRECT_RECLAIM
in non-sleepable contexts. But can't hurt so not a objection.

>   * New users should be evaluated carefully (and the flag should be
>   * used only when there is no reasonable failure policy) but it is
>   * definitely preferable to use the flag rather than opencode endless



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/4] mm: Document __GFP_NOFAIL must be blockable
  2024-07-31  0:01 ` [PATCH v2 2/4] mm: Document __GFP_NOFAIL must be blockable Barry Song
  2024-07-31 10:18   ` Vlastimil Babka
@ 2024-07-31 16:26   ` Christoph Hellwig
  1 sibling, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-07-31 16:26 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	lstoakes, mhocko, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-07-31  0:01 [PATCH v2 0/4] mm: clarify nofail memory allocation Barry Song
  2024-07-31  0:01 ` [PATCH RFT v2 1/4] vpda: try to fix the potential crash due to misusing __GFP_NOFAIL Barry Song
  2024-07-31  0:01 ` [PATCH v2 2/4] mm: Document __GFP_NOFAIL must be blockable Barry Song
@ 2024-07-31  0:01 ` Barry Song
  2024-07-31  7:11   ` Michal Hocko
                     ` (2 more replies)
  2024-07-31  0:01 ` [PATCH v2 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL Barry Song
  3 siblings, 3 replies; 28+ messages in thread
From: Barry Song @ 2024-07-31  0:01 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, lstoakes, mhocko,
	penberg, rientjes, roman.gushchin, torvalds, urezki, v-songbaohua,
	vbabka, virtualization, Kees Cook

From: Barry Song <v-songbaohua@oppo.com>

We have cases we still fail though callers might have __GFP_NOFAIL.
Since they don't check the return, we are exposed to the security
risks for NULL deference.

Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
situation.

Christoph Hellwig:
The whole freaking point of __GFP_NOFAIL is that callers don't handle
allocation failures.  So in fact a straight BUG is the right thing
here.

Vlastimil Babka:
It's just not a recoverable situation (WARN_ON is for recoverable
situations). The caller cannot handle allocation failure and at the same
time asked for an impossible allocation. BUG_ON() is a guaranteed oops
with stracktrace etc. We don't need to hope for the later NULL pointer
dereference (which might if really unlucky happen from a different
context where it's no longer obvious what lead to the allocation failing).

Michal Hocko:
Linus tends to be against adding new BUG() calls unless the failure is
absolutely unrecoverable (e.g. corrupted data structures etc.). I am
not sure how he would look at simply incorrect memory allocator usage to
blow up the kernel. Now the argument could be made that those failures
could cause subtle memory corruptions or even be exploitable which might
be a sufficient reason to stop them early.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/slab.h | 4 +++-
 mm/page_alloc.c      | 4 +++-
 mm/util.c            | 1 +
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index c9cb42203183..4a4d1fdc2afe 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
 {
 	size_t bytes;
 
-	if (unlikely(check_mul_overflow(n, size, &bytes)))
+	if (unlikely(check_mul_overflow(n, size, &bytes))) {
+		BUG_ON(flags & __GFP_NOFAIL);
 		return NULL;
+	}
 
 	return kvmalloc_node_noprof(bytes, flags, node);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c700d2598a26..cc179c3e68df 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4708,8 +4708,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
 	 * There are several places where we assume that the order value is sane
 	 * so bail out early if the request is out of bound.
 	 */
-	if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
+	if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
+		BUG_ON(gfp & __GFP_NOFAIL);
 		return NULL;
+	}
 
 	gfp &= gfp_allowed_mask;
 	/*
diff --git a/mm/util.c b/mm/util.c
index 0ff5898cc6de..bad3258523b6 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
 
 	/* Don't even allow crazy sizes */
 	if (unlikely(size > INT_MAX)) {
+		BUG_ON(flags & __GFP_NOFAIL);
 		WARN_ON_ONCE(!(flags & __GFP_NOWARN));
 		return NULL;
 	}
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-07-31  0:01 ` [PATCH v2 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails Barry Song
@ 2024-07-31  7:11   ` Michal Hocko
  2024-07-31 10:29   ` Vlastimil Babka
  2024-07-31 16:28   ` Christoph Hellwig
  2 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2024-07-31  7:11 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	lstoakes, penberg, rientjes, roman.gushchin, torvalds, urezki,
	v-songbaohua, vbabka, virtualization, Kees Cook

On Wed 31-07-24 12:01:54, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> We have cases we still fail though callers might have __GFP_NOFAIL.
> Since they don't check the return, we are exposed to the security
> risks for NULL deference.
> 
> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
> situation.
> 
> Christoph Hellwig:
> The whole freaking point of __GFP_NOFAIL is that callers don't handle
> allocation failures.  So in fact a straight BUG is the right thing
> here.
> 
> Vlastimil Babka:
> It's just not a recoverable situation (WARN_ON is for recoverable
> situations). The caller cannot handle allocation failure and at the same
> time asked for an impossible allocation. BUG_ON() is a guaranteed oops
> with stracktrace etc. We don't need to hope for the later NULL pointer
> dereference (which might if really unlucky happen from a different
> context where it's no longer obvious what lead to the allocation failing).
> 
> Michal Hocko:
> Linus tends to be against adding new BUG() calls unless the failure is
> absolutely unrecoverable (e.g. corrupted data structures etc.). I am
> not sure how he would look at simply incorrect memory allocator usage to
> blow up the kernel. Now the argument could be made that those failures
> could cause subtle memory corruptions or even be exploitable which might
> be a sufficient reason to stop them early.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Lorenzo Stoakes <lstoakes@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Kees Cook <kees@kernel.org>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>

Thanks for separating overflow/out of bounds checks.
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/slab.h | 4 +++-
>  mm/page_alloc.c      | 4 +++-
>  mm/util.c            | 1 +
>  3 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index c9cb42203183..4a4d1fdc2afe 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
>  {
>  	size_t bytes;
>  
> -	if (unlikely(check_mul_overflow(n, size, &bytes)))
> +	if (unlikely(check_mul_overflow(n, size, &bytes))) {
> +		BUG_ON(flags & __GFP_NOFAIL);
>  		return NULL;
> +	}
>  
>  	return kvmalloc_node_noprof(bytes, flags, node);
>  }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c700d2598a26..cc179c3e68df 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4708,8 +4708,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>  	 * There are several places where we assume that the order value is sane
>  	 * so bail out early if the request is out of bound.
>  	 */
> -	if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
> +	if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
> +		BUG_ON(gfp & __GFP_NOFAIL);
>  		return NULL;
> +	}
>  
>  	gfp &= gfp_allowed_mask;
>  	/*
> diff --git a/mm/util.c b/mm/util.c
> index 0ff5898cc6de..bad3258523b6 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
>  
>  	/* Don't even allow crazy sizes */
>  	if (unlikely(size > INT_MAX)) {
> +		BUG_ON(flags & __GFP_NOFAIL);
>  		WARN_ON_ONCE(!(flags & __GFP_NOWARN));
>  		return NULL;
>  	}
> -- 
> 2.34.1

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-07-31  0:01 ` [PATCH v2 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails Barry Song
  2024-07-31  7:11   ` Michal Hocko
@ 2024-07-31 10:29   ` Vlastimil Babka
  2024-07-31 10:44     ` Tetsuo Handa
  2024-07-31 16:28   ` Christoph Hellwig
  2 siblings, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2024-07-31 10:29 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, lstoakes, mhocko,
	penberg, rientjes, roman.gushchin, torvalds, urezki, v-songbaohua,
	virtualization, Kees Cook

On 7/31/24 2:01 AM, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> We have cases we still fail though callers might have __GFP_NOFAIL.
> Since they don't check the return, we are exposed to the security
> risks for NULL deference.
> 
> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
> situation.
> 
> Christoph Hellwig:
> The whole freaking point of __GFP_NOFAIL is that callers don't handle
> allocation failures.  So in fact a straight BUG is the right thing
> here.
> 
> Vlastimil Babka:
> It's just not a recoverable situation (WARN_ON is for recoverable
> situations). The caller cannot handle allocation failure and at the same
> time asked for an impossible allocation. BUG_ON() is a guaranteed oops
> with stracktrace etc. We don't need to hope for the later NULL pointer
> dereference (which might if really unlucky happen from a different
> context where it's no longer obvious what lead to the allocation failing).
> 
> Michal Hocko:
> Linus tends to be against adding new BUG() calls unless the failure is
> absolutely unrecoverable (e.g. corrupted data structures etc.). I am
> not sure how he would look at simply incorrect memory allocator usage to
> blow up the kernel. Now the argument could be made that those failures
> could cause subtle memory corruptions or even be exploitable which might
> be a sufficient reason to stop them early.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Lorenzo Stoakes <lstoakes@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Kees Cook <kees@kernel.org>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/linux/slab.h | 4 +++-
>  mm/page_alloc.c      | 4 +++-
>  mm/util.c            | 1 +
>  3 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index c9cb42203183..4a4d1fdc2afe 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
>  {
>  	size_t bytes;
>  
> -	if (unlikely(check_mul_overflow(n, size, &bytes)))
> +	if (unlikely(check_mul_overflow(n, size, &bytes))) {
> +		BUG_ON(flags & __GFP_NOFAIL);

Shouldn't we produce some kind of warning also in the no-NOFAIL case?
Returning a NULL completely silently feels wrong.

>  		return NULL;
> +	}
>  
>  	return kvmalloc_node_noprof(bytes, flags, node);
>  }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c700d2598a26..cc179c3e68df 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4708,8 +4708,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>  	 * There are several places where we assume that the order value is sane
>  	 * so bail out early if the request is out of bound.
>  	 */
> -	if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
> +	if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {

Because here we do warn (although I'd argue even __GFP_NOWARN shouldn't
suppress a warning for such a reason).

> +		BUG_ON(gfp & __GFP_NOFAIL);
>  		return NULL;
> +	}
>  
>  	gfp &= gfp_allowed_mask;
>  	/*
> diff --git a/mm/util.c b/mm/util.c
> index 0ff5898cc6de..bad3258523b6 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
>  
>  	/* Don't even allow crazy sizes */
>  	if (unlikely(size > INT_MAX)) {
> +		BUG_ON(flags & __GFP_NOFAIL);
>  		WARN_ON_ONCE(!(flags & __GFP_NOWARN));

And here, would argue the same. But maybe there's code that relies on this
so dunno. In that case we could at least convert to WARN_ON_ONCE_GFP.


>  		return NULL;
>  	}



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-07-31 10:29   ` Vlastimil Babka
@ 2024-07-31 10:44     ` Tetsuo Handa
  2024-07-31 10:48       ` Vlastimil Babka
  0 siblings, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2024-07-31 10:44 UTC (permalink / raw)
  To: Vlastimil Babka, Barry Song, akpm, linux-mm
  Cc: 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, lstoakes, mhocko,
	penberg, rientjes, roman.gushchin, torvalds, urezki, v-songbaohua,
	virtualization, Kees Cook

On 2024/07/31 19:29, Vlastimil Babka wrote:
>> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
>>  {
>>  	size_t bytes;
>>  
>> -	if (unlikely(check_mul_overflow(n, size, &bytes)))
>> +	if (unlikely(check_mul_overflow(n, size, &bytes))) {
>> +		BUG_ON(flags & __GFP_NOFAIL);
> 
> Shouldn't we produce some kind of warning also in the no-NOFAIL case?
> Returning a NULL completely silently feels wrong.

Doesn't embedding macros like BUG_ON() into inline functions needlessly bloat
the kernel size? I think we can call a non-inlined function (that is marked
as EXPORT_SYMBOL()) for emitting warning.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-07-31 10:44     ` Tetsuo Handa
@ 2024-07-31 10:48       ` Vlastimil Babka
  2024-07-31 10:57         ` Barry Song
  0 siblings, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2024-07-31 10:48 UTC (permalink / raw)
  To: Tetsuo Handa, Barry Song, akpm, linux-mm
  Cc: 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, lstoakes, mhocko,
	penberg, rientjes, roman.gushchin, torvalds, urezki, v-songbaohua,
	virtualization, Kees Cook

On 7/31/24 12:44 PM, Tetsuo Handa wrote:
> On 2024/07/31 19:29, Vlastimil Babka wrote:
>>> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
>>>  {
>>>  	size_t bytes;
>>>  
>>> -	if (unlikely(check_mul_overflow(n, size, &bytes)))
>>> +	if (unlikely(check_mul_overflow(n, size, &bytes))) {
>>> +		BUG_ON(flags & __GFP_NOFAIL);
>> 
>> Shouldn't we produce some kind of warning also in the no-NOFAIL case?
>> Returning a NULL completely silently feels wrong.
> 
> Doesn't embedding macros like BUG_ON() into inline functions needlessly bloat
> the kernel size? I think we can call a non-inlined function (that is marked
> as EXPORT_SYMBOL()) for emitting warning.

Hm yeah we could probably make the whole function non-inline as it results
in a call anyway.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-07-31 10:48       ` Vlastimil Babka
@ 2024-07-31 10:57         ` Barry Song
  0 siblings, 0 replies; 28+ messages in thread
From: Barry Song @ 2024-07-31 10:57 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Tetsuo Handa, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, lstoakes, mhocko, penberg, rientjes,
	roman.gushchin, torvalds, urezki, v-songbaohua, virtualization,
	Kees Cook

On Wed, Jul 31, 2024 at 6:48 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 7/31/24 12:44 PM, Tetsuo Handa wrote:
> > On 2024/07/31 19:29, Vlastimil Babka wrote:
> >>> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
> >>>  {
> >>>     size_t bytes;
> >>>
> >>> -   if (unlikely(check_mul_overflow(n, size, &bytes)))
> >>> +   if (unlikely(check_mul_overflow(n, size, &bytes))) {
> >>> +           BUG_ON(flags & __GFP_NOFAIL);
> >>
> >> Shouldn't we produce some kind of warning also in the no-NOFAIL case?
> >> Returning a NULL completely silently feels wrong.
> >
> > Doesn't embedding macros like BUG_ON() into inline functions needlessly bloat
> > the kernel size? I think we can call a non-inlined function (that is marked
> > as EXPORT_SYMBOL()) for emitting warning.
>
> Hm yeah we could probably make the whole function non-inline as it results
> in a call anyway.

Although this might save some code footprint,  we will result in two
calls for users of
kvmalloc_array_node_noprof() with both Tetsuo's and your proposal?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-07-31  0:01 ` [PATCH v2 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails Barry Song
  2024-07-31  7:11   ` Michal Hocko
  2024-07-31 10:29   ` Vlastimil Babka
@ 2024-07-31 16:28   ` Christoph Hellwig
  2 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-07-31 16:28 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	lstoakes, mhocko, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Kees Cook

Looks good (inline or not):

Reviewed-by: Christoph Hellwig <hch@lst.de>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable  __GFP_NOFAIL
  2024-07-31  0:01 [PATCH v2 0/4] mm: clarify nofail memory allocation Barry Song
                   ` (2 preceding siblings ...)
  2024-07-31  0:01 ` [PATCH v2 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails Barry Song
@ 2024-07-31  0:01 ` Barry Song
  2024-07-31  7:15   ` Michal Hocko
  2024-07-31 10:55   ` Vlastimil Babka
  3 siblings, 2 replies; 28+ messages in thread
From: Barry Song @ 2024-07-31  0:01 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, lstoakes, mhocko,
	penberg, rientjes, roman.gushchin, torvalds, urezki, v-songbaohua,
	vbabka, virtualization, Kees Cook

From: Barry Song <v-songbaohua@oppo.com>

When users allocate memory with the __GFP_NOFAIL flag, they might
incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc. This kind
of non-blockable __GFP_NOFAIL is not supported and is pointless. If
we attempt and still fail to allocate memory for these users, we have
two choices:

    1. We could busy-loop and hope that some other direct reclamation or
    kswapd rescues the current process. However, this is unreliable
    and could ultimately lead to hard or soft lockups, which might not
    be well supported by some architectures.

    2. We could use BUG_ON to trigger a reliable system crash, avoiding
    exposing NULL dereference.

This patch chooses the second option because the first is unreliable. Even
if the process incorrectly using __GFP_NOFAIL is sometimes rescued, the
long latency might be unacceptable, especially considering that misusing
GFP_ATOMIC and __GFP_NOFAIL is likely to occur in atomic contexts with
strict timing requirements.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/page_alloc.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cc179c3e68df..ed1bd8f595bd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4439,11 +4439,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	if (gfp_mask & __GFP_NOFAIL) {
 		/*
-		 * All existing users of the __GFP_NOFAIL are blockable, so warn
-		 * of any new users that actually require GFP_NOWAIT
+		 * All existing users of the __GFP_NOFAIL are blockable
+		 * otherwise we introduce a busy loop with inside the page
+		 * allocator from non-sleepable contexts
 		 */
-		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
-			goto fail;
+		BUG_ON(!can_direct_reclaim);
 
 		/*
 		 * PF_MEMALLOC request from this context is rather bizarre
@@ -4474,7 +4474,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		cond_resched();
 		goto retry;
 	}
-fail:
+
 	warn_alloc(gfp_mask, ac->nodemask,
 			"page allocation failure: order:%u", order);
 got_pg:
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable  __GFP_NOFAIL
  2024-07-31  0:01 ` [PATCH v2 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL Barry Song
@ 2024-07-31  7:15   ` Michal Hocko
  2024-07-31 10:55   ` Vlastimil Babka
  1 sibling, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2024-07-31  7:15 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	lstoakes, penberg, rientjes, roman.gushchin, torvalds, urezki,
	v-songbaohua, vbabka, virtualization, Kees Cook

On Wed 31-07-24 12:01:55, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> When users allocate memory with the __GFP_NOFAIL flag, they might
> incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc. This kind
> of non-blockable __GFP_NOFAIL is not supported and is pointless. If
> we attempt and still fail to allocate memory for these users, we have
> two choices:
> 
>     1. We could busy-loop and hope that some other direct reclamation or
>     kswapd rescues the current process. However, this is unreliable
>     and could ultimately lead to hard or soft lockups, which might not
>     be well supported by some architectures.
> 
>     2. We could use BUG_ON to trigger a reliable system crash, avoiding
>     exposing NULL dereference.
> 
> This patch chooses the second option because the first is unreliable. Even
> if the process incorrectly using __GFP_NOFAIL is sometimes rescued, the
> long latency might be unacceptable, especially considering that misusing
> GFP_ATOMIC and __GFP_NOFAIL is likely to occur in atomic contexts with
> strict timing requirements.

Well, any latency arguments are out of table with BUG_ON crashing the
system. So this is not about reliability but rather making those
incorrect uses more obvious.

With your GFP_NOFAIL follow up this should be simply impossible to
trigger though. I am still not sure which of the bad solutions is more
appropriate so I am not giving this an ack. Either of them is better
than allow to fail though.

> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Lorenzo Stoakes <lstoakes@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Kees Cook <kees@kernel.org>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/page_alloc.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cc179c3e68df..ed1bd8f595bd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4439,11 +4439,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	 */
>  	if (gfp_mask & __GFP_NOFAIL) {
>  		/*
> -		 * All existing users of the __GFP_NOFAIL are blockable, so warn
> -		 * of any new users that actually require GFP_NOWAIT
> +		 * All existing users of the __GFP_NOFAIL are blockable
> +		 * otherwise we introduce a busy loop with inside the page
> +		 * allocator from non-sleepable contexts
>  		 */
> -		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> -			goto fail;
> +		BUG_ON(!can_direct_reclaim);
>  
>  		/*
>  		 * PF_MEMALLOC request from this context is rather bizarre
> @@ -4474,7 +4474,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		cond_resched();
>  		goto retry;
>  	}
> -fail:
> +
>  	warn_alloc(gfp_mask, ac->nodemask,
>  			"page allocation failure: order:%u", order);
>  got_pg:
> -- 
> 2.34.1

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-07-31  0:01 ` [PATCH v2 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL Barry Song
  2024-07-31  7:15   ` Michal Hocko
@ 2024-07-31 10:55   ` Vlastimil Babka
  2024-07-31 11:08     ` Barry Song
  1 sibling, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2024-07-31 10:55 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, lstoakes, mhocko,
	penberg, rientjes, roman.gushchin, torvalds, urezki, v-songbaohua,
	virtualization, Kees Cook

On 7/31/24 2:01 AM, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> When users allocate memory with the __GFP_NOFAIL flag, they might
> incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc. This kind
> of non-blockable __GFP_NOFAIL is not supported and is pointless. If
> we attempt and still fail to allocate memory for these users, we have
> two choices:
> 
>     1. We could busy-loop and hope that some other direct reclamation or
>     kswapd rescues the current process. However, this is unreliable
>     and could ultimately lead to hard or soft lockups, which might not
>     be well supported by some architectures.
> 
>     2. We could use BUG_ON to trigger a reliable system crash, avoiding
>     exposing NULL dereference.
> 
> This patch chooses the second option because the first is unreliable. Even
> if the process incorrectly using __GFP_NOFAIL is sometimes rescued, the
> long latency might be unacceptable, especially considering that misusing
> GFP_ATOMIC and __GFP_NOFAIL is likely to occur in atomic contexts with
> strict timing requirements.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Lorenzo Stoakes <lstoakes@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Kees Cook <kees@kernel.org>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/page_alloc.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cc179c3e68df..ed1bd8f595bd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4439,11 +4439,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	 */
>  	if (gfp_mask & __GFP_NOFAIL) {
>  		/*
> -		 * All existing users of the __GFP_NOFAIL are blockable, so warn
> -		 * of any new users that actually require GFP_NOWAIT
> +		 * All existing users of the __GFP_NOFAIL are blockable
> +		 * otherwise we introduce a busy loop with inside the page
> +		 * allocator from non-sleepable contexts
>  		 */
> -		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> -			goto fail;
> +		BUG_ON(!can_direct_reclaim);

We might get more useful output if here we did just "if
(!can_direct_reclaim) goto fail; and let warn_alloc() print it, and then
there would be a BUG_ON(gfp_mask & __GFP_NOFAIL)?
Additionally we could mask out __GFP_NOWARN from gfp_mask before the goto,
as a __GFP_NOWARN would suppress the output in a non-recoverable situation
so it would be wrong.

>  
>  		/*
>  		 * PF_MEMALLOC request from this context is rather bizarre
> @@ -4474,7 +4474,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		cond_resched();
>  		goto retry;
>  	}
> -fail:
> +
>  	warn_alloc(gfp_mask, ac->nodemask,
>  			"page allocation failure: order:%u", order);
>  got_pg:



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-07-31 10:55   ` Vlastimil Babka
@ 2024-07-31 11:08     ` Barry Song
  2024-07-31 11:31       ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Barry Song @ 2024-07-31 11:08 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	mhocko, penberg, rientjes, roman.gushchin, torvalds, urezki,
	v-songbaohua, virtualization, Kees Cook, lorenzo.stoakes

On Wed, Jul 31, 2024 at 6:55 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 7/31/24 2:01 AM, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > When users allocate memory with the __GFP_NOFAIL flag, they might
> > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc. This kind
> > of non-blockable __GFP_NOFAIL is not supported and is pointless. If
> > we attempt and still fail to allocate memory for these users, we have
> > two choices:
> >
> >     1. We could busy-loop and hope that some other direct reclamation or
> >     kswapd rescues the current process. However, this is unreliable
> >     and could ultimately lead to hard or soft lockups, which might not
> >     be well supported by some architectures.
> >
> >     2. We could use BUG_ON to trigger a reliable system crash, avoiding
> >     exposing NULL dereference.
> >
> > This patch chooses the second option because the first is unreliable. Even
> > if the process incorrectly using __GFP_NOFAIL is sometimes rescued, the
> > long latency might be unacceptable, especially considering that misusing
> > GFP_ATOMIC and __GFP_NOFAIL is likely to occur in atomic contexts with
> > strict timing requirements.
> >
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > Cc: Christoph Hellwig <hch@infradead.org>
> > Cc: Lorenzo Stoakes <lstoakes@gmail.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Pekka Enberg <penberg@kernel.org>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Kees Cook <kees@kernel.org>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/page_alloc.c | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index cc179c3e68df..ed1bd8f595bd 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4439,11 +4439,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >        */
> >       if (gfp_mask & __GFP_NOFAIL) {
> >               /*
> > -              * All existing users of the __GFP_NOFAIL are blockable, so warn
> > -              * of any new users that actually require GFP_NOWAIT
> > +              * All existing users of the __GFP_NOFAIL are blockable
> > +              * otherwise we introduce a busy loop with inside the page
> > +              * allocator from non-sleepable contexts
> >                */
> > -             if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > -                     goto fail;
> > +             BUG_ON(!can_direct_reclaim);
>
> We might get more useful output if here we did just "if
> (!can_direct_reclaim) goto fail; and let warn_alloc() print it, and then
> there would be a BUG_ON(gfp_mask & __GFP_NOFAIL)?
> Additionally we could mask out __GFP_NOWARN from gfp_mask before the goto,
> as a __GFP_NOWARN would suppress the output in a non-recoverable situation
> so it would be wrong.

If we use BUG_ON, it seems like we don't need to do anything else, as the BUG_ON
report gives developers all the information they need. If we go with
approach 1—doing
a busy loop until rescued or a lockup occurs—I agree it might be
better to add more
warnings.

>
> >
> >               /*
> >                * PF_MEMALLOC request from this context is rather bizarre
> > @@ -4474,7 +4474,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >               cond_resched();
> >               goto retry;
> >       }
> > -fail:
> > +
> >       warn_alloc(gfp_mask, ac->nodemask,
> >                       "page allocation failure: order:%u", order);
> >  got_pg:
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-07-31 11:08     ` Barry Song
@ 2024-07-31 11:31       ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2024-07-31 11:31 UTC (permalink / raw)
  To: Barry Song
  Cc: Vlastimil Babka, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, virtualization, Kees Cook, lorenzo.stoakes

On Wed 31-07-24 19:08:44, Barry Song wrote:
> On Wed, Jul 31, 2024 at 6:55 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 7/31/24 2:01 AM, Barry Song wrote:
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc. This kind
> > > of non-blockable __GFP_NOFAIL is not supported and is pointless. If
> > > we attempt and still fail to allocate memory for these users, we have
> > > two choices:
> > >
> > >     1. We could busy-loop and hope that some other direct reclamation or
> > >     kswapd rescues the current process. However, this is unreliable
> > >     and could ultimately lead to hard or soft lockups, which might not
> > >     be well supported by some architectures.
> > >
> > >     2. We could use BUG_ON to trigger a reliable system crash, avoiding
> > >     exposing NULL dereference.
> > >
> > > This patch chooses the second option because the first is unreliable. Even
> > > if the process incorrectly using __GFP_NOFAIL is sometimes rescued, the
> > > long latency might be unacceptable, especially considering that misusing
> > > GFP_ATOMIC and __GFP_NOFAIL is likely to occur in atomic contexts with
> > > strict timing requirements.
> > >
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > Cc: Christoph Hellwig <hch@infradead.org>
> > > Cc: Lorenzo Stoakes <lstoakes@gmail.com>
> > > Cc: Christoph Lameter <cl@linux.com>
> > > Cc: Pekka Enberg <penberg@kernel.org>
> > > Cc: David Rientjes <rientjes@google.com>
> > > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > > Cc: Kees Cook <kees@kernel.org>
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > >  mm/page_alloc.c | 10 +++++-----
> > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index cc179c3e68df..ed1bd8f595bd 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -4439,11 +4439,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > >        */
> > >       if (gfp_mask & __GFP_NOFAIL) {
> > >               /*
> > > -              * All existing users of the __GFP_NOFAIL are blockable, so warn
> > > -              * of any new users that actually require GFP_NOWAIT
> > > +              * All existing users of the __GFP_NOFAIL are blockable
> > > +              * otherwise we introduce a busy loop with inside the page
> > > +              * allocator from non-sleepable contexts
> > >                */
> > > -             if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > > -                     goto fail;
> > > +             BUG_ON(!can_direct_reclaim);
> >
> > We might get more useful output if here we did just "if
> > (!can_direct_reclaim) goto fail; and let warn_alloc() print it, and then
> > there would be a BUG_ON(gfp_mask & __GFP_NOFAIL)?
> > Additionally we could mask out __GFP_NOWARN from gfp_mask before the goto,
> > as a __GFP_NOWARN would suppress the output in a non-recoverable situation
> > so it would be wrong.
> 
> If we use BUG_ON, it seems like we don't need to do anything else, as the BUG_ON
> report gives developers all the information they need.

It will not give warn_alloc - aka state of the page allocator at the
time of failure. Is this really necessary? I don't know because it is
"shouldn't ever happen" rather than "how come this allocation has
failed" case.

So IMHO a simple BUG_ON should be sufficient to scream out loud that
impossible has happened and need fixing.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-08-05  8:19 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-31  0:01 [PATCH v2 0/4] mm: clarify nofail memory allocation Barry Song
2024-07-31  0:01 ` [PATCH RFT v2 1/4] vpda: try to fix the potential crash due to misusing __GFP_NOFAIL Barry Song
2024-07-31  3:09   ` Jason Wang
2024-07-31  3:15     ` Barry Song
2024-07-31  3:58       ` Jason Wang
2024-07-31  4:11         ` Barry Song
2024-07-31  4:13           ` Jason Wang
2024-07-31  5:05             ` Barry Song
2024-07-31 10:20               ` Tetsuo Handa
2024-08-01  2:37                 ` Jason Wang
2024-08-05  1:32                   ` Barry Song
2024-08-05  8:19                     ` Jason Wang
2024-08-01  2:30               ` Jason Wang
2024-07-31  0:01 ` [PATCH v2 2/4] mm: Document __GFP_NOFAIL must be blockable Barry Song
2024-07-31 10:18   ` Vlastimil Babka
2024-07-31 16:26   ` Christoph Hellwig
2024-07-31  0:01 ` [PATCH v2 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails Barry Song
2024-07-31  7:11   ` Michal Hocko
2024-07-31 10:29   ` Vlastimil Babka
2024-07-31 10:44     ` Tetsuo Handa
2024-07-31 10:48       ` Vlastimil Babka
2024-07-31 10:57         ` Barry Song
2024-07-31 16:28   ` Christoph Hellwig
2024-07-31  0:01 ` [PATCH v2 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL Barry Song
2024-07-31  7:15   ` Michal Hocko
2024-07-31 10:55   ` Vlastimil Babka
2024-07-31 11:08     ` Barry Song
2024-07-31 11:31       ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).