linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* HugeTLB mapping for drivers (sample driver)
@ 2009-07-14  2:07 Alexey Korolev
  2009-07-14 10:27 ` Mel Gorman
  0 siblings, 1 reply; 7+ messages in thread
From: Alexey Korolev @ 2009-07-14  2:07 UTC (permalink / raw)
  To: mel, linux-mm

Hi,

This is a sample driver which provides huge page mapping to user space.
It might be useful for understanding purposes.

Here we defined file operations for device driver.

We must call htlbfs get_unmapped_area and hugetlbfs_file_mmap functions to
 done some HTLB mapping preparations. (If proposed approach is more 
or less Ok, it will be more accurate to avoid hugetlbfs calls at all - and 
substitute them with htlb functions). 
Allocated page get assiciated with mapping via add_to_page_cache call in
file->open.

---
diff -Naurp empty/hpage_map.c hpage_map/hpage_map.c
--- empty/hpage_map.c	1970-01-01 12:00:00.000000000 +1200
+++ hpage_map/hpage_map.c	2009-07-13 18:40:28.000000000 +1200
@@ -0,0 +1,137 @@
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/hugetlb.h>
+#include <linux/pagevec.h>
+#include <linux/miscdevice.h>
+
+static void make_file_empty(struct file *file)
+{
+    struct address_space *mapping = file->f_mapping;
+    struct pagevec pvec;
+    pgoff_t next = 0;
+    int i;
+
+    pagevec_init(&pvec, 0);
+    while (1) {
+	if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+	    if (!next)
+		break;
+	    next = 0;
+	    continue;
+	}
+
+	for (i = 0; i < pagevec_count(&pvec); ++i) {
+	    struct page *page = pvec.pages[i];
+
+	    lock_page(page);
+	    if (page->index > next)
+		next = page->index;
+	    ++next;
+	    remove_from_page_cache(page);
+	    unlock_page(page);
+	    hugetlb_free_pages(page);
+	}
+    }
+    BUG_ON(mapping->nrpages);
+}
+
+
+static int hpage_map_mmap(struct file *file, struct vm_area_struct
*vma)
+{
+	unsigned long idx;
+	struct address_space *mapping;
+	int ret = VM_FAULT_SIGBUS;
+
+	idx = vma->vm_pgoff >> huge_page_order(h);
+	mapping = file->f_mapping;
+	ret = hugetlbfs_file_mmap(file, vma);
+
+	return ret;
+}
+
+
+static unsigned long hpage_map_get_unmapped_area(struct file *file,
+	unsigned long addr, unsigned long len, unsigned long pgoff,
+	unsigned long flags)
+{
+	return hugetlb_get_unmapped_area(file, addr, len, pgoff, flags);
+}
+
+static int hpage_map_open(struct inode * inode, struct file * file)
+{
+    struct page *page;
+    int num_hpages = 10, cnt = 0;
+    int ret = 0;
+    
+    /* Announce  hugetlb file mapping */
+    mapping_set_hugetlb(file->f_mapping);
+    
+    for (cnt = 0; cnt < num_hpages; cnt++ ) {
+	page = hugetlb_alloc_pages_node(0,GFP_KERNEL);
+	if (IS_ERR(page)) {
+	    ret = -PTR_ERR(page);
+	    goto out_err;	
+	}	
+	ret = add_to_page_cache(page, file->f_mapping, cnt, GFP_KERNEL);
+	if (ret) {
+	    hugetlb_free_pages(page);
+	    goto out_err;
+	}
+	SetPageUptodate(page);
+	unlock_page(page);
+    }
+    return 0;
+out_err:
+    printk(KERN_ERR"%s : Error %d \n",__func__, ret);
+    make_file_empty(file);
+    return ret;
+}
+
+
+static int hpage_map_release(struct inode * inode, struct file * file)
+{
+    make_file_empty(file);
+    return 0;
+}
+/*
+ * The file operations for /dev/hpage_map
+ */
+static const struct file_operations hpage_map_fops = {
+	.owner		= THIS_MODULE,
+	.mmap		= hpage_map_mmap,
+	.open 		= hpage_map_open,
+	.release	= hpage_map_release,
+	.get_unmapped_area	= hpage_map_get_unmapped_area,
+};
+
+static struct miscdevice hpage_map_dev = {
+	MISC_DYNAMIC_MINOR,
+	"hpage_map",
+	&hpage_map_fops
+};
+
+static int __init
+hpage_map_init(void)
+{
+	/* Create the device in the /sys/class/misc directory. */
+	if (misc_register(&hpage_map_dev))
+		return -EIO;
+	return 0;
+}
+
+module_init(hpage_map_init);
+
+static void __exit
+hpage_map_exit(void)
+{
+	misc_deregister(&hpage_map_dev);
+}
+
+module_exit(hpage_map_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Alexey Korolev");
+MODULE_DESCRIPTION("Example of driver with hugetlb mapping");
+MODULE_VERSION("1.0");
diff -Naurp empty/Makefile hpage_map/Makefile
--- empty/Makefile	1970-01-01 12:00:00.000000000 +1200
+++ hpage_map/Makefile	2009-07-13 18:31:27.000000000 +1200
@@ -0,0 +1,7 @@
+obj-m := hpage_map.o 
+
+KDIR  := /lib/modules/$(shell uname -r)/build
+PWD   := $(shell pwd)
+
+default:
+	$(MAKE) -C $(KDIR) M=$(PWD) modules

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: HugeTLB mapping for drivers (sample driver)
  2009-07-14  2:07 HugeTLB mapping for drivers (sample driver) Alexey Korolev
@ 2009-07-14 10:27 ` Mel Gorman
  2009-07-15  0:08   ` Alexey Korolev
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2009-07-14 10:27 UTC (permalink / raw)
  To: Alexey Korolev; +Cc: linux-mm

On Tue, Jul 14, 2009 at 03:07:47AM +0100, Alexey Korolev wrote:
> Hi,
> 
> This is a sample driver which provides huge page mapping to user space.
> It might be useful for understanding purposes.
> 
> Here we defined file operations for device driver.
> 
> We must call htlbfs get_unmapped_area and hugetlbfs_file_mmap functions to
>  done some HTLB mapping preparations. (If proposed approach is more 
> or less Ok, it will be more accurate to avoid hugetlbfs calls at all - and 
> substitute them with htlb functions). 
> Allocated page get assiciated with mapping via add_to_page_cache call in
> file->open.
> 

I ran out of time to review this properly, but glancing through I would be
concerned with what happens on fork() and COW. At a short read, it would
appear that pages get allocated from alloc_buddy_huge_page() instead of your
normal function altering the counters for hstate_nores.

> ---
> diff -Naurp empty/hpage_map.c hpage_map/hpage_map.c
> --- empty/hpage_map.c	1970-01-01 12:00:00.000000000 +1200
> +++ hpage_map/hpage_map.c	2009-07-13 18:40:28.000000000 +1200
> @@ -0,0 +1,137 @@
> +#include <linux/module.h>
> +#include <linux/mm.h>
> +#include <linux/file.h>
> +#include <linux/pagemap.h>
> +#include <linux/hugetlb.h>
> +#include <linux/pagevec.h>
> +#include <linux/miscdevice.h>
> +
> +static void make_file_empty(struct file *file)
> +{
> +    struct address_space *mapping = file->f_mapping;
> +    struct pagevec pvec;
> +    pgoff_t next = 0;
> +    int i;
> +
> +    pagevec_init(&pvec, 0);
> +    while (1) {
> +	if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> +	    if (!next)
> +		break;
> +	    next = 0;
> +	    continue;
> +	}
> +
> +	for (i = 0; i < pagevec_count(&pvec); ++i) {
> +	    struct page *page = pvec.pages[i];
> +
> +	    lock_page(page);
> +	    if (page->index > next)
> +		next = page->index;
> +	    ++next;
> +	    remove_from_page_cache(page);
> +	    unlock_page(page);
> +	    hugetlb_free_pages(page);
> +	}
> +    }
> +    BUG_ON(mapping->nrpages);
> +}
> +
> +
> +static int hpage_map_mmap(struct file *file, struct vm_area_struct
> *vma)
> +{
> +	unsigned long idx;
> +	struct address_space *mapping;
> +	int ret = VM_FAULT_SIGBUS;
> +
> +	idx = vma->vm_pgoff >> huge_page_order(h);
> +	mapping = file->f_mapping;
> +	ret = hugetlbfs_file_mmap(file, vma);
> +
> +	return ret;
> +}
> +
> +
> +static unsigned long hpage_map_get_unmapped_area(struct file *file,
> +	unsigned long addr, unsigned long len, unsigned long pgoff,
> +	unsigned long flags)
> +{
> +	return hugetlb_get_unmapped_area(file, addr, len, pgoff, flags);
> +}
> +
> +static int hpage_map_open(struct inode * inode, struct file * file)
> +{
> +    struct page *page;
> +    int num_hpages = 10, cnt = 0;

What happens if the mmap() call is more than 10 pages? What if the process
fork()s, the mapping is MAP_PRIVATE and the child is long lived causing
a COW fault on the parent process when it next writes the mapping and the
subsequent allocation fails?

Again, I'm worried that by avoiding hugetlbfs, your drivers end up
trying to solve all the same problems.

> +    int ret = 0;
> +    
> +    /* Announce  hugetlb file mapping */
> +    mapping_set_hugetlb(file->f_mapping);
> +    
> +    for (cnt = 0; cnt < num_hpages; cnt++ ) {
> +	page = hugetlb_alloc_pages_node(0,GFP_KERNEL);
> +	if (IS_ERR(page)) {
> +	    ret = -PTR_ERR(page);
> +	    goto out_err;	
> +	}	
> +	ret = add_to_page_cache(page, file->f_mapping, cnt, GFP_KERNEL);
> +	if (ret) {
> +	    hugetlb_free_pages(page);
> +	    goto out_err;
> +	}
> +	SetPageUptodate(page);
> +	unlock_page(page);
> +    }
> +    return 0;
> +out_err:
> +    printk(KERN_ERR"%s : Error %d \n",__func__, ret);
> +    make_file_empty(file);
> +    return ret;
> +}
> +
> +
> +static int hpage_map_release(struct inode * inode, struct file * file)
> +{
> +    make_file_empty(file);
> +    return 0;
> +}
> +/*
> + * The file operations for /dev/hpage_map
> + */
> +static const struct file_operations hpage_map_fops = {
> +	.owner		= THIS_MODULE,
> +	.mmap		= hpage_map_mmap,
> +	.open 		= hpage_map_open,
> +	.release	= hpage_map_release,
> +	.get_unmapped_area	= hpage_map_get_unmapped_area,
> +};
> +
> +static struct miscdevice hpage_map_dev = {
> +	MISC_DYNAMIC_MINOR,
> +	"hpage_map",
> +	&hpage_map_fops
> +};
> +
> +static int __init
> +hpage_map_init(void)
> +{
> +	/* Create the device in the /sys/class/misc directory. */
> +	if (misc_register(&hpage_map_dev))
> +		return -EIO;
> +	return 0;
> +}
> +
> +module_init(hpage_map_init);
> +
> +static void __exit
> +hpage_map_exit(void)
> +{
> +	misc_deregister(&hpage_map_dev);
> +}
> +
> +module_exit(hpage_map_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Alexey Korolev");
> +MODULE_DESCRIPTION("Example of driver with hugetlb mapping");
> +MODULE_VERSION("1.0");
> diff -Naurp empty/Makefile hpage_map/Makefile
> --- empty/Makefile	1970-01-01 12:00:00.000000000 +1200
> +++ hpage_map/Makefile	2009-07-13 18:31:27.000000000 +1200
> @@ -0,0 +1,7 @@
> +obj-m := hpage_map.o 
> +
> +KDIR  := /lib/modules/$(shell uname -r)/build
> +PWD   := $(shell pwd)
> +
> +default:
> +	$(MAKE) -C $(KDIR) M=$(PWD) modules
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: HugeTLB mapping for drivers (sample driver)
  2009-07-14 10:27 ` Mel Gorman
@ 2009-07-15  0:08   ` Alexey Korolev
  2009-07-19 13:39     ` Alexey Korolev
  0 siblings, 1 reply; 7+ messages in thread
From: Alexey Korolev @ 2009-07-15  0:08 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Alexey Korolev, linux-mm

Mel,

Thank you for review.
I'm about to renovate this sample driver in order to handle set of
different scenarios.
Please tell me if you have additional error scenarios. I'll try to
handle them as well.
After that there will be more or less clear picture about should we
involve hugetlbfs or not.

On Tue, Jul 14, 2009 at 10:27 PM, Mel Gorman<mel@csn.ul.ie> wrote:
> On Tue, Jul 14, 2009 at 03:07:47AM +0100, Alexey Korolev wrote:
>> Hi,
>>
>> This is a sample driver which provides huge page mapping to user space.
>> It might be useful for understanding purposes.
>>
>> Here we defined file operations for device driver.
>>
>> We must call htlbfs get_unmapped_area and hugetlbfs_file_mmap functions to
>>  done some HTLB mapping preparations. (If proposed approach is more
>> or less Ok, it will be more accurate to avoid hugetlbfs calls at all - and
>> substitute them with htlb functions).
>> Allocated page get assiciated with mapping via add_to_page_cache call in
>> file->open.
>>
>
> I ran out of time to review this properly, but glancing through I would be
> concerned with what happens on fork() and COW. At a short read, it would
> appear that pages get allocated from alloc_buddy_huge_page() instead of your
> normal function altering the counters for hstate_nores.
>
>> ---
>> diff -Naurp empty/hpage_map.c hpage_map/hpage_map.c
>> --- empty/hpage_map.c 1970-01-01 12:00:00.000000000 +1200
>> +++ hpage_map/hpage_map.c     2009-07-13 18:40:28.000000000 +1200
>> @@ -0,0 +1,137 @@
>> +#include <linux/module.h>
>> +#include <linux/mm.h>
>> +#include <linux/file.h>
>> +#include <linux/pagemap.h>
>> +#include <linux/hugetlb.h>
>> +#include <linux/pagevec.h>
>> +#include <linux/miscdevice.h>
>> +
>> +static void make_file_empty(struct file *file)
>> +{
>> +    struct address_space *mapping = file->f_mapping;
>> +    struct pagevec pvec;
>> +    pgoff_t next = 0;
>> +    int i;
>> +
>> +    pagevec_init(&pvec, 0);
>> +    while (1) {
>> +     if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
>> +         if (!next)
>> +             break;
>> +         next = 0;
>> +         continue;
>> +     }
>> +
>> +     for (i = 0; i < pagevec_count(&pvec); ++i) {
>> +         struct page *page = pvec.pages[i];
>> +
>> +         lock_page(page);
>> +         if (page->index > next)
>> +             next = page->index;
>> +         ++next;
>> +         remove_from_page_cache(page);
>> +         unlock_page(page);
>> +         hugetlb_free_pages(page);
>> +     }
>> +    }
>> +    BUG_ON(mapping->nrpages);
>> +}
>> +
>> +
>> +static int hpage_map_mmap(struct file *file, struct vm_area_struct
>> *vma)
>> +{
>> +     unsigned long idx;
>> +     struct address_space *mapping;
>> +     int ret = VM_FAULT_SIGBUS;
>> +
>> +     idx = vma->vm_pgoff >> huge_page_order(h);
>> +     mapping = file->f_mapping;
>> +     ret = hugetlbfs_file_mmap(file, vma);
>> +
>> +     return ret;
>> +}
>> +
>> +
>> +static unsigned long hpage_map_get_unmapped_area(struct file *file,
>> +     unsigned long addr, unsigned long len, unsigned long pgoff,
>> +     unsigned long flags)
>> +{
>> +     return hugetlb_get_unmapped_area(file, addr, len, pgoff, flags);
>> +}
>> +
>> +static int hpage_map_open(struct inode * inode, struct file * file)
>> +{
>> +    struct page *page;
>> +    int num_hpages = 10, cnt = 0;
>
> What happens if the mmap() call is more than 10 pages? What if the process
> fork()s, the mapping is MAP_PRIVATE and the child is long lived causing
> a COW fault on the parent process when it next writes the mapping and the
> subsequent allocation fails?
>
> Again, I'm worried that by avoiding hugetlbfs, your drivers end up
> trying to solve all the same problems.
>
>> +    int ret = 0;
>> +
>> +    /* Announce  hugetlb file mapping */
>> +    mapping_set_hugetlb(file->f_mapping);
>> +
>> +    for (cnt = 0; cnt < num_hpages; cnt++ ) {
>> +     page = hugetlb_alloc_pages_node(0,GFP_KERNEL);
>> +     if (IS_ERR(page)) {
>> +         ret = -PTR_ERR(page);
>> +         goto out_err;
>> +     }
>> +     ret = add_to_page_cache(page, file->f_mapping, cnt, GFP_KERNEL);
>> +     if (ret) {
>> +         hugetlb_free_pages(page);
>> +         goto out_err;
>> +     }
>> +     SetPageUptodate(page);
>> +     unlock_page(page);
>> +    }
>> +    return 0;
>> +out_err:
>> +    printk(KERN_ERR"%s : Error %d \n",__func__, ret);
>> +    make_file_empty(file);
>> +    return ret;
>> +}
>> +
>> +
>> +static int hpage_map_release(struct inode * inode, struct file * file)
>> +{
>> +    make_file_empty(file);
>> +    return 0;
>> +}
>> +/*
>> + * The file operations for /dev/hpage_map
>> + */
>> +static const struct file_operations hpage_map_fops = {
>> +     .owner          = THIS_MODULE,
>> +     .mmap           = hpage_map_mmap,
>> +     .open           = hpage_map_open,
>> +     .release        = hpage_map_release,
>> +     .get_unmapped_area      = hpage_map_get_unmapped_area,
>> +};
>> +
>> +static struct miscdevice hpage_map_dev = {
>> +     MISC_DYNAMIC_MINOR,
>> +     "hpage_map",
>> +     &hpage_map_fops
>> +};
>> +
>> +static int __init
>> +hpage_map_init(void)
>> +{
>> +     /* Create the device in the /sys/class/misc directory. */
>> +     if (misc_register(&hpage_map_dev))
>> +             return -EIO;
>> +     return 0;
>> +}
>> +
>> +module_init(hpage_map_init);
>> +
>> +static void __exit
>> +hpage_map_exit(void)
>> +{
>> +     misc_deregister(&hpage_map_dev);
>> +}
>> +
>> +module_exit(hpage_map_exit);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_AUTHOR("Alexey Korolev");
>> +MODULE_DESCRIPTION("Example of driver with hugetlb mapping");
>> +MODULE_VERSION("1.0");
>> diff -Naurp empty/Makefile hpage_map/Makefile
>> --- empty/Makefile    1970-01-01 12:00:00.000000000 +1200
>> +++ hpage_map/Makefile        2009-07-13 18:31:27.000000000 +1200
>> @@ -0,0 +1,7 @@
>> +obj-m := hpage_map.o
>> +
>> +KDIR  := /lib/modules/$(shell uname -r)/build
>> +PWD   := $(shell pwd)
>> +
>> +default:
>> +     $(MAKE) -C $(KDIR) M=$(PWD) modules
>>
>
> --
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: HugeTLB mapping for drivers (sample driver)
  2009-07-15  0:08   ` Alexey Korolev
@ 2009-07-19 13:39     ` Alexey Korolev
  2009-07-20  8:11       ` Mel Gorman
  0 siblings, 1 reply; 7+ messages in thread
From: Alexey Korolev @ 2009-07-19 13:39 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Alexey Korolev, linux-mm

Mel,

>>
>> I ran out of time to review this properly, but glancing through I would be
>> concerned with what happens on fork() and COW. At a short read, it would
>> appear that pages get allocated from alloc_buddy_huge_page() instead of your
>> normal function altering the counters for hstate_nores.
>>

I've done some more investigations. You are right it is necessary to
track cases with private mappings some how if we are going to provide
hugetlb remap for drivers. OOM killer starts to work on COW caused by
private hugetlb mapping. (In case of non huge tlb mapping memory just
copied)

In fact there should be quite few cases when private mapping makes
sense for drivers and mapping DMA buffers. I thought about possible
solutions. The question is what to choose.

1. Forbid private mappings for drivers in case of hugetlb. (But this
limits functionality - it is not so good)
2. Allow private mapping. Use hugetlbfs hstates. (But it forces user
to know how much hugetlb memory it is necessary to reserve for
drivers)
3. Allow private mapping. Use special hstate for driver and driver
should tell how much memory needs to be reserved for it. (Not clear
yet how to behave if we are out of reserved space)

Could you please suggest what is the best solution? May be some other options?

Thanks,
Alexey

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: HugeTLB mapping for drivers (sample driver)
  2009-07-19 13:39     ` Alexey Korolev
@ 2009-07-20  8:11       ` Mel Gorman
  2009-07-21  9:32         ` Alexey Korolev
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2009-07-20  8:11 UTC (permalink / raw)
  To: Alexey Korolev; +Cc: Alexey Korolev, linux-mm

On Mon, Jul 20, 2009 at 01:39:30AM +1200, Alexey Korolev wrote:
> Mel,
> 
> >>
> >> I ran out of time to review this properly, but glancing through I would be
> >> concerned with what happens on fork() and COW. At a short read, it would
> >> appear that pages get allocated from alloc_buddy_huge_page() instead of your
> >> normal function altering the counters for hstate_nores.
> >>
> 
> I've done some more investigations. You are right it is necessary to
> track cases with private mappings some how if we are going to provide
> hugetlb remap for drivers. OOM killer starts to work on COW caused by
> private hugetlb mapping. (In case of non huge tlb mapping memory just
> copied)
> 

Did the OOM killer really trigger and select a process for killing or
did the process itself just get killed with an out-of-memory message? I
would have expected the latter.

> In fact there should be quite few cases when private mapping makes
> sense for drivers and mapping DMA buffers. I thought about possible
> solutions. The question is what to choose.
> 
> 1. Forbid private mappings for drivers in case of hugetlb. (But this
> limits functionality - it is not so good)

For a long time, this was the "solution" for hugetlbfs.

> 2. Allow private mapping. Use hugetlbfs hstates. (But it forces user
> to know how much hugetlb memory it is necessary to reserve for
> drivers)

You can defer working out the reservations until mmap() time,
particularly if you are using dynamic hugepage pool resizing instead of
static allocation.

> 3. Allow private mapping. Use special hstate for driver and driver
> should tell how much memory needs to be reserved for it. (Not clear
> yet how to behave if we are out of reserved space)
> 
> Could you please suggest what is the best solution? May be some other options?
> 

The only solution that springs to mind is the same one used by hugetlbfs
and that is that reservations are taken at mmap() time for the size of the
mapping. In your case, you prefault but either way, the hugepages exist.

What then happens for hugetlbfs is that only the process that called mmap()
is guaranteed their faults will succeed. If a child process incurs a COW
and the hugepages are not available, the child process gets killed. If
the parent process performs COW and the huge pages are not available, it
unmaps the pages from the child process so that COW becomes unnecessary. If
the child process then faults, it gets killed.  This is implemented in
mm/hugetlb.c#unmap_ref_private().

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: HugeTLB mapping for drivers (sample driver)
  2009-07-20  8:11       ` Mel Gorman
@ 2009-07-21  9:32         ` Alexey Korolev
  2009-07-21  9:40           ` Mel Gorman
  0 siblings, 1 reply; 7+ messages in thread
From: Alexey Korolev @ 2009-07-21  9:32 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Alexey Korolev, linux-mm

Hi,
>
> Did the OOM killer really trigger and select a process for killing or
> did the process itself just get killed with an out-of-memory message? I
> would have expected the latter.
>
OMM killer triggered in case of private mapping on attempt to access a
page under private mapping. It was because code did not check the pages
availability at mmap time. Will be fixed.

>> In fact there should be quite few cases when private mapping makes
>> sense for drivers and mapping DMA buffers. I thought about possible
>> solutions. The question is what to choose.
>>
>> 1. Forbid private mappings for drivers in case of hugetlb. (But this
>> limits functionality - it is not so good)
>
> For a long time, this was the "solution" for hugetlbfs.
>
>> 2. Allow private mapping. Use hugetlbfs hstates. (But it forces user
>> to know how much hugetlb memory it is necessary to reserve for
>> drivers)
>
> You can defer working out the reservations until mmap() time,
> particularly if you are using dynamic hugepage pool resizing instead of
> static allocation.
>
>> 3. Allow private mapping. Use special hstate for driver and driver
>> should tell how much memory needs to be reserved for it. (Not clear
>> yet how to behave if we are out of reserved space)
>>
>> Could you please suggest what is the best solution? May be some other options?
>>
>
> The only solution that springs to mind is the same one used by hugetlbfs
> and that is that reservations are taken at mmap() time for the size of the
> mapping. In your case, you prefault but either way, the hugepages exist.
>
Yes, that looks sane. I'll follow this way. In a particular case if
driver do not
need a private mapping mmap will return error. Thanks for the advice.
I'm about
to modify the patches. I'll try to involve  hugetlb reservation
functions as much  as
possible and track reservations by special hstate for drivers.

> What then happens for hugetlbfs is that only the process that called mmap()
> is guaranteed their faults will succeed. If a child process incurs a COW
> and the hugepages are not available, the child process gets killed. If
> the parent process performs COW and the huge pages are not available, it
> unmaps the pages from the child process so that COW becomes unnecessary. If
> the child process then faults, it gets killed.  This is implemented in
> mm/hugetlb.c#unmap_ref_private().

So on out of memory COW hugetlb code prefer applications to be killed by
SIGSEGV (SIGBUS?) instead of OOM. Okk.

Thanks,
Alexey

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: HugeTLB mapping for drivers (sample driver)
  2009-07-21  9:32         ` Alexey Korolev
@ 2009-07-21  9:40           ` Mel Gorman
  0 siblings, 0 replies; 7+ messages in thread
From: Mel Gorman @ 2009-07-21  9:40 UTC (permalink / raw)
  To: Alexey Korolev; +Cc: Alexey Korolev, linux-mm

On Tue, Jul 21, 2009 at 09:32:34PM +1200, Alexey Korolev wrote:
> Hi,
> >
> > Did the OOM killer really trigger and select a process for killing or
> > did the process itself just get killed with an out-of-memory message? I
> > would have expected the latter.
> >
>
> OMM killer triggered in case of private mapping on attempt to access a
> page under private mapping. It was because code did not check the pages
> availability at mmap time. Will be fixed.
> 

That's a surprise. I should check out why the OOM killer fired instead
of just killing the application that failed to fault the page.

> >> In fact there should be quite few cases when private mapping makes
> >> sense for drivers and mapping DMA buffers. I thought about possible
> >> solutions. The question is what to choose.
> >>
> >> 1. Forbid private mappings for drivers in case of hugetlb. (But this
> >> limits functionality - it is not so good)
> >
> > For a long time, this was the "solution" for hugetlbfs.
> >
> >> 2. Allow private mapping. Use hugetlbfs hstates. (But it forces user
> >> to know how much hugetlb memory it is necessary to reserve for
> >> drivers)
> >
> > You can defer working out the reservations until mmap() time,
> > particularly if you are using dynamic hugepage pool resizing instead of
> > static allocation.
> >
> >> 3. Allow private mapping. Use special hstate for driver and driver
> >> should tell how much memory needs to be reserved for it. (Not clear
> >> yet how to behave if we are out of reserved space)
> >>
> >> Could you please suggest what is the best solution? May be some other options?
> >>
> >
> > The only solution that springs to mind is the same one used by hugetlbfs
> > and that is that reservations are taken at mmap() time for the size of the
> > mapping. In your case, you prefault but either way, the hugepages exist.
> >
> Yes, that looks sane. I'll follow this way. In a particular case if
> driver do not
> need a private mapping mmap will return error. Thanks for the advice.
> I'm about
> to modify the patches. I'll try to involve  hugetlb reservation
> functions as much  as
> possible and track reservations by special hstate for drivers.
> 

Ok but bear in mind you are now going far down the road of
re-implementing hugetlbfs and you should re-examine why you cannot use
the hidden internal hugetlbfs mount similar to what shared memory does.

> > What then happens for hugetlbfs is that only the process that called mmap()
> > is guaranteed their faults will succeed. If a child process incurs a COW
> > and the hugepages are not available, the child process gets killed. If
> > the parent process performs COW and the huge pages are not available, it
> > unmaps the pages from the child process so that COW becomes unnecessary. If
> > the child process then faults, it gets killed.  This is implemented in
> > mm/hugetlb.c#unmap_ref_private().
> 
> So on out of memory COW hugetlb code prefer applications to be killed by
> SIGSEGV (SIGBUS?) instead of OOM. Okk.
> 

It prefers to kill the children with SIGKILL than have the parent
application randomly fail. This happens when the pool is insufficient for
any part of the application to continue. What it was intended to address
was hugepage-aware-applications-using-MAP_PRIVATE that fork() and exec()
helper applications/monitors which appears to be fairly common. There was
a sizable window between fork() and exec() where the parent process could
get killed accessing its MAP_PRIVATE area and taking a COW even though the
child would never need it. Guaranteeing that the process that called mmap()
would always succeed fault was better than it being a random choice between
parents and children.

The impact is that applications that use MAP_PRIVATE that expect
children to get a full private copy of hugetlb-backed areas are going to
have a bad time but the expectation is that these applications are very
rare and they'll be told "don't do that".

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-07-21  9:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-14  2:07 HugeTLB mapping for drivers (sample driver) Alexey Korolev
2009-07-14 10:27 ` Mel Gorman
2009-07-15  0:08   ` Alexey Korolev
2009-07-19 13:39     ` Alexey Korolev
2009-07-20  8:11       ` Mel Gorman
2009-07-21  9:32         ` Alexey Korolev
2009-07-21  9:40           ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).