kmalloc memory slower than malloc

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* kmalloc memory slower than malloc
@ 2013-09-06  7:48 Thommy Jakobsson
  2013-09-06  8:07 ` Russell King - ARM Linux
  2013-09-06  9:12 ` Lucas Stach
  0 siblings, 2 replies; 18+ messages in thread
From: Thommy Jakobsson @ 2013-09-06  7:48 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

doing a project where I use DMA and a DMA-capable buffer in a driver. This 
buffer is then mmap:ed to userspace, the driver notice userspace 
that the device has filled the buffer. Pretty standard setup I think.

The initial problem was that I noticed that the buffer I got through 
dma_alloc_coherent was very slow to step through in my userspace program. 
I figured it was due to the memory allocated should be coherent (my hw 
doesn't have cache coherence for DMA), so I probably got memory with cache 
turned off. So I switched to a kmalloc and dma_map_single, plan was to  
get more speed if I did cache invalidations.

After switching to kmalloc in the driver I still got loosy performance 
though. I run below testdriver and program on a 
marvell kirkwood 88F6281 (ARM9Em ARMv5TE) and a imx6 (Cortex A9 MP, Armv7) 
with similar result. The test program is looping through a 4k buffer 
10000 times, just adding all bytes and measuring how long time it takes. 
On the kirkwood I get the following printout:

pa_dmabuf = 0x195d8000
va_dmabuf = 0x401e4000
pa_kmbuf = 0x19418000
va_kmbuf = 0x4031c000
dma_alloc_coherent 3037365us
kmalloc            3039321us
malloc              823403us

As you can see the kmalloc is ~3-4times slower to step through than a 
normal malloc. The addresses in the beginning are just printouts of where 
the buffers end up, both physical and virtual (in userspace) addresses.

I would have expected the kmalloc buffer to have roughly the same speed as 
a malloc one. Any ideas what am I doing wrong? or are the assumptions 
wrong?


BR,
Thommy

relevant driver part:
------------------------------------------------------------------
static long device_ioctl(struct file *file,
			 unsigned int cmd, unsigned long arg){

        dma_addr_t pa = 0;	

	printk("entering ioctl cmd %d\r\n",cmd);
	switch(cmd)
	{
	case DMAMEM:
		va_dmabuf = dma_alloc_coherent(0,BUFSIZE,&pa,GFP_KERNEL|GFP_DMA);
		pa_dmabuf = pa;
		break;
	case KMEM:
		va_kmbuf = kmalloc(BUFSIZE,GFP_KERNEL);
		//pa = dma_map_single(0,va_kmbuf,BUFSIZE,DMA_FROM_DEVICE);
		pa = __pa(va_kmbuf);
		pa_kmbuf = pa;
		break;
	case DMAMEM_REL:
		dma_free_coherent(0,BUFSIZE,va_dmabuf,pa_dmabuf);
		break;
	case KMEM_REL:
		kfree(va_kmbuf);
		break;
	default:
		break;
	} 
 
	printk("allocated pa = 0x%08X\r\n",pa);

 	if(copy_to_user((void*)arg, &pa, sizeof(pa)))
		return -EFAULT;
	return 0;
}

static int device_mmap(struct file *filp, struct vm_area_struct *vma)
 {
	unsigned long size;
	int res = 0;
	size = vma->vm_end - vma->vm_start;
	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);

	if (remap_pfn_range(vma, vma->vm_start,
			    vma->vm_pgoff, size, vma->vm_page_prot)) {
		res = -ENOBUFS;
		goto device_mmap_exit;
	}

	vma->vm_flags &= ~VM_IO;	/* using shared anonymous pages */

device_mmap_exit:
	return res;

 }


relevant parts of userspace program
-----------------------------------------------------------------

	/*
	 *alloc memory with dma_alloc_coherent
	*/
	ioctl(fd,DMAMEM,&pa_dmabuf);
	if(pa_dmabuf == 0){
		printf("no dma pa returned\r\n");
		goto exito;
	}else{
		printf("pa_dmabuf = %p\r\n",(void*)pa_dmabuf);
	}

	va_dmabuf = 
		mmap(NULL,BUFSIZE,PROT_READ|PROT_WRITE,MAP_SHARED,fd,pa_dmabuf);
	
	if(!va_dmabuf || va_dmabuf == (char*)0xFFFFFFFF){
		perror("no valid va for dmabuf");
		goto exito;
	}else{
		printf("va_dmabuf = %p\r\n",va_dmabuf);
	}

	/*
	 * alloc memory with kmalloc
	 */
	ioctl(fd,KMEM,&pa_kmbuf);
        if(pa_kmbuf == 0){
                printf("no kmalloc pa returned\r\n");
                goto exito;
        }else{
                printf("pa_kmbuf = %p\r\n",(void*)pa_kmbuf);
        }

        va_kmbuf = 
                mmap(NULL,BUFSIZE,PROT_READ|PROT_WRITE,MAP_SHARED,fd,pa_kmbuf);
        
        if(!va_kmbuf || va_kmbuf == (char*)0xFFFFFFFF){
                perror("no valid va for kmbuf");
                goto exito;
        }else{
                printf("va_kmbuf = %p\r\n",va_kmbuf);
        }


	/*
	 * test speed of dma_alloc_coherent buffer
	 */
	gettimeofday(&t1,NULL);
        for(j=0;j<LOOPCNT;j++){
		for(i=0;i<BUFSIZE;i++)
                	va_dmabuf[i]++;
	}	
        gettimeofday(&t2,NULL);
        printf("dma_alloc_coherent %ldus\n",
		(t2.tv_sec-t1.tv_sec)*1000000+(t2.tv_usec-t1.tv_usec));

        /*
         * test speed of kmalloc buffer
         */
        gettimeofday(&t1,NULL);
        for(j=0;j<LOOPCNT;j++){
                for(i=0;i<BUFSIZE;i++)
                        va_kmbuf[i]++;
        }
        gettimeofday(&t2,NULL);
        printf("kmalloc            %ldus\n",
		(t2.tv_sec-t1.tv_sec)*1000000+(t2.tv_usec-t1.tv_usec));

	/*
	 * test speed of malloc
	 */
	va_mbuf = malloc(BUFSIZE);

        gettimeofday(&t1,NULL);
        for(j=0;j<LOOPCNT;j++){
                for(i=0;i<BUFSIZE;i++)
                        va_mbuf[i]++;
        }
        gettimeofday(&t2,NULL);
        printf("malloc              %ldus\n",
		(t2.tv_sec-t1.tv_sec)*1000000+(t2.tv_usec-t1.tv_usec));
	

^ permalink raw reply	[flat|nested] 18+ messages in thread

* kmalloc memory slower than malloc
  2013-09-06  7:48 kmalloc memory slower than malloc Thommy Jakobsson
@ 2013-09-06  8:07 ` Russell King - ARM Linux
  2013-09-06  9:04   ` Thommy Jakobsson
  2013-09-06  9:12 ` Lucas Stach
  1 sibling, 1 reply; 18+ messages in thread
From: Russell King - ARM Linux @ 2013-09-06  8:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Sep 06, 2013 at 09:48:02AM +0200, Thommy Jakobsson wrote:
> Hi,
> 
> doing a project where I use DMA and a DMA-capable buffer in a driver. This 
> buffer is then mmap:ed to userspace, the driver notice userspace 
> that the device has filled the buffer. Pretty standard setup I think.

Your driver appears to be exposing physical addresses to userspace.
This is a no-go.  This is a massive security hole - it allows userspace
to map any physical address and write into that memory.  That includes
system flash and all system RAM.

This gives userspace a way to overwrite the kernel with exploits,
retrieve sensitive and/or personal data, etc.

Therefore, I will not provide any assistance with this.  Please change
your approach so you do not need physical addresses in userspace.

I know that some closed source libraries, particularly GPU and video
decode libraries like to take this approach.  Everyone should be aware
that such approaches bypass all system security, especially if the GPU
or video device is accessible to any userspace process.

In your case, your device driver special device node just has to be
accessible to any userspace process.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* kmalloc memory slower than malloc
  2013-09-06  8:07 ` Russell King - ARM Linux
@ 2013-09-06  9:04   ` Thommy Jakobsson
  0 siblings, 0 replies; 18+ messages in thread
From: Thommy Jakobsson @ 2013-09-06  9:04 UTC (permalink / raw)
  To: linux-arm-kernel



On Fri, 6 Sep 2013, Russell King - ARM Linux wrote:

> Your driver appears to be exposing physical addresses to userspace.
> This is a no-go.  This is a massive security hole - it allows userspace
> to map any physical address and write into that memory.  That includes
> system flash and all system RAM.
Sorry Russell, maybe I was unclear, the attached test was just a quick 
hack to be able to compare kmalloc and dma_coherent_alloc with malloc. This is not 
code that is part of the actual driver.

> This gives userspace a way to overwrite the kernel with exploits,
> retrieve sensitive and/or personal data, etc.
> 
> Therefore, I will not provide any assistance with this.  Please change
> your approach so you do not need physical addresses in userspace.
I see your point, but as I said I do not do that in my driver. I do map 
in the dmabuffer in userspace though, so I expose a virtual mapping to 
userspace. I have understand that to be a "normal" approach for speeding 
things up, or do you consider that to be wrong as well?

thanks,
Thommy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* kmalloc memory slower than malloc
  2013-09-06  7:48 kmalloc memory slower than malloc Thommy Jakobsson
  2013-09-06  8:07 ` Russell King - ARM Linux
@ 2013-09-06  9:12 ` Lucas Stach
  2013-09-06  9:36   ` Thommy Jakobsson
  2013-09-10  9:54   ` Thommy Jakobsson
  1 sibling, 2 replies; 18+ messages in thread
From: Lucas Stach @ 2013-09-06  9:12 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Thommy,

Am Freitag, den 06.09.2013, 09:48 +0200 schrieb Thommy Jakobsson:
> Hi,
> 
> doing a project where I use DMA and a DMA-capable buffer in a driver. This 
> buffer is then mmap:ed to userspace, the driver notice userspace 
> that the device has filled the buffer. Pretty standard setup I think.
> 
> The initial problem was that I noticed that the buffer I got through 
> dma_alloc_coherent was very slow to step through in my userspace program. 
> I figured it was due to the memory allocated should be coherent (my hw 
> doesn't have cache coherence for DMA), so I probably got memory with cache 
> turned off. So I switched to a kmalloc and dma_map_single, plan was to  
> get more speed if I did cache invalidations.
> 
> After switching to kmalloc in the driver I still got loosy performance 
> though. I run below testdriver and program on a 
> marvell kirkwood 88F6281 (ARM9Em ARMv5TE) and a imx6 (Cortex A9 MP, Armv7) 
> with similar result. The test program is looping through a 4k buffer 
> 10000 times, just adding all bytes and measuring how long time it takes. 
> On the kirkwood I get the following printout:
> 
> pa_dmabuf = 0x195d8000
> va_dmabuf = 0x401e4000
> pa_kmbuf = 0x19418000
> va_kmbuf = 0x4031c000
> dma_alloc_coherent 3037365us
> kmalloc            3039321us
> malloc              823403us
> 
> As you can see the kmalloc is ~3-4times slower to step through than a 
> normal malloc. The addresses in the beginning are just printouts of where 
> the buffers end up, both physical and virtual (in userspace) addresses.
> 
> I would have expected the kmalloc buffer to have roughly the same speed as 
> a malloc one. Any ideas what am I doing wrong? or are the assumptions 
> wrong?
> 
> 
> BR,
> Thommy
> 
> relevant driver part:
> ------------------------------------------------------------------
> static long device_ioctl(struct file *file,
> 			 unsigned int cmd, unsigned long arg){
> 
>         dma_addr_t pa = 0;	
> 
> 	printk("entering ioctl cmd %d\r\n",cmd);
> 	switch(cmd)
> 	{
> 	case DMAMEM:
> 		va_dmabuf = dma_alloc_coherent(0,BUFSIZE,&pa,GFP_KERNEL|GFP_DMA);
> 		pa_dmabuf = pa;
> 		break;
> 	case KMEM:
> 		va_kmbuf = kmalloc(BUFSIZE,GFP_KERNEL);
> 		//pa = dma_map_single(0,va_kmbuf,BUFSIZE,DMA_FROM_DEVICE);
> 		pa = __pa(va_kmbuf);
> 		pa_kmbuf = pa;
> 		break;
> 	case DMAMEM_REL:
> 		dma_free_coherent(0,BUFSIZE,va_dmabuf,pa_dmabuf);
> 		break;
> 	case KMEM_REL:
> 		kfree(va_kmbuf);
> 		break;
> 	default:
> 		break;
> 	} 
>  
> 	printk("allocated pa = 0x%08X\r\n",pa);
> 
>  	if(copy_to_user((void*)arg, &pa, sizeof(pa)))
> 		return -EFAULT;
> 	return 0;
> }
> 
> static int device_mmap(struct file *filp, struct vm_area_struct *vma)
>  {
> 	unsigned long size;
> 	int res = 0;
> 	size = vma->vm_end - vma->vm_start;
> 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> 
This is the relevant part where you are mapping things uncached into
userspace, so no wonder it is slower than cached malloc memory. If you
want to use cached userspace mappings you need bracketed MMAP access,
where you tell the kernel by using an ioctl or something that userspace
is accessing the mapping so it can flush/invalidate caches at the right
points in time.

Before doing so read up on how conflicting page mappings can lead to
undefined behavior on ARMv7 systems and consider the consequences
carefully. If you aren't sure you understood the problem fully and know
how to mitigate the problems, back out and live with an uncached or
writecombined mapping.

> 	if (remap_pfn_range(vma, vma->vm_start,
> 			    vma->vm_pgoff, size, vma->vm_page_prot)) {
> 		res = -ENOBUFS;
> 		goto device_mmap_exit;
> 	}
> 
> 	vma->vm_flags &= ~VM_IO;	/* using shared anonymous pages */
> 
> device_mmap_exit:
> 	return res;
> 
>  }
[...]

Regards,
Lucas
-- 
Pengutronix e.K.                           | Lucas Stach                 |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 18+ messages in thread

* kmalloc memory slower than malloc
  2013-09-06  9:12 ` Lucas Stach
@ 2013-09-06  9:36   ` Thommy Jakobsson
  2013-09-10  9:54   ` Thommy Jakobsson
  1 sibling, 0 replies; 18+ messages in thread
From: Thommy Jakobsson @ 2013-09-06  9:36 UTC (permalink / raw)
  To: linux-arm-kernel



On Fri, 6 Sep 2013, Lucas Stach wrote:

> > static int device_mmap(struct file *filp, struct vm_area_struct *vma)
> >  {
> > 	unsigned long size;
> > 	int res = 0;
> > 	size = vma->vm_end - vma->vm_start;
> > 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > 
> This is the relevant part where you are mapping things uncached into
> userspace, so no wonder it is slower than cached malloc memory. If you
> want to use cached userspace mappings you need bracketed MMAP access,
> where you tell the kernel by using an ioctl or something that userspace
> is accessing the mapping so it can flush/invalidate caches at the right
> points in time.
Well that explains it, you think that calling a function named noncached 
would have be a tell, but apparently no =). Thanks Lucas for spotting 
that. I should not copy and paste so much I guess.

> Before doing so read up on how conflicting page mappings can lead to
> undefined behavior on ARMv7 systems and consider the consequences
> carefully. If you aren't sure you understood the problem fully and know
> how to mitigate the problems, back out and live with an uncached or
> writecombined mapping.
I have read up a bit on it, but isn't so that I have conflicting page 
mappings now in my test? Since the kernel is accessing the buffer as 
cached and userspace as noncached? If I would use cached in both places I 
assume that I wouldn't have conflicting page mappings?

Thanks,
Thommy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* kmalloc memory slower than malloc
  2013-09-06  9:12 ` Lucas Stach
  2013-09-06  9:36   ` Thommy Jakobsson
@ 2013-09-10  9:54   ` Thommy Jakobsson
  2013-09-10 10:10     ` Lucas Stach
  2013-09-10 11:41     ` Russell King - ARM Linux
  1 sibling, 2 replies; 18+ messages in thread
From: Thommy Jakobsson @ 2013-09-10  9:54 UTC (permalink / raw)
  To: linux-arm-kernel



On Fri, 6 Sep 2013, Lucas Stach wrote:

> This is the relevant part where you are mapping things uncached into
> userspace, so no wonder it is slower than cached malloc memory. If you
> want to use cached userspace mappings you need bracketed MMAP access,
> where you tell the kernel by using an ioctl or something that userspace
> is accessing the mapping so it can flush/invalidate caches at the right
> points in time.
Removing the pgprot_noncached() seems to make things more what I expected. 
Both buffers takes about the same time to traverse in userspace. Thanks.

I changed the code in my testprogram and driver to do the same thing in 
kernelspace as well. And now I don't understand the result. The result 
stepping through and adding all bytes in a page sized buffer is about 4-5 
times faster to do in the kernel. This is the times for looping through 
the buffer 10000 times on a imx6:
dma_alloc_coherent in kernel   4.256s (s=0)
kmalloc in kernel              0.126s (s=86700000)
dma_alloc_coherent userspace   0.566s (s=0)
kmalloc in userspace          0.566s (s=86700000)
malloc in userspace          0.566s (s=0)

The 's' inside the paranthes is the resulting sum. See below for 
actual code. I've read the L2 cache controller (PL310) in the imx6 has 
speculative read. So I assume that it will be a performance increase to 
have the memory physically continous (like kmalloc). But that should be 
the same after I have map:ed it to userspace aswell, right? There is no 
other load on the target during the test run.

I don't really understand the different pgprot-flags (some are obvious 
like L_PTE_MT_UNCACHED of ocurse), so maybe I still have some errors in my 
mmap. Can someone point me in the right direction or have any ideas why 
it is so much faster in the kernel?

Thanks,
Thommy

code from testdriver:
--------------------
static long device_ioctl(struct file *file,
			 unsigned int cmd, unsigned long arg){

        dma_addr_t pa = 0;	
	int i,j;
	unsigned long s=0;

	printk("entering ioctl cmd %d\r\n",cmd);
	switch(cmd)
	{
	case DMAMEM:
		va_dmabuf = dma_alloc_coherent(0,BUFSIZE,&pa,GFP_KERNEL|GFP_DMA);
		//memset(va_dmabuf,0,BUFSIZE);
		//va_dmabuf[15] = 23;
		pa_dmabuf = pa;	
		printk("kernel va_dmabuf: 0x%p, pa_dmabuf 0x%08X\r\n",va_dmabuf,pa_dmabuf);
		break;
	case DMAMEM_TEST:
		for(j=0;j<LOOPCNT;j++){
			for(i=0;i<BUFSIZE;i++){
				s += va_dmabuf[i];
			}
		}
		break;
	case KMEM:
		va_kmbuf = kmalloc(BUFSIZE,GFP_KERNEL);
		//pa = virt_to_phys(va_kmbuf);
		//pa = __pa(va_kmbuf);
		pa = dma_map_single(0,va_kmbuf,BUFSIZE,DMA_FROM_DEVICE);
		pa_kmbuf = pa;
		dma_sync_single_for_cpu(0,pa_kmbuf,BUFSIZE,DMA_FROM_DEVICE);
		//memset(va_kmbuf,0,BUFSIZE);
		//va_kmbuf[10] = 11;
		printk("kernel va_kmbuf: 0x%p, pa_kmbuf 0x%08X\r\n",va_kmbuf,pa_kmbuf);
		break;
	case KMEM_TEST:
		for(j=0;j<LOOPCNT;j++){
			for(i=0;i<BUFSIZE;i++){
				s += va_kmbuf[i];
			}
		}
		break;
	case DMAMEM_REL:
		dma_free_coherent(0,BUFSIZE,va_dmabuf,pa_dmabuf);
		va_dmabuf = 0;
		break;
	case KMEM_REL:
		kfree(va_kmbuf);
		va_kmbuf = 0;
		break;
	default:
		break;
	} 
 
	if(cmd == DMAMEM_TEST || cmd == KMEM_TEST){
		if(copy_to_user((void*)arg, &s, sizeof(s)))
			return -EFAULT;
	}else{
		pa_currentbuf = pa;
 		if(copy_to_user((void*)arg, &pa, sizeof(pa)))
			return -EFAULT;
	}
	return 0;
}

static int device_mmap(struct file *filp, struct vm_area_struct *vma)
 {
	unsigned long size;
	int res = 0;
	size = vma->vm_end - vma->vm_start;
	//vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
//	vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
//	vma->vm_page_prot = __pgprot_modify(vma->vm_page_prot,L_PTE_MT_MASK,L_PTE_MT_WRITEBACK);
//	vma->vm_page_prot = __pgprot_modify(vma->vm_page_prot,L_PTE_MT_MASK,L_PTE_MT_DEV_CACHED);
//	vma->vm_page_prot = __pgprot_modify(vma->vm_page_prot,L_PTE_MT_MASK,L_PTE_MT_WRITETHROUGH);

	if (remap_pfn_range(vma, vma->vm_start,
			    pa_currentbuf>>PAGE_SHIFT , size, vma->vm_page_prot)) {
		res = -ENOBUFS;
		goto device_mmap_exit;
	}

	vma->vm_flags &= ~VM_IO;	/* using shared anonymous pages */

device_mmap_exit:
	return res;

 }


code from testapplication:
-------------------------
	/*
	 * test speed of dma_alloc_coherent buffer in kernel
	 */
	gettimeofday(&t1,NULL);
	ioctl(fd,DMAMEM_TEST,&s);
        gettimeofday(&t2,NULL);
        printf("dma_alloc_coherent in kernel   %.3fs (s=%lu)\n",
		((t2.tv_sec-t1.tv_sec)*1000000+(t2.tv_usec-t1.tv_usec))/1000000.0,s);

        /*
         * test speed of kmalloc buffer in kernel
         */
        gettimeofday(&t1,NULL);
	ioctl(fd,KMEM_TEST,&s);
        gettimeofday(&t2,NULL);
        printf("kmalloc in kernel              %.3fs (s=%lu)\n",
		((t2.tv_sec-t1.tv_sec)*1000000+(t2.tv_usec-t1.tv_usec))/1000000.0,s);
 

	/*
	 * test speed of dma_alloc_coherent buffer
	 */
	s=0;
	gettimeofday(&t1,NULL);
        for(j=0;j<LOOPCNT;j++){
		for(i=0;i<BUFSIZE;i++){
                	s += va_dmabuf[i];
		}
	}	
        gettimeofday(&t2,NULL);
        printf("dma_alloc_coherent userspace   %.3fs (s=%lu)\n",
		((t2.tv_sec-t1.tv_sec)*1000000+(t2.tv_usec-t1.tv_usec))/1000000.0,s);
 

        /*
         * test speed of kmalloc buffer
         */
	s=0;
        gettimeofday(&t1,NULL);
        for(j=0;j<LOOPCNT;j++){
                for(i=0;i<BUFSIZE;i++){
                        s += va_kmbuf[i];
        	}
	}
        gettimeofday(&t2,NULL);
        printf("kmalloc in userspace          %.3fs (s=%lu)\n",
		((t2.tv_sec-t1.tv_sec)*1000000+(t2.tv_usec-t1.tv_usec))/1000000.0,s);


	/*
	 * test speed of malloc
	 */
	s=0;
	va_mbuf = malloc(BUFSIZE);

        gettimeofday(&t1,NULL);
        for(j=0;j<LOOPCNT;j++){
                for(i=0;i<BUFSIZE;i++){
                        s += va_mbuf[i];
		}
        }
        gettimeofday(&t2,NULL);
        printf("malloc in userspace          %.3fs (s=%lu)\n",
		((t2.tv_sec-t1.tv_sec)*1000000+(t2.tv_usec-t1.tv_usec))/1000000.0,s);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* kmalloc memory slower than malloc
  2013-09-10  9:54   ` Thommy Jakobsson
@ 2013-09-10 10:10     ` Lucas Stach
  2013-09-10 10:42       ` Duan Fugang-B38611
  2013-09-10 11:27       ` Thommy Jakobsson
  2013-09-10 11:41     ` Russell King - ARM Linux
  1 sibling, 2 replies; 18+ messages in thread
From: Lucas Stach @ 2013-09-10 10:10 UTC (permalink / raw)
  To: linux-arm-kernel

Am Dienstag, den 10.09.2013, 11:54 +0200 schrieb Thommy Jakobsson:
> 
> On Fri, 6 Sep 2013, Lucas Stach wrote:
> 
> > This is the relevant part where you are mapping things uncached into
> > userspace, so no wonder it is slower than cached malloc memory. If you
> > want to use cached userspace mappings you need bracketed MMAP access,
> > where you tell the kernel by using an ioctl or something that userspace
> > is accessing the mapping so it can flush/invalidate caches at the right
> > points in time.
> Removing the pgprot_noncached() seems to make things more what I expected. 
> Both buffers takes about the same time to traverse in userspace. Thanks.
> 
> I changed the code in my testprogram and driver to do the same thing in 
> kernelspace as well. And now I don't understand the result. The result 
> stepping through and adding all bytes in a page sized buffer is about 4-5 
> times faster to do in the kernel. This is the times for looping through 
> the buffer 10000 times on a imx6:
> dma_alloc_coherent in kernel   4.256s (s=0)
> kmalloc in kernel              0.126s (s=86700000)
> dma_alloc_coherent userspace   0.566s (s=0)
> kmalloc in userspace          0.566s (s=86700000)
> malloc in userspace          0.566s (s=0)
> 
How do you init the kmalloc memory? If you do a memset right before the
test loop your "kmalloc in kernel" will most likely always hit in the L1
cache, that's why it's really fast to do.

The userspace mapping of the kmalloc memory will get a different virtual
address than the kernel mapping. So if you do a memset in kernelspace,
but the test loop in userspace you'll always miss the cache as the ARM
v7 caches are virtually indexed. So the processor always fetches data
from memory. The performance advantage against an uncached mapping is
entirely due to the fact that you are fetching whole cache lines
(32bytes) from memory at once, instead of doing a memory/bus transaction
per byte.

Regards,
Lucas
-- 
Pengutronix e.K.                           | Lucas Stach                 |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 18+ messages in thread

* kmalloc memory slower than malloc
  2013-09-10 10:10     ` Lucas Stach
@ 2013-09-10 10:42       ` Duan Fugang-B38611
  2013-09-10 11:28         ` Thommy Jakobsson
  2013-09-10 11:27       ` Thommy Jakobsson
  1 sibling, 1 reply; 18+ messages in thread
From: Duan Fugang-B38611 @ 2013-09-10 10:42 UTC (permalink / raw)
  To: linux-arm-kernel

From: linux-arm-kernel [mailto:linux-arm-kernel-bounces at lists.infradead.org] On Behalf Of Lucas Stach
Data: Tuesday, September 10, 2013 6:10 PM

> To: Thommy Jakobsson
> Cc: linux-arm-kernel at lists.infradead.org
> Subject: Re: kmalloc memory slower than malloc
> 
> Am Dienstag, den 10.09.2013, 11:54 +0200 schrieb Thommy Jakobsson:
> >
> > On Fri, 6 Sep 2013, Lucas Stach wrote:
> >
> > > This is the relevant part where you are mapping things uncached into
> > > userspace, so no wonder it is slower than cached malloc memory. If
> > > you want to use cached userspace mappings you need bracketed MMAP
> > > access, where you tell the kernel by using an ioctl or something
> > > that userspace is accessing the mapping so it can flush/invalidate
> > > caches at the right points in time.
> > Removing the pgprot_noncached() seems to make things more what I
> expected.
> > Both buffers takes about the same time to traverse in userspace. Thanks.
> >
> > I changed the code in my testprogram and driver to do the same thing
> > in kernelspace as well. And now I don't understand the result. The
> > result stepping through and adding all bytes in a page sized buffer is
> > about 4-5 times faster to do in the kernel. This is the times for
> > looping through the buffer 10000 times on a imx6:
> > dma_alloc_coherent in kernel   4.256s (s=0)
> > kmalloc in kernel              0.126s (s=86700000)
> > dma_alloc_coherent userspace   0.566s (s=0)
> > kmalloc in userspace          0.566s (s=86700000)
> > malloc in userspace          0.566s (s=0)
> >
> How do you init the kmalloc memory? If you do a memset right before the
> test loop your "kmalloc in kernel" will most likely always hit in the L1
> cache, that's why it's really fast to do.
> 
> The userspace mapping of the kmalloc memory will get a different virtual
> address than the kernel mapping. So if you do a memset in kernelspace, but
> the test loop in userspace you'll always miss the cache as the ARM
> v7 caches are virtually indexed. So the processor always fetches data from
> memory. The performance advantage against an uncached mapping is entirely
> due to the fact that you are fetching whole cache lines
> (32bytes) from memory at once, instead of doing a memory/bus transaction
> per byte.
> 
> Regards,
> Lucas
About the diff:
dma_alloc_coherent in kernel   4.256s (s=0)
dma_alloc_coherent userspace   0.566s (s=0)

I think it call remap_pfn_range() with page attribute (vma->vm_page_prot) transferred from mmap() maybe cacheable.
So the performance is the same as malloc/kmalloc in userspace.

Regards,
Andy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* kmalloc memory slower than malloc
  2013-09-10 10:10     ` Lucas Stach
  2013-09-10 10:42       ` Duan Fugang-B38611
@ 2013-09-10 11:27       ` Thommy Jakobsson
  1 sibling, 0 replies; 18+ messages in thread
From: Thommy Jakobsson @ 2013-09-10 11:27 UTC (permalink / raw)
  To: linux-arm-kernel



On Tue, 10 Sep 2013, Lucas Stach wrote:

> How do you init the kmalloc memory? If you do a memset right before the
> test loop your "kmalloc in kernel" will most likely always hit in the L1
> cache, that's why it's really fast to do.
I did do a memset previosly but I removed it to see if I still had the 
difference. So now I don't initilize the memory at all. The run from which 
I attached the times had no initilization at all. Besides, I loop through 
all bytes 10000 times, so I would assume that all would be in cache after 
the first loop.

> 
> The userspace mapping of the kmalloc memory will get a different virtual
> address than the kernel mapping. So if you do a memset in kernelspace,
> but the test loop in userspace you'll always miss the cache as the ARM
> v7 caches are virtually indexed. So the processor always fetches data
> from memory. The performance advantage against an uncached mapping is
> entirely due to the fact that you are fetching whole cache lines
> (32bytes) from memory at once, instead of doing a memory/bus transaction
> per byte.

I thought that the L1 data cache was physically indexed and tagged, 
whereas the instruction cache used virtual indexing. But maybe I'm 
wrong. The L2 cache is physically indexed and tagged though, right? 

Thanks,
Thommy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* kmalloc memory slower than malloc
  2013-09-10 10:42       ` Duan Fugang-B38611
@ 2013-09-10 11:28         ` Thommy Jakobsson
  2013-09-10 11:36           ` Duan Fugang-B38611
  0 siblings, 1 reply; 18+ messages in thread
From: Thommy Jakobsson @ 2013-09-10 11:28 UTC (permalink / raw)
  To: linux-arm-kernel



On Tue, 10 Sep 2013, Duan Fugang-B38611 wrote:

> About the diff:
> dma_alloc_coherent in kernel   4.256s (s=0)
> dma_alloc_coherent userspace   0.566s (s=0)
> 
> I think it call remap_pfn_range() with page attribute (vma->vm_page_prot) transferred from mmap() maybe cacheable.
> So the performance is the same as malloc/kmalloc in userspace.
> 
Thats probably true, or at least that is how I explained it to myself in 
my head =) 

Thanks,
Thommy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* kmalloc memory slower than malloc
  2013-09-10 11:28         ` Thommy Jakobsson
@ 2013-09-10 11:36           ` Duan Fugang-B38611
  2013-09-10 11:44             ` Russell King - ARM Linux
  0 siblings, 1 reply; 18+ messages in thread
From: Duan Fugang-B38611 @ 2013-09-10 11:36 UTC (permalink / raw)
  To: linux-arm-kernel

From: Thommy Jakobsson [mailto:thommyj at gmail.com] 
Data: Tuesday, September 10, 2013 7:29 PM

> To: Duan Fugang-B38611
> Cc: Lucas Stach; Thommy Jakobsson; linux-arm-kernel at lists.infradead.org
> Subject: RE: kmalloc memory slower than malloc
> 
> 
> 
> On Tue, 10 Sep 2013, Duan Fugang-B38611 wrote:
> 
> > About the diff:
> > dma_alloc_coherent in kernel   4.256s (s=0)
> > dma_alloc_coherent userspace   0.566s (s=0)
> >
> > I think it call remap_pfn_range() with page attribute (vma->vm_page_prot)
> transferred from mmap() maybe cacheable.
> > So the performance is the same as malloc/kmalloc in userspace.
> >
> Thats probably true, or at least that is how I explained it to myself in
> my head =)
> 
> Thanks,
> Thommy

Can you add below code to your device_mmap() to test the performance for above two cases:
	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);

I think the performance must be the same.

Regards,
Andy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* kmalloc memory slower than malloc
  2013-09-10  9:54   ` Thommy Jakobsson
  2013-09-10 10:10     ` Lucas Stach
@ 2013-09-10 11:41     ` Russell King - ARM Linux
  2013-09-10 12:54       ` Thommy Jakobsson
  1 sibling, 1 reply; 18+ messages in thread
From: Russell King - ARM Linux @ 2013-09-10 11:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 10, 2013 at 11:54:03AM +0200, Thommy Jakobsson wrote:
> I changed the code in my testprogram and driver to do the same thing in 
> kernelspace as well. And now I don't understand the result. The result 
> stepping through and adding all bytes in a page sized buffer is about 4-5 
> times faster to do in the kernel. This is the times for looping through 
> the buffer 10000 times on a imx6:
> dma_alloc_coherent in kernel   4.256s (s=0)
> kmalloc in kernel              0.126s (s=86700000)
> dma_alloc_coherent userspace   0.566s (s=0)
> kmalloc in userspace          0.566s (s=86700000)
> malloc in userspace          0.566s (s=0)

How many times have you verified this result?

So, the obvious question is: does this kernel have kernel preemption
enabled?

The reason for asking that is that if you have kernel preemption
disabled, while your running your buffer sum, no other thread will get
use of the CPU, so you'll have all the CPU cycles (with the exception
of interrupt handling) to yourself.

That won't be true in userspace.

You may also like to consider giving people the full source to your
tests so that it can be run on other platforms as well.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* kmalloc memory slower than malloc
  2013-09-10 11:36           ` Duan Fugang-B38611
@ 2013-09-10 11:44             ` Russell King - ARM Linux
  2013-09-10 12:42               ` Thommy Jakobsson
  0 siblings, 1 reply; 18+ messages in thread
From: Russell King - ARM Linux @ 2013-09-10 11:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 10, 2013 at 11:36:34AM +0000, Duan Fugang-B38611 wrote:
> From: Thommy Jakobsson [mailto:thommyj at gmail.com] 
> Data: Tuesday, September 10, 2013 7:29 PM
> 
> > To: Duan Fugang-B38611
> > Cc: Lucas Stach; Thommy Jakobsson; linux-arm-kernel at lists.infradead.org
> > Subject: RE: kmalloc memory slower than malloc
> > 
> > 
> > 
> > On Tue, 10 Sep 2013, Duan Fugang-B38611 wrote:
> > 
> > > About the diff:
> > > dma_alloc_coherent in kernel   4.256s (s=0)
> > > dma_alloc_coherent userspace   0.566s (s=0)
> > >
> > > I think it call remap_pfn_range() with page attribute (vma->vm_page_prot)
> > transferred from mmap() maybe cacheable.
> > > So the performance is the same as malloc/kmalloc in userspace.
> > >
> > Thats probably true, or at least that is how I explained it to myself in
> > my head =)
> > 
> > Thanks,
> > Thommy
> 
> Can you add below code to your device_mmap() to test the performance for above two cases:
> 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);

No, that is not match the page table settings that dma_mmap_coherent
would use.  That gets you strongly ordered memory which will be
(a) a violation of the ARM architecture requirements, being a different
"memory type", and (b) will be a different mapping type compared to
that used by the virtual address returned from dma_alloc_coherent().

The appropriate modification here would be pgprot_dmacoherent().

^ permalink raw reply	[flat|nested] 18+ messages in thread

* kmalloc memory slower than malloc
  2013-09-10 11:44             ` Russell King - ARM Linux
@ 2013-09-10 12:42               ` Thommy Jakobsson
  2013-09-10 12:50                 ` Russell King - ARM Linux
  0 siblings, 1 reply; 18+ messages in thread
From: Thommy Jakobsson @ 2013-09-10 12:42 UTC (permalink / raw)
  To: linux-arm-kernel



On Tue, 10 Sep 2013, Russell King - ARM Linux wrote:

> On Tue, Sep 10, 2013 at 11:36:34AM +0000, Duan Fugang-B38611 wrote:
> > From: Thommy Jakobsson [mailto:thommyj at gmail.com] 
> > Data: Tuesday, September 10, 2013 7:29 PM
> > 
> > > To: Duan Fugang-B38611
> > > Cc: Lucas Stach; Thommy Jakobsson; linux-arm-kernel at lists.infradead.org
> > > Subject: RE: kmalloc memory slower than malloc
> > > 
> > > 
> > > 
> > > On Tue, 10 Sep 2013, Duan Fugang-B38611 wrote:
> > > 
> > > > About the diff:
> > > > dma_alloc_coherent in kernel   4.256s (s=0)
> > > > dma_alloc_coherent userspace   0.566s (s=0)
> > > >
> > > > I think it call remap_pfn_range() with page attribute (vma->vm_page_prot)
> > > transferred from mmap() maybe cacheable.
> > > > So the performance is the same as malloc/kmalloc in userspace.
> > > >
> > > Thats probably true, or at least that is how I explained it to myself in
> > > my head =)
> > > 
> > > Thanks,
> > > Thommy
> > 
> > Can you add below code to your device_mmap() to test the performance for above two cases:
> > 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> 
> No, that is not match the page table settings that dma_mmap_coherent
> would use.  That gets you strongly ordered memory which will be
> (a) a violation of the ARM architecture requirements, being a different
> "memory type", and (b) will be a different mapping type compared to
> that used by the virtual address returned from dma_alloc_coherent().
> 
> The appropriate modification here would be pgprot_dmacoherent().
Using pgprot_dmacoherent() in mmap they look more similar. Still 
~10-15% difference, but maybe that is normal for kernel/userspace. 

dma_alloc_coherent in kernel   4.257s (s=0)
kmalloc in kernel              0.126s (s=81370000)
dma_alloc_coherent userspace   4.907s (s=0)
kmalloc in userspace          1.815s (s=81370000)
malloc in userspace          0.566s (s=0)

Note that I was lazy and used the same pgprot for all mappings now, which 
I guess is a violation. 
//thommy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* kmalloc memory slower than malloc
  2013-09-10 12:42               ` Thommy Jakobsson
@ 2013-09-10 12:50                 ` Russell King - ARM Linux
  2013-09-12 15:58                   ` Thommy Jakobsson
  0 siblings, 1 reply; 18+ messages in thread
From: Russell King - ARM Linux @ 2013-09-10 12:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 10, 2013 at 02:42:17PM +0200, Thommy Jakobsson wrote:
> Using pgprot_dmacoherent() in mmap they look more similar. Still 
> ~10-15% difference, but maybe that is normal for kernel/userspace. 
> 
> dma_alloc_coherent in kernel   4.257s (s=0)
> kmalloc in kernel              0.126s (s=81370000)
> dma_alloc_coherent userspace   4.907s (s=0)
> kmalloc in userspace          1.815s (s=81370000)
> malloc in userspace          0.566s (s=0)
> 
> Note that I was lazy and used the same pgprot for all mappings now, which 
> I guess is a violation. 

What it means is that the results you end up with are documented to be
"unpredictable" which gives scope to manufacturers to come up with any
behaviour they desire in that situation - and it doesn't have to be
consistent.

What that means is that if you have an area of physical memory mapped as
"normal memory cacheable" and it's also mapped "strongly ordered" elsewhere,
it is entirely legal for an access via the strongly ordered mapping to
hit the cache if a cache line exists, whereas another implementation
may miss the cache line if it exists.

Furthermore, with such mappings (and this has been true since ARMv3 days)
if you have two such mappings - one cacheable and one non-cacheable, and
the cacheable mapping has dirty cache lines, the dirty cache lines can be
evicted at any moment, overwriting whatever you're doing via the non-
cacheable mapping.

I've recently had a hard-to-track bug doing exactly that in a non-mainline
kernel on ARMv7 because someone decided it was a good idea to bypass my
test in arch/arm/mm/ioremap.c preventing system RAM being ioremap()d.  It
lead to one boot in 20ish locking up because a GPU command stream was
being overwritten by the dirty cache lines being evicted after the GPU
had started to read from that memory - or, if you typed "reboot" at the
right moment during a previous boot, you could get it to occur 100% of
the time.

I notice you turn off VM_IO - you don't want to do that...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* kmalloc memory slower than malloc
  2013-09-10 11:41     ` Russell King - ARM Linux
@ 2013-09-10 12:54       ` Thommy Jakobsson
  0 siblings, 0 replies; 18+ messages in thread
From: Thommy Jakobsson @ 2013-09-10 12:54 UTC (permalink / raw)
  To: linux-arm-kernel



On Tue, 10 Sep 2013, Russell King - ARM Linux wrote:

> On Tue, Sep 10, 2013 at 11:54:03AM +0200, Thommy Jakobsson wrote:
> > I changed the code in my testprogram and driver to do the same thing in 
> > kernelspace as well. And now I don't understand the result. The result 
> > stepping through and adding all bytes in a page sized buffer is about 4-5 
> > times faster to do in the kernel. This is the times for looping through 
> > the buffer 10000 times on a imx6:
> > dma_alloc_coherent in kernel   4.256s (s=0)
> > kmalloc in kernel              0.126s (s=86700000)
> > dma_alloc_coherent userspace   0.566s (s=0)
> > kmalloc in userspace          0.566s (s=86700000)
> > malloc in userspace          0.566s (s=0)
> 
> How many times have you verified this result?
I havent done any scientific study, but at least 20 times with restarts in 
between. I got a similar result on another hw as well, but haven't checked 
what kernel or config I was running there. So might not be the same thing. 
Also each buffer is looped 10000 times, which I assume would remove the 
most severe randomness a least.

> So, the obvious question is: does this kernel have kernel preemption
> enabled?
> 
> The reason for asking that is that if you have kernel preemption
> disabled, while your running your buffer sum, no other thread will get
> use of the CPU, so you'll have all the CPU cycles (with the exception
> of interrupt handling) to yourself.
> 
> That won't be true in userspace.

It should be enabled
zcat /proc/config.gz | grep PREEM
CONFIG_TREE_PREEMPT_RCU=y
CONFIG_PREEMPT_RCU=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y

I wouldn't be surprised if doing things in the kernel are quicker, its 
just the amount that surprise me.

> 
> You may also like to consider giving people the full source to your
> tests so that it can be run on other platforms as well.
>
Sure thing, I didn't include it in the mail to avoid cluttering it down to 
much. One can find it here:
https://github.com/thommyj/buf-speedtest


Thanks,
Thommy 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* kmalloc memory slower than malloc
  2013-09-10 12:50                 ` Russell King - ARM Linux
@ 2013-09-12 15:58                   ` Thommy Jakobsson
  2013-09-12 16:19                     ` Russell King - ARM Linux
  0 siblings, 1 reply; 18+ messages in thread
From: Thommy Jakobsson @ 2013-09-12 15:58 UTC (permalink / raw)
  To: linux-arm-kernel



On Tue, 10 Sep 2013, Russell King - ARM Linux wrote:

> On Tue, Sep 10, 2013 at 02:42:17PM +0200, Thommy Jakobsson wrote:
> > Using pgprot_dmacoherent() in mmap they look more similar. Still 
> > ~10-15% difference, but maybe that is normal for kernel/userspace. 
> > 
> > dma_alloc_coherent in kernel   4.257s (s=0)
> > kmalloc in kernel              0.126s (s=81370000)
> > dma_alloc_coherent userspace   4.907s (s=0)
> > kmalloc in userspace          1.815s (s=81370000)
> > malloc in userspace          0.566s (s=0)
> > 
> > Note that I was lazy and used the same pgprot for all mappings now, which 
> > I guess is a violation. 
> 
> What it means is that the results you end up with are documented to be
> "unpredictable" which gives scope to manufacturers to come up with any
> behaviour they desire in that situation - and it doesn't have to be
> consistent.
> 
> What that means is that if you have an area of physical memory mapped as
> "normal memory cacheable" and it's also mapped "strongly ordered" elsewhere,
> it is entirely legal for an access via the strongly ordered mapping to
> hit the cache if a cache line exists, whereas another implementation
> may miss the cache line if it exists.
> 
> Furthermore, with such mappings (and this has been true since ARMv3 days)
> if you have two such mappings - one cacheable and one non-cacheable, and
> the cacheable mapping has dirty cache lines, the dirty cache lines can be
> evicted at any moment, overwriting whatever you're doing via the non-
> cacheable mapping.
But isn't the memory received with dma_alloc_coherent() given a noncached 
mapping? or even strongly ordered? Will that not conflict with the normal 
kernel mapping which is cached?

Is all the mappings documented somewhere, what linux mapping corresponds 
to which mapping in MMU? Seems like the armv7 documentation isn't free 
either, which isn't making things easier for me.

Comning back to the original issue; dissassembling the code I noticed that 
the userspace code looked really stupid with a lot of unnecessary memory 
accesses. Kernel looked much better. Even after commenting the actual 
memory access out in userspace, leaving just the loop itself, I got 
terrible times.

Previous times:
dma_alloc_coherent in kernel   4.257s (s=0)
kmalloc in kernel              0.126s (s=68620000)
dma_alloc_coherent userspace   0.566s (s=0)
kmalloc in userspace          0.566s (s=68620000)
malloc in userspace          0.566s (s=0)

Commenting out actual memory access (loop not optimized away when checking 
assembler):
dma_alloc_coherent in kernel   4.256s (s=0)
kmalloc in kernel              0.126s (s=84750000)
dma_alloc_coherent userspace   0.566s (s=0)
kmalloc in userspace          0.412s (s=0) << just looping
malloc in userspace          0.566s (s=0)

Kernel is with -O2 so compiling the testprogram with -O2 aswell yield more 
reasonable results:
dma_alloc_coherent in kernel   4.257s (s=0)
kmalloc in kernel              0.126s (s=84560000)
dma_alloc_coherent userspace   0.124s (s=0)
kmalloc in userspace          0.124s (s=84560000)
malloc in userspace          0.113s (s=0)

As can be seen all tests executed in userspace was cut in 1/4-1/5. Malloc 
is now a bit faster than kmalloc. Could be faster if physical memory is 
spread out on different banks, but on other hand cache prefetching should 
be easier if continous. 

> I notice you turn off VM_IO - you don't want to do that...
Fixed

Thanks for all help,
Thommy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* kmalloc memory slower than malloc
  2013-09-12 15:58                   ` Thommy Jakobsson
@ 2013-09-12 16:19                     ` Russell King - ARM Linux
  0 siblings, 0 replies; 18+ messages in thread
From: Russell King - ARM Linux @ 2013-09-12 16:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 12, 2013 at 05:58:22PM +0200, Thommy Jakobsson wrote:
> 
> 
> On Tue, 10 Sep 2013, Russell King - ARM Linux wrote:
> > What it means is that the results you end up with are documented to be
> > "unpredictable" which gives scope to manufacturers to come up with any
> > behaviour they desire in that situation - and it doesn't have to be
> > consistent.
> > 
> > What that means is that if you have an area of physical memory mapped as
> > "normal memory cacheable" and it's also mapped "strongly ordered" elsewhere,
> > it is entirely legal for an access via the strongly ordered mapping to
> > hit the cache if a cache line exists, whereas another implementation
> > may miss the cache line if it exists.
> > 
> > Furthermore, with such mappings (and this has been true since ARMv3 days)
> > if you have two such mappings - one cacheable and one non-cacheable, and
> > the cacheable mapping has dirty cache lines, the dirty cache lines can be
> > evicted at any moment, overwriting whatever you're doing via the non-
> > cacheable mapping.
>
> But isn't the memory received with dma_alloc_coherent() given a noncached 
> mapping? or even strongly ordered? Will that not conflict with the normal 
> kernel mapping which is cached?

dma_alloc_coherent() and dma_map_single()/dma_map_page() both know about
the issues and deal with any dirty cache lines - they also try and map
the memory as compatibly as possible with any existing mapping.

On pre-ARMv6, dma_alloc_coherent() will provide memory which is "non-cached
non-bufferable" - C = B = 0.  This is also called "strongly ordered" on
ARMv6 and later.  You get this with pgprot_uncached(), or
pgprot_dmacoherent() on pre-ARMv6 architectures.

On ARMv6+, it provides memory which is "memory like, uncached".  This
is what you get when you use pgprot_dmacoherent() on ARMv6 or later.

On ARMv6+, there are three classes of mapping: strongly ordered, device,
and memory-like.  Strongly ordered and device are both non-cacheable.
However, memory-like can be cacheable, and the cache properties can be
specified.  All mappings of a physical address _should_ be of the same
"class".

dma_map_single()/dma_map_page() deal with the problem completely
differently - they don't setup a new mapping, instead they perform
manual cache maintanence to ensure that the data is appropriately
visible to either the CPU or the DMA engine after the appropriate
call(s).

> Comning back to the original issue; dissassembling the code I noticed that 
> the userspace code looked really stupid with a lot of unnecessary memory 
> accesses. Kernel looked much better. Even after commenting the actual 
> memory access out in userspace, leaving just the loop itself, I got 
> terrible times.

Oh, you're not specifying any optimisation what so ever?  That'll be
the reason then - the compiler won't do _any_ optimisation unless you
ask it to.  That means it'll do stuff like saving an interator out on
the stack and then immediately read it back in, increment it, and
write it back out again.

> Kernel is with -O2 so compiling the testprogram with -O2 aswell yield more 
> reasonable results:
> dma_alloc_coherent in kernel   4.257s (s=0)
> kmalloc in kernel              0.126s (s=84560000)
> dma_alloc_coherent userspace   0.124s (s=0)
> kmalloc in userspace          0.124s (s=84560000)
> malloc in userspace          0.113s (s=0)

Great, glad you solved it.

Note however that the kmalloc version is not realistic of what's required
for the CPU to provide or read DMA data: between the CPU accessing the
data and the DMA engine accessing it, there needs to be a cache flush,
which will consume additional time.  That's where using the dma_map_*,
dma_unmap_* or dma_sync_* functions come in.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2013-09-12 16:19 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-06  7:48 kmalloc memory slower than malloc Thommy Jakobsson
2013-09-06  8:07 ` Russell King - ARM Linux
2013-09-06  9:04   ` Thommy Jakobsson
2013-09-06  9:12 ` Lucas Stach
2013-09-06  9:36   ` Thommy Jakobsson
2013-09-10  9:54   ` Thommy Jakobsson
2013-09-10 10:10     ` Lucas Stach
2013-09-10 10:42       ` Duan Fugang-B38611
2013-09-10 11:28         ` Thommy Jakobsson
2013-09-10 11:36           ` Duan Fugang-B38611
2013-09-10 11:44             ` Russell King - ARM Linux
2013-09-10 12:42               ` Thommy Jakobsson
2013-09-10 12:50                 ` Russell King - ARM Linux
2013-09-12 15:58                   ` Thommy Jakobsson
2013-09-12 16:19                     ` Russell King - ARM Linux
2013-09-10 11:27       ` Thommy Jakobsson
2013-09-10 11:41     ` Russell King - ARM Linux
2013-09-10 12:54       ` Thommy Jakobsson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).