Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH net-next V2] tun: introduce tx skb ring
From: Jason Wang @ 2016-06-16  7:08 UTC (permalink / raw)
  To: Jamal Hadi Salim, mst, netdev, linux-kernel, kvm, virtualization,
	davem
  Cc: eric.dumazet, brouer
In-Reply-To: <5761422F.3010303@mojatatu.com>



On 2016年06月15日 19:55, Jamal Hadi Salim wrote:
> On 16-06-15 07:52 AM, Jamal Hadi Salim wrote:
>> On 16-06-15 04:38 AM, Jason Wang wrote:
>>> We used to queue tx packets in sk_receive_queue, this is less
>>> efficient since it requires spinlocks to synchronize between producer
>> So this is more exercising the skb array improvements. For tun
>> it would be useful to see general performance numbers on user/kernel
>> crossing (i.e tun read/write).
>> If you have the cycles can you run such tests?
>>
> Ignore my message - you are running pktgen from a VM towards the host.

Actually reversed, test were done from an external host to VM.

Thanks

> So the numbers you posted are what i was interested in.
> Thanks for the good work.
>
> cheers,
> jamal
>

^ permalink raw reply

* Re: [PATCH v7 00/12] Support non-lru page migration
From: Minchan Kim @ 2016-06-16  6:47 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Rik van Riel, Sergey Senozhatsky, Naoya Horiguchi,
	Jonathan Corbet, Chan Gyun Jeong, Rafael Aquini, Hugh Dickins,
	linux-kernel, dri-devel, virtualization, John Einar Reitan,
	linux-mm, Chulmin Kim, Gioh Kim, Konstantin Khlebnikov,
	Sangseok Lee, Andrew Morton, Kyeongdon Kim, Joonsoo Kim,
	Vlastimil Babka, Mel Gorman
In-Reply-To: <20160616052209.GB516@swordfish>

On Thu, Jun 16, 2016 at 02:22:09PM +0900, Sergey Senozhatsky wrote:
> On (06/16/16 13:47), Minchan Kim wrote:
> [..]
> > > this is what I'm getting with the [zsmalloc: keep first object offset in struct page]
> > > applied:  "count:0 mapcount:-127". which may be not related to zsmalloc at this point.
> > > 
> > > kernel: BUG: Bad page state in process khugepaged  pfn:101db8
> > > kernel: page:ffffea0004076e00 count:0 mapcount:-127 mapping:          (null) index:0x1
> > 
> > Hm, it seems double free.
> > 
> > It doen't happen if you disable zram? IOW, it seems to be related
> > zsmalloc migration?
> 
> need to test more, can't confidently answer now.
> 
> > How easy can you reprodcue it? Could you bisect it?
> 
> it takes some (um.. random) time to trigger the bug.
> I'll try to come up with more details.

Could you revert [1] and retest?

[1] mm/compaction: split freepages without holding the zone lock

> 
> 	-ss
> 
> > > kernel: flags: 0x8000000000000000()
> > > kernel: page dumped because: nonzero mapcount
> > > kernel: Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel snd_hda_codec_realtek i2c_i801 snd_hda_codec_generic r8169 mii snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich processor mfd_core sch_fq_codel sd_mod hid_generic usb
> > > kernel: CPU: 3 PID: 38 Comm: khugepaged Not tainted 4.7.0-rc3-next-20160615-dbg-00005-gfd11984-dirty #491
> > > kernel:  0000000000000000 ffff8801124c73f8 ffffffff814d69b0 ffffea0004076e00
> > > kernel:  ffffffff81e658a0 ffff8801124c7420 ffffffff811e9b63 0000000000000000
> > > kernel:  ffffea0004076e00 ffffffff81e658a0 ffff8801124c7440 ffffffff811e9ca9
> > > kernel: Call Trace:
> > > kernel:  [<ffffffff814d69b0>] dump_stack+0x68/0x92
> > > kernel:  [<ffffffff811e9b63>] bad_page+0x158/0x1a2
> > > kernel:  [<ffffffff811e9ca9>] free_pages_check_bad+0xfc/0x101
> > > kernel:  [<ffffffff811ee516>] free_hot_cold_page+0x135/0x5de
> > > kernel:  [<ffffffff811eea26>] __free_pages+0x67/0x72
> > > kernel:  [<ffffffff81227c63>] release_freepages+0x13a/0x191
> > > kernel:  [<ffffffff8122b3c2>] compact_zone+0x845/0x1155
> > > kernel:  [<ffffffff8122ab7d>] ? compaction_suitable+0x76/0x76
> > > kernel:  [<ffffffff8122bdb2>] compact_zone_order+0xe0/0x167
> > > kernel:  [<ffffffff8122bcd2>] ? compact_zone+0x1155/0x1155
> > > kernel:  [<ffffffff8122ce88>] try_to_compact_pages+0x2f1/0x648
> > > kernel:  [<ffffffff8122ce88>] ? try_to_compact_pages+0x2f1/0x648
> > > kernel:  [<ffffffff8122cb97>] ? compaction_zonelist_suitable+0x3a6/0x3a6
> > > kernel:  [<ffffffff811ef1ea>] ? get_page_from_freelist+0x2c0/0x133c
> > > kernel:  [<ffffffff811f0350>] __alloc_pages_direct_compact+0xea/0x30d
> > > kernel:  [<ffffffff811f0266>] ? get_page_from_freelist+0x133c/0x133c
> > > kernel:  [<ffffffff811ee3b2>] ? drain_all_pages+0x1d6/0x205
> > > kernel:  [<ffffffff811f21a8>] __alloc_pages_nodemask+0x143d/0x16b6
> > > kernel:  [<ffffffff8111f405>] ? debug_show_all_locks+0x226/0x226
> > > kernel:  [<ffffffff811f0d6b>] ? warn_alloc_failed+0x24c/0x24c
> > > kernel:  [<ffffffff81110ffc>] ? finish_wait+0x1a4/0x1b0
> > > kernel:  [<ffffffff81122faf>] ? lock_acquire+0xec/0x147
> > > kernel:  [<ffffffff81d32ed0>] ? _raw_spin_unlock_irqrestore+0x3b/0x5c
> > > kernel:  [<ffffffff81d32edc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
> > > kernel:  [<ffffffff81110ffc>] ? finish_wait+0x1a4/0x1b0
> > > kernel:  [<ffffffff8128f73a>] khugepaged+0x1d4/0x484f
> > > kernel:  [<ffffffff8128f566>] ? hugepage_vma_revalidate+0xef/0xef
> > > kernel:  [<ffffffff810d5bcc>] ? finish_task_switch+0x3de/0x484
> > > kernel:  [<ffffffff81d32f18>] ? _raw_spin_unlock_irq+0x27/0x45
> > > kernel:  [<ffffffff8111d13f>] ? trace_hardirqs_on_caller+0x3d2/0x492
> > > kernel:  [<ffffffff81111487>] ? prepare_to_wait_event+0x3f7/0x3f7
> > > kernel:  [<ffffffff81d28bf5>] ? __schedule+0xa4d/0xd16
> > > kernel:  [<ffffffff810cd0de>] kthread+0x252/0x261 > > > kernel:  [<ffffffff8128f566>] ? hugepage_vma_revalidate+0xef/0xef > > > kernel:  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377 > > > kernel:  [<ffffffff81d3387f>] ret_from_fork+0x1f/0x40 > > > kernel:  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377 > > > -- Reboot --

^ permalink raw reply

* Re: [PATCH v6v3 02/12] mm: migrate: support non-lru movable page migration
From: Minchan Kim @ 2016-06-16  5:37 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Rik van Riel, Sergey Senozhatsky, Rafael Aquini, Jonathan Corbet,
	Hugh Dickins, linux-kernel, dri-devel, virtualization,
	John Einar Reitan, linux-mm, Gioh Kim, Mel Gorman, Andrew Morton,
	Joonsoo Kim, Vlastimil Babka
In-Reply-To: <5762200F.5040908@linux.vnet.ibm.com>

On Thu, Jun 16, 2016 at 09:12:07AM +0530, Anshuman Khandual wrote:
> On 06/16/2016 05:56 AM, Minchan Kim wrote:
> > On Wed, Jun 15, 2016 at 12:15:04PM +0530, Anshuman Khandual wrote:
> >> On 06/15/2016 08:02 AM, Minchan Kim wrote:
> >>> Hi,
> >>>
> >>> On Mon, Jun 13, 2016 at 03:08:19PM +0530, Anshuman Khandual wrote:
> >>>>> On 05/31/2016 05:31 AM, Minchan Kim wrote:
> >>>>>>> @@ -791,6 +921,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> >>>>>>>  	int rc = -EAGAIN;
> >>>>>>>  	int page_was_mapped = 0;
> >>>>>>>  	struct anon_vma *anon_vma = NULL;
> >>>>>>> +	bool is_lru = !__PageMovable(page);
> >>>>>>>  
> >>>>>>>  	if (!trylock_page(page)) {
> >>>>>>>  		if (!force || mode == MIGRATE_ASYNC)
> >>>>>>> @@ -871,6 +1002,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> >>>>>>>  		goto out_unlock_both;
> >>>>>>>  	}
> >>>>>>>  
> >>>>>>> +	if (unlikely(!is_lru)) {
> >>>>>>> +		rc = move_to_new_page(newpage, page, mode);
> >>>>>>> +		goto out_unlock_both;
> >>>>>>> +	}
> >>>>>>> +
> >>>>>
> >>>>> Hello Minchan,
> >>>>>
> >>>>> I might be missing something here but does this implementation support the
> >>>>> scenario where these non LRU pages owned by the driver mapped as PTE into
> >>>>> process page table ? Because the "goto out_unlock_both" statement above
> >>>>> skips all the PTE unmap, putting a migration PTE and removing the migration
> >>>>> PTE steps.
> >>> You're right. Unfortunately, it doesn't support right now but surely,
> >>> it's my TODO after landing this work.
> >>>
> >>> Could you share your usecase?
> >>
> >> Sure.
> > 
> > Thanks a lot!
> > 
> >>
> >> My driver has privately managed non LRU pages which gets mapped into user space
> >> process page table through f_ops->mmap() and vmops->fault() which then updates
> >> the file RMAP (page->mapping->i_mmap) through page_add_file_rmap(page). One thing
> > 
> > Hmm, page_add_file_rmap is not exported function. How does your driver can use it?
> 
> Its not using the function directly, I just re-iterated the sequence of functions
> above. (do_set_pte -> page_add_file_rmap) gets called after we grab the page from
> driver through (__do_fault->vma->vm_ops->fault()).
> 
> > Do you use vm_insert_pfn?
> > What type your vma is? VM_PFNMMAP or VM_MIXEDMAP?
> 
> I dont use vm_insert_pfn(). Here is the sequence of events how the user space
> VMA gets the non LRU pages from the driver.
> 
> - Driver registers a character device with 'struct file_operations' binding
> - Then the 'fops->mmap()' just binds the incoming 'struct vma' with a 'struct
>   vm_operations_struct' which provides the 'vmops->fault()' routine which
>   basically traps all page faults on the VMA and provides one page at a time
>   through a driver specific allocation routine which hands over non LRU pages
> 
> The VMA is not anything special as such. Its what we get when we try to do a
> simple mmap() on a file descriptor pointing to a character device. I can
> figure out all the VM_* flags it holds after creation.
> 
> > 
> > I want to make dummy driver to simulate your case.
> 
> Sure. I hope the above mentioned steps will help you but in case you need more
> information, please do let me know.

I got understood now. :)
I will test it with dummy driver and will Cc'ed when I send a patch.

Thanks.

^ permalink raw reply

* Re: [PATCH v7 00/12] Support non-lru page migration
From: Sergey Senozhatsky @ 2016-06-16  5:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, dri-devel, virtualization, linux-mm,
	Chulmin Kim, Sangseok Lee, Konstantin Khlebnikov, Rafael Aquini,
	Jonathan Corbet, Hugh Dickins, Gioh Kim, Joonsoo Kim, Mel Gorman,
	Rik van Riel, Vlastimil Babka, Naoya Horiguchi, Chan Gyun Jeong,
	linux-kernel, John Einar Reitan, Sergey Senozhatsky,
	Andrew Morton, Kyeongdon Kim
In-Reply-To: <20160616044710.GP17127@bbox>

On (06/16/16 13:47), Minchan Kim wrote:
[..]
> > this is what I'm getting with the [zsmalloc: keep first object offset in struct page]
> > applied:  "count:0 mapcount:-127". which may be not related to zsmalloc at this point.
> > 
> > kernel: BUG: Bad page state in process khugepaged  pfn:101db8
> > kernel: page:ffffea0004076e00 count:0 mapcount:-127 mapping:          (null) index:0x1
> 
> Hm, it seems double free.
> 
> It doen't happen if you disable zram? IOW, it seems to be related
> zsmalloc migration?

need to test more, can't confidently answer now.

> How easy can you reprodcue it? Could you bisect it?

it takes some (um.. random) time to trigger the bug.
I'll try to come up with more details.

	-ss

> > kernel: flags: 0x8000000000000000()
> > kernel: page dumped because: nonzero mapcount
> > kernel: Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel snd_hda_codec_realtek i2c_i801 snd_hda_codec_generic r8169 mii snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich processor mfd_core sch_fq_codel sd_mod hid_generic usb
> > kernel: CPU: 3 PID: 38 Comm: khugepaged Not tainted 4.7.0-rc3-next-20160615-dbg-00005-gfd11984-dirty #491
> > kernel:  0000000000000000 ffff8801124c73f8 ffffffff814d69b0 ffffea0004076e00
> > kernel:  ffffffff81e658a0 ffff8801124c7420 ffffffff811e9b63 0000000000000000
> > kernel:  ffffea0004076e00 ffffffff81e658a0 ffff8801124c7440 ffffffff811e9ca9
> > kernel: Call Trace:
> > kernel:  [<ffffffff814d69b0>] dump_stack+0x68/0x92
> > kernel:  [<ffffffff811e9b63>] bad_page+0x158/0x1a2
> > kernel:  [<ffffffff811e9ca9>] free_pages_check_bad+0xfc/0x101
> > kernel:  [<ffffffff811ee516>] free_hot_cold_page+0x135/0x5de
> > kernel:  [<ffffffff811eea26>] __free_pages+0x67/0x72
> > kernel:  [<ffffffff81227c63>] release_freepages+0x13a/0x191
> > kernel:  [<ffffffff8122b3c2>] compact_zone+0x845/0x1155
> > kernel:  [<ffffffff8122ab7d>] ? compaction_suitable+0x76/0x76
> > kernel:  [<ffffffff8122bdb2>] compact_zone_order+0xe0/0x167
> > kernel:  [<ffffffff8122bcd2>] ? compact_zone+0x1155/0x1155
> > kernel:  [<ffffffff8122ce88>] try_to_compact_pages+0x2f1/0x648
> > kernel:  [<ffffffff8122ce88>] ? try_to_compact_pages+0x2f1/0x648
> > kernel:  [<ffffffff8122cb97>] ? compaction_zonelist_suitable+0x3a6/0x3a6
> > kernel:  [<ffffffff811ef1ea>] ? get_page_from_freelist+0x2c0/0x133c
> > kernel:  [<ffffffff811f0350>] __alloc_pages_direct_compact+0xea/0x30d
> > kernel:  [<ffffffff811f0266>] ? get_page_from_freelist+0x133c/0x133c
> > kernel:  [<ffffffff811ee3b2>] ? drain_all_pages+0x1d6/0x205
> > kernel:  [<ffffffff811f21a8>] __alloc_pages_nodemask+0x143d/0x16b6
> > kernel:  [<ffffffff8111f405>] ? debug_show_all_locks+0x226/0x226
> > kernel:  [<ffffffff811f0d6b>] ? warn_alloc_failed+0x24c/0x24c
> > kernel:  [<ffffffff81110ffc>] ? finish_wait+0x1a4/0x1b0
> > kernel:  [<ffffffff81122faf>] ? lock_acquire+0xec/0x147
> > kernel:  [<ffffffff81d32ed0>] ? _raw_spin_unlock_irqrestore+0x3b/0x5c
> > kernel:  [<ffffffff81d32edc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
> > kernel:  [<ffffffff81110ffc>] ? finish_wait+0x1a4/0x1b0
> > kernel:  [<ffffffff8128f73a>] khugepaged+0x1d4/0x484f
> > kernel:  [<ffffffff8128f566>] ? hugepage_vma_revalidate+0xef/0xef
> > kernel:  [<ffffffff810d5bcc>] ? finish_task_switch+0x3de/0x484
> > kernel:  [<ffffffff81d32f18>] ? _raw_spin_unlock_irq+0x27/0x45
> > kernel:  [<ffffffff8111d13f>] ? trace_hardirqs_on_caller+0x3d2/0x492
> > kernel:  [<ffffffff81111487>] ? prepare_to_wait_event+0x3f7/0x3f7
> > kernel:  [<ffffffff81d28bf5>] ? __schedule+0xa4d/0xd16
> > kernel:  [<ffffffff810cd0de>] kthread+0x252/0x261
> > kernel:  [<ffffffff8128f566>] ? hugepage_vma_revalidate+0xef/0xef
> > kernel:  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
> > kernel:  [<ffffffff81d3387f>] ret_from_fork+0x1f/0x40
> > kernel:  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
> > -- Reboot --

^ permalink raw reply

* Re: [PATCH v7 00/12] Support non-lru page migration
From: Minchan Kim @ 2016-06-16  4:47 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Rik van Riel, Sergey Senozhatsky, Naoya Horiguchi,
	Jonathan Corbet, Chan Gyun Jeong, Rafael Aquini, Hugh Dickins,
	linux-kernel, dri-devel, virtualization, John Einar Reitan,
	linux-mm, Chulmin Kim, Gioh Kim, Konstantin Khlebnikov,
	Sangseok Lee, Andrew Morton, Kyeongdon Kim, Joonsoo Kim,
	Vlastimil Babka, Mel Gorman
In-Reply-To: <20160616042343.GA516@swordfish>

On Thu, Jun 16, 2016 at 01:23:43PM +0900, Sergey Senozhatsky wrote:
> On (06/16/16 11:58), Minchan Kim wrote:
> [..]
> > RAX: 2065676162726166 so rax is totally garbage, I think.
> > It means obj_to_head returns garbage because get_first_obj_offset is
> > utter crab because (page_idx / class->pages_per_zspage) was totally
> > wrong.
> > 
> > > 					^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >     6408:       f0 0f ba 28 00          lock btsl $0x0,(%rax)
> >  
> > <snip>
> > 
> > > > Could you test with [zsmalloc: keep first object offset in struct page]
> > > > in mmotm?
> > > 
> > > sure, I can.  will it help, tho? we have a race condition here I think.
> > 
> > I guess root cause is caused by get_first_obj_offset.
> 
> sounds reasonable.
> 
> > Please test with it.
> 
> 
> this is what I'm getting with the [zsmalloc: keep first object offset in struct page]
> applied:  "count:0 mapcount:-127". which may be not related to zsmalloc at this point.
> 
> kernel: BUG: Bad page state in process khugepaged  pfn:101db8
> kernel: page:ffffea0004076e00 count:0 mapcount:-127 mapping:          (null) index:0x1

Hm, it seems double free.

It doen't happen if you disable zram? IOW, it seems to be related
zsmalloc migration?

How easy can you reprodcue it? Could you bisect it?

> kernel: flags: 0x8000000000000000()
> kernel: page dumped because: nonzero mapcount
> kernel: Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel snd_hda_codec_realtek i2c_i801 snd_hda_codec_generic r8169 mii snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich processor mfd_core sch_fq_codel sd_mod hid_generic usb
> kernel: CPU: 3 PID: 38 Comm: khugepaged Not tainted 4.7.0-rc3-next-20160615-dbg-00005-gfd11984-dirty #491
> kernel:  0000000000000000 ffff8801124c73f8 ffffffff814d69b0 ffffea0004076e00
> kernel:  ffffffff81e658a0 ffff8801124c7420 ffffffff811e9b63 0000000000000000
> kernel:  ffffea0004076e00 ffffffff81e658a0 ffff8801124c7440 ffffffff811e9ca9
> kernel: Call Trace:
> kernel:  [<ffffffff814d69b0>] dump_stack+0x68/0x92
> kernel:  [<ffffffff811e9b63>] bad_page+0x158/0x1a2
> kernel:  [<ffffffff811e9ca9>] free_pages_check_bad+0xfc/0x101
> kernel:  [<ffffffff811ee516>] free_hot_cold_page+0x135/0x5de
> kernel:  [<ffffffff811eea26>] __free_pages+0x67/0x72
> kernel:  [<ffffffff81227c63>] release_freepages+0x13a/0x191
> kernel:  [<ffffffff8122b3c2>] compact_zone+0x845/0x1155
> kernel:  [<ffffffff8122ab7d>] ? compaction_suitable+0x76/0x76
> kernel:  [<ffffffff8122bdb2>] compact_zone_order+0xe0/0x167
> kernel:  [<ffffffff8122bcd2>] ? compact_zone+0x1155/0x1155
> kernel:  [<ffffffff8122ce88>] try_to_compact_pages+0x2f1/0x648
> kernel:  [<ffffffff8122ce88>] ? try_to_compact_pages+0x2f1/0x648
> kernel:  [<ffffffff8122cb97>] ? compaction_zonelist_suitable+0x3a6/0x3a6
> kernel:  [<ffffffff811ef1ea>] ? get_page_from_freelist+0x2c0/0x133c
> kernel:  [<ffffffff811f0350>] __alloc_pages_direct_compact+0xea/0x30d
> kernel:  [<ffffffff811f0266>] ? get_page_from_freelist+0x133c/0x133c
> kernel:  [<ffffffff811ee3b2>] ? drain_all_pages+0x1d6/0x205
> kernel:  [<ffffffff811f21a8>] __alloc_pages_nodemask+0x143d/0x16b6
> kernel:  [<ffffffff8111f405>] ? debug_show_all_locks+0x226/0x226
> kernel:  [<ffffffff811f0d6b>] ? warn_alloc_failed+0x24c/0x24c
> kernel:  [<ffffffff81110ffc>] ? finish_wait+0x1a4/0x1b0
> kernel:  [<ffffffff81122faf>] ? lock_acquire+0xec/0x147
> kernel:  [<ffffffff81d32ed0>] ? _raw_spin_unlock_irqrestore+0x3b/0x5c
> kernel:  [<ffffffff81d32edc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
> kernel:  [<ffffffff81110ffc>] ? finish_wait+0x1a4/0x1b0
> kernel:  [<ffffffff8128f73a>] khugepaged+0x1d4/0x484f
> kernel:  [<ffffffff8128f566>] ? hugepage_vma_revalidate+0xef/0xef
> kernel:  [<ffffffff810d5bcc>] ? finish_task_switch+0x3de/0x484
> kernel:  [<ffffffff81d32f18>] ? _raw_spin_unlock_irq+0x27/0x45
> kernel:  [<ffffffff8111d13f>] ? trace_hardirqs_on_caller+0x3d2/0x492
> kernel:  [<ffffffff81111487>] ? prepare_to_wait_event+0x3f7/0x3f7
> kernel:  [<ffffffff81d28bf5>] ? __schedule+0xa4d/0xd16
> kernel:  [<ffffffff810cd0de>] kthread+0x252/0x261
> kernel:  [<ffffffff8128f566>] ? hugepage_vma_revalidate+0xef/0xef
> kernel:  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
> kernel:  [<ffffffff81d3387f>] ret_from_fork+0x1f/0x40
> kernel:  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
> -- Reboot --
> 
> 	-ss

^ permalink raw reply

* Re: [PATCH v7 00/12] Support non-lru page migration
From: Sergey Senozhatsky @ 2016-06-16  4:23 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, dri-devel, virtualization, linux-mm,
	Chulmin Kim, Sangseok Lee, Konstantin Khlebnikov, Rafael Aquini,
	Jonathan Corbet, Hugh Dickins, Gioh Kim, Joonsoo Kim, Mel Gorman,
	Rik van Riel, Vlastimil Babka, Naoya Horiguchi, Chan Gyun Jeong,
	linux-kernel, John Einar Reitan, Sergey Senozhatsky,
	Andrew Morton, Kyeongdon Kim
In-Reply-To: <20160616025800.GO17127@bbox>

On (06/16/16 11:58), Minchan Kim wrote:
[..]
> RAX: 2065676162726166 so rax is totally garbage, I think.
> It means obj_to_head returns garbage because get_first_obj_offset is
> utter crab because (page_idx / class->pages_per_zspage) was totally
> wrong.
> 
> > 					^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >     6408:       f0 0f ba 28 00          lock btsl $0x0,(%rax)
>  
> <snip>
> 
> > > Could you test with [zsmalloc: keep first object offset in struct page]
> > > in mmotm?
> > 
> > sure, I can.  will it help, tho? we have a race condition here I think.
> 
> I guess root cause is caused by get_first_obj_offset.

sounds reasonable.

> Please test with it.


this is what I'm getting with the [zsmalloc: keep first object offset in struct page]
applied:  "count:0 mapcount:-127". which may be not related to zsmalloc at this point.

kernel: BUG: Bad page state in process khugepaged  pfn:101db8
kernel: page:ffffea0004076e00 count:0 mapcount:-127 mapping:          (null) index:0x1
kernel: flags: 0x8000000000000000()
kernel: page dumped because: nonzero mapcount
kernel: Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel snd_hda_codec_realtek i2c_i801 snd_hda_codec_generic r8169 mii snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich processor mfd_core sch_fq_codel sd_mod hid_generic usb
kernel: CPU: 3 PID: 38 Comm: khugepaged Not tainted 4.7.0-rc3-next-20160615-dbg-00005-gfd11984-dirty #491
kernel:  0000000000000000 ffff8801124c73f8 ffffffff814d69b0 ffffea0004076e00
kernel:  ffffffff81e658a0 ffff8801124c7420 ffffffff811e9b63 0000000000000000
kernel:  ffffea0004076e00 ffffffff81e658a0 ffff8801124c7440 ffffffff811e9ca9
kernel: Call Trace:
kernel:  [<ffffffff814d69b0>] dump_stack+0x68/0x92
kernel:  [<ffffffff811e9b63>] bad_page+0x158/0x1a2
kernel:  [<ffffffff811e9ca9>] free_pages_check_bad+0xfc/0x101
kernel:  [<ffffffff811ee516>] free_hot_cold_page+0x135/0x5de
kernel:  [<ffffffff811eea26>] __free_pages+0x67/0x72
kernel:  [<ffffffff81227c63>] release_freepages+0x13a/0x191
kernel:  [<ffffffff8122b3c2>] compact_zone+0x845/0x1155
kernel:  [<ffffffff8122ab7d>] ? compaction_suitable+0x76/0x76
kernel:  [<ffffffff8122bdb2>] compact_zone_order+0xe0/0x167
kernel:  [<ffffffff8122bcd2>] ? compact_zone+0x1155/0x1155
kernel:  [<ffffffff8122ce88>] try_to_compact_pages+0x2f1/0x648
kernel:  [<ffffffff8122ce88>] ? try_to_compact_pages+0x2f1/0x648
kernel:  [<ffffffff8122cb97>] ? compaction_zonelist_suitable+0x3a6/0x3a6
kernel:  [<ffffffff811ef1ea>] ? get_page_from_freelist+0x2c0/0x133c
kernel:  [<ffffffff811f0350>] __alloc_pages_direct_compact+0xea/0x30d
kernel:  [<ffffffff811f0266>] ? get_page_from_freelist+0x133c/0x133c
kernel:  [<ffffffff811ee3b2>] ? drain_all_pages+0x1d6/0x205
kernel:  [<ffffffff811f21a8>] __alloc_pages_nodemask+0x143d/0x16b6
kernel:  [<ffffffff8111f405>] ? debug_show_all_locks+0x226/0x226
kernel:  [<ffffffff811f0d6b>] ? warn_alloc_failed+0x24c/0x24c
kernel:  [<ffffffff81110ffc>] ? finish_wait+0x1a4/0x1b0
kernel:  [<ffffffff81122faf>] ? lock_acquire+0xec/0x147
kernel:  [<ffffffff81d32ed0>] ? _raw_spin_unlock_irqrestore+0x3b/0x5c
kernel:  [<ffffffff81d32edc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
kernel:  [<ffffffff81110ffc>] ? finish_wait+0x1a4/0x1b0
kernel:  [<ffffffff8128f73a>] khugepaged+0x1d4/0x484f
kernel:  [<ffffffff8128f566>] ? hugepage_vma_revalidate+0xef/0xef
kernel:  [<ffffffff810d5bcc>] ? finish_task_switch+0x3de/0x484
kernel:  [<ffffffff81d32f18>] ? _raw_spin_unlock_irq+0x27/0x45
kernel:  [<ffffffff8111d13f>] ? trace_hardirqs_on_caller+0x3d2/0x492
kernel:  [<ffffffff81111487>] ? prepare_to_wait_event+0x3f7/0x3f7
kernel:  [<ffffffff81d28bf5>] ? __schedule+0xa4d/0xd16
kernel:  [<ffffffff810cd0de>] kthread+0x252/0x261
kernel:  [<ffffffff8128f566>] ? hugepage_vma_revalidate+0xef/0xef
kernel:  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
kernel:  [<ffffffff81d3387f>] ret_from_fork+0x1f/0x40
kernel:  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
-- Reboot --

	-ss

^ permalink raw reply

* Re: [PATCH v6v3 02/12] mm: migrate: support non-lru movable page migration
From: Anshuman Khandual @ 2016-06-16  3:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, Sergey Senozhatsky, Rafael Aquini, Jonathan Corbet,
	Hugh Dickins, linux-kernel, dri-devel, virtualization,
	John Einar Reitan, linux-mm, Gioh Kim, Mel Gorman, Andrew Morton,
	Joonsoo Kim, Vlastimil Babka
In-Reply-To: <20160616002617.GM17127@bbox>

On 06/16/2016 05:56 AM, Minchan Kim wrote:
> On Wed, Jun 15, 2016 at 12:15:04PM +0530, Anshuman Khandual wrote:
>> On 06/15/2016 08:02 AM, Minchan Kim wrote:
>>> Hi,
>>>
>>> On Mon, Jun 13, 2016 at 03:08:19PM +0530, Anshuman Khandual wrote:
>>>>> On 05/31/2016 05:31 AM, Minchan Kim wrote:
>>>>>>> @@ -791,6 +921,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>>>>>>>  	int rc = -EAGAIN;
>>>>>>>  	int page_was_mapped = 0;
>>>>>>>  	struct anon_vma *anon_vma = NULL;
>>>>>>> +	bool is_lru = !__PageMovable(page);
>>>>>>>  
>>>>>>>  	if (!trylock_page(page)) {
>>>>>>>  		if (!force || mode == MIGRATE_ASYNC)
>>>>>>> @@ -871,6 +1002,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>>>>>>>  		goto out_unlock_both;
>>>>>>>  	}
>>>>>>>  
>>>>>>> +	if (unlikely(!is_lru)) {
>>>>>>> +		rc = move_to_new_page(newpage, page, mode);
>>>>>>> +		goto out_unlock_both;
>>>>>>> +	}
>>>>>>> +
>>>>>
>>>>> Hello Minchan,
>>>>>
>>>>> I might be missing something here but does this implementation support the
>>>>> scenario where these non LRU pages owned by the driver mapped as PTE into
>>>>> process page table ? Because the "goto out_unlock_both" statement above
>>>>> skips all the PTE unmap, putting a migration PTE and removing the migration
>>>>> PTE steps.
>>> You're right. Unfortunately, it doesn't support right now but surely,
>>> it's my TODO after landing this work.
>>>
>>> Could you share your usecase?
>>
>> Sure.
> 
> Thanks a lot!
> 
>>
>> My driver has privately managed non LRU pages which gets mapped into user space
>> process page table through f_ops->mmap() and vmops->fault() which then updates
>> the file RMAP (page->mapping->i_mmap) through page_add_file_rmap(page). One thing
> 
> Hmm, page_add_file_rmap is not exported function. How does your driver can use it?

Its not using the function directly, I just re-iterated the sequence of functions
above. (do_set_pte -> page_add_file_rmap) gets called after we grab the page from
driver through (__do_fault->vma->vm_ops->fault()).

> Do you use vm_insert_pfn?
> What type your vma is? VM_PFNMMAP or VM_MIXEDMAP?

I dont use vm_insert_pfn(). Here is the sequence of events how the user space
VMA gets the non LRU pages from the driver.

- Driver registers a character device with 'struct file_operations' binding
- Then the 'fops->mmap()' just binds the incoming 'struct vma' with a 'struct
  vm_operations_struct' which provides the 'vmops->fault()' routine which
  basically traps all page faults on the VMA and provides one page at a time
  through a driver specific allocation routine which hands over non LRU pages

The VMA is not anything special as such. Its what we get when we try to do a
simple mmap() on a file descriptor pointing to a character device. I can
figure out all the VM_* flags it holds after creation.

> 
> I want to make dummy driver to simulate your case.

Sure. I hope the above mentioned steps will help you but in case you need more
information, please do let me know.

> It would be very helpful to implement/test pte-mapped non-lru page
> migration feature. That's why I ask now.
> 
>> to note here is that the page->mapping eventually points to struct address_space
>> (file->f_mapping) which belongs to the character device file (created using mknod)
>> which we are using for establishing the mmap() regions in the user space.
>>
>> Now as per this new framework, all the page's are to be made __SetPageMovable before
>> passing the list down to migrate_pages(). Now __SetPageMovable() takes *new* struct
>> address_space as an argument and replaces the existing page->mapping. Now thats the
>> problem, we have lost all our connection to the existing file RMAP information. This
> 
> We could change __SetPageMovable doesn't need mapping argument.
> Instead, it just marks PAGE_MAPPING_MOVABLE into page->mapping.
> For that, user should take care of setting page->mapping earlier than
> marking the flag.

Sounds like a good idea, that way we dont loose the reverse mapping information.

> 
>> stands as a problem when we try to migrate these non LRU pages which are PTE mapped.
>> The rmap_walk_file() never finds them in the VMA, skips all the migrate PTE steps and
>> then the migration eventually fails.
>>
>> Seems like assigning a new struct address_space to the page through __SetPageMovable()
>> is the source of the problem. Can it take the existing (file->f_mapping) as an argument

> We can set existing file->f_mapping under the page_lock.

Thats another option along with what you mentioned above.

> 
>> in there ? Sure, but then can we override file system generic ->isolate(), ->putback(),
> 
> I don't get it. Why does it override file system generic functions?

Sure it does not, it was just an wild idea to over come the problem.

^ permalink raw reply

* Re: [PATCH v7 00/12] Support non-lru page migration
From: Minchan Kim @ 2016-06-16  2:58 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Rik van Riel, Sergey Senozhatsky, Naoya Horiguchi,
	Jonathan Corbet, Chan Gyun Jeong, Rafael Aquini, Hugh Dickins,
	linux-kernel, dri-devel, virtualization, John Einar Reitan,
	linux-mm, Chulmin Kim, Gioh Kim, Konstantin Khlebnikov,
	Sangseok Lee, Andrew Morton, Kyeongdon Kim, Joonsoo Kim,
	Vlastimil Babka, Mel Gorman
In-Reply-To: <20160616024827.GA497@swordfish>

On Thu, Jun 16, 2016 at 11:48:27AM +0900, Sergey Senozhatsky wrote:
> Hi,
> 
> On (06/16/16 08:12), Minchan Kim wrote:
> > > [  315.146533] kasan: CONFIG_KASAN_INLINE enabled
> > > [  315.146538] kasan: GPF could be caused by NULL-ptr deref or user memory access
> > > [  315.146546] general protection fault: 0000 [#1] PREEMPT SMP KASAN
> > > [  315.146576] Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel r8169 i2c_i801 mii snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich mfd_core processor sch_fq_codel sd_mod hid_generic usbhid hid ahci libahci libata ehci_pci ehci_hcd scsi_mod usbcore usb_common
> > > [  315.146785] CPU: 3 PID: 38 Comm: khugepaged Not tainted 4.7.0-rc3-next-20160614-dbg-00004-ga1c2cbc-dirty #488
> > > [  315.146841] task: ffff8800bfaf2900 ti: ffff880112468000 task.ti: ffff880112468000
> > > [  315.146859] RIP: 0010:[<ffffffffa02c413d>]  [<ffffffffa02c413d>] zs_page_migrate+0x355/0xaa0 [zsmalloc]
> > 
> > Thanks for the report!
> > 
> > zs_page_migrate+0x355? Could you tell me what line is it?
> > 
> > It seems to be related to obj_to_head.
> 
> reproduced. a bit different call stack this time. but the problem is
> still the same.
> 
> zs_compact()
> ...
>     6371:       e8 00 00 00 00          callq  6376 <zs_compact+0x22b>
>     6376:       0f 0b                   ud2    
>     6378:       48 8b 95 a8 fe ff ff    mov    -0x158(%rbp),%rdx
>     637f:       4d 8d 74 24 78          lea    0x78(%r12),%r14
>     6384:       4c 89 ee                mov    %r13,%rsi
>     6387:       4c 89 e7                mov    %r12,%rdi
>     638a:       e8 86 c7 ff ff          callq  2b15 <get_first_obj_offset>
>     638f:       41 89 c5                mov    %eax,%r13d
>     6392:       4c 89 f0                mov    %r14,%rax
>     6395:       48 c1 e8 03             shr    $0x3,%rax
>     6399:       8a 04 18                mov    (%rax,%rbx,1),%al
>     639c:       84 c0                   test   %al,%al
>     639e:       0f 85 f2 02 00 00       jne    6696 <zs_compact+0x54b>
>     63a4:       41 8b 44 24 78          mov    0x78(%r12),%eax
>     63a9:       41 0f af c7             imul   %r15d,%eax
>     63ad:       41 01 c5                add    %eax,%r13d
>     63b0:       4c 89 f0                mov    %r14,%rax
>     63b3:       48 c1 e8 03             shr    $0x3,%rax
>     63b7:       48 01 d8                add    %rbx,%rax
>     63ba:       48 89 85 88 fe ff ff    mov    %rax,-0x178(%rbp)
>     63c1:       41 81 fd ff 0f 00 00    cmp    $0xfff,%r13d
>     63c8:       0f 87 1a 03 00 00       ja     66e8 <zs_compact+0x59d>
>     63ce:       49 63 f5                movslq %r13d,%rsi
>     63d1:       48 03 b5 98 fe ff ff    add    -0x168(%rbp),%rsi
>     63d8:       48 8b bd a8 fe ff ff    mov    -0x158(%rbp),%rdi
>     63df:       e8 67 d9 ff ff          callq  3d4b <obj_to_head>
>     63e4:       a8 01                   test   $0x1,%al
>     63e6:       0f 84 d9 02 00 00       je     66c5 <zs_compact+0x57a>
>     63ec:       48 83 e0 fe             and    $0xfffffffffffffffe,%rax
>     63f0:       bf 01 00 00 00          mov    $0x1,%edi
>     63f5:       48 89 85 b0 fe ff ff    mov    %rax,-0x150(%rbp)
>     63fc:       e8 00 00 00 00          callq  6401 <zs_compact+0x2b6>
>     6401:       48 8b 85 b0 fe ff ff    mov    -0x150(%rbp),%rax

RAX: 2065676162726166 so rax is totally garbage, I think.
It means obj_to_head returns garbage because get_first_obj_offset is
utter crab because (page_idx / class->pages_per_zspage) was totally
wrong.

> 					^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>     6408:       f0 0f ba 28 00          lock btsl $0x0,(%rax)
 
<snip>

> > Could you test with [zsmalloc: keep first object offset in struct page]
> > in mmotm?
> 
> sure, I can.  will it help, tho? we have a race condition here I think.

I guess root cause is caused by get_first_obj_offset.
Please test with it.

Thanks!

^ permalink raw reply

* Re: [PATCH v7 00/12] Support non-lru page migration
From: Sergey Senozhatsky @ 2016-06-16  2:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, dri-devel, virtualization, linux-mm,
	Chulmin Kim, Sangseok Lee, Konstantin Khlebnikov, Rafael Aquini,
	Jonathan Corbet, Hugh Dickins, Gioh Kim, Joonsoo Kim, Mel Gorman,
	Rik van Riel, Vlastimil Babka, Naoya Horiguchi, Chan Gyun Jeong,
	linux-kernel, John Einar Reitan, Sergey Senozhatsky,
	Andrew Morton, Kyeongdon Kim
In-Reply-To: <20160615231248.GI17127@bbox>

Hi,

On (06/16/16 08:12), Minchan Kim wrote:
> > [  315.146533] kasan: CONFIG_KASAN_INLINE enabled
> > [  315.146538] kasan: GPF could be caused by NULL-ptr deref or user memory access
> > [  315.146546] general protection fault: 0000 [#1] PREEMPT SMP KASAN
> > [  315.146576] Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel r8169 i2c_i801 mii snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich mfd_core processor sch_fq_codel sd_mod hid_generic usbhid hid ahci libahci libata ehci_pci ehci_hcd scsi_mod usbcore usb_common
> > [  315.146785] CPU: 3 PID: 38 Comm: khugepaged Not tainted 4.7.0-rc3-next-20160614-dbg-00004-ga1c2cbc-dirty #488
> > [  315.146841] task: ffff8800bfaf2900 ti: ffff880112468000 task.ti: ffff880112468000
> > [  315.146859] RIP: 0010:[<ffffffffa02c413d>]  [<ffffffffa02c413d>] zs_page_migrate+0x355/0xaa0 [zsmalloc]
> 
> Thanks for the report!
> 
> zs_page_migrate+0x355? Could you tell me what line is it?
> 
> It seems to be related to obj_to_head.

reproduced. a bit different call stack this time. but the problem is
still the same.

zs_compact()
...
    6371:       e8 00 00 00 00          callq  6376 <zs_compact+0x22b>
    6376:       0f 0b                   ud2    
    6378:       48 8b 95 a8 fe ff ff    mov    -0x158(%rbp),%rdx
    637f:       4d 8d 74 24 78          lea    0x78(%r12),%r14
    6384:       4c 89 ee                mov    %r13,%rsi
    6387:       4c 89 e7                mov    %r12,%rdi
    638a:       e8 86 c7 ff ff          callq  2b15 <get_first_obj_offset>
    638f:       41 89 c5                mov    %eax,%r13d
    6392:       4c 89 f0                mov    %r14,%rax
    6395:       48 c1 e8 03             shr    $0x3,%rax
    6399:       8a 04 18                mov    (%rax,%rbx,1),%al
    639c:       84 c0                   test   %al,%al
    639e:       0f 85 f2 02 00 00       jne    6696 <zs_compact+0x54b>
    63a4:       41 8b 44 24 78          mov    0x78(%r12),%eax
    63a9:       41 0f af c7             imul   %r15d,%eax
    63ad:       41 01 c5                add    %eax,%r13d
    63b0:       4c 89 f0                mov    %r14,%rax
    63b3:       48 c1 e8 03             shr    $0x3,%rax
    63b7:       48 01 d8                add    %rbx,%rax
    63ba:       48 89 85 88 fe ff ff    mov    %rax,-0x178(%rbp)
    63c1:       41 81 fd ff 0f 00 00    cmp    $0xfff,%r13d
    63c8:       0f 87 1a 03 00 00       ja     66e8 <zs_compact+0x59d>
    63ce:       49 63 f5                movslq %r13d,%rsi
    63d1:       48 03 b5 98 fe ff ff    add    -0x168(%rbp),%rsi
    63d8:       48 8b bd a8 fe ff ff    mov    -0x158(%rbp),%rdi
    63df:       e8 67 d9 ff ff          callq  3d4b <obj_to_head>
    63e4:       a8 01                   test   $0x1,%al
    63e6:       0f 84 d9 02 00 00       je     66c5 <zs_compact+0x57a>
    63ec:       48 83 e0 fe             and    $0xfffffffffffffffe,%rax
    63f0:       bf 01 00 00 00          mov    $0x1,%edi
    63f5:       48 89 85 b0 fe ff ff    mov    %rax,-0x150(%rbp)
    63fc:       e8 00 00 00 00          callq  6401 <zs_compact+0x2b6>
    6401:       48 8b 85 b0 fe ff ff    mov    -0x150(%rbp),%rax
					^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    6408:       f0 0f ba 28 00          lock btsl $0x0,(%rax)
    640d:       0f 82 98 02 00 00       jb     66ab <zs_compact+0x560>
    6413:       48 8b 85 10 fe ff ff    mov    -0x1f0(%rbp),%rax
    641a:       48 8d b8 48 10 00 00    lea    0x1048(%rax),%rdi
    6421:       48 89 f8                mov    %rdi,%rax
    6424:       48 c1 e8 03             shr    $0x3,%rax
    6428:       8a 04 18                mov    (%rax,%rbx,1),%al
    642b:       84 c0                   test   %al,%al
    642d:       0f 85 c5 02 00 00       jne    66f8 <zs_compact+0x5ad>
    6433:       48 8b 85 10 fe ff ff    mov    -0x1f0(%rbp),%rax
    643a:       65 4c 8b 2c 25 00 00    mov    %gs:0x0,%r13
    6441:       00 00 
    6443:       49 8d bd 48 10 00 00    lea    0x1048(%r13),%rdi
    644a:       ff 88 48 10 00 00       decl   0x1048(%rax)
    6450:       48 89 f8                mov    %rdi,%rax
    6453:       48 c1 e8 03             shr    $0x3,%rax
    6457:       8a 04 18                mov    (%rax,%rbx,1),%al
    645a:       84 c0                   test   %al,%al
    645c:       0f 85 a8 02 00 00       jne    670a <zs_compact+0x5bf>
    6462:       41 83 bd 48 10 00 00    cmpl   $0x0,0x1048(%r13)


which is

_next/./arch/x86/include/asm/bitops.h:206
_next/./arch/x86/include/asm/bitops.h:219
_next/include/linux/bit_spinlock.h:44
_next/mm/zsmalloc.c:950
_next/mm/zsmalloc.c:1774
_next/mm/zsmalloc.c:1809
_next/mm/zsmalloc.c:2306
_next/mm/zsmalloc.c:2346


smells like race conditon.



backtraces:

[  319.363646] kasan: CONFIG_KASAN_INLINE enabled
[  319.363650] kasan: GPF could be caused by NULL-ptr deref or user memory access
[  319.363658] general protection fault: 0000 [#1] PREEMPT SMP KASAN
[  319.363688] Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel snd_hda_codec_realtek snd_hda_codec_generic r8169 mii i2c_i801 snd_hda_intel snd_hda_codec snd_hda_core snd_pcm snd_timer acpi_cpufreq snd lpc_ich soundcore mfd_core processor sch_fq_codel sd_mod hid_generic usbhid hid ahci libahci ehci_pci libata ehci_hcd usbcore scsi_mod usb_common
[  319.363895] CPU: 0 PID: 45 Comm: kswapd0 Not tainted 4.7.0-rc3-next-20160615-dbg-00004-g550dc8a-dirty #490
[  319.363950] task: ffff8800bfb93d80 ti: ffff880112200000 task.ti: ffff880112200000
[  319.363968] RIP: 0010:[<ffffffffa03ce408>]  [<ffffffffa03ce408>] zs_compact+0x2bd/0xf22 [zsmalloc]
[  319.364000] RSP: 0018:ffff8801122077f8  EFLAGS: 00010293
[  319.364014] RAX: 2065676162726166 RBX: dffffc0000000000 RCX: 0000000000000000
[  319.364032] RDX: 1ffffffff064c504 RSI: ffff88003217c770 RDI: ffffffff83262ae0
[  319.364049] RBP: ffff880112207a18 R08: 0000000000000001 R09: 0000000000000000
[  319.364067] R10: ffff880112207768 R11: 00000000a19f2c26 R12: ffff8800a7caab00
[  319.364085] R13: 0000000000000770 R14: ffff8800a7caab78 R15: 0000000000000000
[  319.364103] FS:  0000000000000000(0000) GS:ffff880113600000(0000) knlGS:0000000000000000
[  319.364123] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  319.364138] CR2: 00007fa154633d70 CR3: 00000000b183d000 CR4: 00000000000006f0
[  319.364154] Stack:
[  319.364160]  ffffed00163d6a81 1ffff10017f729b9 ffff8800bfb944a0 ffffed0017f729b9
[  319.364191]  ffff8800bfb93d80 ffff8800b1eb5408 ffff8800bfb93d80 ffff8800bfb94dc8
[  319.364222]  ffff8800bfb944f8 ffff880000000001 1ffff10022440f1a 0000000041b58ab3
[  319.364252] Call Trace:
[  319.364264]  [<ffffffff8111f405>] ? debug_show_all_locks+0x226/0x226
[  319.364284]  [<ffffffffa03ce14b>] ? zs_free+0x27a/0x27a [zsmalloc]
[  319.364303]  [<ffffffff812303e3>] ? list_lru_count_one+0x65/0x6d
[  319.364320]  [<ffffffff81122faf>] ? lock_acquire+0xec/0x147
[  319.364336]  [<ffffffff812303b7>] ? list_lru_count_one+0x39/0x6d
[  319.364353]  [<ffffffff81d32e4f>] ? _raw_spin_unlock+0x2c/0x3f
[  319.364371]  [<ffffffffa03cf0a8>] zs_shrinker_scan+0x3b/0x4e [zsmalloc]
[  319.364391]  [<ffffffff81204eef>] shrink_slab.part.5.constprop.17+0x2e4/0x432
[  319.364411]  [<ffffffff81204c0b>] ? cpu_callback+0xb0/0xb0
[  319.364426]  [<ffffffff8120bfbc>] shrink_zone+0x19b/0x416
[  319.364442]  [<ffffffff8120be21>] ? shrink_zone_memcg.isra.14+0xd08/0xd08
[  319.364461]  [<ffffffff811f0b10>] ? zone_watermark_ok_safe+0x1e9/0x1f8
[  319.364478]  [<ffffffff81205fd7>] ? zone_reclaimable+0x14b/0x170
[  319.364495]  [<ffffffff8120d2fb>] kswapd+0xaad/0xcee
[  319.364510]  [<ffffffff8120c84e>] ? try_to_free_pages+0x617/0x617
[  319.364527]  [<ffffffff8111d13f>] ? trace_hardirqs_on_caller+0x3d2/0x492
[  319.364545]  [<ffffffff81111487>] ? prepare_to_wait_event+0x3f7/0x3f7
[  319.364564]  [<ffffffff810cd0de>] kthread+0x252/0x261
[  319.364578]  [<ffffffff8120c84e>] ? try_to_free_pages+0x617/0x617
[  319.364595]  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
[  319.364614]  [<ffffffff81d3387f>] ret_from_fork+0x1f/0x40
[  319.364629]  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
[  319.364645] Code: ff ff e8 67 d9 ff ff a8 01 0f 84 d9 02 00 00 48 83 e0 fe bf 01 00 00 00 48 89 85 b0 fe ff ff e8 71 78 d0 e0 48 8b 85 b0 fe ff ff <f0> 0f ba 28 00 0f 82 98 02 00 00 48 8b 85 10 fe ff ff 48 8d b8 
[  319.364913] RIP  [<ffffffffa03ce408>] zs_compact+0x2bd/0xf22 [zsmalloc]
[  319.364937]  RSP <ffff8801122077f8>
[  319.372870] ---[ end trace bcefd5a456f6b462 ]---



[  319.372875] BUG: sleeping function called from invalid context at include/linux/sched.h:2960
[  319.372877] in_atomic(): 1, irqs_disabled(): 0, pid: 45, name: kswapd0
[  319.372879] INFO: lockdep is turned off.
[  319.372880] Preemption disabled at:[<ffffffffa03ce2c3>] zs_compact+0x178/0xf22 [zsmalloc]

[  319.372891] CPU: 0 PID: 45 Comm: kswapd0 Tainted: G      D         4.7.0-rc3-next-20160615-dbg-00004-g550dc8a-dirty #490
[  319.372895]  0000000000000000 ffff880112207418 ffffffff814d69b0 ffff8800bfb93d80
[  319.372901]  0000000000000003 ffff880112207458 ffffffff810d6165 0000000000000000
[  319.372906]  ffff8800bfb93d80 ffffffff81e39860 0000000000000b90 0000000000000000
[  319.372911] Call Trace:
[  319.372915]  [<ffffffff814d69b0>] dump_stack+0x68/0x92
[  319.372919]  [<ffffffff810d6165>] ___might_sleep+0x3bd/0x3c9
[  319.372922]  [<ffffffff810d62cc>] __might_sleep+0x15b/0x167
[  319.372927]  [<ffffffff810ac7bf>] exit_signals+0x7a/0x34f
[  319.372931]  [<ffffffff810ac745>] ? get_signal+0xd9b/0xd9b
[  319.372934]  [<ffffffff811af758>] ? irq_work_queue+0x101/0x11c
[  319.372938]  [<ffffffff8111f405>] ? debug_show_all_locks+0x226/0x226
[  319.372943]  [<ffffffff81096655>] do_exit+0x34d/0x1b4e
[  319.372947]  [<ffffffff8113119f>] ? vprintk_emit+0x4b1/0x4d3
[  319.372951]  [<ffffffff81096308>] ? is_current_pgrp_orphaned+0x8c/0x8c
[  319.372954]  [<ffffffff81122faf>] ? lock_acquire+0xec/0x147
[  319.372957]  [<ffffffff81132578>] ? kmsg_dump+0x12/0x27a
[  319.372961]  [<ffffffff811327d1>] ? kmsg_dump+0x26b/0x27a
[  319.372965]  [<ffffffff81036507>] oops_end+0x9d/0xa4
[  319.372968]  [<ffffffff81036641>] die+0x55/0x5e
[  319.372971]  [<ffffffff81032aa0>] do_general_protection+0x16c/0x337
[  319.372975]  [<ffffffff81d34bbf>] general_protection+0x1f/0x30
[  319.372981]  [<ffffffffa03ce408>] ? zs_compact+0x2bd/0xf22 [zsmalloc]
[  319.372986]  [<ffffffffa03ce401>] ? zs_compact+0x2b6/0xf22 [zsmalloc]
[  319.372989]  [<ffffffff8111f405>] ? debug_show_all_locks+0x226/0x226
[  319.372995]  [<ffffffffa03ce14b>] ? zs_free+0x27a/0x27a [zsmalloc]
[  319.372999]  [<ffffffff812303e3>] ? list_lru_count_one+0x65/0x6d
[  319.373002]  [<ffffffff81122faf>] ? lock_acquire+0xec/0x147
[  319.373005]  [<ffffffff812303b7>] ? list_lru_count_one+0x39/0x6d
[  319.373009]  [<ffffffff81d32e4f>] ? _raw_spin_unlock+0x2c/0x3f
[  319.373014]  [<ffffffffa03cf0a8>] zs_shrinker_scan+0x3b/0x4e [zsmalloc]
[  319.373018]  [<ffffffff81204eef>] shrink_slab.part.5.constprop.17+0x2e4/0x432
[  319.373022]  [<ffffffff81204c0b>] ? cpu_callback+0xb0/0xb0
[  319.373025]  [<ffffffff8120bfbc>] shrink_zone+0x19b/0x416
[  319.373029]  [<ffffffff8120be21>] ? shrink_zone_memcg.isra.14+0xd08/0xd08
[  319.373032]  [<ffffffff811f0b10>] ? zone_watermark_ok_safe+0x1e9/0x1f8
[  319.373036]  [<ffffffff81205fd7>] ? zone_reclaimable+0x14b/0x170
[  319.373039]  [<ffffffff8120d2fb>] kswapd+0xaad/0xcee
[  319.373043]  [<ffffffff8120c84e>] ? try_to_free_pages+0x617/0x617
[  319.373046]  [<ffffffff8111d13f>] ? trace_hardirqs_on_caller+0x3d2/0x492
[  319.373050]  [<ffffffff81111487>] ? prepare_to_wait_event+0x3f7/0x3f7
[  319.373054]  [<ffffffff810cd0de>] kthread+0x252/0x261
[  319.373057]  [<ffffffff8120c84e>] ? try_to_free_pages+0x617/0x617
[  319.373060]  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
[  319.373064]  [<ffffffff81d3387f>] ret_from_fork+0x1f/0x40
[  319.373068]  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377


[  319.373071] note: kswapd0[45] exited with preempt_count 3
[  322.891083] kmemleak: Cannot allocate a kmemleak_object structure


[  322.891091] kmemleak: Kernel memory leak detector disabled
[  322.891194] kmemleak: Automatic memory scanning thread ended


[  344.264076] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/u8:3:108]
[  344.264080] Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel snd_hda_codec_realtek snd_hda_codec_generic r8169 mii i2c_i801 snd_hda_intel snd_hda_codec snd_hda_core snd_pcm snd_timer acpi_cpufreq snd lpc_ich soundcore mfd_core processor sch_fq_codel sd_mod hid_generic usbhid hid ahci libahci ehci_pci libata ehci_hcd usbcore scsi_mod usb_common
[  344.264118] irq event stamp: 13848655
[  344.264119] hardirqs last  enabled at (13848655): [<ffffffff8127dbd8>] __slab_alloc.isra.18.constprop.23+0x53/0x61
[  344.264127] hardirqs last disabled at (13848654): [<ffffffff8127db9e>] __slab_alloc.isra.18.constprop.23+0x19/0x61
[  344.264131] softirqs last  enabled at (13848614): [<ffffffff81d3565e>] __do_softirq+0x406/0x48f
[  344.264136] softirqs last disabled at (13848593): [<ffffffff81099448>] irq_exit+0x6a/0x113
[  344.264143] CPU: 1 PID: 108 Comm: kworker/u8:3 Tainted: G      D         4.7.0-rc3-next-20160615-dbg-00004-g550dc8a-dirty #490
[  344.264151] Workqueue: writeback wb_workfn (flush-254:0)
[  344.264155] task: ffff8800ba1c2900 ti: ffff8801122a0000 task.ti: ffff8801122a0000
[  344.264157] RIP: 0010:[<ffffffff814eeae3>]  [<ffffffff814eeae3>] delay_tsc+0x81/0xa4
[  344.264162] RSP: 0018:ffff8801122a70d0  EFLAGS: 00000206
[  344.264164] RAX: 000000000000001c RBX: 000000dc3a548e47 RCX: 0000000000000000
[  344.264166] RDX: 000000dc3a548e63 RSI: ffffffff81ed2e80 RDI: ffffffff81ed2ec0
[  344.264168] RBP: ffff8801122a70f0 R08: 0000000000000001 R09: 0000000000000000
[  344.264170] R10: ffff8801122a70e8 R11: 0000000045cb5d4f R12: 000000dc3a548e63
[  344.264172] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
[  344.264175] FS:  0000000000000000(0000) GS:ffff880113680000(0000) knlGS:0000000000000000
[  344.264177] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  344.264179] CR2: 00007fa26a978978 CR3: 0000000002209000 CR4: 00000000000006e0
[  344.264180] Stack:
[  344.264181]  ffff8800a7caab00 ffff8800a7caab10 ffff8800a7caab08 0000000022af534e
[  344.264186]  ffff8801122a7100 ffffffff814eeb8c ffff8801122a7148 ffffffff81127ce6
[  344.264191]  ffffed0014f95560 000000009e85cd68 ffff8800a7caab00 ffff8800a7caab58
[  344.264196] Call Trace:
[  344.264199]  [<ffffffff814eeb8c>] __delay+0xa/0xc
[  344.264203]  [<ffffffff81127ce6>] do_raw_spin_lock+0x197/0x257
[  344.264206]  [<ffffffff81d32d0d>] _raw_spin_lock+0x35/0x3c
[  344.264212]  [<ffffffffa03ccd78>] ? zs_malloc+0x17e/0xb71 [zsmalloc]
[  344.264217]  [<ffffffffa03ccd78>] zs_malloc+0x17e/0xb71 [zsmalloc]
[  344.264220]  [<ffffffffa0190204>] ? lzo_decompress+0x11d/0x11d [lzo]
[  344.264223]  [<ffffffff81122faf>] ? lock_acquire+0xec/0x147
[  344.264228]  [<ffffffffa03ccbfa>] ? obj_malloc+0x372/0x372 [zsmalloc]
[  344.264233]  [<ffffffff81472ff9>] ? crypto_compress+0x87/0x93
[  344.264238]  [<ffffffffa041522d>] zram_bvec_rw+0x1073/0x1638 [zram]
[  344.264243]  [<ffffffffa04141ba>] ? zram_slot_free_notify+0x1c8/0x1c8 [zram]
[  344.264247]  [<ffffffff812fc37b>] ? wb_writeback+0x316/0x44c
[  344.264251]  [<ffffffffa0416104>] zram_make_request+0x6f5/0x89f [zram]
[  344.264255]  [<ffffffff81111ef0>] ? woken_wake_function+0x51/0x51
[  344.264260]  [<ffffffffa0415a0f>] ? zram_rw_page+0x21d/0x21d [zram]
[  344.264263]  [<ffffffff81494948>] ? blk_exit_rl+0x39/0x39
[  344.264267]  [<ffffffff81491130>] ? handle_bad_sector+0x192/0x192
[  344.264271]  [<ffffffff811506a1>] ? call_rcu+0x12/0x14
[  344.264274]  [<ffffffff8129a684>] ? put_object+0x58/0x5b
[  344.264277]  [<ffffffff81496128>] generic_make_request+0x2bc/0x496
[  344.264280]  [<ffffffff81495e6c>] ? blk_plug_queued_count+0x103/0x103
[  344.264283]  [<ffffffff814965fa>] submit_bio+0x2f8/0x324
[  344.264286]  [<ffffffff81496302>] ? generic_make_request+0x496/0x496
[  344.264289]  [<ffffffff813aa993>] ? ext4_reserve_inode_write+0x101/0x101
[  344.264292]  [<ffffffff813b44e8>] ext4_io_submit+0x12d/0x15d
[  344.264295]  [<ffffffff813ac54d>] ext4_writepages+0x15f9/0x1660
[  344.264298]  [<ffffffff813aaf54>] ? ext4_mark_inode_dirty+0x5c1/0x5c1
[  344.264301]  [<ffffffff8111f405>] ? debug_show_all_locks+0x226/0x226
[  344.264304]  [<ffffffff8111f405>] ? debug_show_all_locks+0x226/0x226
[  344.264307]  [<ffffffff8111f9a4>] ? __lock_acquire+0x59f/0x33b8
[  344.264311]  [<ffffffff811fa6ea>] do_writepages+0x93/0xa1
[  344.264315]  [<ffffffff812fb7a0>] ? writeback_sb_inodes+0x270/0x85e
[  344.264317]  [<ffffffff811fa6ea>] ? do_writepages+0x93/0xa1
[  344.264321]  [<ffffffff812fb287>] __writeback_single_inode+0x8b/0x334
[  344.264324]  [<ffffffff812fb9c9>] writeback_sb_inodes+0x499/0x85e
[  344.264327]  [<ffffffff812fb530>] ? __writeback_single_inode+0x334/0x334
[  344.264331]  [<ffffffff81115e1c>] ? down_read_trylock+0x53/0xaf
[  344.264335]  [<ffffffff812a7398>] ? trylock_super+0x16/0xaf
[  344.264338]  [<ffffffff812fbe95>] __writeback_inodes_wb+0x107/0x17d
[  344.264341]  [<ffffffff812fc37b>] wb_writeback+0x316/0x44c
[  344.264345]  [<ffffffff812fc065>] ? writeback_inodes_wb.constprop.10+0x15a/0x15a
[  344.264348]  [<ffffffff811f837f>] ? wb_over_bg_thresh+0x110/0x194
[  344.264351]  [<ffffffff811f826f>] ? balance_dirty_pages_ratelimited+0x14f5/0x14f5
[  344.264354]  [<ffffffff812fce5d>] ? wb_workfn+0x296/0x6d6
[  344.264357]  [<ffffffff812fced4>] wb_workfn+0x30d/0x6d6
[  344.264360]  [<ffffffff812fced4>] ? wb_workfn+0x30d/0x6d6
[  344.264364]  [<ffffffff812fcbc7>] ? inode_wait_for_writeback+0x2e/0x2e
[  344.264368]  [<ffffffff810be6d0>] process_one_work+0x6f4/0xb2c
[  344.264371]  [<ffffffff810bdfdc>] ? pwq_dec_nr_in_flight+0x22b/0x22b
[  344.264375]  [<ffffffff810c0de0>] worker_thread+0x5bb/0x88e
[  344.264378]  [<ffffffff810cd0de>] kthread+0x252/0x261
[  344.264381]  [<ffffffff810c0825>] ? rescuer_thread+0x879/0x879
[  344.264383]  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
[  344.264387]  [<ffffffff81d3387f>] ret_from_fork+0x1f/0x40
[  344.264390]  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
[  344.264392] Code: 14 6a b2 7e 85 c0 75 05 e8 8b 35 b1 ff f3 90 bf 01 00 00 00 e8 a1 71 be ff e8 e6 f3 01 00 44 39 f0 74 b6 4c 29 e3 49 01 dd eb 97 <bf> 01 00 00 00 e8 4c 81 be ff 65 8b 05 dc 69 b2 7e 85 c0 75 05 


> Could you test with [zsmalloc: keep first object offset in struct page]
> in mmotm?

sure, I can.  will it help, tho? we have a race condition here I think.

	-ss

^ permalink raw reply

* Re: [PATCH v6v3 02/12] mm: migrate: support non-lru movable page migration
From: Minchan Kim @ 2016-06-16  0:26 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Rik van Riel, Sergey Senozhatsky, Rafael Aquini, Jonathan Corbet,
	Hugh Dickins, linux-kernel, dri-devel, virtualization,
	John Einar Reitan, linux-mm, Gioh Kim, Mel Gorman, Andrew Morton,
	Joonsoo Kim, Vlastimil Babka
In-Reply-To: <5760F970.7060805@linux.vnet.ibm.com>

On Wed, Jun 15, 2016 at 12:15:04PM +0530, Anshuman Khandual wrote:
> On 06/15/2016 08:02 AM, Minchan Kim wrote:
> > Hi,
> > 
> > On Mon, Jun 13, 2016 at 03:08:19PM +0530, Anshuman Khandual wrote:
> >> > On 05/31/2016 05:31 AM, Minchan Kim wrote:
> >>> > > @@ -791,6 +921,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> >>> > >  	int rc = -EAGAIN;
> >>> > >  	int page_was_mapped = 0;
> >>> > >  	struct anon_vma *anon_vma = NULL;
> >>> > > +	bool is_lru = !__PageMovable(page);
> >>> > >  
> >>> > >  	if (!trylock_page(page)) {
> >>> > >  		if (!force || mode == MIGRATE_ASYNC)
> >>> > > @@ -871,6 +1002,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> >>> > >  		goto out_unlock_both;
> >>> > >  	}
> >>> > >  
> >>> > > +	if (unlikely(!is_lru)) {
> >>> > > +		rc = move_to_new_page(newpage, page, mode);
> >>> > > +		goto out_unlock_both;
> >>> > > +	}
> >>> > > +
> >> > 
> >> > Hello Minchan,
> >> > 
> >> > I might be missing something here but does this implementation support the
> >> > scenario where these non LRU pages owned by the driver mapped as PTE into
> >> > process page table ? Because the "goto out_unlock_both" statement above
> >> > skips all the PTE unmap, putting a migration PTE and removing the migration
> >> > PTE steps.
> > You're right. Unfortunately, it doesn't support right now but surely,
> > it's my TODO after landing this work.
> > 
> > Could you share your usecase?
> 
> Sure.

Thanks a lot!

> 
> My driver has privately managed non LRU pages which gets mapped into user space
> process page table through f_ops->mmap() and vmops->fault() which then updates
> the file RMAP (page->mapping->i_mmap) through page_add_file_rmap(page). One thing

Hmm, page_add_file_rmap is not exported function. How does your driver can use it?
Do you use vm_insert_pfn?
What type your vma is? VM_PFNMMAP or VM_MIXEDMAP?

I want to make dummy driver to simulate your case.
It would be very helpful to implement/test pte-mapped non-lru page
migration feature. That's why I ask now.

> to note here is that the page->mapping eventually points to struct address_space
> (file->f_mapping) which belongs to the character device file (created using mknod)
> which we are using for establishing the mmap() regions in the user space.
> 
> Now as per this new framework, all the page's are to be made __SetPageMovable before
> passing the list down to migrate_pages(). Now __SetPageMovable() takes *new* struct
> address_space as an argument and replaces the existing page->mapping. Now thats the
> problem, we have lost all our connection to the existing file RMAP information. This

We could change __SetPageMovable doesn't need mapping argument.
Instead, it just marks PAGE_MAPPING_MOVABLE into page->mapping.
For that, user should take care of setting page->mapping earlier than
marking the flag.

> stands as a problem when we try to migrate these non LRU pages which are PTE mapped.
> The rmap_walk_file() never finds them in the VMA, skips all the migrate PTE steps and
> then the migration eventually fails.
> 
> Seems like assigning a new struct address_space to the page through __SetPageMovable()
> is the source of the problem. Can it take the existing (file->f_mapping) as an argument
We can set existing file->f_mapping under the page_lock.

> in there ? Sure, but then can we override file system generic ->isolate(), ->putback(),

I don't get it. Why does it override file system generic functions?

> ->migratepages() functions ? I dont think so. I am sure, there must be some work around
> to fix this problem for the driver. But we need to rethink this framework from supporting
> these mapped non LRU pages point of view.
> 
> I might be missing something here, feel free to point out.
> 
> - Anshuman
> 

^ permalink raw reply

* Re: [PATCH v7 00/12] Support non-lru page migration
From: Minchan Kim @ 2016-06-15 23:12 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Rik van Riel, Sergey Senozhatsky, Naoya Horiguchi,
	Jonathan Corbet, Chan Gyun Jeong, Rafael Aquini, Hugh Dickins,
	linux-kernel, dri-devel, virtualization, John Einar Reitan,
	linux-mm, Chulmin Kim, Gioh Kim, Konstantin Khlebnikov,
	Sangseok Lee, Andrew Morton, Kyeongdon Kim, Joonsoo Kim,
	Vlastimil Babka, Mel Gorman
In-Reply-To: <20160615075909.GA425@swordfish>

Hi Sergey,

On Wed, Jun 15, 2016 at 04:59:09PM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> -next 4.7.0-rc3-next-20160614
> 
> 
> [  315.146533] kasan: CONFIG_KASAN_INLINE enabled
> [  315.146538] kasan: GPF could be caused by NULL-ptr deref or user memory access
> [  315.146546] general protection fault: 0000 [#1] PREEMPT SMP KASAN
> [  315.146576] Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel r8169 i2c_i801 mii snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich mfd_core processor sch_fq_codel sd_mod hid_generic usbhid hid ahci libahci libata ehci_pci ehci_hcd scsi_mod usbcore usb_common
> [  315.146785] CPU: 3 PID: 38 Comm: khugepaged Not tainted 4.7.0-rc3-next-20160614-dbg-00004-ga1c2cbc-dirty #488
> [  315.146841] task: ffff8800bfaf2900 ti: ffff880112468000 task.ti: ffff880112468000
> [  315.146859] RIP: 0010:[<ffffffffa02c413d>]  [<ffffffffa02c413d>] zs_page_migrate+0x355/0xaa0 [zsmalloc]

Thanks for the report!

zs_page_migrate+0x355? Could you tell me what line is it?

It seems to be related to obj_to_head.

Could you test with [zsmalloc: keep first object offset in struct page]
in mmotm?


> [  315.146892] RSP: 0000:ffff88011246f138  EFLAGS: 00010293
> [  315.146906] RAX: 736761742d6f6e2c RBX: ffff880017ad9a80 RCX: 0000000000000000
> [  315.146924] RDX: 1ffffffff064d704 RSI: ffff88000511469a RDI: ffffffff8326ba20
> [  315.146942] RBP: ffff88011246f328 R08: 0000000000000001 R09: 0000000000000000
> [  315.146959] R10: ffff88011246f0a8 R11: ffff8800bfc07fff R12: ffff88011246f300
> [  315.146977] R13: ffffed0015523e6f R14: ffff8800aa91f378 R15: ffffea0000144500
> [  315.146995] FS:  0000000000000000(0000) GS:ffff880113780000(0000) knlGS:0000000000000000
> [  315.147015] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  315.147030] CR2: 00007f3f97911000 CR3: 0000000002209000 CR4: 00000000000006e0
> [  315.147046] Stack:
> [  315.147052]  1ffff10015523e0f ffff88011246f240 ffff880005116800 00017f80e0000000
> [  315.147083]  ffff880017ad9aa8 736761742d6f6e2c 1ffff1002248de34 ffff880017ad9a90
> [  315.147113]  0000069a1246f660 000000000000069a ffff880005114000 ffffea0002ff0180
> [  315.147143] Call Trace:
> [  315.147154]  [<ffffffffa02c3de8>] ? obj_to_head+0x9d/0x9d [zsmalloc]
> [  315.147175]  [<ffffffff81d31dbc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
> [  315.147195]  [<ffffffff812275b1>] ? isolate_freepages_block+0x2f9/0x5a6
> [  315.147213]  [<ffffffff8127f15c>] ? kasan_poison_shadow+0x2f/0x31
> [  315.147230]  [<ffffffff8127f66a>] ? kasan_alloc_pages+0x39/0x3b
> [  315.147246]  [<ffffffff812267e6>] ? map_pages+0x1f3/0x3ad
> [  315.147262]  [<ffffffff812265f3>] ? update_pageblock_skip+0x18d/0x18d
> [  315.147280]  [<ffffffff81115972>] ? up_read+0x1a/0x30
> [  315.147296]  [<ffffffff8111ec7e>] ? debug_check_no_locks_freed+0x150/0x22b
> [  315.147315]  [<ffffffff812842d1>] move_to_new_page+0x4dd/0x615
> [  315.147332]  [<ffffffff81283df4>] ? migrate_page+0x75/0x75
> [  315.147347]  [<ffffffff8122785e>] ? isolate_freepages_block+0x5a6/0x5a6
> [  315.147366]  [<ffffffff812851c1>] migrate_pages+0xadd/0x131a
> [  315.147382]  [<ffffffff8122785e>] ? isolate_freepages_block+0x5a6/0x5a6
> [  315.147399]  [<ffffffff81226375>] ? kzfree+0x2b/0x2b
> [  315.147414]  [<ffffffff812846e4>] ? buffer_migrate_page+0x2db/0x2db
> [  315.147431]  [<ffffffff8122a6cf>] compact_zone+0xcdb/0x1155
> [  315.147448]  [<ffffffff812299f4>] ? compaction_suitable+0x76/0x76
> [  315.147465]  [<ffffffff8122ac29>] compact_zone_order+0xe0/0x167
> [  315.147481]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
> [  315.147499]  [<ffffffff8122ab49>] ? compact_zone+0x1155/0x1155
> [  315.147515]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
> [  315.147533]  [<ffffffff8122bcff>] try_to_compact_pages+0x2f1/0x648
> [  315.147550]  [<ffffffff8122bcff>] ? try_to_compact_pages+0x2f1/0x648
> [  315.147568]  [<ffffffff8122ba0e>] ? compaction_zonelist_suitable+0x3a6/0x3a6
> [  315.147589]  [<ffffffff811ee129>] ? get_page_from_freelist+0x2c0/0x129a
> [  315.147608]  [<ffffffff811ef1ed>] __alloc_pages_direct_compact+0xea/0x30d
> [  315.147626]  [<ffffffff811ef103>] ? get_page_from_freelist+0x129a/0x129a
> [  315.147645]  [<ffffffff811f0422>] __alloc_pages_nodemask+0x840/0x16b6
> [  315.147663]  [<ffffffff810dba27>] ? try_to_wake_up+0x696/0x6c8
> [  315.149147]  [<ffffffff811efbe2>] ? warn_alloc_failed+0x226/0x226
> [  315.150615]  [<ffffffff810dba69>] ? wake_up_process+0x10/0x12
> [  315.152078]  [<ffffffff810dbaf4>] ? wake_up_q+0x89/0xa7
> [  315.153539]  [<ffffffff81128b6f>] ? rwsem_wake+0x131/0x15c
> [  315.155007]  [<ffffffff812922e7>] ? khugepaged+0x4072/0x484f
> [  315.156471]  [<ffffffff8128e449>] khugepaged+0x1d4/0x484f
> [  315.157940]  [<ffffffff8128e275>] ? hugepage_vma_revalidate+0xef/0xef
> [  315.159402]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
> [  315.160870]  [<ffffffff81d31df8>] ? _raw_spin_unlock_irq+0x27/0x45
> [  315.162341]  [<ffffffff8111cde6>] ? trace_hardirqs_on_caller+0x3d2/0x492
> [  315.163814]  [<ffffffff8111112e>] ? prepare_to_wait_event+0x3f7/0x3f7
> [  315.165295]  [<ffffffff81d27ad5>] ? __schedule+0xa4d/0xd16
> [  315.166763]  [<ffffffff810ccde3>] kthread+0x252/0x261
> [  315.168214]  [<ffffffff8128e275>] ? hugepage_vma_revalidate+0xef/0xef
> [  315.169646]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
> [  315.171056]  [<ffffffff81d3277f>] ret_from_fork+0x1f/0x40
> [  315.172462]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
> [  315.173869] Code: 03 b5 60 fe ff ff e8 2e fc ff ff a8 01 74 4c 48 83 e0 fe bf 01 00 00 00 48 89 85 38 fe ff ff e8 41 18 e1 e0 48 8b 85 38 fe ff ff <f0> 0f ba 28 00 73 29 bf 01 00 00 00 41 bc f5 ff ff ff e8 ea 27 
> [  315.175573] RIP  [<ffffffffa02c413d>] zs_page_migrate+0x355/0xaa0 [zsmalloc]
> [  315.177084]  RSP <ffff88011246f138>
> [  315.186572] ---[ end trace 0962b8ee48c98bbc ]---
> 
> 
> 
> 
> [  315.186577] BUG: sleeping function called from invalid context at include/linux/sched.h:2960
> [  315.186580] in_atomic(): 1, irqs_disabled(): 0, pid: 38, name: khugepaged
> [  315.186581] INFO: lockdep is turned off.
> [  315.186583] Preemption disabled at:[<ffffffffa02c3f1d>] zs_page_migrate+0x135/0xaa0 [zsmalloc]
> 
> [  315.186594] CPU: 3 PID: 38 Comm: khugepaged Tainted: G      D         4.7.0-rc3-next-20160614-dbg-00004-ga1c2cbc-dirty #488
> [  315.186599]  0000000000000000 ffff88011246ed58 ffffffff814d56bf ffff8800bfaf2900
> [  315.186604]  0000000000000004 ffff88011246ed98 ffffffff810d5e6a 0000000000000000
> [  315.186609]  ffff8800bfaf2900 ffffffff81e39820 0000000000000b90 0000000000000000
> [  315.186614] Call Trace:
> [  315.186618]  [<ffffffff814d56bf>] dump_stack+0x68/0x92
> [  315.186622]  [<ffffffff810d5e6a>] ___might_sleep+0x3bd/0x3c9
> [  315.186625]  [<ffffffff810d5fd1>] __might_sleep+0x15b/0x167
> [  315.186630]  [<ffffffff810ac4c1>] exit_signals+0x7a/0x34f
> [  315.186633]  [<ffffffff810ac447>] ? get_signal+0xd9b/0xd9b
> [  315.186636]  [<ffffffff811aee21>] ? irq_work_queue+0x101/0x11c
> [  315.186640]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
> [  315.186645]  [<ffffffff81096357>] do_exit+0x34d/0x1b4e
> [  315.186648]  [<ffffffff81130e16>] ? vprintk_emit+0x4b1/0x4d3
> [  315.186652]  [<ffffffff8109600a>] ? is_current_pgrp_orphaned+0x8c/0x8c
> [  315.186655]  [<ffffffff81122c56>] ? lock_acquire+0xec/0x147
> [  315.186658]  [<ffffffff811321ef>] ? kmsg_dump+0x12/0x27a
> [  315.186662]  [<ffffffff81132448>] ? kmsg_dump+0x26b/0x27a
> [  315.186666]  [<ffffffff81036507>] oops_end+0x9d/0xa4
> [  315.186669]  [<ffffffff8103662c>] die+0x55/0x5e
> [  315.186672]  [<ffffffff81032aa0>] do_general_protection+0x16c/0x337
> [  315.186676]  [<ffffffff81d33abf>] general_protection+0x1f/0x30
> [  315.186681]  [<ffffffffa02c413d>] ? zs_page_migrate+0x355/0xaa0 [zsmalloc]
> [  315.186686]  [<ffffffffa02c4136>] ? zs_page_migrate+0x34e/0xaa0 [zsmalloc]
> [  315.186691]  [<ffffffffa02c3de8>] ? obj_to_head+0x9d/0x9d [zsmalloc]
> [  315.186695]  [<ffffffff81d31dbc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
> [  315.186699]  [<ffffffff812275b1>] ? isolate_freepages_block+0x2f9/0x5a6
> [  315.186702]  [<ffffffff8127f15c>] ? kasan_poison_shadow+0x2f/0x31
> [  315.186706]  [<ffffffff8127f66a>] ? kasan_alloc_pages+0x39/0x3b
> [  315.186709]  [<ffffffff812267e6>] ? map_pages+0x1f3/0x3ad
> [  315.186712]  [<ffffffff812265f3>] ? update_pageblock_skip+0x18d/0x18d
> [  315.186716]  [<ffffffff81115972>] ? up_read+0x1a/0x30
> [  315.186719]  [<ffffffff8111ec7e>] ? debug_check_no_locks_freed+0x150/0x22b
> [  315.186723]  [<ffffffff812842d1>] move_to_new_page+0x4dd/0x615
> [  315.186726]  [<ffffffff81283df4>] ? migrate_page+0x75/0x75
> [  315.186730]  [<ffffffff8122785e>] ? isolate_freepages_block+0x5a6/0x5a6
> [  315.186733]  [<ffffffff812851c1>] migrate_pages+0xadd/0x131a
> [  315.186737]  [<ffffffff8122785e>] ? isolate_freepages_block+0x5a6/0x5a6
> [  315.186740]  [<ffffffff81226375>] ? kzfree+0x2b/0x2b
> [  315.186743]  [<ffffffff812846e4>] ? buffer_migrate_page+0x2db/0x2db
> [  315.186747]  [<ffffffff8122a6cf>] compact_zone+0xcdb/0x1155
> [  315.186751]  [<ffffffff812299f4>] ? compaction_suitable+0x76/0x76
> [  315.186754]  [<ffffffff8122ac29>] compact_zone_order+0xe0/0x167
> [  315.186757]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
> [  315.186761]  [<ffffffff8122ab49>] ? compact_zone+0x1155/0x1155
> [  315.186764]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
> [  315.186768]  [<ffffffff8122bcff>] try_to_compact_pages+0x2f1/0x648
> [  315.186771]  [<ffffffff8122bcff>] ? try_to_compact_pages+0x2f1/0x648
> [  315.186775]  [<ffffffff8122ba0e>] ? compaction_zonelist_suitable+0x3a6/0x3a6
> [  315.186780]  [<ffffffff811ee129>] ? get_page_from_freelist+0x2c0/0x129a
> [  315.186783]  [<ffffffff811ef1ed>] __alloc_pages_direct_compact+0xea/0x30d
> [  315.186787]  [<ffffffff811ef103>] ? get_page_from_freelist+0x129a/0x129a
> [  315.186791]  [<ffffffff811f0422>] __alloc_pages_nodemask+0x840/0x16b6
> [  315.186794]  [<ffffffff810dba27>] ? try_to_wake_up+0x696/0x6c8
> [  315.186798]  [<ffffffff811efbe2>] ? warn_alloc_failed+0x226/0x226
> [  315.186801]  [<ffffffff810dba69>] ? wake_up_process+0x10/0x12
> [  315.186804]  [<ffffffff810dbaf4>] ? wake_up_q+0x89/0xa7
> [  315.186807]  [<ffffffff81128b6f>] ? rwsem_wake+0x131/0x15c
> [  315.186811]  [<ffffffff812922e7>] ? khugepaged+0x4072/0x484f
> [  315.186815]  [<ffffffff8128e449>] khugepaged+0x1d4/0x484f
> [  315.186819]  [<ffffffff8128e275>] ? hugepage_vma_revalidate+0xef/0xef
> [  315.186822]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
> [  315.186826]  [<ffffffff81d31df8>] ? _raw_spin_unlock_irq+0x27/0x45
> [  315.186829]  [<ffffffff8111cde6>] ? trace_hardirqs_on_caller+0x3d2/0x492
> [  315.186832]  [<ffffffff8111112e>] ? prepare_to_wait_event+0x3f7/0x3f7
> [  315.186836]  [<ffffffff81d27ad5>] ? __schedule+0xa4d/0xd16
> [  315.186840]  [<ffffffff810ccde3>] kthread+0x252/0x261
> [  315.186843]  [<ffffffff8128e275>] ? hugepage_vma_revalidate+0xef/0xef
> [  315.186846]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
> [  315.186851]  [<ffffffff81d3277f>] ret_from_fork+0x1f/0x40
> [  315.186854]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
> [  315.186869] note: khugepaged[38] exited with preempt_count 4
> 
> 
> 
> [  340.319852] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [jbd2/zram0-8:405]
> [  340.319856] Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel r8169 i2c_i801 mii snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich mfd_core processor sch_fq_codel sd_mod hid_generic usbhid hid ahci libahci libata ehci_pci ehci_hcd scsi_mod usbcore usb_common
> [  340.319900] irq event stamp: 834296
> [  340.319902] hardirqs last  enabled at (834295): [<ffffffff81280b07>] quarantine_put+0xa1/0xe6
> [  340.319911] hardirqs last disabled at (834296): [<ffffffff81d31e68>] _raw_write_lock_irqsave+0x13/0x4c
> [  340.319917] softirqs last  enabled at (833836): [<ffffffff81d3455e>] __do_softirq+0x406/0x48f
> [  340.319922] softirqs last disabled at (833831): [<ffffffff8109914a>] irq_exit+0x6a/0x113
> [  340.319929] CPU: 2 PID: 405 Comm: jbd2/zram0-8 Tainted: G      D         4.7.0-rc3-next-20160614-dbg-00004-ga1c2cbc-dirty #488
> [  340.319935] task: ffff8800bb512900 ti: ffff8800a69c0000 task.ti: ffff8800a69c0000
> [  340.319937] RIP: 0010:[<ffffffff814ed772>]  [<ffffffff814ed772>] delay_tsc+0x0/0xa4
> [  340.319943] RSP: 0018:ffff8800a69c70f8  EFLAGS: 00000206
> [  340.319945] RAX: 0000000000000001 RBX: ffff8800aa91f300 RCX: 0000000000000000
> [  340.319947] RDX: 0000000000000003 RSI: ffffffff81ed2840 RDI: 0000000000000001
> [  340.319949] RBP: ffff8800a69c7100 R08: 0000000000000001 R09: 0000000000000000
> [  340.319951] R10: ffff8800a69c70e8 R11: 000000007e7516b9 R12: ffff8800aa91f310
> [  340.319954] R13: ffff8800aa91f308 R14: 000000001f3306fa R15: 0000000000000000
> [  340.319956] FS:  0000000000000000(0000) GS:ffff880113700000(0000) knlGS:0000000000000000
> [  340.319959] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  340.319961] CR2: 00007fc99caba080 CR3: 00000000b9796000 CR4: 00000000000006e0
> [  340.319963] Stack:
> [  340.319964]  ffffffff814ed89c ffff8800a69c7148 ffffffff8112795d ffffed0015523e60
> [  340.319970]  000000009e857390 ffff8800aa91f300 ffff8800bbe21cc0 ffff8800047d6f80
> [  340.319975]  ffff8800a69c72b0 ffff8800aa91f300 ffff8800a69c7168 ffffffff81d31bed
> [  340.319980] Call Trace:
> [  340.319983]  [<ffffffff814ed89c>] ? __delay+0xa/0xc
> [  340.319988]  [<ffffffff8112795d>] do_raw_spin_lock+0x197/0x257
> [  340.319991]  [<ffffffff81d31bed>] _raw_spin_lock+0x35/0x3c
> [  340.319998]  [<ffffffffa02c6062>] ? zs_free+0x191/0x27a [zsmalloc]
> [  340.320003]  [<ffffffffa02c6062>] zs_free+0x191/0x27a [zsmalloc]
> [  340.320008]  [<ffffffffa02c5ed1>] ? free_zspage+0xe8/0xe8 [zsmalloc]
> [  340.320012]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
> [  340.320015]  [<ffffffff810d58a6>] ? finish_task_switch+0x3b3/0x484
> [  340.320021]  [<ffffffff81d27ad5>] ? __schedule+0xa4d/0xd16
> [  340.320024]  [<ffffffff81d28086>] ? preempt_schedule+0x1f/0x21
> [  340.320028]  [<ffffffff81d27ff9>] ? preempt_schedule_common+0xb7/0xe8
> [  340.320034]  [<ffffffffa02d3f0e>] zram_free_page+0x112/0x1f6 [zram]
> [  340.320039]  [<ffffffffa02d5e6c>] zram_make_request+0x45d/0x89f [zram]
> [  340.320045]  [<ffffffffa02d5a0f>] ? zram_rw_page+0x21d/0x21d [zram]
> [  340.320048]  [<ffffffff81493657>] ? blk_exit_rl+0x39/0x39
> [  340.320053]  [<ffffffff8148fe3f>] ? handle_bad_sector+0x192/0x192
> [  340.320056]  [<ffffffff8127f83e>] ? kasan_slab_alloc+0x12/0x14
> [  340.320059]  [<ffffffff8127ca68>] ? kmem_cache_alloc+0xf3/0x101
> [  340.320062]  [<ffffffff81494e37>] generic_make_request+0x2bc/0x496
> [  340.320066]  [<ffffffff81494b7b>] ? blk_plug_queued_count+0x103/0x103
> [  340.320069]  [<ffffffff8111ec7e>] ? debug_check_no_locks_freed+0x150/0x22b
> [  340.320072]  [<ffffffff81495309>] submit_bio+0x2f8/0x324
> [  340.320075]  [<ffffffff81495011>] ? generic_make_request+0x496/0x496
> [  340.320078]  [<ffffffff811190fc>] ? lockdep_init_map+0x1ef/0x4b0
> [  340.320082]  [<ffffffff814880a4>] submit_bio_wait+0xff/0x138
> [  340.320085]  [<ffffffff81487fa5>] ? bio_add_page+0x292/0x292
> [  340.320090]  [<ffffffff814ab82c>] blkdev_issue_discard+0xee/0x148
> [  340.320093]  [<ffffffff814ab73e>] ? __blkdev_issue_discard+0x399/0x399
> [  340.320097]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
> [  340.320101]  [<ffffffff81404de8>] ext4_free_data_callback+0x2cc/0x8bc
> [  340.320104]  [<ffffffff81404de8>] ? ext4_free_data_callback+0x2cc/0x8bc
> [  340.320107]  [<ffffffff81404b1c>] ? ext4_mb_release_context+0x10aa/0x10aa
> [  340.320111]  [<ffffffff81122c56>] ? lock_acquire+0xec/0x147
> [  340.320115]  [<ffffffff813c8a6a>] ? ext4_journal_commit_callback+0x203/0x220
> [  340.320119]  [<ffffffff813c8a61>] ext4_journal_commit_callback+0x1fa/0x220
> [  340.320124]  [<ffffffff81438bf5>] jbd2_journal_commit_transaction+0x3753/0x3c20
> [  340.320128]  [<ffffffff814354a2>] ? journal_submit_commit_record+0x777/0x777
> [  340.320132]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
> [  340.320135]  [<ffffffff811205a5>] ? __lock_acquire+0x14f9/0x33b8
> [  340.320139]  [<ffffffff81d31db0>] ? _raw_spin_unlock_irqrestore+0x3b/0x5c
> [  340.320143]  [<ffffffff8111cde6>] ? trace_hardirqs_on_caller+0x3d2/0x492
> [  340.320146]  [<ffffffff81d31dbc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
> [  340.320151]  [<ffffffff81156945>] ? try_to_del_timer_sync+0xa5/0xce
> [  340.320154]  [<ffffffff8111cde6>] ? trace_hardirqs_on_caller+0x3d2/0x492
> [  340.320157]  [<ffffffff8143febd>] kjournald2+0x246/0x6e1
> [  340.320160]  [<ffffffff8143febd>] ? kjournald2+0x246/0x6e1
> [  340.320163]  [<ffffffff8143fc77>] ? commit_timeout+0xb/0xb
> [  340.320167]  [<ffffffff8111112e>] ? prepare_to_wait_event+0x3f7/0x3f7
> [  340.320171]  [<ffffffff810ccde3>] kthread+0x252/0x261
> [  340.320174]  [<ffffffff8143fc77>] ? commit_timeout+0xb/0xb
> [  340.320177]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
> [  340.320181]  [<ffffffff81d3277f>] ret_from_fork+0x1f/0x40
> [  340.320185]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
> [  340.320186] Code: 5c 5d c3 55 48 8d 04 bd 00 00 00 00 65 48 8b 15 8d 59 b2 7e 48 69 d2 fa 00 00 00 48 89 e5 f7 e2 48 8d 7a 01 e8 22 01 00 00 5d c3 <55> 48 89 e5 41 56 41 55 41 54 53 49 89 fd bf 01 00 00 00 e8 ed 
> 
> 	-ss

^ permalink raw reply

* Re: [PATCH net-next V2] tun: introduce tx skb ring
From: Jamal Hadi Salim @ 2016-06-15 11:55 UTC (permalink / raw)
  To: Jason Wang, mst, netdev, linux-kernel, kvm, virtualization, davem
  Cc: eric.dumazet, brouer
In-Reply-To: <5761418F.2050407@mojatatu.com>

On 16-06-15 07:52 AM, Jamal Hadi Salim wrote:
> On 16-06-15 04:38 AM, Jason Wang wrote:
>> We used to queue tx packets in sk_receive_queue, this is less
>> efficient since it requires spinlocks to synchronize between producer

> 
> So this is more exercising the skb array improvements. For tun
> it would be useful to see general performance numbers on user/kernel
> crossing (i.e tun read/write).
> If you have the cycles can you run such tests?
> 

Ignore my message - you are running pktgen from a VM towards the host.
So the numbers you posted are what i was interested in.
Thanks for the good work.

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net-next V2] tun: introduce tx skb ring
From: Jamal Hadi Salim @ 2016-06-15 11:52 UTC (permalink / raw)
  To: Jason Wang, mst, netdev, linux-kernel, kvm, virtualization, davem
  Cc: eric.dumazet, brouer
In-Reply-To: <1465979897-4445-1-git-send-email-jasowang@redhat.com>

On 16-06-15 04:38 AM, Jason Wang wrote:
> We used to queue tx packets in sk_receive_queue, this is less
> efficient since it requires spinlocks to synchronize between producer
> and consumer.
> 
> This patch tries to address this by:
> 
> - introduce a new mode which will be only enabled with IFF_TX_ARRAY
>    set and switch from sk_receive_queue to a fixed size of skb
>    array with 256 entries in this mode.
> - introduce a new proto_ops peek_len which was used for peeking the
>    skb length.
> - implement a tun version of peek_len for vhost_net to use and convert
>    vhost_net to use peek_len if possible.
> 
> Pktgen test shows about 18% improvement on guest receiving pps for small
> buffers:
> 
> Before: ~1220000pps
> After : ~1440000pps
> 

So this is more exercising the skb array improvements. For tun
it would be useful to see general performance numbers on user/kernel
crossing (i.e tun read/write).
If you have the cycles can you run such tests?

cheers,
jamal

^ permalink raw reply

* Re: [PATCH v5 5/6] dcdbas: make use of smp_call_on_cpu()
From: Juergen Gross @ 2016-06-15 11:19 UTC (permalink / raw)
  To: linux-kernel, xen-devel
  Cc: jeremy, jdelvare, peterz, hpa, akataria, x86, virtualization,
	chrisw, mingo, david.vrabel, Douglas_Warzecha, pali.rohar,
	boris.ostrovsky, tglx, linux
In-Reply-To: <1459952266-3687-6-git-send-email-jgross@suse.com>

On 06/04/16 16:17, Juergen Gross wrote:
> Use smp_call_on_cpu() to raise SMI on cpu 0.
> Make call secure by adding get_online_cpus() to avoid e.g. suspend
> resume cycles in between.
>
> Signed-off-by: Juergen Gross <jgross@suse.com>

Could please some maintainer comment on this patch?


Juergen

> ---
> V4: add call to get_online_cpus()
> ---
>   drivers/firmware/dcdbas.c | 51 ++++++++++++++++++++++++-----------------------
>   1 file changed, 26 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/firmware/dcdbas.c b/drivers/firmware/dcdbas.c
> index 829eec8..2fe1a13 100644
> --- a/drivers/firmware/dcdbas.c
> +++ b/drivers/firmware/dcdbas.c
> @@ -23,6 +23,7 @@
>   #include <linux/platform_device.h>
>   #include <linux/dma-mapping.h>
>   #include <linux/errno.h>
> +#include <linux/cpu.h>
>   #include <linux/gfp.h>
>   #include <linux/init.h>
>   #include <linux/kernel.h>
> @@ -238,33 +239,14 @@ static ssize_t host_control_on_shutdown_store(struct device *dev,
>   	return count;
>   }
>
> -/**
> - * dcdbas_smi_request: generate SMI request
> - *
> - * Called with smi_data_lock.
> - */
> -int dcdbas_smi_request(struct smi_cmd *smi_cmd)
> +static int raise_smi(void *par)
>   {
> -	cpumask_var_t old_mask;
> -	int ret = 0;
> +	struct smi_cmd *smi_cmd = par;
>
> -	if (smi_cmd->magic != SMI_CMD_MAGIC) {
> -		dev_info(&dcdbas_pdev->dev, "%s: invalid magic value\n",
> -			 __func__);
> -		return -EBADR;
> -	}
> -
> -	/* SMI requires CPU 0 */
> -	if (!alloc_cpumask_var(&old_mask, GFP_KERNEL))
> -		return -ENOMEM;
> -
> -	cpumask_copy(old_mask, &current->cpus_allowed);
> -	set_cpus_allowed_ptr(current, cpumask_of(0));
>   	if (smp_processor_id() != 0) {
>   		dev_dbg(&dcdbas_pdev->dev, "%s: failed to get CPU 0\n",
>   			__func__);
> -		ret = -EBUSY;
> -		goto out;
> +		return -EBUSY;
>   	}
>
>   	/* generate SMI */
> @@ -280,9 +262,28 @@ int dcdbas_smi_request(struct smi_cmd *smi_cmd)
>   		: "memory"
>   	);
>
> -out:
> -	set_cpus_allowed_ptr(current, old_mask);
> -	free_cpumask_var(old_mask);
> +	return 0;
> +}
> +/**
> + * dcdbas_smi_request: generate SMI request
> + *
> + * Called with smi_data_lock.
> + */
> +int dcdbas_smi_request(struct smi_cmd *smi_cmd)
> +{
> +	int ret;
> +
> +	if (smi_cmd->magic != SMI_CMD_MAGIC) {
> +		dev_info(&dcdbas_pdev->dev, "%s: invalid magic value\n",
> +			 __func__);
> +		return -EBADR;
> +	}
> +
> +	/* SMI requires CPU 0 */
> +	get_online_cpus();
> +	ret = smp_call_on_cpu(0, raise_smi, smi_cmd, true);
> +	put_online_cpus();
> +
>   	return ret;
>   }
>
>

^ permalink raw reply

* Re: [PATCH v5 2/6] virt, sched: add generic vcpu pinning support
From: Juergen Gross @ 2016-06-15 11:18 UTC (permalink / raw)
  To: linux-kernel, xen-devel
  Cc: jeremy, jdelvare, peterz, hpa, akataria, x86, virtualization,
	chrisw, mingo, david.vrabel, Douglas_Warzecha, pali.rohar,
	boris.ostrovsky, tglx, linux
In-Reply-To: <1459952266-3687-3-git-send-email-jgross@suse.com>

On 06/04/16 16:17, Juergen Gross wrote:
> Add generic virtualization support for pinning the current vcpu to a
> specified physical cpu. As this operation isn't performance critical
> (a very limited set of operations like BIOS calls and SMIs is expected
> to need this) just add a hypervisor specific indirection.
>
> Signed-off-by: Juergen Gross <jgross@suse.com>

Could please some maintainer comment on this patch?


Juergen

> ---
> V4: move this patch some places up in the series
>      WARN_ONCE in case platform doesn't support pinning as requested by
>      Peter Zijlstra
>
> V3: use getc_cpu()/put_cpu() as suggested by David Vrabel
>
> V2: adapt to using workqueues
>      add include/linux/hypervisor.h to hide architecture specific stuff
>      from generic kernel code
>
> In case paravirt maintainers don't want to be responsible for
> include/linux/hypervisor.h I could take it.
> ---
>   MAINTAINERS                       |  1 +
>   arch/x86/include/asm/hypervisor.h |  4 ++++
>   arch/x86/kernel/cpu/hypervisor.c  | 11 +++++++++++
>   include/linux/hypervisor.h        | 17 +++++++++++++++++
>   kernel/smp.c                      |  1 +
>   kernel/up.c                       |  1 +
>   6 files changed, 35 insertions(+)
>   create mode 100644 include/linux/hypervisor.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 40eb1db..d3bde4f 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8330,6 +8330,7 @@ S:	Supported
>   F:	Documentation/virtual/paravirt_ops.txt
>   F:	arch/*/kernel/paravirt*
>   F:	arch/*/include/asm/paravirt.h
> +F:	include/linux/hypervisor.h
>
>   PARIDE DRIVERS FOR PARALLEL PORT IDE DEVICES
>   M:	Tim Waugh <tim@cyberelk.net>
> diff --git a/arch/x86/include/asm/hypervisor.h b/arch/x86/include/asm/hypervisor.h
> index 055ea99..67942b6 100644
> --- a/arch/x86/include/asm/hypervisor.h
> +++ b/arch/x86/include/asm/hypervisor.h
> @@ -43,6 +43,9 @@ struct hypervisor_x86 {
>
>   	/* X2APIC detection (run once per boot) */
>   	bool		(*x2apic_available)(void);
> +
> +	/* pin current vcpu to specified physical cpu (run rarely) */
> +	void		(*pin_vcpu)(int);
>   };
>
>   extern const struct hypervisor_x86 *x86_hyper;
> @@ -56,6 +59,7 @@ extern const struct hypervisor_x86 x86_hyper_kvm;
>   extern void init_hypervisor(struct cpuinfo_x86 *c);
>   extern void init_hypervisor_platform(void);
>   extern bool hypervisor_x2apic_available(void);
> +extern void hypervisor_pin_vcpu(int cpu);
>   #else
>   static inline void init_hypervisor(struct cpuinfo_x86 *c) { }
>   static inline void init_hypervisor_platform(void) { }
> diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c
> index 73d391a..ff108f8 100644
> --- a/arch/x86/kernel/cpu/hypervisor.c
> +++ b/arch/x86/kernel/cpu/hypervisor.c
> @@ -85,3 +85,14 @@ bool __init hypervisor_x2apic_available(void)
>   	       x86_hyper->x2apic_available &&
>   	       x86_hyper->x2apic_available();
>   }
> +
> +void hypervisor_pin_vcpu(int cpu)
> +{
> +	if (!x86_hyper)
> +		return;
> +
> +	if (x86_hyper->pin_vcpu)
> +		x86_hyper->pin_vcpu(cpu);
> +	else
> +		WARN_ONCE(1, "vcpu pinning requested but not supported!\n");
> +}
> diff --git a/include/linux/hypervisor.h b/include/linux/hypervisor.h
> new file mode 100644
> index 0000000..3fa5ef2
> --- /dev/null
> +++ b/include/linux/hypervisor.h
> @@ -0,0 +1,17 @@
> +#ifndef __LINUX_HYPEVISOR_H
> +#define __LINUX_HYPEVISOR_H
> +
> +/*
> + *	Generic Hypervisor support
> + *		Juergen Gross <jgross@suse.com>
> + */
> +
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +#include <asm/hypervisor.h>
> +#else
> +static inline void hypervisor_pin_vcpu(int cpu)
> +{
> +}
> +#endif
> +
> +#endif /* __LINUX_HYPEVISOR_H */
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 7416544..9388064 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -14,6 +14,7 @@
>   #include <linux/smp.h>
>   #include <linux/cpu.h>
>   #include <linux/sched.h>
> +#include <linux/hypervisor.h>
>
>   #include "smpboot.h"
>
> diff --git a/kernel/up.c b/kernel/up.c
> index 1760bf3..3ccee2b 100644
> --- a/kernel/up.c
> +++ b/kernel/up.c
> @@ -6,6 +6,7 @@
>   #include <linux/kernel.h>
>   #include <linux/export.h>
>   #include <linux/smp.h>
> +#include <linux/hypervisor.h>
>
>   int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
>   				int wait)
>

^ permalink raw reply

* Re: [PATCH net-next V2] tun: introduce tx skb ring
From: kbuild test robot @ 2016-06-15 10:34 UTC (permalink / raw)
  To: Jason Wang
  Cc: eric.dumazet, kvm, mst, netdev, linux-kernel, virtualization,
	kbuild-all, brouer, davem
In-Reply-To: <1465979897-4445-1-git-send-email-jasowang@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 1049 bytes --]

Hi,

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Jason-Wang/tun-introduce-tx-skb-ring/20160615-164041
config: x86_64-randconfig-s2-06151732 (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

>> drivers/net/tun.c:74:29: fatal error: linux/skb_array.h: No such file or directory
    #include <linux/skb_array.h>
                                ^
   compilation terminated.

vim +74 drivers/net/tun.c

    68	#include <net/net_namespace.h>
    69	#include <net/netns/generic.h>
    70	#include <net/rtnetlink.h>
    71	#include <net/sock.h>
    72	#include <linux/seq_file.h>
    73	#include <linux/uio.h>
  > 74	#include <linux/skb_array.h>
    75	
    76	#include <asm/uaccess.h>
    77	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 30315 bytes --]

[-- Attachment #3: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* [PATCH net-next V2] tun: introduce tx skb ring
From: Jason Wang @ 2016-06-15  8:38 UTC (permalink / raw)
  To: mst, netdev, linux-kernel, kvm, virtualization, davem
  Cc: eric.dumazet, brouer

We used to queue tx packets in sk_receive_queue, this is less
efficient since it requires spinlocks to synchronize between producer
and consumer.

This patch tries to address this by:

- introduce a new mode which will be only enabled with IFF_TX_ARRAY
  set and switch from sk_receive_queue to a fixed size of skb
  array with 256 entries in this mode.
- introduce a new proto_ops peek_len which was used for peeking the
  skb length.
- implement a tun version of peek_len for vhost_net to use and convert
  vhost_net to use peek_len if possible.

Pktgen test shows about 18% improvement on guest receiving pps for small
buffers:

Before: ~1220000pps
After : ~1440000pps

The reason why I stick to new mode is because:

- though resize is supported by skb array, in multiqueue mode, it's
  not easy to recover from a partial success of queue resizing.
- tx_queue_len is a user visible feature.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
- The patch is based on [PATCH v8 0/5] skb_array: array based FIFO for skbs

Changes from V1:
- switch to use skb array instead of a customized circular buffer
- add non-blocking support
- rename .peek to .peek_len
- drop lockless peeking since test show very minor improvement
---
 drivers/net/tun.c           | 138 ++++++++++++++++++++++++++++++++++++++++----
 drivers/vhost/net.c         |  16 ++++-
 include/linux/net.h         |   1 +
 include/uapi/linux/if_tun.h |   1 +
 4 files changed, 143 insertions(+), 13 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e16487c..b22e475 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -71,6 +71,7 @@
 #include <net/sock.h>
 #include <linux/seq_file.h>
 #include <linux/uio.h>
+#include <linux/skb_array.h>
 
 #include <asm/uaccess.h>
 
@@ -130,6 +131,7 @@ struct tap_filter {
 #define MAX_TAP_FLOWS  4096
 
 #define TUN_FLOW_EXPIRE (3 * HZ)
+#define TUN_RING_SIZE 256
 
 struct tun_pcpu_stats {
 	u64 rx_packets;
@@ -167,6 +169,7 @@ struct tun_file {
 	};
 	struct list_head next;
 	struct tun_struct *detached;
+	struct skb_array tx_array;
 };
 
 struct tun_flow_entry {
@@ -513,8 +516,15 @@ static struct tun_struct *tun_enable_queue(struct tun_file *tfile)
 	return tun;
 }
 
-static void tun_queue_purge(struct tun_file *tfile)
+static void tun_queue_purge(struct tun_struct *tun, struct tun_file *tfile)
 {
+	struct sk_buff *skb;
+
+	if (tun->flags & IFF_TX_ARRAY) {
+		while ((skb = skb_array_consume(&tfile->tx_array)) != NULL)
+			kfree_skb(skb);
+	}
+
 	skb_queue_purge(&tfile->sk.sk_receive_queue);
 	skb_queue_purge(&tfile->sk.sk_error_queue);
 }
@@ -545,7 +555,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
 		synchronize_net();
 		tun_flow_delete_by_queue(tun, tun->numqueues + 1);
 		/* Drop read queue */
-		tun_queue_purge(tfile);
+		tun_queue_purge(tun, tfile);
 		tun_set_real_num_queues(tun);
 	} else if (tfile->detached && clean) {
 		tun = tun_enable_queue(tfile);
@@ -560,6 +570,8 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
 			    tun->dev->reg_state == NETREG_REGISTERED)
 				unregister_netdevice(tun->dev);
 		}
+		if (tun && tun->flags & IFF_TX_ARRAY)
+			skb_array_cleanup(&tfile->tx_array);
 		sock_put(&tfile->sk);
 	}
 }
@@ -596,12 +608,12 @@ static void tun_detach_all(struct net_device *dev)
 	for (i = 0; i < n; i++) {
 		tfile = rtnl_dereference(tun->tfiles[i]);
 		/* Drop read queue */
-		tun_queue_purge(tfile);
+		tun_queue_purge(tun, tfile);
 		sock_put(&tfile->sk);
 	}
 	list_for_each_entry_safe(tfile, tmp, &tun->disabled, next) {
 		tun_enable_queue(tfile);
-		tun_queue_purge(tfile);
+		tun_queue_purge(tun, tfile);
 		sock_put(&tfile->sk);
 	}
 	BUG_ON(tun->numdisabled != 0);
@@ -642,6 +654,13 @@ static int tun_attach(struct tun_struct *tun, struct file *file, bool skip_filte
 		if (!err)
 			goto out;
 	}
+
+	if (!tfile->detached && tun->flags & IFF_TX_ARRAY &&
+	    skb_array_init(&tfile->tx_array, TUN_RING_SIZE, GFP_KERNEL)) {
+		err = -ENOMEM;
+		goto out;
+	}
+
 	tfile->queue_index = tun->numqueues;
 	tfile->socket.sk->sk_shutdown &= ~RCV_SHUTDOWN;
 	rcu_assign_pointer(tfile->tun, tun);
@@ -891,8 +910,13 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	nf_reset(skb);
 
-	/* Enqueue packet */
-	skb_queue_tail(&tfile->socket.sk->sk_receive_queue, skb);
+	if (tun->flags & IFF_TX_ARRAY) {
+		if (skb_array_produce(&tfile->tx_array, skb))
+			goto drop;
+	} else {
+		/* Enqueue packet */
+		skb_queue_tail(&tfile->socket.sk->sk_receive_queue, skb);
+	}
 
 	/* Notify and wake up reader process */
 	if (tfile->flags & TUN_FASYNC)
@@ -1088,6 +1112,17 @@ static void tun_net_init(struct net_device *dev)
 	}
 }
 
+static int tun_queue_not_empty(struct tun_struct *tun,
+			       struct tun_file *tfile)
+{
+	struct sock *sk = tfile->socket.sk;
+
+	if (tun->flags & IFF_TX_ARRAY)
+		return !skb_array_empty(&tfile->tx_array);
+	else
+		return !skb_queue_empty(&sk->sk_receive_queue);
+}
+
 /* Character device part */
 
 /* Poll */
@@ -1107,7 +1142,7 @@ static unsigned int tun_chr_poll(struct file *file, poll_table *wait)
 
 	poll_wait(file, sk_sleep(sk), wait);
 
-	if (!skb_queue_empty(&sk->sk_receive_queue))
+	if (tun_queue_not_empty(tun, tfile))
 		mask |= POLLIN | POLLRDNORM;
 
 	if (sock_writeable(sk) ||
@@ -1481,6 +1516,46 @@ done:
 	return total;
 }
 
+static struct sk_buff *tun_ring_recv(struct tun_file *tfile, int noblock,
+				     int *err)
+{
+	DECLARE_WAITQUEUE(wait, current);
+	struct sk_buff *skb = NULL;
+
+	skb = skb_array_consume(&tfile->tx_array);
+	if (skb)
+		goto out;
+	if (noblock) {
+		*err = -EAGAIN;
+		goto out;
+	}
+
+	add_wait_queue(&tfile->wq.wait, &wait);
+	current->state = TASK_INTERRUPTIBLE;
+
+	while (1) {
+		skb = skb_array_consume(&tfile->tx_array);
+		if (skb)
+			break;
+		if (signal_pending(current)) {
+			*err = -ERESTARTSYS;
+			break;
+		}
+		if (tfile->socket.sk->sk_shutdown & RCV_SHUTDOWN) {
+			*err = -EFAULT;
+			break;
+		}
+
+		schedule();
+	};
+
+	current->state = TASK_RUNNING;
+	remove_wait_queue(&tfile->wq.wait, &wait);
+
+out:
+	return skb;
+}
+
 static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 			   struct iov_iter *to,
 			   int noblock)
@@ -1494,9 +1569,13 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 	if (!iov_iter_count(to))
 		return 0;
 
-	/* Read frames from queue */
-	skb = __skb_recv_datagram(tfile->socket.sk, noblock ? MSG_DONTWAIT : 0,
-				  &peeked, &off, &err);
+	if (tun->flags & IFF_TX_ARRAY)
+		skb = tun_ring_recv(tfile, noblock, &err);
+	else
+		/* Read frames from queue */
+		skb = __skb_recv_datagram(tfile->socket.sk,
+					  noblock ? MSG_DONTWAIT : 0,
+					  &peeked, &off, &err);
 	if (!skb)
 		return err;
 
@@ -1629,8 +1708,39 @@ out:
 	return ret;
 }
 
+static int tun_peek_len(struct socket *sock)
+{
+	struct tun_file *tfile = container_of(sock, struct tun_file, socket);
+	struct sock *sk = sock->sk;
+	struct tun_struct *tun;
+	int ret = 0;
+
+	tun = __tun_get(tfile);
+	if (!tun)
+		return 0;
+
+	if (tun->flags & IFF_TX_ARRAY) {
+		ret = skb_array_peek_len(&tfile->tx_array);
+	} else {
+		struct sk_buff *head;
+
+		spin_lock_bh(&sk->sk_receive_queue.lock);
+		head = skb_peek(&sk->sk_receive_queue);
+		if (likely(head)) {
+			ret = head->len;
+			if (skb_vlan_tag_present(head))
+				ret += VLAN_HLEN;
+		}
+		spin_unlock_bh(&sk->sk_receive_queue.lock);
+	}
+
+	tun_put(tun);
+	return ret;
+}
+
 /* Ops structure to mimic raw sockets with tun */
 static const struct proto_ops tun_socket_ops = {
+	.peek_len = tun_peek_len,
 	.sendmsg = tun_sendmsg,
 	.recvmsg = tun_recvmsg,
 };
@@ -1643,7 +1753,8 @@ static struct proto tun_proto = {
 
 static int tun_flags(struct tun_struct *tun)
 {
-	return tun->flags & (TUN_FEATURES | IFF_PERSIST | IFF_TUN | IFF_TAP);
+	return tun->flags & (TUN_FEATURES | IFF_PERSIST | IFF_TUN |
+			     IFF_TAP | IFF_TX_ARRAY);
 }
 
 static ssize_t tun_show_flags(struct device *dev, struct device_attribute *attr,
@@ -1755,6 +1866,9 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 		} else
 			return -EINVAL;
 
+		if (ifr->ifr_flags & IFF_TX_ARRAY)
+			flags |= IFF_TX_ARRAY;
+
 		if (*ifr->ifr_name)
 			name = ifr->ifr_name;
 
@@ -1995,7 +2109,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		 * This is needed because we never checked for invalid flags on
 		 * TUNSETIFF.
 		 */
-		return put_user(IFF_TUN | IFF_TAP | TUN_FEATURES,
+		return put_user(IFF_TUN | IFF_TAP | IFF_TX_ARRAY | TUN_FEATURES,
 				(unsigned int __user*)argp);
 	} else if (cmd == TUNSETQUEUE)
 		return tun_set_queue(file, &ifr);
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f744eeb..236ba52 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -455,10 +455,14 @@ out:
 
 static int peek_head_len(struct sock *sk)
 {
+	struct socket *sock = sk->sk_socket;
 	struct sk_buff *head;
 	int len = 0;
 	unsigned long flags;
 
+	if (sock->ops->peek_len)
+		return sock->ops->peek_len(sock);
+
 	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
 	head = skb_peek(&sk->sk_receive_queue);
 	if (likely(head)) {
@@ -471,6 +475,16 @@ static int peek_head_len(struct sock *sk)
 	return len;
 }
 
+static int sk_has_rx_data(struct sock *sk)
+{
+	struct socket *sock = sk->sk_socket;
+
+	if (sock->ops->peek_len)
+		return sock->ops->peek_len(sock);
+
+	return skb_queue_empty(&sk->sk_receive_queue);
+}
+
 static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
 {
 	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
@@ -487,7 +501,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
 		endtime = busy_clock() + vq->busyloop_timeout;
 
 		while (vhost_can_busy_poll(&net->dev, endtime) &&
-		       skb_queue_empty(&sk->sk_receive_queue) &&
+		       !sk_has_rx_data(sk) &&
 		       vhost_vq_avail_empty(&net->dev, vq))
 			cpu_relax_lowlatency();
 
diff --git a/include/linux/net.h b/include/linux/net.h
index 9aa49a0..b6b3843 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -185,6 +185,7 @@ struct proto_ops {
 	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos,
 				       struct pipe_inode_info *pipe, size_t len, unsigned int flags);
 	int		(*set_peek_off)(struct sock *sk, int val);
+	int		(*peek_len)(struct socket *sock);
 };
 
 #define DECLARE_SOCKADDR(type, dst, src)	\
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index 3cb5e1d..080003c 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -61,6 +61,7 @@
 #define IFF_TUN		0x0001
 #define IFF_TAP		0x0002
 #define IFF_NO_PI	0x1000
+#define IFF_TX_ARRAY	0x0010
 /* This flag has no real effect */
 #define IFF_ONE_QUEUE	0x2000
 #define IFF_VNET_HDR	0x4000
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v7 00/12] Support non-lru page migration
From: Sergey Senozhatsky @ 2016-06-15  7:59 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, Sergey Senozhatsky, Naoya Horiguchi,
	Jonathan Corbet, Chan Gyun Jeong, Rafael Aquini, Hugh Dickins,
	linux-kernel, dri-devel, virtualization, John Einar Reitan,
	linux-mm, Chulmin Kim, Gioh Kim, Konstantin Khlebnikov,
	Sangseok Lee, Andrew Morton, Kyeongdon Kim, Joonsoo Kim,
	Vlastimil Babka, Mel Gorman
In-Reply-To: <1464736881-24886-1-git-send-email-minchan@kernel.org>

Hello Minchan,

-next 4.7.0-rc3-next-20160614


[  315.146533] kasan: CONFIG_KASAN_INLINE enabled
[  315.146538] kasan: GPF could be caused by NULL-ptr deref or user memory access
[  315.146546] general protection fault: 0000 [#1] PREEMPT SMP KASAN
[  315.146576] Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel r8169 i2c_i801 mii snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich mfd_core processor sch_fq_codel sd_mod hid_generic usbhid hid ahci libahci libata ehci_pci ehci_hcd scsi_mod usbcore usb_common
[  315.146785] CPU: 3 PID: 38 Comm: khugepaged Not tainted 4.7.0-rc3-next-20160614-dbg-00004-ga1c2cbc-dirty #488
[  315.146841] task: ffff8800bfaf2900 ti: ffff880112468000 task.ti: ffff880112468000
[  315.146859] RIP: 0010:[<ffffffffa02c413d>]  [<ffffffffa02c413d>] zs_page_migrate+0x355/0xaa0 [zsmalloc]
[  315.146892] RSP: 0000:ffff88011246f138  EFLAGS: 00010293
[  315.146906] RAX: 736761742d6f6e2c RBX: ffff880017ad9a80 RCX: 0000000000000000
[  315.146924] RDX: 1ffffffff064d704 RSI: ffff88000511469a RDI: ffffffff8326ba20
[  315.146942] RBP: ffff88011246f328 R08: 0000000000000001 R09: 0000000000000000
[  315.146959] R10: ffff88011246f0a8 R11: ffff8800bfc07fff R12: ffff88011246f300
[  315.146977] R13: ffffed0015523e6f R14: ffff8800aa91f378 R15: ffffea0000144500
[  315.146995] FS:  0000000000000000(0000) GS:ffff880113780000(0000) knlGS:0000000000000000
[  315.147015] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  315.147030] CR2: 00007f3f97911000 CR3: 0000000002209000 CR4: 00000000000006e0
[  315.147046] Stack:
[  315.147052]  1ffff10015523e0f ffff88011246f240 ffff880005116800 00017f80e0000000
[  315.147083]  ffff880017ad9aa8 736761742d6f6e2c 1ffff1002248de34 ffff880017ad9a90
[  315.147113]  0000069a1246f660 000000000000069a ffff880005114000 ffffea0002ff0180
[  315.147143] Call Trace:
[  315.147154]  [<ffffffffa02c3de8>] ? obj_to_head+0x9d/0x9d [zsmalloc]
[  315.147175]  [<ffffffff81d31dbc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
[  315.147195]  [<ffffffff812275b1>] ? isolate_freepages_block+0x2f9/0x5a6
[  315.147213]  [<ffffffff8127f15c>] ? kasan_poison_shadow+0x2f/0x31
[  315.147230]  [<ffffffff8127f66a>] ? kasan_alloc_pages+0x39/0x3b
[  315.147246]  [<ffffffff812267e6>] ? map_pages+0x1f3/0x3ad
[  315.147262]  [<ffffffff812265f3>] ? update_pageblock_skip+0x18d/0x18d
[  315.147280]  [<ffffffff81115972>] ? up_read+0x1a/0x30
[  315.147296]  [<ffffffff8111ec7e>] ? debug_check_no_locks_freed+0x150/0x22b
[  315.147315]  [<ffffffff812842d1>] move_to_new_page+0x4dd/0x615
[  315.147332]  [<ffffffff81283df4>] ? migrate_page+0x75/0x75
[  315.147347]  [<ffffffff8122785e>] ? isolate_freepages_block+0x5a6/0x5a6
[  315.147366]  [<ffffffff812851c1>] migrate_pages+0xadd/0x131a
[  315.147382]  [<ffffffff8122785e>] ? isolate_freepages_block+0x5a6/0x5a6
[  315.147399]  [<ffffffff81226375>] ? kzfree+0x2b/0x2b
[  315.147414]  [<ffffffff812846e4>] ? buffer_migrate_page+0x2db/0x2db
[  315.147431]  [<ffffffff8122a6cf>] compact_zone+0xcdb/0x1155
[  315.147448]  [<ffffffff812299f4>] ? compaction_suitable+0x76/0x76
[  315.147465]  [<ffffffff8122ac29>] compact_zone_order+0xe0/0x167
[  315.147481]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
[  315.147499]  [<ffffffff8122ab49>] ? compact_zone+0x1155/0x1155
[  315.147515]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
[  315.147533]  [<ffffffff8122bcff>] try_to_compact_pages+0x2f1/0x648
[  315.147550]  [<ffffffff8122bcff>] ? try_to_compact_pages+0x2f1/0x648
[  315.147568]  [<ffffffff8122ba0e>] ? compaction_zonelist_suitable+0x3a6/0x3a6
[  315.147589]  [<ffffffff811ee129>] ? get_page_from_freelist+0x2c0/0x129a
[  315.147608]  [<ffffffff811ef1ed>] __alloc_pages_direct_compact+0xea/0x30d
[  315.147626]  [<ffffffff811ef103>] ? get_page_from_freelist+0x129a/0x129a
[  315.147645]  [<ffffffff811f0422>] __alloc_pages_nodemask+0x840/0x16b6
[  315.147663]  [<ffffffff810dba27>] ? try_to_wake_up+0x696/0x6c8
[  315.149147]  [<ffffffff811efbe2>] ? warn_alloc_failed+0x226/0x226
[  315.150615]  [<ffffffff810dba69>] ? wake_up_process+0x10/0x12
[  315.152078]  [<ffffffff810dbaf4>] ? wake_up_q+0x89/0xa7
[  315.153539]  [<ffffffff81128b6f>] ? rwsem_wake+0x131/0x15c
[  315.155007]  [<ffffffff812922e7>] ? khugepaged+0x4072/0x484f
[  315.156471]  [<ffffffff8128e449>] khugepaged+0x1d4/0x484f
[  315.157940]  [<ffffffff8128e275>] ? hugepage_vma_revalidate+0xef/0xef
[  315.159402]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
[  315.160870]  [<ffffffff81d31df8>] ? _raw_spin_unlock_irq+0x27/0x45
[  315.162341]  [<ffffffff8111cde6>] ? trace_hardirqs_on_caller+0x3d2/0x492
[  315.163814]  [<ffffffff8111112e>] ? prepare_to_wait_event+0x3f7/0x3f7
[  315.165295]  [<ffffffff81d27ad5>] ? __schedule+0xa4d/0xd16
[  315.166763]  [<ffffffff810ccde3>] kthread+0x252/0x261
[  315.168214]  [<ffffffff8128e275>] ? hugepage_vma_revalidate+0xef/0xef
[  315.169646]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
[  315.171056]  [<ffffffff81d3277f>] ret_from_fork+0x1f/0x40
[  315.172462]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
[  315.173869] Code: 03 b5 60 fe ff ff e8 2e fc ff ff a8 01 74 4c 48 83 e0 fe bf 01 00 00 00 48 89 85 38 fe ff ff e8 41 18 e1 e0 48 8b 85 38 fe ff ff <f0> 0f ba 28 00 73 29 bf 01 00 00 00 41 bc f5 ff ff ff e8 ea 27 
[  315.175573] RIP  [<ffffffffa02c413d>] zs_page_migrate+0x355/0xaa0 [zsmalloc]
[  315.177084]  RSP <ffff88011246f138>
[  315.186572] ---[ end trace 0962b8ee48c98bbc ]---




[  315.186577] BUG: sleeping function called from invalid context at include/linux/sched.h:2960
[  315.186580] in_atomic(): 1, irqs_disabled(): 0, pid: 38, name: khugepaged
[  315.186581] INFO: lockdep is turned off.
[  315.186583] Preemption disabled at:[<ffffffffa02c3f1d>] zs_page_migrate+0x135/0xaa0 [zsmalloc]

[  315.186594] CPU: 3 PID: 38 Comm: khugepaged Tainted: G      D         4.7.0-rc3-next-20160614-dbg-00004-ga1c2cbc-dirty #488
[  315.186599]  0000000000000000 ffff88011246ed58 ffffffff814d56bf ffff8800bfaf2900
[  315.186604]  0000000000000004 ffff88011246ed98 ffffffff810d5e6a 0000000000000000
[  315.186609]  ffff8800bfaf2900 ffffffff81e39820 0000000000000b90 0000000000000000
[  315.186614] Call Trace:
[  315.186618]  [<ffffffff814d56bf>] dump_stack+0x68/0x92
[  315.186622]  [<ffffffff810d5e6a>] ___might_sleep+0x3bd/0x3c9
[  315.186625]  [<ffffffff810d5fd1>] __might_sleep+0x15b/0x167
[  315.186630]  [<ffffffff810ac4c1>] exit_signals+0x7a/0x34f
[  315.186633]  [<ffffffff810ac447>] ? get_signal+0xd9b/0xd9b
[  315.186636]  [<ffffffff811aee21>] ? irq_work_queue+0x101/0x11c
[  315.186640]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
[  315.186645]  [<ffffffff81096357>] do_exit+0x34d/0x1b4e
[  315.186648]  [<ffffffff81130e16>] ? vprintk_emit+0x4b1/0x4d3
[  315.186652]  [<ffffffff8109600a>] ? is_current_pgrp_orphaned+0x8c/0x8c
[  315.186655]  [<ffffffff81122c56>] ? lock_acquire+0xec/0x147
[  315.186658]  [<ffffffff811321ef>] ? kmsg_dump+0x12/0x27a
[  315.186662]  [<ffffffff81132448>] ? kmsg_dump+0x26b/0x27a
[  315.186666]  [<ffffffff81036507>] oops_end+0x9d/0xa4
[  315.186669]  [<ffffffff8103662c>] die+0x55/0x5e
[  315.186672]  [<ffffffff81032aa0>] do_general_protection+0x16c/0x337
[  315.186676]  [<ffffffff81d33abf>] general_protection+0x1f/0x30
[  315.186681]  [<ffffffffa02c413d>] ? zs_page_migrate+0x355/0xaa0 [zsmalloc]
[  315.186686]  [<ffffffffa02c4136>] ? zs_page_migrate+0x34e/0xaa0 [zsmalloc]
[  315.186691]  [<ffffffffa02c3de8>] ? obj_to_head+0x9d/0x9d [zsmalloc]
[  315.186695]  [<ffffffff81d31dbc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
[  315.186699]  [<ffffffff812275b1>] ? isolate_freepages_block+0x2f9/0x5a6
[  315.186702]  [<ffffffff8127f15c>] ? kasan_poison_shadow+0x2f/0x31
[  315.186706]  [<ffffffff8127f66a>] ? kasan_alloc_pages+0x39/0x3b
[  315.186709]  [<ffffffff812267e6>] ? map_pages+0x1f3/0x3ad
[  315.186712]  [<ffffffff812265f3>] ? update_pageblock_skip+0x18d/0x18d
[  315.186716]  [<ffffffff81115972>] ? up_read+0x1a/0x30
[  315.186719]  [<ffffffff8111ec7e>] ? debug_check_no_locks_freed+0x150/0x22b
[  315.186723]  [<ffffffff812842d1>] move_to_new_page+0x4dd/0x615
[  315.186726]  [<ffffffff81283df4>] ? migrate_page+0x75/0x75
[  315.186730]  [<ffffffff8122785e>] ? isolate_freepages_block+0x5a6/0x5a6
[  315.186733]  [<ffffffff812851c1>] migrate_pages+0xadd/0x131a
[  315.186737]  [<ffffffff8122785e>] ? isolate_freepages_block+0x5a6/0x5a6
[  315.186740]  [<ffffffff81226375>] ? kzfree+0x2b/0x2b
[  315.186743]  [<ffffffff812846e4>] ? buffer_migrate_page+0x2db/0x2db
[  315.186747]  [<ffffffff8122a6cf>] compact_zone+0xcdb/0x1155
[  315.186751]  [<ffffffff812299f4>] ? compaction_suitable+0x76/0x76
[  315.186754]  [<ffffffff8122ac29>] compact_zone_order+0xe0/0x167
[  315.186757]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
[  315.186761]  [<ffffffff8122ab49>] ? compact_zone+0x1155/0x1155
[  315.186764]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
[  315.186768]  [<ffffffff8122bcff>] try_to_compact_pages+0x2f1/0x648
[  315.186771]  [<ffffffff8122bcff>] ? try_to_compact_pages+0x2f1/0x648
[  315.186775]  [<ffffffff8122ba0e>] ? compaction_zonelist_suitable+0x3a6/0x3a6
[  315.186780]  [<ffffffff811ee129>] ? get_page_from_freelist+0x2c0/0x129a
[  315.186783]  [<ffffffff811ef1ed>] __alloc_pages_direct_compact+0xea/0x30d
[  315.186787]  [<ffffffff811ef103>] ? get_page_from_freelist+0x129a/0x129a
[  315.186791]  [<ffffffff811f0422>] __alloc_pages_nodemask+0x840/0x16b6
[  315.186794]  [<ffffffff810dba27>] ? try_to_wake_up+0x696/0x6c8
[  315.186798]  [<ffffffff811efbe2>] ? warn_alloc_failed+0x226/0x226
[  315.186801]  [<ffffffff810dba69>] ? wake_up_process+0x10/0x12
[  315.186804]  [<ffffffff810dbaf4>] ? wake_up_q+0x89/0xa7
[  315.186807]  [<ffffffff81128b6f>] ? rwsem_wake+0x131/0x15c
[  315.186811]  [<ffffffff812922e7>] ? khugepaged+0x4072/0x484f
[  315.186815]  [<ffffffff8128e449>] khugepaged+0x1d4/0x484f
[  315.186819]  [<ffffffff8128e275>] ? hugepage_vma_revalidate+0xef/0xef
[  315.186822]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
[  315.186826]  [<ffffffff81d31df8>] ? _raw_spin_unlock_irq+0x27/0x45
[  315.186829]  [<ffffffff8111cde6>] ? trace_hardirqs_on_caller+0x3d2/0x492
[  315.186832]  [<ffffffff8111112e>] ? prepare_to_wait_event+0x3f7/0x3f7
[  315.186836]  [<ffffffff81d27ad5>] ? __schedule+0xa4d/0xd16
[  315.186840]  [<ffffffff810ccde3>] kthread+0x252/0x261
[  315.186843]  [<ffffffff8128e275>] ? hugepage_vma_revalidate+0xef/0xef
[  315.186846]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
[  315.186851]  [<ffffffff81d3277f>] ret_from_fork+0x1f/0x40
[  315.186854]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
[  315.186869] note: khugepaged[38] exited with preempt_count 4



[  340.319852] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [jbd2/zram0-8:405]
[  340.319856] Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel r8169 i2c_i801 mii snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich mfd_core processor sch_fq_codel sd_mod hid_generic usbhid hid ahci libahci libata ehci_pci ehci_hcd scsi_mod usbcore usb_common
[  340.319900] irq event stamp: 834296
[  340.319902] hardirqs last  enabled at (834295): [<ffffffff81280b07>] quarantine_put+0xa1/0xe6
[  340.319911] hardirqs last disabled at (834296): [<ffffffff81d31e68>] _raw_write_lock_irqsave+0x13/0x4c
[  340.319917] softirqs last  enabled at (833836): [<ffffffff81d3455e>] __do_softirq+0x406/0x48f
[  340.319922] softirqs last disabled at (833831): [<ffffffff8109914a>] irq_exit+0x6a/0x113
[  340.319929] CPU: 2 PID: 405 Comm: jbd2/zram0-8 Tainted: G      D         4.7.0-rc3-next-20160614-dbg-00004-ga1c2cbc-dirty #488
[  340.319935] task: ffff8800bb512900 ti: ffff8800a69c0000 task.ti: ffff8800a69c0000
[  340.319937] RIP: 0010:[<ffffffff814ed772>]  [<ffffffff814ed772>] delay_tsc+0x0/0xa4
[  340.319943] RSP: 0018:ffff8800a69c70f8  EFLAGS: 00000206
[  340.319945] RAX: 0000000000000001 RBX: ffff8800aa91f300 RCX: 0000000000000000
[  340.319947] RDX: 0000000000000003 RSI: ffffffff81ed2840 RDI: 0000000000000001
[  340.319949] RBP: ffff8800a69c7100 R08: 0000000000000001 R09: 0000000000000000
[  340.319951] R10: ffff8800a69c70e8 R11: 000000007e7516b9 R12: ffff8800aa91f310
[  340.319954] R13: ffff8800aa91f308 R14: 000000001f3306fa R15: 0000000000000000
[  340.319956] FS:  0000000000000000(0000) GS:ffff880113700000(0000) knlGS:0000000000000000
[  340.319959] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  340.319961] CR2: 00007fc99caba080 CR3: 00000000b9796000 CR4: 00000000000006e0
[  340.319963] Stack:
[  340.319964]  ffffffff814ed89c ffff8800a69c7148 ffffffff8112795d ffffed0015523e60
[  340.319970]  000000009e857390 ffff8800aa91f300 ffff8800bbe21cc0 ffff8800047d6f80
[  340.319975]  ffff8800a69c72b0 ffff8800aa91f300 ffff8800a69c7168 ffffffff81d31bed
[  340.319980] Call Trace:
[  340.319983]  [<ffffffff814ed89c>] ? __delay+0xa/0xc
[  340.319988]  [<ffffffff8112795d>] do_raw_spin_lock+0x197/0x257
[  340.319991]  [<ffffffff81d31bed>] _raw_spin_lock+0x35/0x3c
[  340.319998]  [<ffffffffa02c6062>] ? zs_free+0x191/0x27a [zsmalloc]
[  340.320003]  [<ffffffffa02c6062>] zs_free+0x191/0x27a [zsmalloc]
[  340.320008]  [<ffffffffa02c5ed1>] ? free_zspage+0xe8/0xe8 [zsmalloc]
[  340.320012]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
[  340.320015]  [<ffffffff810d58a6>] ? finish_task_switch+0x3b3/0x484
[  340.320021]  [<ffffffff81d27ad5>] ? __schedule+0xa4d/0xd16
[  340.320024]  [<ffffffff81d28086>] ? preempt_schedule+0x1f/0x21
[  340.320028]  [<ffffffff81d27ff9>] ? preempt_schedule_common+0xb7/0xe8
[  340.320034]  [<ffffffffa02d3f0e>] zram_free_page+0x112/0x1f6 [zram]
[  340.320039]  [<ffffffffa02d5e6c>] zram_make_request+0x45d/0x89f [zram]
[  340.320045]  [<ffffffffa02d5a0f>] ? zram_rw_page+0x21d/0x21d [zram]
[  340.320048]  [<ffffffff81493657>] ? blk_exit_rl+0x39/0x39
[  340.320053]  [<ffffffff8148fe3f>] ? handle_bad_sector+0x192/0x192
[  340.320056]  [<ffffffff8127f83e>] ? kasan_slab_alloc+0x12/0x14
[  340.320059]  [<ffffffff8127ca68>] ? kmem_cache_alloc+0xf3/0x101
[  340.320062]  [<ffffffff81494e37>] generic_make_request+0x2bc/0x496
[  340.320066]  [<ffffffff81494b7b>] ? blk_plug_queued_count+0x103/0x103
[  340.320069]  [<ffffffff8111ec7e>] ? debug_check_no_locks_freed+0x150/0x22b
[  340.320072]  [<ffffffff81495309>] submit_bio+0x2f8/0x324
[  340.320075]  [<ffffffff81495011>] ? generic_make_request+0x496/0x496
[  340.320078]  [<ffffffff811190fc>] ? lockdep_init_map+0x1ef/0x4b0
[  340.320082]  [<ffffffff814880a4>] submit_bio_wait+0xff/0x138
[  340.320085]  [<ffffffff81487fa5>] ? bio_add_page+0x292/0x292
[  340.320090]  [<ffffffff814ab82c>] blkdev_issue_discard+0xee/0x148
[  340.320093]  [<ffffffff814ab73e>] ? __blkdev_issue_discard+0x399/0x399
[  340.320097]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
[  340.320101]  [<ffffffff81404de8>] ext4_free_data_callback+0x2cc/0x8bc
[  340.320104]  [<ffffffff81404de8>] ? ext4_free_data_callback+0x2cc/0x8bc
[  340.320107]  [<ffffffff81404b1c>] ? ext4_mb_release_context+0x10aa/0x10aa
[  340.320111]  [<ffffffff81122c56>] ? lock_acquire+0xec/0x147
[  340.320115]  [<ffffffff813c8a6a>] ? ext4_journal_commit_callback+0x203/0x220
[  340.320119]  [<ffffffff813c8a61>] ext4_journal_commit_callback+0x1fa/0x220
[  340.320124]  [<ffffffff81438bf5>] jbd2_journal_commit_transaction+0x3753/0x3c20
[  340.320128]  [<ffffffff814354a2>] ? journal_submit_commit_record+0x777/0x777
[  340.320132]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
[  340.320135]  [<ffffffff811205a5>] ? __lock_acquire+0x14f9/0x33b8
[  340.320139]  [<ffffffff81d31db0>] ? _raw_spin_unlock_irqrestore+0x3b/0x5c
[  340.320143]  [<ffffffff8111cde6>] ? trace_hardirqs_on_caller+0x3d2/0x492
[  340.320146]  [<ffffffff81d31dbc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
[  340.320151]  [<ffffffff81156945>] ? try_to_del_timer_sync+0xa5/0xce
[  340.320154]  [<ffffffff8111cde6>] ? trace_hardirqs_on_caller+0x3d2/0x492
[  340.320157]  [<ffffffff8143febd>] kjournald2+0x246/0x6e1
[  340.320160]  [<ffffffff8143febd>] ? kjournald2+0x246/0x6e1
[  340.320163]  [<ffffffff8143fc77>] ? commit_timeout+0xb/0xb
[  340.320167]  [<ffffffff8111112e>] ? prepare_to_wait_event+0x3f7/0x3f7
[  340.320171]  [<ffffffff810ccde3>] kthread+0x252/0x261
[  340.320174]  [<ffffffff8143fc77>] ? commit_timeout+0xb/0xb
[  340.320177]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
[  340.320181]  [<ffffffff81d3277f>] ret_from_fork+0x1f/0x40
[  340.320185]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
[  340.320186] Code: 5c 5d c3 55 48 8d 04 bd 00 00 00 00 65 48 8b 15 8d 59 b2 7e 48 69 d2 fa 00 00 00 48 89 e5 f7 e2 48 8d 7a 01 e8 22 01 00 00 5d c3 <55> 48 89 e5 41 56 41 55 41 54 53 49 89 fd bf 01 00 00 00 e8 ed 

	-ss

^ permalink raw reply

* Re: [PATCH v6v3 02/12] mm: migrate: support non-lru movable page migration
From: Anshuman Khandual @ 2016-06-15  6:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, Sergey Senozhatsky, Rafael Aquini, Jonathan Corbet,
	Hugh Dickins, linux-kernel, dri-devel, virtualization,
	John Einar Reitan, linux-mm, Gioh Kim, Mel Gorman, Andrew Morton,
	Joonsoo Kim, Vlastimil Babka
In-Reply-To: <20160615023249.GG17127@bbox>

On 06/15/2016 08:02 AM, Minchan Kim wrote:
> Hi,
> 
> On Mon, Jun 13, 2016 at 03:08:19PM +0530, Anshuman Khandual wrote:
>> > On 05/31/2016 05:31 AM, Minchan Kim wrote:
>>> > > @@ -791,6 +921,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>>> > >  	int rc = -EAGAIN;
>>> > >  	int page_was_mapped = 0;
>>> > >  	struct anon_vma *anon_vma = NULL;
>>> > > +	bool is_lru = !__PageMovable(page);
>>> > >  
>>> > >  	if (!trylock_page(page)) {
>>> > >  		if (!force || mode == MIGRATE_ASYNC)
>>> > > @@ -871,6 +1002,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>>> > >  		goto out_unlock_both;
>>> > >  	}
>>> > >  
>>> > > +	if (unlikely(!is_lru)) {
>>> > > +		rc = move_to_new_page(newpage, page, mode);
>>> > > +		goto out_unlock_both;
>>> > > +	}
>>> > > +
>> > 
>> > Hello Minchan,
>> > 
>> > I might be missing something here but does this implementation support the
>> > scenario where these non LRU pages owned by the driver mapped as PTE into
>> > process page table ? Because the "goto out_unlock_both" statement above
>> > skips all the PTE unmap, putting a migration PTE and removing the migration
>> > PTE steps.
> You're right. Unfortunately, it doesn't support right now but surely,
> it's my TODO after landing this work.
> 
> Could you share your usecase?

Sure.

My driver has privately managed non LRU pages which gets mapped into user space
process page table through f_ops->mmap() and vmops->fault() which then updates
the file RMAP (page->mapping->i_mmap) through page_add_file_rmap(page). One thing
to note here is that the page->mapping eventually points to struct address_space
(file->f_mapping) which belongs to the character device file (created using mknod)
which we are using for establishing the mmap() regions in the user space.

Now as per this new framework, all the page's are to be made __SetPageMovable before
passing the list down to migrate_pages(). Now __SetPageMovable() takes *new* struct
address_space as an argument and replaces the existing page->mapping. Now thats the
problem, we have lost all our connection to the existing file RMAP information. This
stands as a problem when we try to migrate these non LRU pages which are PTE mapped.
The rmap_walk_file() never finds them in the VMA, skips all the migrate PTE steps and
then the migration eventually fails.

Seems like assigning a new struct address_space to the page through __SetPageMovable()
is the source of the problem. Can it take the existing (file->f_mapping) as an argument
in there ? Sure, but then can we override file system generic ->isolate(), ->putback(),
->migratepages() functions ? I dont think so. I am sure, there must be some work around
to fix this problem for the driver. But we need to rethink this framework from supporting
these mapped non LRU pages point of view.

I might be missing something here, feel free to point out.

- Anshuman

^ permalink raw reply

* Re: [PATCH v6v3 02/12] mm: migrate: support non-lru movable page migration
From: Minchan Kim @ 2016-06-15  2:32 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Rik van Riel, Sergey Senozhatsky, Rafael Aquini, Jonathan Corbet,
	Hugh Dickins, linux-kernel, dri-devel, virtualization,
	John Einar Reitan, linux-mm, Gioh Kim, Mel Gorman, Andrew Morton,
	Joonsoo Kim, Vlastimil Babka
In-Reply-To: <575E7F0B.8010201@linux.vnet.ibm.com>

Hi,

On Mon, Jun 13, 2016 at 03:08:19PM +0530, Anshuman Khandual wrote:
> On 05/31/2016 05:31 AM, Minchan Kim wrote:
> > @@ -791,6 +921,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> >  	int rc = -EAGAIN;
> >  	int page_was_mapped = 0;
> >  	struct anon_vma *anon_vma = NULL;
> > +	bool is_lru = !__PageMovable(page);
> >  
> >  	if (!trylock_page(page)) {
> >  		if (!force || mode == MIGRATE_ASYNC)
> > @@ -871,6 +1002,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> >  		goto out_unlock_both;
> >  	}
> >  
> > +	if (unlikely(!is_lru)) {
> > +		rc = move_to_new_page(newpage, page, mode);
> > +		goto out_unlock_both;
> > +	}
> > +
> 
> Hello Minchan,
> 
> I might be missing something here but does this implementation support the
> scenario where these non LRU pages owned by the driver mapped as PTE into
> process page table ? Because the "goto out_unlock_both" statement above
> skips all the PTE unmap, putting a migration PTE and removing the migration
> PTE steps.

You're right. Unfortunately, it doesn't support right now but surely,
it's my TODO after landing this work.

Could you share your usecase?

It would be helpful for merging when I wll send patchset.

Thanks!

^ permalink raw reply

* [PULL] vhost: docs/tests
From: Michael S. Tsirkin @ 2016-06-14 20:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kvm, mst, netdev, linux-kernel, virtualization, rppt, geert

The following changes since commit af8c34ce6ae32addda3788d54a7e340cad22516b:

  Linux 4.7-rc2 (2016-06-05 14:31:26 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 139ab4d4e68b8cf2a611b06c006a2195dc6bedf1:

  tools/virtio: add noring tool (2016-06-06 13:00:11 +0300)

----------------------------------------------------------------
virtio: docs, tests for 4.7

This merely has some documentation and a new test, seems safe to merge.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

----------------------------------------------------------------
Geert Uytterhoeven (1):
      MAINTAINERS: Add file patterns for virtio device tree bindings

Michael S. Tsirkin (1):
      tools/virtio: add noring tool

Mike Rapoport (2):
      tools/virtio/ringtest: add usage example to README
      tools/virtio/ringtest: fix run-on-all.sh to work without /dev/cpu

 tools/virtio/ringtest/noring.c      | 69 +++++++++++++++++++++++++++++++++++++
 MAINTAINERS                         |  1 +
 tools/virtio/ringtest/Makefile      |  4 ++-
 tools/virtio/ringtest/README        |  4 +++
 tools/virtio/ringtest/run-on-all.sh |  4 +--
 5 files changed, 79 insertions(+), 3 deletions(-)
 create mode 100644 tools/virtio/ringtest/noring.c

^ permalink raw reply

* Re: [PATCH] virtio-gpu: use src not crtc
From: Gerd Hoffmann @ 2016-06-14 12:57 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: David Airlie, open list, dri-devel, open list:VIRTIO GPU DRIVER
In-Reply-To: <CAJ+F1CKQNpJ+=i3wq2JDJi8Z+tyK1=wBEmYTE=NfNJ20VKEoDA@mail.gmail.com>

On Di, 2016-06-14 at 12:13 +0200, Marc-André Lureau wrote:
> Hi
> 
> On Tue, May 31, 2016 at 2:52 PM, Gerd Hoffmann <kraxel@redhat.com> wrote:
> > Pick up the correct source rectangle from framebuffer.
> > Without this multihead setups are not working correctly.
> >
> > Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
> 
> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
> 
> Additionally, I had to modify the page_flip() function to take the
> plane source coordinates for virgl/3d multihead to work. Feel free to
> squash.

This is in progress of being sorted, by dropping the
virtio_gpu_page_flip function altogether in favor of atomic helpers,
which will use the (already fixed) plane callbacks instead.

See nonblocking commit support patches by Daniel Vetter on this list.

cheers,
  Gerd

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC PATCH V3 0/3] basic device IOTLB support
From: Jason Wang @ 2016-06-14 10:40 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: vkaplans, wexu, peterx
In-Reply-To: <1464082585-13049-1-git-send-email-jasowang@redhat.com>



On 2016年05月24日 17:36, Jason Wang wrote:
> This patch tries to implement an device IOTLB for vhost. This could be
> used with for co-operation with userspace IOMMU implementation (qemu)
> for a secure DMA environment (DMAR) in guest.
>
> The idea is simple. When vhost meets an IOTLB miss, it will request
> the assistance of userspace to do the translation, this is done
> through:
>
> - when there's a IOTLB miss, it will notify userspace through
>    vhost_net fd and then userspace read the fault address, size and
>    access from vhost fd.
> - userspace write the translation result back to vhost fd, vhost can
>    then update its IOTLB.
>
> The codes were optimized for fixed mapping users e.g dpdk in guest. It
> will be slow if dynamic mappings were used in guest. We could do
> optimizations on top.
>
> The codes were designed to be architecture independent. It should be
> easily ported to any architecture.
>
> Stress tested with l2fwd/vfio in guest with 4K/2M/1G page size. On 1G
> hugepage case, 100% TLB hit rate were noticed.
>
> Changes from V2:
> - introduce memory accessors for vhost
> - switch from ioctls to oridinary file read/write for iotlb miss and
>    updating
> - do not assume virtqueue were virtually mapped contiguously, all
>    virtqueue access were done throug IOTLB
> - verify memory access during IOTLB update and fail early
> - introduce a module parameter for the size of IOTLB
>
> Changes from V1:
> - support any size/range of updating and invalidation through
>    introducing the interval tree.
> - convert from per device iotlb request to per virtqueue iotlb
>    request, this solves the possible deadlock in V1.
> - read/write permission check support.
>
> Please review.

Have a benchmark on this. Test was done with l2fwd in guest.

For 2MB page, no difference in 64B performance and I notice a 4%-5% drop 
for 1500B performance compare to UIO in guest. We can add some shortcut 
to bypass the IOTLB for virtqueue accessing, but I think it's better to 
be done on top.

>
> Jason Wang (3):
>    vhost: introduce vhost memory accessors
>    vhost: convert pre sorted vhost memory array to interval tree
>    vhost: device IOTLB API
>
>   drivers/vhost/net.c        |  63 +++-
>   drivers/vhost/vhost.c      | 760 ++++++++++++++++++++++++++++++++++++++-------
>   drivers/vhost/vhost.h      |  60 +++-
>   include/uapi/linux/vhost.h |  28 ++
>   4 files changed, 790 insertions(+), 121 deletions(-)
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH] virtio-gpu: use src not crtc
From: Marc-André Lureau @ 2016-06-14 10:13 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: David Airlie, open list, dri-devel, open list:VIRTIO GPU DRIVER
In-Reply-To: <1464699172-26755-1-git-send-email-kraxel@redhat.com>

Hi

On Tue, May 31, 2016 at 2:52 PM, Gerd Hoffmann <kraxel@redhat.com> wrote:
> Pick up the correct source rectangle from framebuffer.
> Without this multihead setups are not working correctly.
>
> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>

Additionally, I had to modify the page_flip() function to take the
plane source coordinates for virgl/3d multihead to work. Feel free to
squash.

--- a/drivers/gpu/drm/virtio/virtgpu_display.c
+++ b/drivers/gpu/drm/virtio/virtgpu_display.c
@@ -68,11 +68,16 @@ static int virtio_gpu_page_flip(struct drm_crtc *crtc,
                         0, 0, NULL);
        }
        virtio_gpu_cmd_set_scanout(vgdev, output->index, handle,
-                                  crtc->mode.hdisplay,
-                                  crtc->mode.vdisplay, 0, 0);
-       virtio_gpu_cmd_resource_flush(vgdev, handle, 0, 0,
-                                     crtc->mode.hdisplay,
-                                     crtc->mode.vdisplay);
+                       plane->state->src_w >> 16,
+                       plane->state->src_h >> 16,
+                       plane->state->src_x >> 16,
+                       plane->state->src_y >> 16);
+
+       virtio_gpu_cmd_resource_flush(vgdev, handle,
+                       plane->state->src_x >> 16,
+                       plane->state->src_y >> 16,
+                       plane->state->src_w >> 16,
+                       plane->state->src_h);

> ---
>  drivers/gpu/drm/virtio/virtgpu_plane.c | 31 ++++++++++++++++++-------------
>  1 file changed, 18 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/virtio/virtgpu_plane.c b/drivers/gpu/drm/virtio/virtgpu_plane.c
> index b7778a7..925ca25 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_plane.c
> +++ b/drivers/gpu/drm/virtio/virtgpu_plane.c
> @@ -85,27 +85,32 @@ static void virtio_gpu_primary_plane_update(struct drm_plane *plane,
>                 if (bo->dumb) {
>                         virtio_gpu_cmd_transfer_to_host_2d
>                                 (vgdev, handle, 0,
> -                                cpu_to_le32(plane->state->crtc_w),
> -                                cpu_to_le32(plane->state->crtc_h),
> -                                plane->state->crtc_x, plane->state->crtc_y, NULL);
> +                                cpu_to_le32(plane->state->src_w >> 16),
> +                                cpu_to_le32(plane->state->src_h >> 16),
> +                                plane->state->src_x >> 16,
> +                                plane->state->src_y >> 16, NULL);
>                 }
>         } else {
>                 handle = 0;
>         }
>
> -       DRM_DEBUG("handle 0x%x, crtc %dx%d+%d+%d\n", handle,
> +       DRM_DEBUG("handle 0x%x, crtc %dx%d+%d+%d, src %dx%d+%d+%d\n", handle,
>                   plane->state->crtc_w, plane->state->crtc_h,
> -                 plane->state->crtc_x, plane->state->crtc_y);
> +                 plane->state->crtc_x, plane->state->crtc_y,
> +                 plane->state->src_w >> 16,
> +                 plane->state->src_h >> 16,
> +                 plane->state->src_x >> 16,
> +                 plane->state->src_y >> 16);
>         virtio_gpu_cmd_set_scanout(vgdev, output->index, handle,
> -                                  plane->state->crtc_w,
> -                                  plane->state->crtc_h,
> -                                  plane->state->crtc_x,
> -                                  plane->state->crtc_y);
> +                                  plane->state->src_w >> 16,
> +                                  plane->state->src_h >> 16,
> +                                  plane->state->src_x >> 16,
> +                                  plane->state->src_y >> 16);
>         virtio_gpu_cmd_resource_flush(vgdev, handle,
> -                                     plane->state->crtc_x,
> -                                     plane->state->crtc_y,
> -                                     plane->state->crtc_w,
> -                                     plane->state->crtc_h);
> +                                     plane->state->src_x >> 16,
> +                                     plane->state->src_y >> 16,
> +                                     plane->state->src_w >> 16,
> +                                     plane->state->src_h >> 16);
>  }
>
>  static void virtio_gpu_cursor_plane_update(struct drm_plane *plane,
> --
> 1.8.3.1
>



-- 
Marc-André Lureau
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* [PATCH v8 2/5] ptr_ring: ring test
From: Michael S. Tsirkin @ 2016-06-13 20:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, Eric Dumazet, netdev, Steven Rostedt, virtualization, brouer,
	davem
In-Reply-To: <1465851234-13558-1-git-send-email-mst@redhat.com>

Add ringtest based unit test for ptr ring.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 tools/virtio/ringtest/ptr_ring.c | 192 +++++++++++++++++++++++++++++++++++++++
 tools/virtio/ringtest/Makefile   |   5 +-
 2 files changed, 196 insertions(+), 1 deletion(-)
 create mode 100644 tools/virtio/ringtest/ptr_ring.c

diff --git a/tools/virtio/ringtest/ptr_ring.c b/tools/virtio/ringtest/ptr_ring.c
new file mode 100644
index 0000000..74abd74
--- /dev/null
+++ b/tools/virtio/ringtest/ptr_ring.c
@@ -0,0 +1,192 @@
+#define _GNU_SOURCE
+#include "main.h"
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <pthread.h>
+#include <malloc.h>
+#include <assert.h>
+#include <errno.h>
+#include <limits.h>
+
+#define SMP_CACHE_BYTES 64
+#define cache_line_size() SMP_CACHE_BYTES
+#define ____cacheline_aligned_in_smp __attribute__ ((aligned (SMP_CACHE_BYTES)))
+#define unlikely(x)    (__builtin_expect(!!(x), 0))
+#define ALIGN(x, a) (((x) + (a) - 1) / (a) * (a))
+typedef pthread_spinlock_t  spinlock_t;
+
+typedef int gfp_t;
+static void *kzalloc(unsigned size, gfp_t gfp)
+{
+	void *p = memalign(64, size);
+	if (!p)
+		return p;
+	memset(p, 0, size);
+
+	return p;
+}
+
+static void kfree(void *p)
+{
+	if (p)
+		free(p);
+}
+
+static void spin_lock_init(spinlock_t *lock)
+{
+	int r = pthread_spin_init(lock, 0);
+	assert(!r);
+}
+
+static void spin_lock(spinlock_t *lock)
+{
+	int ret = pthread_spin_lock(lock);
+	assert(!ret);
+}
+
+static void spin_unlock(spinlock_t *lock)
+{
+	int ret = pthread_spin_unlock(lock);
+	assert(!ret);
+}
+
+static void spin_lock_bh(spinlock_t *lock)
+{
+	spin_lock(lock);
+}
+
+static void spin_unlock_bh(spinlock_t *lock)
+{
+	spin_unlock(lock);
+}
+
+static void spin_lock_irq(spinlock_t *lock)
+{
+	spin_lock(lock);
+}
+
+static void spin_unlock_irq(spinlock_t *lock)
+{
+	spin_unlock(lock);
+}
+
+static void spin_lock_irqsave(spinlock_t *lock, unsigned long f)
+{
+	spin_lock(lock);
+}
+
+static void spin_unlock_irqrestore(spinlock_t *lock, unsigned long f)
+{
+	spin_unlock(lock);
+}
+
+#include "../../../include/linux/ptr_ring.h"
+
+static unsigned long long headcnt, tailcnt;
+static struct ptr_ring array ____cacheline_aligned_in_smp;
+
+/* implemented by ring */
+void alloc_ring(void)
+{
+	int ret = ptr_ring_init(&array, ring_size, 0);
+	assert(!ret);
+}
+
+/* guest side */
+int add_inbuf(unsigned len, void *buf, void *datap)
+{
+	int ret;
+
+	ret = __ptr_ring_produce(&array, buf);
+	if (ret >= 0) {
+		ret = 0;
+		headcnt++;
+	}
+
+	return ret;
+}
+
+/*
+ * ptr_ring API provides no way for producer to find out whether a given
+ * buffer was consumed.  Our tests merely require that a successful get_buf
+ * implies that add_inbuf succeed in the past, and that add_inbuf will succeed,
+ * fake it accordingly.
+ */
+void *get_buf(unsigned *lenp, void **bufp)
+{
+	void *datap;
+
+	if (tailcnt == headcnt || __ptr_ring_full(&array))
+		datap = NULL;
+	else {
+		datap = "Buffer\n";
+		++tailcnt;
+	}
+
+	return datap;
+}
+
+void poll_used(void)
+{
+	void *b;
+
+	do {
+		if (tailcnt == headcnt || __ptr_ring_full(&array)) {
+			b = NULL;
+			barrier();
+		} else {
+			b = "Buffer\n";
+		}
+	} while (!b);
+}
+
+void disable_call()
+{
+	assert(0);
+}
+
+bool enable_call()
+{
+	assert(0);
+}
+
+void kick_available(void)
+{
+	assert(0);
+}
+
+/* host side */
+void disable_kick()
+{
+	assert(0);
+}
+
+bool enable_kick()
+{
+	assert(0);
+}
+
+void poll_avail(void)
+{
+	void *b;
+
+	do {
+		barrier();
+		b = __ptr_ring_peek(&array);
+	} while (!b);
+}
+
+bool use_buf(unsigned *lenp, void **bufp)
+{
+	void *ptr;
+
+	ptr = __ptr_ring_consume(&array);
+
+	return ptr;
+}
+
+void call_used(void)
+{
+	assert(0);
+}
diff --git a/tools/virtio/ringtest/Makefile b/tools/virtio/ringtest/Makefile
index 6173ada..877a8a4 100644
--- a/tools/virtio/ringtest/Makefile
+++ b/tools/virtio/ringtest/Makefile
@@ -1,6 +1,6 @@
 all:
 
-all: ring virtio_ring_0_9 virtio_ring_poll virtio_ring_inorder noring
+all: ring virtio_ring_0_9 virtio_ring_poll virtio_ring_inorder ptr_ring noring
 
 CFLAGS += -Wall
 CFLAGS += -pthread -O2 -ggdb
@@ -8,6 +8,7 @@ LDFLAGS += -pthread -O2 -ggdb
 
 main.o: main.c main.h
 ring.o: ring.c main.h
+ptr_ring.o: ptr_ring.c main.h ../../../include/linux/ptr_ring.h
 virtio_ring_0_9.o: virtio_ring_0_9.c main.h
 virtio_ring_poll.o: virtio_ring_poll.c virtio_ring_0_9.c main.h
 virtio_ring_inorder.o: virtio_ring_inorder.c virtio_ring_0_9.c main.h
@@ -15,6 +16,7 @@ ring: ring.o main.o
 virtio_ring_0_9: virtio_ring_0_9.o main.o
 virtio_ring_poll: virtio_ring_poll.o main.o
 virtio_ring_inorder: virtio_ring_inorder.o main.o
+ptr_ring: ptr_ring.o main.o
 noring: noring.o main.o
 clean:
 	-rm main.o
@@ -22,6 +24,7 @@ clean:
 	-rm virtio_ring_0_9.o virtio_ring_0_9
 	-rm virtio_ring_poll.o virtio_ring_poll
 	-rm virtio_ring_inorder.o virtio_ring_inorder
+	-rm ptr_ring.o ptr_ring
 	-rm noring.o noring
 
 .PHONY: all clean
-- 
MST

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox