LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC PATCH v3 2/13] memory-hotplug : add physical memory hotplug code to acpi_memory_device_remove
From: Wen Congyang @ 2012-07-17  3:32 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: len.brown, linux-acpi, linux-kernel, linux-mm, paulus,
	minchan.kim, kosaki.motohiro, rientjes, cl, linuxppc-dev, akpm,
	liuj97
In-Reply-To: <5004D745.3060303@jp.fujitsu.com>

At 07/17/2012 11:08 AM, Yasuaki Ishimatsu Wrote:
> Hi Wen,
> 
> 2012/07/17 11:32, Wen Congyang wrote:
>> At 07/17/2012 09:54 AM, Yasuaki Ishimatsu Wrote:
>>> Hi Wen,
>>>
>>> 2012/07/17 10:44, Yasuaki Ishimatsu wrote:
>>>> Hi Wen,
>>>>
>>>> 2012/07/13 12:35, Wen Congyang wrote:
>>>>> At 07/09/2012 06:24 PM, Yasuaki Ishimatsu Wrote:
>>>>>> acpi_memory_device_remove() has been prepared to remove physical memory.
>>>>>> But, the function only frees acpi_memory_device currentlry.
>>>>>>
>>>>>> The patch adds following functions into acpi_memory_device_remove():
>>>>>>      - offline memory
>>>>>>      - remove physical memory (only return -EBUSY)
>>>>>>      - free acpi_memory_device
>>>>>>
>>>>>> CC: David Rientjes <rientjes@google.com>
>>>>>> CC: Jiang Liu <liuj97@gmail.com>
>>>>>> CC: Len Brown <len.brown@intel.com>
>>>>>> CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>>>>>> CC: Paul Mackerras <paulus@samba.org>
>>>>>> CC: Christoph Lameter <cl@linux.com>
>>>>>> Cc: Minchan Kim <minchan.kim@gmail.com>
>>>>>> CC: Andrew Morton <akpm@linux-foundation.org>
>>>>>> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>>>> CC: Wen Congyang <wency@cn.fujitsu.com>
>>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>>>
>>>>>> ---
>>>>>>     drivers/acpi/acpi_memhotplug.c |   26 +++++++++++++++++++++++++-
>>>>>>     drivers/base/memory.c          |   39 +++++++++++++++++++++++++++++++++++++++
>>>>>>     include/linux/memory.h         |    5 +++++
>>>>>>     include/linux/memory_hotplug.h |    1 +
>>>>>>     mm/memory_hotplug.c            |    8 ++++++++
>>>>>>     5 files changed, 78 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> Index: linux-3.5-rc6/drivers/acpi/acpi_memhotplug.c
>>>>>> ===================================================================
>>>>>> --- linux-3.5-rc6.orig/drivers/acpi/acpi_memhotplug.c	2012-07-09 18:08:29.946888653 +0900
>>>>>> +++ linux-3.5-rc6/drivers/acpi/acpi_memhotplug.c	2012-07-09 18:08:43.470719531 +0900
>>>>>> @@ -29,6 +29,7 @@
>>>>>>     #include <linux/module.h>
>>>>>>     #include <linux/init.h>
>>>>>>     #include <linux/types.h>
>>>>>> +#include <linux/memory.h>
>>>>>>     #include <linux/memory_hotplug.h>
>>>>>>     #include <linux/slab.h>
>>>>>>     #include <acpi/acpi_drivers.h>
>>>>>> @@ -452,12 +453,35 @@ static int acpi_memory_device_add(struct
>>>>>>     static int acpi_memory_device_remove(struct acpi_device *device, int type)
>>>>>>     {
>>>>>>     	struct acpi_memory_device *mem_device = NULL;
>>>>>> -
>>>>>> +	struct acpi_memory_info *info, *tmp;
>>>>>> +	int result;
>>>>>> +	int node;
>>>>>>
>>>>>>     	if (!device || !acpi_driver_data(device))
>>>>>>     		return -EINVAL;
>>>>>>
>>>>>>     	mem_device = acpi_driver_data(device);
>>>>>> +
>>>>>> +	node = acpi_get_node(mem_device->device->handle);
>>>>>> +
>>>>>> +	list_for_each_entry_safe(info, tmp, &mem_device->res_list, list) {
>>>>>> +		if (!info->enabled)
>>>>>> +			continue;
>>>>>> +
>>>>>> +		if (!is_memblk_offline(info->start_addr, info->length)) {
>>>>>> +			result = offline_memory(info->start_addr, info->length);
>>>>>> +			if (result)
>>>>>> +				return result;
>>>>>> +		}
>>>>>> +
>>>>>> +		result = remove_memory(node, info->start_addr, info->length);
>>>>>
>>>>> The user may online the memory between offline_memory() and remove_memory().
>>>>> So I think we should lock memory hotplug before check the memory's status
>>>>> and release it after remove_memory().
>>>>
>>>> How about get "mem_block->state_mutex" of removed memory? When offlining
>>>> memory, we need to change "memory_block->state" into "MEM_OFFLINE".
>>>> In this case, we get mem_block->state_mutex. So I think the mutex lock
>>>> is beneficial.
>>>
>>> It is not good idea since remove_memory frees mem_block structure...
>>> Do you have any ideas?
>>
>> Hmm, split offline_memory() to 2 functions: offline_pages() and __offline_pages()
>>
>> offline_pages()
>> 	lock_memory_hotplug();
>> 	__offline_pages();
>> 	unlock_memory_hotplug();
>>
>> and implement remove_memory() like this:
>> remove_memory()
>> 	lock_memory_hotplug()
>> 	if (!is_memblk_offline()) {
>> 		__offline_pages();
>> 	}
>> 	// cleanup
>> 	unlock_memory_hotplug();
>>
>> What about this?
> 
> I also thought about it once. But a problem remains. Current offilne_pages()
> cannot realize the memory has been removed by remove_memory(). So even if
> protecting the race by lock_memory_hotplug(), offline_pages() can offline
> the removed memory. offline_pages() should have the means to know the memory
> was removed. But I don't have good idea.

We can not online/offline part of memory block, so what about this?

remove_memory()
	lock_memory_hotplug()
	for each memory block:
		if (!is_memblk_offline()) {
			__offline_pages();
		}
	// cleanup
	unlock_memory_hotplug();

Thanks
Wen Congyang
> 
> Thanks,
> Yasuaki Ishimatsu
> 
>>
>> Thanks
>> Wen Congyang
>>>
>>> Thanks,
>>> Yasuaki Ishimatsu
>>>
>>>> Thanks,
>>>> Yasuaki Ishimatsu
>>>>
>>>>>
>>>>> Thanks
>>>>> Wen Congyang
>>>>>
>>>>>> +		if (result)
>>>>>> +			return result;
>>>>>> +
>>>>>> +		list_del(&info->list);
>>>>>> +		kfree(info);
>>>>>> +	}
>>>>>> +
>>>>>>     	kfree(mem_device);
>>>>>>
>>>>>>     	return 0;
>>>>>> Index: linux-3.5-rc6/include/linux/memory_hotplug.h
>>>>>> ===================================================================
>>>>>> --- linux-3.5-rc6.orig/include/linux/memory_hotplug.h	2012-07-09 18:08:29.955888542 +0900
>>>>>> +++ linux-3.5-rc6/include/linux/memory_hotplug.h	2012-07-09 18:08:43.471719518 +0900
>>>>>> @@ -233,6 +233,7 @@ static inline int is_mem_section_removab
>>>>>>     extern int mem_online_node(int nid);
>>>>>>     extern int add_memory(int nid, u64 start, u64 size);
>>>>>>     extern int arch_add_memory(int nid, u64 start, u64 size);
>>>>>> +extern int remove_memory(int nid, u64 start, u64 size);
>>>>>>     extern int offline_memory(u64 start, u64 size);
>>>>>>     extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
>>>>>>     								int nr_pages);
>>>>>> Index: linux-3.5-rc6/mm/memory_hotplug.c
>>>>>> ===================================================================
>>>>>> --- linux-3.5-rc6.orig/mm/memory_hotplug.c	2012-07-09 18:08:29.953888567 +0900
>>>>>> +++ linux-3.5-rc6/mm/memory_hotplug.c	2012-07-09 18:08:43.476719455 +0900
>>>>>> @@ -659,6 +659,14 @@ out:
>>>>>>     }
>>>>>>     EXPORT_SYMBOL_GPL(add_memory);
>>>>>>
>>>>>> +int remove_memory(int nid, u64 start, u64 size)
>>>>>> +{
>>>>>> +	return -EBUSY;
>>>>>> +
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(remove_memory);
>>>>>> +
>>>>>> +
>>>>>>     #ifdef CONFIG_MEMORY_HOTREMOVE
>>>>>>     /*
>>>>>>      * A free page on the buddy free lists (not the per-cpu lists) has PageBuddy
>>>>>> Index: linux-3.5-rc6/drivers/base/memory.c
>>>>>> ===================================================================
>>>>>> --- linux-3.5-rc6.orig/drivers/base/memory.c	2012-07-09 18:08:29.947888640 +0900
>>>>>> +++ linux-3.5-rc6/drivers/base/memory.c	2012-07-09 18:10:54.880076739 +0900
>>>>>> @@ -70,6 +70,45 @@ void unregister_memory_isolate_notifier(
>>>>>>     }
>>>>>>     EXPORT_SYMBOL(unregister_memory_isolate_notifier);
>>>>>>
>>>>>> +bool is_memblk_offline(unsigned long start, unsigned long size)
>>>>>> +{
>>>>>> +	struct memory_block *mem = NULL;
>>>>>> +	struct mem_section *section;
>>>>>> +	unsigned long start_pfn, end_pfn;
>>>>>> +	unsigned long pfn, section_nr;
>>>>>> +
>>>>>> +	start_pfn = PFN_DOWN(start);
>>>>>> +	end_pfn = start_pfn + PFN_DOWN(start);
>>>>>> +
>>>>>> +	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>>>>>> +		section_nr = pfn_to_section_nr(pfn);
>>>>>> +		if (!present_section_nr(section_nr));
>>>>>> +			continue;
>>>>>> +
>>>>>> +		section = __nr_to_section(section_nr);
>>>>>> +		/* same memblock? */
>>>>>> +		if (mem)
>>>>>> +			if((section_nr >= mem->start_section_nr) &&
>>>>>> +			   (section_nr <= mem->end_section_nr))
>>>>>> +				continue;
>>>>>> +
>>>>>> +		mem = find_memory_block_hinted(section, mem);
>>>>>> +		if (!mem)
>>>>>> +			continue;
>>>>>> +		if (mem->state == MEM_OFFLINE)
>>>>>> +			continue;
>>>>>> +
>>>>>> +		kobject_put(&mem->dev.kobj);
>>>>>> +		return false;
>>>>>> +	}
>>>>>> +
>>>>>> +	if (mem)
>>>>>> +		kobject_put(&mem->dev.kobj);
>>>>>> +
>>>>>> +	return true;
>>>>>> +}
>>>>>> +EXPORT_SYMBOL(is_memblk_offline);
>>>>>> +
>>>>>>     /*
>>>>>>      * register_memory - Setup a sysfs device for a memory block
>>>>>>      */
>>>>>> Index: linux-3.5-rc6/include/linux/memory.h
>>>>>> ===================================================================
>>>>>> --- linux-3.5-rc6.orig/include/linux/memory.h	2012-07-08 09:23:56.000000000 +0900
>>>>>> +++ linux-3.5-rc6/include/linux/memory.h	2012-07-09 18:08:43.484719355 +0900
>>>>>> @@ -106,6 +106,10 @@ static inline int memory_isolate_notify(
>>>>>>     {
>>>>>>     	return 0;
>>>>>>     }
>>>>>> +static inline bool is_memblk_offline(unsigned long start, unsigned long size)
>>>>>> +{
>>>>>> +	return false;
>>>>>> +}
>>>>>>     #else
>>>>>>     extern int register_memory_notifier(struct notifier_block *nb);
>>>>>>     extern void unregister_memory_notifier(struct notifier_block *nb);
>>>>>> @@ -120,6 +124,7 @@ extern int memory_isolate_notify(unsigne
>>>>>>     extern struct memory_block *find_memory_block_hinted(struct mem_section *,
>>>>>>     							struct memory_block *);
>>>>>>     extern struct memory_block *find_memory_block(struct mem_section *);
>>>>>> +extern bool is_memblk_offline(unsigned long start, unsigned long size);
>>>>>>     #define CONFIG_MEM_BLOCK_SIZE	(PAGES_PER_SECTION<<PAGE_SHIFT)
>>>>>>     enum mem_add_context { BOOT, HOTPLUG };
>>>>>>     #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>>>> see: http://www.linux-mm.org/ .
>>>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> 
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v3 3/4] fsl-dma: change release process of dma descriptor for supporting async_tx
From: Li Yang @ 2012-07-17  3:36 UTC (permalink / raw)
  To: Qiang Liu
  Cc: Ira W. Snyder, Vinod Koul, herbert, linux-crypto, Dan Williams,
	linuxppc-dev, davem
In-Reply-To: <1342411709-29895-1-git-send-email-qiang.liu@freescale.com>

On Mon, Jul 16, 2012 at 12:08 PM, Qiang Liu <qiang.liu@freescale.com> wrote:
> Fix the potential risk when enable config NET_DMA and ASYNC_TX.
> Async_tx is lack of support in current release process of dma descriptor,
> all descriptors will be released whatever is acked or no-acked by async_tx,
> so there is a potential race condition when dma engine is uesd by others
> clients (e.g. when enable NET_DMA to offload TCP).
>
> In our case, a race condition which is raised when use both of talitos
> and dmaengine to offload xor is because napi scheduler will sync all
> pending requests in dma channels, it affects the process of raid operations
> due to ack_tx is not checked in fsl dma. The no-acked descriptor is freed
> which is submitted just now, as a dependent tx, this freed descriptor trigger
> BUG_ON(async_tx_test_ack(depend_tx)) in async_tx_submit().
>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Vinod Koul <vinod.koul@intel.com>
> Cc: Li Yang <leoli@freescale.com>
> Cc: Ira W. Snyder <iws@ovro.caltech.edu>
> Signed-off-by: Qiang Liu <qiang.liu@freescale.com>

Also separate the function ordering change and real code change into
different patches when you work on the next patch set.

- Leo

^ permalink raw reply

* Re: [PATCH v6 0/7] minimal alignment for p2p bars
From: Ram Pai @ 2012-07-17  3:40 UTC (permalink / raw)
  To: Gavin Shan; +Cc: linux-pci, linuxram, linuxppc-dev, bhelgaas, yinghai
In-Reply-To: <1342491799-30303-1-git-send-email-shangw@linux.vnet.ibm.com>

On Tue, Jul 17, 2012 at 10:23:12AM +0800, Gavin Shan wrote:
> v1 -> v2:
> 	* Shorten the varaible names so that they looks more short.
> 	* Changelog adjustment so that they looks more meaningful.
> 
> v2 -> v3:
> 	* Rebase to 3.5.RC4
> 
> v3 -> v4:
> 	* Merge Yinghai's patches.
> 
> v3 -> v4:
> 	* Split patch for easy review.
> 	* Add function to retrieve the minimal alignment of p2p bridge. 
> 
> v4 -> v5:
> 	* Rebase to 3.5.RC7
> 	* Introduce weak function pcibios_window_alignment() to retrieve
> 	  I/O and memory alignment for P2P bridges.
> 	* Introduce pcibios_window_alignment() for ppc to override the
> 	  PCI function.
> 	* Add ppc_md.pcibios_window_alignment() for specific platform like
> 	  powernv can override ppc's pcibios_window_alignment().
> 
> v5 -> v6:
> 	* Refactor pcibios_window_alignment() so the platform-specific
> 	  implementation needn't return the default alignment according
> 	  to Bjorn's suggestion.
> 	* Simplify pbus_size_mem() according to Bjorn's suggestion: Just
> 	  check the platform required alignment at very end and adjust
> 	  the "min_align" if necessary.
> 
> Lu Yinghai(3):
>   pci: change variable name for find_pci_host_bridge
>   pci: argument pci_bus for find_pci_host_bridge
>   pci: fiddle with conversion of pci and CPU address


Gavin,
	Do you need Yinghai's first 3 patches to implement your
	functionality?

	I dont see those new functions being called from anywhere
	in your patches.

RP

^ permalink raw reply

* RE: [PATCH v3 0/4] Raid: enable talitos xor offload for improving performance
From: Liu Qiang-B32616 @ 2012-07-17  4:01 UTC (permalink / raw)
  To: Phillips Kim-R1AAHA
  Cc: herbert@gondor.hengli.com.au, dan.j.williams@intel.com,
	linuxppc-dev@lists.ozlabs.org, linux-crypto@vger.kernel.org,
	Li Yang-R58472
In-Reply-To: <20120716200423.830d174e69fc0aafb40d2858@freescale.com>

> -----Original Message-----
> From: Phillips Kim-R1AAHA
> Sent: Tuesday, July 17, 2012 9:04 AM
> To: Liu Qiang-B32616
> Cc: linux-crypto@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; Li Yang-
> R58472; dan.j.williams@intel.com; herbert@gondor.hengli.com.au
> Subject: Re: [PATCH v3 0/4] Raid: enable talitos xor offload for
> improving performance
>=20
> On Mon, 16 Jul 2012 12:07:16 +0800
> Qiang Liu <qiang.liu@freescale.com> wrote:
>=20
> >  drivers/crypto/Kconfig   |    9 +
> >  drivers/crypto/talitos.c |  410
> +++++++++++++++++++++++++++++++++++++++++++
> >  drivers/crypto/talitos.h |   53 ++++++
> >  drivers/dma/fsldma.c     |  436 +++++++++++++++++++++++++-------------
> --------
> >  drivers/dma/fsldma.h     |    1 +
> >  5 files changed, 708 insertions(+), 201 deletions(-)
>=20
> Given the pending talitos patches, this patchseries doesn't apply
> cleanly:  can you rebase onto [1], which is based on Herbert's
> cryptodev tree and contain's Horia's four patches?  They didn't get
> any negative comments, so I assume eventually they will be applied,
> and doing so will make Herbert's life easier.
>=20
> I applied the series on Herbert's cryptodev, and while fsldma
> already had this build warning:
>=20
> drivers/dma/fsldma.c: In function 'fsl_dma_tx_submit':
> drivers/dma/fsldma.c:636:2: warning: 'cookie' may be used uninitialized
> in this function [-Wuninitialized]
Kim, I will fix it in an another separate patch.

>=20
> this patchseries introduces a new one:
>=20
> drivers/dma/fsldma.c: In function 'dma_do_tasklet':
> drivers/dma/fsldma.c:1134:16: warning: unused variable 'flags' [-Wunused-
> variable]
Sorry, my bad. I will correct it in v4 5/6 (according to Li Yang's comments=
,
v3 3/4 should be split up). Thanks.

>=20
> I'll wait for a re-post, after these and Ira's comments are
> addressed, before trying to test again.
I agree, I think his comments is very important.

>=20
> Thanks,
>=20
> Kim
>=20
> [1] git://git.freescale.com/crypto/cryptodev.git

^ permalink raw reply

* RE: [PATCH v3 3/4] fsl-dma: change release process of dma descriptor for supporting async_tx
From: Liu Qiang-B32616 @ 2012-07-17  4:03 UTC (permalink / raw)
  To: Li Yang-R58472
  Cc: Ira W. Snyder, Vinod Koul, herbert@gondor.hengli.com.au,
	linux-crypto@vger.kernel.org, Dan Williams,
	linuxppc-dev@lists.ozlabs.org, davem@davemloft.net
In-Reply-To: <CADRPPNS-cOwCTVYnjq==W-f1-AAWctyOFg_6bkSDZPtM8Jf+Lw@mail.gmail.com>

PiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiBGcm9tOiBsaW51eC1jcnlwdG8tb3duZXJA
dmdlci5rZXJuZWwub3JnIFttYWlsdG86bGludXgtY3J5cHRvLQ0KPiBvd25lckB2Z2VyLmtlcm5l
bC5vcmddIE9uIEJlaGFsZiBPZiBMaSBZYW5nDQo+IFNlbnQ6IFR1ZXNkYXksIEp1bHkgMTcsIDIw
MTIgMTE6MzcgQU0NCj4gVG86IExpdSBRaWFuZy1CMzI2MTYNCj4gQ2M6IGxpbnV4LWNyeXB0b0B2
Z2VyLmtlcm5lbC5vcmc7IGxpbnV4cHBjLWRldkBsaXN0cy5vemxhYnMub3JnOyBJcmEgVy4NCj4g
U255ZGVyOyBWaW5vZCBLb3VsOyBoZXJiZXJ0QGdvbmRvci5oZW5nbGkuY29tLmF1OyBEYW4gV2ls
bGlhbXM7DQo+IGRhdmVtQGRhdmVtbG9mdC5uZXQNCj4gU3ViamVjdDogUmU6IFtQQVRDSCB2MyAz
LzRdIGZzbC1kbWE6IGNoYW5nZSByZWxlYXNlIHByb2Nlc3Mgb2YgZG1hDQo+IGRlc2NyaXB0b3Ig
Zm9yIHN1cHBvcnRpbmcgYXN5bmNfdHgNCj4gDQo+IE9uIE1vbiwgSnVsIDE2LCAyMDEyIGF0IDEy
OjA4IFBNLCBRaWFuZyBMaXUgPHFpYW5nLmxpdUBmcmVlc2NhbGUuY29tPg0KPiB3cm90ZToNCj4g
PiBGaXggdGhlIHBvdGVudGlhbCByaXNrIHdoZW4gZW5hYmxlIGNvbmZpZyBORVRfRE1BIGFuZCBB
U1lOQ19UWC4NCj4gPiBBc3luY190eCBpcyBsYWNrIG9mIHN1cHBvcnQgaW4gY3VycmVudCByZWxl
YXNlIHByb2Nlc3Mgb2YgZG1hDQo+ID4gZGVzY3JpcHRvciwgYWxsIGRlc2NyaXB0b3JzIHdpbGwg
YmUgcmVsZWFzZWQgd2hhdGV2ZXIgaXMgYWNrZWQgb3INCj4gPiBuby1hY2tlZCBieSBhc3luY190
eCwgc28gdGhlcmUgaXMgYSBwb3RlbnRpYWwgcmFjZSBjb25kaXRpb24gd2hlbiBkbWENCj4gPiBl
bmdpbmUgaXMgdWVzZCBieSBvdGhlcnMgY2xpZW50cyAoZS5nLiB3aGVuIGVuYWJsZSBORVRfRE1B
IHRvIG9mZmxvYWQNCj4gVENQKS4NCj4gPg0KPiA+IEluIG91ciBjYXNlLCBhIHJhY2UgY29uZGl0
aW9uIHdoaWNoIGlzIHJhaXNlZCB3aGVuIHVzZSBib3RoIG9mIHRhbGl0b3MNCj4gPiBhbmQgZG1h
ZW5naW5lIHRvIG9mZmxvYWQgeG9yIGlzIGJlY2F1c2UgbmFwaSBzY2hlZHVsZXIgd2lsbCBzeW5j
IGFsbA0KPiA+IHBlbmRpbmcgcmVxdWVzdHMgaW4gZG1hIGNoYW5uZWxzLCBpdCBhZmZlY3RzIHRo
ZSBwcm9jZXNzIG9mIHJhaWQNCj4gPiBvcGVyYXRpb25zIGR1ZSB0byBhY2tfdHggaXMgbm90IGNo
ZWNrZWQgaW4gZnNsIGRtYS4gVGhlIG5vLWFja2VkDQo+ID4gZGVzY3JpcHRvciBpcyBmcmVlZCB3
aGljaCBpcyBzdWJtaXR0ZWQganVzdCBub3csIGFzIGEgZGVwZW5kZW50IHR4LA0KPiA+IHRoaXMg
ZnJlZWQgZGVzY3JpcHRvciB0cmlnZ2VyDQo+ID4gQlVHX09OKGFzeW5jX3R4X3Rlc3RfYWNrKGRl
cGVuZF90eCkpIGluIGFzeW5jX3R4X3N1Ym1pdCgpLg0KPiA+DQo+ID4gQ2M6IERhbiBXaWxsaWFt
cyA8ZGFuLmoud2lsbGlhbXNAaW50ZWwuY29tPg0KPiA+IENjOiBWaW5vZCBLb3VsIDx2aW5vZC5r
b3VsQGludGVsLmNvbT4NCj4gPiBDYzogTGkgWWFuZyA8bGVvbGlAZnJlZXNjYWxlLmNvbT4NCj4g
PiBDYzogSXJhIFcuIFNueWRlciA8aXdzQG92cm8uY2FsdGVjaC5lZHU+DQo+ID4gU2lnbmVkLW9m
Zi1ieTogUWlhbmcgTGl1IDxxaWFuZy5saXVAZnJlZXNjYWxlLmNvbT4NCj4gDQo+IEFsc28gc2Vw
YXJhdGUgdGhlIGZ1bmN0aW9uIG9yZGVyaW5nIGNoYW5nZSBhbmQgcmVhbCBjb2RlIGNoYW5nZSBp
bnRvDQo+IGRpZmZlcmVudCBwYXRjaGVzIHdoZW4geW91IHdvcmsgb24gdGhlIG5leHQgcGF0Y2gg
c2V0Lg0KQWNjZXB0LCBJIHdpbGwgc3BpbHQgaXQgdXAgaW4gdjQuIFRoYW5rcy4NCg0KPiANCj4g
LSBMZW8NCj4gLS0NCj4gVG8gdW5zdWJzY3JpYmUgZnJvbSB0aGlzIGxpc3Q6IHNlbmQgdGhlIGxp
bmUgInVuc3Vic2NyaWJlIGxpbnV4LWNyeXB0byINCj4gaW4gdGhlIGJvZHkgb2YgYSBtZXNzYWdl
IHRvIG1ham9yZG9tb0B2Z2VyLmtlcm5lbC5vcmcgTW9yZSBtYWpvcmRvbW8gaW5mbw0KPiBhdCAg
aHR0cDovL3ZnZXIua2VybmVsLm9yZy9tYWpvcmRvbW8taW5mby5odG1sDQoNCg==

^ permalink raw reply

* RE: [linuxppc-release] [PATCH v3 4/4] fsl-dma: use spin_lock_bh to instead of spin_lock_irqsave
From: Liu Qiang-B32616 @ 2012-07-17  4:23 UTC (permalink / raw)
  To: Tabi Timur-B04825
  Cc: Li Yang-R58472, Vinod Koul, herbert@gondor.hengli.com.au,
	linux-crypto@vger.kernel.org, Dan Williams,
	linuxppc-dev@lists.ozlabs.org, davem@davemloft.net
In-Reply-To: <5004244D.4000106@freescale.com>

> -----Original Message-----
> From: Tabi Timur-B04825
> Sent: Monday, July 16, 2012 10:25 PM
> To: Liu Qiang-B32616
> Cc: linux-crypto@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; Vinod
> Koul; herbert@gondor.hengli.com.au; Dan Williams; Li Yang-R58472;
> davem@davemloft.net
> Subject: Re: [linuxppc-release] [PATCH v3 4/4] fsl-dma: use spin_lock_bh
> to instead of spin_lock_irqsave
>=20
> Qiang Liu wrote:
> > Use spin_lock_bh to instead of spin_lock_irqsave for improving
> performance.
>=20
> You forgot to include the evidence that performance has improved, as well
> as an explanation why it's okay to use spin_lock_bh, and why it's faster.
>   I told you to respin the patch with that information in the patch
> description.
I attached the test result in v3 0/4, performance is improved by 2%.
For my understanding, there is not any place to access descriptor lists
in fsl-dma interrupt service handler except its tasklet, spin_lock_bh()
is born for this. Interrupts will be turned off and context will be save in
irqsave, there is needless to use irqsave in our case. You can refer to the
implement of mv_xor.c or ioap-adma.c.
If you think my explanation is ok, I can add it in the patch.
Thanks.

>=20
> --
> Timur Tabi
> Linux kernel developer at Freescale

^ permalink raw reply

* Re: [RFC PATCH v3 2/13] memory-hotplug : add physical memory hotplug code to acpi_memory_device_remove
From: Yasuaki Ishimatsu @ 2012-07-17  4:51 UTC (permalink / raw)
  To: Wen Congyang
  Cc: len.brown, linux-acpi, linux-kernel, linux-mm, paulus,
	minchan.kim, kosaki.motohiro, rientjes, cl, linuxppc-dev, akpm,
	liuj97
In-Reply-To: <5004DCC2.4030905@cn.fujitsu.com>

Hi Wen,

2012/07/17 12:32, Wen Congyang wrote:
> At 07/17/2012 11:08 AM, Yasuaki Ishimatsu Wrote:
>> Hi Wen,
>>
>> 2012/07/17 11:32, Wen Congyang wrote:
>>> At 07/17/2012 09:54 AM, Yasuaki Ishimatsu Wrote:
>>>> Hi Wen,
>>>>
>>>> 2012/07/17 10:44, Yasuaki Ishimatsu wrote:
>>>>> Hi Wen,
>>>>>
>>>>> 2012/07/13 12:35, Wen Congyang wrote:
>>>>>> At 07/09/2012 06:24 PM, Yasuaki Ishimatsu Wrote:
>>>>>>> acpi_memory_device_remove() has been prepared to remove physical memory.
>>>>>>> But, the function only frees acpi_memory_device currentlry.
>>>>>>>
>>>>>>> The patch adds following functions into acpi_memory_device_remove():
>>>>>>>       - offline memory
>>>>>>>       - remove physical memory (only return -EBUSY)
>>>>>>>       - free acpi_memory_device
>>>>>>>
>>>>>>> CC: David Rientjes <rientjes@google.com>
>>>>>>> CC: Jiang Liu <liuj97@gmail.com>
>>>>>>> CC: Len Brown <len.brown@intel.com>
>>>>>>> CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>>>>>>> CC: Paul Mackerras <paulus@samba.org>
>>>>>>> CC: Christoph Lameter <cl@linux.com>
>>>>>>> Cc: Minchan Kim <minchan.kim@gmail.com>
>>>>>>> CC: Andrew Morton <akpm@linux-foundation.org>
>>>>>>> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>>>>> CC: Wen Congyang <wency@cn.fujitsu.com>
>>>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>>>>
>>>>>>> ---
>>>>>>>      drivers/acpi/acpi_memhotplug.c |   26 +++++++++++++++++++++++++-
>>>>>>>      drivers/base/memory.c          |   39 +++++++++++++++++++++++++++++++++++++++
>>>>>>>      include/linux/memory.h         |    5 +++++
>>>>>>>      include/linux/memory_hotplug.h |    1 +
>>>>>>>      mm/memory_hotplug.c            |    8 ++++++++
>>>>>>>      5 files changed, 78 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> Index: linux-3.5-rc6/drivers/acpi/acpi_memhotplug.c
>>>>>>> ===================================================================
>>>>>>> --- linux-3.5-rc6.orig/drivers/acpi/acpi_memhotplug.c	2012-07-09 18:08:29.946888653 +0900
>>>>>>> +++ linux-3.5-rc6/drivers/acpi/acpi_memhotplug.c	2012-07-09 18:08:43.470719531 +0900
>>>>>>> @@ -29,6 +29,7 @@
>>>>>>>      #include <linux/module.h>
>>>>>>>      #include <linux/init.h>
>>>>>>>      #include <linux/types.h>
>>>>>>> +#include <linux/memory.h>
>>>>>>>      #include <linux/memory_hotplug.h>
>>>>>>>      #include <linux/slab.h>
>>>>>>>      #include <acpi/acpi_drivers.h>
>>>>>>> @@ -452,12 +453,35 @@ static int acpi_memory_device_add(struct
>>>>>>>      static int acpi_memory_device_remove(struct acpi_device *device, int type)
>>>>>>>      {
>>>>>>>      	struct acpi_memory_device *mem_device = NULL;
>>>>>>> -
>>>>>>> +	struct acpi_memory_info *info, *tmp;
>>>>>>> +	int result;
>>>>>>> +	int node;
>>>>>>>
>>>>>>>      	if (!device || !acpi_driver_data(device))
>>>>>>>      		return -EINVAL;
>>>>>>>
>>>>>>>      	mem_device = acpi_driver_data(device);
>>>>>>> +
>>>>>>> +	node = acpi_get_node(mem_device->device->handle);
>>>>>>> +
>>>>>>> +	list_for_each_entry_safe(info, tmp, &mem_device->res_list, list) {
>>>>>>> +		if (!info->enabled)
>>>>>>> +			continue;
>>>>>>> +
>>>>>>> +		if (!is_memblk_offline(info->start_addr, info->length)) {
>>>>>>> +			result = offline_memory(info->start_addr, info->length);
>>>>>>> +			if (result)
>>>>>>> +				return result;
>>>>>>> +		}
>>>>>>> +
>>>>>>> +		result = remove_memory(node, info->start_addr, info->length);
>>>>>>
>>>>>> The user may online the memory between offline_memory() and remove_memory().
>>>>>> So I think we should lock memory hotplug before check the memory's status
>>>>>> and release it after remove_memory().
>>>>>
>>>>> How about get "mem_block->state_mutex" of removed memory? When offlining
>>>>> memory, we need to change "memory_block->state" into "MEM_OFFLINE".
>>>>> In this case, we get mem_block->state_mutex. So I think the mutex lock
>>>>> is beneficial.
>>>>
>>>> It is not good idea since remove_memory frees mem_block structure...
>>>> Do you have any ideas?
>>>
>>> Hmm, split offline_memory() to 2 functions: offline_pages() and __offline_pages()
>>>
>>> offline_pages()
>>> 	lock_memory_hotplug();
>>> 	__offline_pages();
>>> 	unlock_memory_hotplug();
>>>
>>> and implement remove_memory() like this:
>>> remove_memory()
>>> 	lock_memory_hotplug()
>>> 	if (!is_memblk_offline()) {
>>> 		__offline_pages();
>>> 	}
>>> 	// cleanup
>>> 	unlock_memory_hotplug();
>>>
>>> What about this?
>>
>> I also thought about it once. But a problem remains. Current offilne_pages()
>> cannot realize the memory has been removed by remove_memory(). So even if
>> protecting the race by lock_memory_hotplug(), offline_pages() can offline
>> the removed memory. offline_pages() should have the means to know the memory
>> was removed. But I don't have good idea.
> 
> We can not online/offline part of memory block, so what about this?

It seems you do not understand my concern.
When memory_remove() and offline_pages() run to same memory simultaneously,
offline_pages runs to removed memory.

memory_remove()              | offline_pages()
-----------------------------------------------------------
lock_memory_hotplug()        |
                             | wait at lock_memory_hotplug()
remove memory                |
unlock_memory_hotplug()      |
                             | wake up and start offline_pages()
			     | offline page
			     | => but the memory has already removed
			     |    by memory_remove()

In this case, offline_page() may access removed memory.

Thanks,
Yasuaki Ishimatsu

> 
> remove_memory()
> 	lock_memory_hotplug()
> 	for each memory block:
> 		if (!is_memblk_offline()) {
> 			__offline_pages();
> 		}
> 	// cleanup
> 	unlock_memory_hotplug();
> 
> Thanks
> Wen Congyang
>>
>> Thanks,
>> Yasuaki Ishimatsu
>>
>>>
>>> Thanks
>>> Wen Congyang
>>>>
>>>> Thanks,
>>>> Yasuaki Ishimatsu
>>>>
>>>>> Thanks,
>>>>> Yasuaki Ishimatsu
>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Wen Congyang
>>>>>>
>>>>>>> +		if (result)
>>>>>>> +			return result;
>>>>>>> +
>>>>>>> +		list_del(&info->list);
>>>>>>> +		kfree(info);
>>>>>>> +	}
>>>>>>> +
>>>>>>>      	kfree(mem_device);
>>>>>>>
>>>>>>>      	return 0;
>>>>>>> Index: linux-3.5-rc6/include/linux/memory_hotplug.h
>>>>>>> ===================================================================
>>>>>>> --- linux-3.5-rc6.orig/include/linux/memory_hotplug.h	2012-07-09 18:08:29.955888542 +0900
>>>>>>> +++ linux-3.5-rc6/include/linux/memory_hotplug.h	2012-07-09 18:08:43.471719518 +0900
>>>>>>> @@ -233,6 +233,7 @@ static inline int is_mem_section_removab
>>>>>>>      extern int mem_online_node(int nid);
>>>>>>>      extern int add_memory(int nid, u64 start, u64 size);
>>>>>>>      extern int arch_add_memory(int nid, u64 start, u64 size);
>>>>>>> +extern int remove_memory(int nid, u64 start, u64 size);
>>>>>>>      extern int offline_memory(u64 start, u64 size);
>>>>>>>      extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
>>>>>>>      								int nr_pages);
>>>>>>> Index: linux-3.5-rc6/mm/memory_hotplug.c
>>>>>>> ===================================================================
>>>>>>> --- linux-3.5-rc6.orig/mm/memory_hotplug.c	2012-07-09 18:08:29.953888567 +0900
>>>>>>> +++ linux-3.5-rc6/mm/memory_hotplug.c	2012-07-09 18:08:43.476719455 +0900
>>>>>>> @@ -659,6 +659,14 @@ out:
>>>>>>>      }
>>>>>>>      EXPORT_SYMBOL_GPL(add_memory);
>>>>>>>
>>>>>>> +int remove_memory(int nid, u64 start, u64 size)
>>>>>>> +{
>>>>>>> +	return -EBUSY;
>>>>>>> +
>>>>>>> +}
>>>>>>> +EXPORT_SYMBOL_GPL(remove_memory);
>>>>>>> +
>>>>>>> +
>>>>>>>      #ifdef CONFIG_MEMORY_HOTREMOVE
>>>>>>>      /*
>>>>>>>       * A free page on the buddy free lists (not the per-cpu lists) has PageBuddy
>>>>>>> Index: linux-3.5-rc6/drivers/base/memory.c
>>>>>>> ===================================================================
>>>>>>> --- linux-3.5-rc6.orig/drivers/base/memory.c	2012-07-09 18:08:29.947888640 +0900
>>>>>>> +++ linux-3.5-rc6/drivers/base/memory.c	2012-07-09 18:10:54.880076739 +0900
>>>>>>> @@ -70,6 +70,45 @@ void unregister_memory_isolate_notifier(
>>>>>>>      }
>>>>>>>      EXPORT_SYMBOL(unregister_memory_isolate_notifier);
>>>>>>>
>>>>>>> +bool is_memblk_offline(unsigned long start, unsigned long size)
>>>>>>> +{
>>>>>>> +	struct memory_block *mem = NULL;
>>>>>>> +	struct mem_section *section;
>>>>>>> +	unsigned long start_pfn, end_pfn;
>>>>>>> +	unsigned long pfn, section_nr;
>>>>>>> +
>>>>>>> +	start_pfn = PFN_DOWN(start);
>>>>>>> +	end_pfn = start_pfn + PFN_DOWN(start);
>>>>>>> +
>>>>>>> +	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>>>>>>> +		section_nr = pfn_to_section_nr(pfn);
>>>>>>> +		if (!present_section_nr(section_nr));
>>>>>>> +			continue;
>>>>>>> +
>>>>>>> +		section = __nr_to_section(section_nr);
>>>>>>> +		/* same memblock? */
>>>>>>> +		if (mem)
>>>>>>> +			if((section_nr >= mem->start_section_nr) &&
>>>>>>> +			   (section_nr <= mem->end_section_nr))
>>>>>>> +				continue;
>>>>>>> +
>>>>>>> +		mem = find_memory_block_hinted(section, mem);
>>>>>>> +		if (!mem)
>>>>>>> +			continue;
>>>>>>> +		if (mem->state == MEM_OFFLINE)
>>>>>>> +			continue;
>>>>>>> +
>>>>>>> +		kobject_put(&mem->dev.kobj);
>>>>>>> +		return false;
>>>>>>> +	}
>>>>>>> +
>>>>>>> +	if (mem)
>>>>>>> +		kobject_put(&mem->dev.kobj);
>>>>>>> +
>>>>>>> +	return true;
>>>>>>> +}
>>>>>>> +EXPORT_SYMBOL(is_memblk_offline);
>>>>>>> +
>>>>>>>      /*
>>>>>>>       * register_memory - Setup a sysfs device for a memory block
>>>>>>>       */
>>>>>>> Index: linux-3.5-rc6/include/linux/memory.h
>>>>>>> ===================================================================
>>>>>>> --- linux-3.5-rc6.orig/include/linux/memory.h	2012-07-08 09:23:56.000000000 +0900
>>>>>>> +++ linux-3.5-rc6/include/linux/memory.h	2012-07-09 18:08:43.484719355 +0900
>>>>>>> @@ -106,6 +106,10 @@ static inline int memory_isolate_notify(
>>>>>>>      {
>>>>>>>      	return 0;
>>>>>>>      }
>>>>>>> +static inline bool is_memblk_offline(unsigned long start, unsigned long size)
>>>>>>> +{
>>>>>>> +	return false;
>>>>>>> +}
>>>>>>>      #else
>>>>>>>      extern int register_memory_notifier(struct notifier_block *nb);
>>>>>>>      extern void unregister_memory_notifier(struct notifier_block *nb);
>>>>>>> @@ -120,6 +124,7 @@ extern int memory_isolate_notify(unsigne
>>>>>>>      extern struct memory_block *find_memory_block_hinted(struct mem_section *,
>>>>>>>      							struct memory_block *);
>>>>>>>      extern struct memory_block *find_memory_block(struct mem_section *);
>>>>>>> +extern bool is_memblk_offline(unsigned long start, unsigned long size);
>>>>>>>      #define CONFIG_MEM_BLOCK_SIZE	(PAGES_PER_SECTION<<PAGE_SHIFT)
>>>>>>>      enum mem_add_context { BOOT, HOTPLUG };
>>>>>>>      #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>>>>> see: http://www.linux-mm.org/ .
>>>>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
>>
>>
>>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply

* Re: [PATCH 05/15] pci: resource assignment based on p2p alignment
From: Ram Pai @ 2012-07-17  5:05 UTC (permalink / raw)
  To: Gavin Shan; +Cc: linux-pci, linuxram, linuxppc-dev, bhelgaas, yinghai
In-Reply-To: <1342491799-30303-6-git-send-email-shangw@linux.vnet.ibm.com>

On Tue, Jul 17, 2012 at 10:23:17AM +0800, Gavin Shan wrote:
> The patch changes function pbus_size_io() and pbus_size_mem() to
> do resource (I/O, memory and prefetchable memory) reassignment
> based on the minimal alignments for the p2p bridge, which was
> retrieved by function window_alignment().
> 
> Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
> ---
>  drivers/pci/setup-bus.c |   13 +++++++++----
>  1 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index c0fb9da..a29483a 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -731,6 +731,7 @@ static void pbus_size_io(struct pci_bus *bus, resource_size_t min_size,
>  	struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO);
>  	unsigned long size = 0, size0 = 0, size1 = 0;
>  	resource_size_t children_add_size = 0;
> +	resource_size_t io_align;
> 
>  	if (!b_res)
>   		return;
> @@ -756,13 +757,15 @@ static void pbus_size_io(struct pci_bus *bus, resource_size_t min_size,
>  				children_add_size += get_res_add_size(realloc_head, r);
>  		}
>  	}
> +
> +	io_align = window_alignment(bus, IORESOURCE_IO);

this should also be
	io_align = max(4096, window_alignment(bus, IORESOURCE_IO));

right?


>  	size0 = calculate_iosize(size, min_size, size1,
> -			resource_size(b_res), 4096);
> +			resource_size(b_res), io_align);
>  	if (children_add_size > add_size)
>  		add_size = children_add_size;
>  	size1 = (!realloc_head || (realloc_head && !add_size)) ? size0 :
>  		calculate_iosize(size, min_size, add_size + size1,
> -			resource_size(b_res), 4096);
> +			resource_size(b_res), io_align);
>  	if (!size0 && !size1) {
>  		if (b_res->start || b_res->end)
>  			dev_info(&bus->self->dev, "disabling bridge window "
> @@ -772,11 +775,11 @@ static void pbus_size_io(struct pci_bus *bus, resource_size_t min_size,
>  		return;
>  	}
>  	/* Alignment of the IO window is always 4K */
> -	b_res->start = 4096;
> +	b_res->start = io_align;
>  	b_res->end = b_res->start + size0 - 1;
>  	b_res->flags |= IORESOURCE_STARTALIGN;
>  	if (size1 > size0 && realloc_head) {
> -		add_to_list(realloc_head, bus->self, b_res, size1-size0, 4096);
> +		add_to_list(realloc_head, bus->self, b_res, size1-size0, io_align);
>  		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window "
>  				 "%pR to [bus %02x-%02x] add_size %lx\n", b_res,
>  				 bus->secondary, bus->subordinate, size1-size0);
> @@ -875,6 +878,8 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
>  			min_align = align1 >> 1;
>  		align += aligns[order];
>  	}
> +
> +	min_align = max(min_align, window_alignment(bus, type));

  'type' can sometimes be (IORESOURCE_MEM | IORESOURCE_PREFETCH), which
  can lead to unpredictable results depending on how window_alignment()
  is implemented... Hence to be on the safer side I suggest

	min_align = max(min_align, window_alignment(bus, b_res->flags & mask));

  

RP

^ permalink raw reply

* Re: [RFC PATCH v3 2/13] memory-hotplug : add physical memory hotplug code to acpi_memory_device_remove
From: Wen Congyang @ 2012-07-17  5:17 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: len.brown, linux-acpi, linux-kernel, linux-mm, paulus,
	minchan.kim, kosaki.motohiro, rientjes, cl, linuxppc-dev, akpm,
	liuj97
In-Reply-To: <5004EF68.5050708@jp.fujitsu.com>

At 07/17/2012 12:51 PM, Yasuaki Ishimatsu Wrote:
> Hi Wen,
> 
> 2012/07/17 12:32, Wen Congyang wrote:
>> At 07/17/2012 11:08 AM, Yasuaki Ishimatsu Wrote:
>>> Hi Wen,
>>>
>>> 2012/07/17 11:32, Wen Congyang wrote:
>>>> At 07/17/2012 09:54 AM, Yasuaki Ishimatsu Wrote:
>>>>> Hi Wen,
>>>>>
>>>>> 2012/07/17 10:44, Yasuaki Ishimatsu wrote:
>>>>>> Hi Wen,
>>>>>>
>>>>>> 2012/07/13 12:35, Wen Congyang wrote:
>>>>>>> At 07/09/2012 06:24 PM, Yasuaki Ishimatsu Wrote:
>>>>>>>> acpi_memory_device_remove() has been prepared to remove physical memory.
>>>>>>>> But, the function only frees acpi_memory_device currentlry.
>>>>>>>>
>>>>>>>> The patch adds following functions into acpi_memory_device_remove():
>>>>>>>>       - offline memory
>>>>>>>>       - remove physical memory (only return -EBUSY)
>>>>>>>>       - free acpi_memory_device
>>>>>>>>
>>>>>>>> CC: David Rientjes <rientjes@google.com>
>>>>>>>> CC: Jiang Liu <liuj97@gmail.com>
>>>>>>>> CC: Len Brown <len.brown@intel.com>
>>>>>>>> CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>>>>>>>> CC: Paul Mackerras <paulus@samba.org>
>>>>>>>> CC: Christoph Lameter <cl@linux.com>
>>>>>>>> Cc: Minchan Kim <minchan.kim@gmail.com>
>>>>>>>> CC: Andrew Morton <akpm@linux-foundation.org>
>>>>>>>> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>>>>>> CC: Wen Congyang <wency@cn.fujitsu.com>
>>>>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>>>>>
>>>>>>>> ---
>>>>>>>>      drivers/acpi/acpi_memhotplug.c |   26 +++++++++++++++++++++++++-
>>>>>>>>      drivers/base/memory.c          |   39 +++++++++++++++++++++++++++++++++++++++
>>>>>>>>      include/linux/memory.h         |    5 +++++
>>>>>>>>      include/linux/memory_hotplug.h |    1 +
>>>>>>>>      mm/memory_hotplug.c            |    8 ++++++++
>>>>>>>>      5 files changed, 78 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> Index: linux-3.5-rc6/drivers/acpi/acpi_memhotplug.c
>>>>>>>> ===================================================================
>>>>>>>> --- linux-3.5-rc6.orig/drivers/acpi/acpi_memhotplug.c	2012-07-09 18:08:29.946888653 +0900
>>>>>>>> +++ linux-3.5-rc6/drivers/acpi/acpi_memhotplug.c	2012-07-09 18:08:43.470719531 +0900
>>>>>>>> @@ -29,6 +29,7 @@
>>>>>>>>      #include <linux/module.h>
>>>>>>>>      #include <linux/init.h>
>>>>>>>>      #include <linux/types.h>
>>>>>>>> +#include <linux/memory.h>
>>>>>>>>      #include <linux/memory_hotplug.h>
>>>>>>>>      #include <linux/slab.h>
>>>>>>>>      #include <acpi/acpi_drivers.h>
>>>>>>>> @@ -452,12 +453,35 @@ static int acpi_memory_device_add(struct
>>>>>>>>      static int acpi_memory_device_remove(struct acpi_device *device, int type)
>>>>>>>>      {
>>>>>>>>      	struct acpi_memory_device *mem_device = NULL;
>>>>>>>> -
>>>>>>>> +	struct acpi_memory_info *info, *tmp;
>>>>>>>> +	int result;
>>>>>>>> +	int node;
>>>>>>>>
>>>>>>>>      	if (!device || !acpi_driver_data(device))
>>>>>>>>      		return -EINVAL;
>>>>>>>>
>>>>>>>>      	mem_device = acpi_driver_data(device);
>>>>>>>> +
>>>>>>>> +	node = acpi_get_node(mem_device->device->handle);
>>>>>>>> +
>>>>>>>> +	list_for_each_entry_safe(info, tmp, &mem_device->res_list, list) {
>>>>>>>> +		if (!info->enabled)
>>>>>>>> +			continue;
>>>>>>>> +
>>>>>>>> +		if (!is_memblk_offline(info->start_addr, info->length)) {
>>>>>>>> +			result = offline_memory(info->start_addr, info->length);
>>>>>>>> +			if (result)
>>>>>>>> +				return result;
>>>>>>>> +		}
>>>>>>>> +
>>>>>>>> +		result = remove_memory(node, info->start_addr, info->length);
>>>>>>>
>>>>>>> The user may online the memory between offline_memory() and remove_memory().
>>>>>>> So I think we should lock memory hotplug before check the memory's status
>>>>>>> and release it after remove_memory().
>>>>>>
>>>>>> How about get "mem_block->state_mutex" of removed memory? When offlining
>>>>>> memory, we need to change "memory_block->state" into "MEM_OFFLINE".
>>>>>> In this case, we get mem_block->state_mutex. So I think the mutex lock
>>>>>> is beneficial.
>>>>>
>>>>> It is not good idea since remove_memory frees mem_block structure...
>>>>> Do you have any ideas?
>>>>
>>>> Hmm, split offline_memory() to 2 functions: offline_pages() and __offline_pages()
>>>>
>>>> offline_pages()
>>>> 	lock_memory_hotplug();
>>>> 	__offline_pages();
>>>> 	unlock_memory_hotplug();
>>>>
>>>> and implement remove_memory() like this:
>>>> remove_memory()
>>>> 	lock_memory_hotplug()
>>>> 	if (!is_memblk_offline()) {
>>>> 		__offline_pages();
>>>> 	}
>>>> 	// cleanup
>>>> 	unlock_memory_hotplug();
>>>>
>>>> What about this?
>>>
>>> I also thought about it once. But a problem remains. Current offilne_pages()
>>> cannot realize the memory has been removed by remove_memory(). So even if
>>> protecting the race by lock_memory_hotplug(), offline_pages() can offline
>>> the removed memory. offline_pages() should have the means to know the memory
>>> was removed. But I don't have good idea.
>>
>> We can not online/offline part of memory block, so what about this?
> 
> It seems you do not understand my concern.
> When memory_remove() and offline_pages() run to same memory simultaneously,
> offline_pages runs to removed memory.
> 
> memory_remove()              | offline_pages()
> -----------------------------------------------------------
> lock_memory_hotplug()        |
>                              | wait at lock_memory_hotplug()
> remove memory                |
> unlock_memory_hotplug()      |
>                              | wake up and start offline_pages()
> 			     | offline page
> 			     | => but the memory has already removed
> 			     |    by memory_remove()
> 
> In this case, offline_page() may access removed memory.

Yes, in this case, the kernel may panic.

I think we can call pfn_present() in online_pages()/offline_pages()
to check whether the memory is removed.

Thanks
Wen Congyang

> 
> Thanks,
> Yasuaki Ishimatsu
> 
>>
>> remove_memory()
>> 	lock_memory_hotplug()
>> 	for each memory block:
>> 		if (!is_memblk_offline()) {
>> 			__offline_pages();
>> 		}
>> 	// cleanup
>> 	unlock_memory_hotplug();
>>
>> Thanks
>> Wen Congyang
>>>
>>> Thanks,
>>> Yasuaki Ishimatsu
>>>
>>>>
>>>> Thanks
>>>> Wen Congyang
>>>>>
>>>>> Thanks,
>>>>> Yasuaki Ishimatsu
>>>>>
>>>>>> Thanks,
>>>>>> Yasuaki Ishimatsu
>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>> Wen Congyang
>>>>>>>
>>>>>>>> +		if (result)
>>>>>>>> +			return result;
>>>>>>>> +
>>>>>>>> +		list_del(&info->list);
>>>>>>>> +		kfree(info);
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>>      	kfree(mem_device);
>>>>>>>>
>>>>>>>>      	return 0;
>>>>>>>> Index: linux-3.5-rc6/include/linux/memory_hotplug.h
>>>>>>>> ===================================================================
>>>>>>>> --- linux-3.5-rc6.orig/include/linux/memory_hotplug.h	2012-07-09 18:08:29.955888542 +0900
>>>>>>>> +++ linux-3.5-rc6/include/linux/memory_hotplug.h	2012-07-09 18:08:43.471719518 +0900
>>>>>>>> @@ -233,6 +233,7 @@ static inline int is_mem_section_removab
>>>>>>>>      extern int mem_online_node(int nid);
>>>>>>>>      extern int add_memory(int nid, u64 start, u64 size);
>>>>>>>>      extern int arch_add_memory(int nid, u64 start, u64 size);
>>>>>>>> +extern int remove_memory(int nid, u64 start, u64 size);
>>>>>>>>      extern int offline_memory(u64 start, u64 size);
>>>>>>>>      extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
>>>>>>>>      								int nr_pages);
>>>>>>>> Index: linux-3.5-rc6/mm/memory_hotplug.c
>>>>>>>> ===================================================================
>>>>>>>> --- linux-3.5-rc6.orig/mm/memory_hotplug.c	2012-07-09 18:08:29.953888567 +0900
>>>>>>>> +++ linux-3.5-rc6/mm/memory_hotplug.c	2012-07-09 18:08:43.476719455 +0900
>>>>>>>> @@ -659,6 +659,14 @@ out:
>>>>>>>>      }
>>>>>>>>      EXPORT_SYMBOL_GPL(add_memory);
>>>>>>>>
>>>>>>>> +int remove_memory(int nid, u64 start, u64 size)
>>>>>>>> +{
>>>>>>>> +	return -EBUSY;
>>>>>>>> +
>>>>>>>> +}
>>>>>>>> +EXPORT_SYMBOL_GPL(remove_memory);
>>>>>>>> +
>>>>>>>> +
>>>>>>>>      #ifdef CONFIG_MEMORY_HOTREMOVE
>>>>>>>>      /*
>>>>>>>>       * A free page on the buddy free lists (not the per-cpu lists) has PageBuddy
>>>>>>>> Index: linux-3.5-rc6/drivers/base/memory.c
>>>>>>>> ===================================================================
>>>>>>>> --- linux-3.5-rc6.orig/drivers/base/memory.c	2012-07-09 18:08:29.947888640 +0900
>>>>>>>> +++ linux-3.5-rc6/drivers/base/memory.c	2012-07-09 18:10:54.880076739 +0900
>>>>>>>> @@ -70,6 +70,45 @@ void unregister_memory_isolate_notifier(
>>>>>>>>      }
>>>>>>>>      EXPORT_SYMBOL(unregister_memory_isolate_notifier);
>>>>>>>>
>>>>>>>> +bool is_memblk_offline(unsigned long start, unsigned long size)
>>>>>>>> +{
>>>>>>>> +	struct memory_block *mem = NULL;
>>>>>>>> +	struct mem_section *section;
>>>>>>>> +	unsigned long start_pfn, end_pfn;
>>>>>>>> +	unsigned long pfn, section_nr;
>>>>>>>> +
>>>>>>>> +	start_pfn = PFN_DOWN(start);
>>>>>>>> +	end_pfn = start_pfn + PFN_DOWN(start);
>>>>>>>> +
>>>>>>>> +	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>>>>>>>> +		section_nr = pfn_to_section_nr(pfn);
>>>>>>>> +		if (!present_section_nr(section_nr));
>>>>>>>> +			continue;
>>>>>>>> +
>>>>>>>> +		section = __nr_to_section(section_nr);
>>>>>>>> +		/* same memblock? */
>>>>>>>> +		if (mem)
>>>>>>>> +			if((section_nr >= mem->start_section_nr) &&
>>>>>>>> +			   (section_nr <= mem->end_section_nr))
>>>>>>>> +				continue;
>>>>>>>> +
>>>>>>>> +		mem = find_memory_block_hinted(section, mem);
>>>>>>>> +		if (!mem)
>>>>>>>> +			continue;
>>>>>>>> +		if (mem->state == MEM_OFFLINE)
>>>>>>>> +			continue;
>>>>>>>> +
>>>>>>>> +		kobject_put(&mem->dev.kobj);
>>>>>>>> +		return false;
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>> +	if (mem)
>>>>>>>> +		kobject_put(&mem->dev.kobj);
>>>>>>>> +
>>>>>>>> +	return true;
>>>>>>>> +}
>>>>>>>> +EXPORT_SYMBOL(is_memblk_offline);
>>>>>>>> +
>>>>>>>>      /*
>>>>>>>>       * register_memory - Setup a sysfs device for a memory block
>>>>>>>>       */
>>>>>>>> Index: linux-3.5-rc6/include/linux/memory.h
>>>>>>>> ===================================================================
>>>>>>>> --- linux-3.5-rc6.orig/include/linux/memory.h	2012-07-08 09:23:56.000000000 +0900
>>>>>>>> +++ linux-3.5-rc6/include/linux/memory.h	2012-07-09 18:08:43.484719355 +0900
>>>>>>>> @@ -106,6 +106,10 @@ static inline int memory_isolate_notify(
>>>>>>>>      {
>>>>>>>>      	return 0;
>>>>>>>>      }
>>>>>>>> +static inline bool is_memblk_offline(unsigned long start, unsigned long size)
>>>>>>>> +{
>>>>>>>> +	return false;
>>>>>>>> +}
>>>>>>>>      #else
>>>>>>>>      extern int register_memory_notifier(struct notifier_block *nb);
>>>>>>>>      extern void unregister_memory_notifier(struct notifier_block *nb);
>>>>>>>> @@ -120,6 +124,7 @@ extern int memory_isolate_notify(unsigne
>>>>>>>>      extern struct memory_block *find_memory_block_hinted(struct mem_section *,
>>>>>>>>      							struct memory_block *);
>>>>>>>>      extern struct memory_block *find_memory_block(struct mem_section *);
>>>>>>>> +extern bool is_memblk_offline(unsigned long start, unsigned long size);
>>>>>>>>      #define CONFIG_MEM_BLOCK_SIZE	(PAGES_PER_SECTION<<PAGE_SHIFT)
>>>>>>>>      enum mem_add_context { BOOT, HOTPLUG };
>>>>>>>>      #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>>>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>>>>>> see: http://www.linux-mm.org/ .
>>>>>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply

* Re: [RFC PATCH v3 2/13] memory-hotplug : add physical memory hotplug code to acpi_memory_device_remove
From: Yasuaki Ishimatsu @ 2012-07-17  5:19 UTC (permalink / raw)
  To: Wen Congyang
  Cc: len.brown, linux-acpi, linux-kernel, linux-mm, paulus,
	minchan.kim, kosaki.motohiro, rientjes, cl, linuxppc-dev, akpm,
	liuj97
In-Reply-To: <5004F570.80508@cn.fujitsu.com>

Hi Wen,

2012/07/17 14:17, Wen Congyang wrote:
> At 07/17/2012 12:51 PM, Yasuaki Ishimatsu Wrote:
>> Hi Wen,
>>
>> 2012/07/17 12:32, Wen Congyang wrote:
>>> At 07/17/2012 11:08 AM, Yasuaki Ishimatsu Wrote:
>>>> Hi Wen,
>>>>
>>>> 2012/07/17 11:32, Wen Congyang wrote:
>>>>> At 07/17/2012 09:54 AM, Yasuaki Ishimatsu Wrote:
>>>>>> Hi Wen,
>>>>>>
>>>>>> 2012/07/17 10:44, Yasuaki Ishimatsu wrote:
>>>>>>> Hi Wen,
>>>>>>>
>>>>>>> 2012/07/13 12:35, Wen Congyang wrote:
>>>>>>>> At 07/09/2012 06:24 PM, Yasuaki Ishimatsu Wrote:
>>>>>>>>> acpi_memory_device_remove() has been prepared to remove physical memory.
>>>>>>>>> But, the function only frees acpi_memory_device currentlry.
>>>>>>>>>
>>>>>>>>> The patch adds following functions into acpi_memory_device_remove():
>>>>>>>>>        - offline memory
>>>>>>>>>        - remove physical memory (only return -EBUSY)
>>>>>>>>>        - free acpi_memory_device
>>>>>>>>>
>>>>>>>>> CC: David Rientjes <rientjes@google.com>
>>>>>>>>> CC: Jiang Liu <liuj97@gmail.com>
>>>>>>>>> CC: Len Brown <len.brown@intel.com>
>>>>>>>>> CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>>>>>>>>> CC: Paul Mackerras <paulus@samba.org>
>>>>>>>>> CC: Christoph Lameter <cl@linux.com>
>>>>>>>>> Cc: Minchan Kim <minchan.kim@gmail.com>
>>>>>>>>> CC: Andrew Morton <akpm@linux-foundation.org>
>>>>>>>>> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>>>>>>> CC: Wen Congyang <wency@cn.fujitsu.com>
>>>>>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>       drivers/acpi/acpi_memhotplug.c |   26 +++++++++++++++++++++++++-
>>>>>>>>>       drivers/base/memory.c          |   39 +++++++++++++++++++++++++++++++++++++++
>>>>>>>>>       include/linux/memory.h         |    5 +++++
>>>>>>>>>       include/linux/memory_hotplug.h |    1 +
>>>>>>>>>       mm/memory_hotplug.c            |    8 ++++++++
>>>>>>>>>       5 files changed, 78 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> Index: linux-3.5-rc6/drivers/acpi/acpi_memhotplug.c
>>>>>>>>> ===================================================================
>>>>>>>>> --- linux-3.5-rc6.orig/drivers/acpi/acpi_memhotplug.c	2012-07-09 18:08:29.946888653 +0900
>>>>>>>>> +++ linux-3.5-rc6/drivers/acpi/acpi_memhotplug.c	2012-07-09 18:08:43.470719531 +0900
>>>>>>>>> @@ -29,6 +29,7 @@
>>>>>>>>>       #include <linux/module.h>
>>>>>>>>>       #include <linux/init.h>
>>>>>>>>>       #include <linux/types.h>
>>>>>>>>> +#include <linux/memory.h>
>>>>>>>>>       #include <linux/memory_hotplug.h>
>>>>>>>>>       #include <linux/slab.h>
>>>>>>>>>       #include <acpi/acpi_drivers.h>
>>>>>>>>> @@ -452,12 +453,35 @@ static int acpi_memory_device_add(struct
>>>>>>>>>       static int acpi_memory_device_remove(struct acpi_device *device, int type)
>>>>>>>>>       {
>>>>>>>>>       	struct acpi_memory_device *mem_device = NULL;
>>>>>>>>> -
>>>>>>>>> +	struct acpi_memory_info *info, *tmp;
>>>>>>>>> +	int result;
>>>>>>>>> +	int node;
>>>>>>>>>
>>>>>>>>>       	if (!device || !acpi_driver_data(device))
>>>>>>>>>       		return -EINVAL;
>>>>>>>>>
>>>>>>>>>       	mem_device = acpi_driver_data(device);
>>>>>>>>> +
>>>>>>>>> +	node = acpi_get_node(mem_device->device->handle);
>>>>>>>>> +
>>>>>>>>> +	list_for_each_entry_safe(info, tmp, &mem_device->res_list, list) {
>>>>>>>>> +		if (!info->enabled)
>>>>>>>>> +			continue;
>>>>>>>>> +
>>>>>>>>> +		if (!is_memblk_offline(info->start_addr, info->length)) {
>>>>>>>>> +			result = offline_memory(info->start_addr, info->length);
>>>>>>>>> +			if (result)
>>>>>>>>> +				return result;
>>>>>>>>> +		}
>>>>>>>>> +
>>>>>>>>> +		result = remove_memory(node, info->start_addr, info->length);
>>>>>>>>
>>>>>>>> The user may online the memory between offline_memory() and remove_memory().
>>>>>>>> So I think we should lock memory hotplug before check the memory's status
>>>>>>>> and release it after remove_memory().
>>>>>>>
>>>>>>> How about get "mem_block->state_mutex" of removed memory? When offlining
>>>>>>> memory, we need to change "memory_block->state" into "MEM_OFFLINE".
>>>>>>> In this case, we get mem_block->state_mutex. So I think the mutex lock
>>>>>>> is beneficial.
>>>>>>
>>>>>> It is not good idea since remove_memory frees mem_block structure...
>>>>>> Do you have any ideas?
>>>>>
>>>>> Hmm, split offline_memory() to 2 functions: offline_pages() and __offline_pages()
>>>>>
>>>>> offline_pages()
>>>>> 	lock_memory_hotplug();
>>>>> 	__offline_pages();
>>>>> 	unlock_memory_hotplug();
>>>>>
>>>>> and implement remove_memory() like this:
>>>>> remove_memory()
>>>>> 	lock_memory_hotplug()
>>>>> 	if (!is_memblk_offline()) {
>>>>> 		__offline_pages();
>>>>> 	}
>>>>> 	// cleanup
>>>>> 	unlock_memory_hotplug();
>>>>>
>>>>> What about this?
>>>>
>>>> I also thought about it once. But a problem remains. Current offilne_pages()
>>>> cannot realize the memory has been removed by remove_memory(). So even if
>>>> protecting the race by lock_memory_hotplug(), offline_pages() can offline
>>>> the removed memory. offline_pages() should have the means to know the memory
>>>> was removed. But I don't have good idea.
>>>
>>> We can not online/offline part of memory block, so what about this?
>>
>> It seems you do not understand my concern.
>> When memory_remove() and offline_pages() run to same memory simultaneously,
>> offline_pages runs to removed memory.
>>
>> memory_remove()              | offline_pages()
>> -----------------------------------------------------------
>> lock_memory_hotplug()        |
>>                               | wait at lock_memory_hotplug()
>> remove memory                |
>> unlock_memory_hotplug()      |
>>                               | wake up and start offline_pages()
>> 			     | offline page
>> 			     | => but the memory has already removed
>> 			     |    by memory_remove()
>>
>> In this case, offline_page() may access removed memory.
> 
> Yes, in this case, the kernel may panic.
> 
> I think we can call pfn_present() in online_pages()/offline_pages()
> to check whether the memory is removed.

Thank you for good idea. I'll add it.

Thanks,
Yasuaki Ishimatsu

> Thanks
> Wen Congyang
> 
>>
>> Thanks,
>> Yasuaki Ishimatsu
>>
>>>
>>> remove_memory()
>>> 	lock_memory_hotplug()
>>> 	for each memory block:
>>> 		if (!is_memblk_offline()) {
>>> 			__offline_pages();
>>> 		}
>>> 	// cleanup
>>> 	unlock_memory_hotplug();
>>>
>>> Thanks
>>> Wen Congyang
>>>>
>>>> Thanks,
>>>> Yasuaki Ishimatsu
>>>>
>>>>>
>>>>> Thanks
>>>>> Wen Congyang
>>>>>>
>>>>>> Thanks,
>>>>>> Yasuaki Ishimatsu
>>>>>>
>>>>>>> Thanks,
>>>>>>> Yasuaki Ishimatsu
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Wen Congyang
>>>>>>>>
>>>>>>>>> +		if (result)
>>>>>>>>> +			return result;
>>>>>>>>> +
>>>>>>>>> +		list_del(&info->list);
>>>>>>>>> +		kfree(info);
>>>>>>>>> +	}
>>>>>>>>> +
>>>>>>>>>       	kfree(mem_device);
>>>>>>>>>
>>>>>>>>>       	return 0;
>>>>>>>>> Index: linux-3.5-rc6/include/linux/memory_hotplug.h
>>>>>>>>> ===================================================================
>>>>>>>>> --- linux-3.5-rc6.orig/include/linux/memory_hotplug.h	2012-07-09 18:08:29.955888542 +0900
>>>>>>>>> +++ linux-3.5-rc6/include/linux/memory_hotplug.h	2012-07-09 18:08:43.471719518 +0900
>>>>>>>>> @@ -233,6 +233,7 @@ static inline int is_mem_section_removab
>>>>>>>>>       extern int mem_online_node(int nid);
>>>>>>>>>       extern int add_memory(int nid, u64 start, u64 size);
>>>>>>>>>       extern int arch_add_memory(int nid, u64 start, u64 size);
>>>>>>>>> +extern int remove_memory(int nid, u64 start, u64 size);
>>>>>>>>>       extern int offline_memory(u64 start, u64 size);
>>>>>>>>>       extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
>>>>>>>>>       								int nr_pages);
>>>>>>>>> Index: linux-3.5-rc6/mm/memory_hotplug.c
>>>>>>>>> ===================================================================
>>>>>>>>> --- linux-3.5-rc6.orig/mm/memory_hotplug.c	2012-07-09 18:08:29.953888567 +0900
>>>>>>>>> +++ linux-3.5-rc6/mm/memory_hotplug.c	2012-07-09 18:08:43.476719455 +0900
>>>>>>>>> @@ -659,6 +659,14 @@ out:
>>>>>>>>>       }
>>>>>>>>>       EXPORT_SYMBOL_GPL(add_memory);
>>>>>>>>>
>>>>>>>>> +int remove_memory(int nid, u64 start, u64 size)
>>>>>>>>> +{
>>>>>>>>> +	return -EBUSY;
>>>>>>>>> +
>>>>>>>>> +}
>>>>>>>>> +EXPORT_SYMBOL_GPL(remove_memory);
>>>>>>>>> +
>>>>>>>>> +
>>>>>>>>>       #ifdef CONFIG_MEMORY_HOTREMOVE
>>>>>>>>>       /*
>>>>>>>>>        * A free page on the buddy free lists (not the per-cpu lists) has PageBuddy
>>>>>>>>> Index: linux-3.5-rc6/drivers/base/memory.c
>>>>>>>>> ===================================================================
>>>>>>>>> --- linux-3.5-rc6.orig/drivers/base/memory.c	2012-07-09 18:08:29.947888640 +0900
>>>>>>>>> +++ linux-3.5-rc6/drivers/base/memory.c	2012-07-09 18:10:54.880076739 +0900
>>>>>>>>> @@ -70,6 +70,45 @@ void unregister_memory_isolate_notifier(
>>>>>>>>>       }
>>>>>>>>>       EXPORT_SYMBOL(unregister_memory_isolate_notifier);
>>>>>>>>>
>>>>>>>>> +bool is_memblk_offline(unsigned long start, unsigned long size)
>>>>>>>>> +{
>>>>>>>>> +	struct memory_block *mem = NULL;
>>>>>>>>> +	struct mem_section *section;
>>>>>>>>> +	unsigned long start_pfn, end_pfn;
>>>>>>>>> +	unsigned long pfn, section_nr;
>>>>>>>>> +
>>>>>>>>> +	start_pfn = PFN_DOWN(start);
>>>>>>>>> +	end_pfn = start_pfn + PFN_DOWN(start);
>>>>>>>>> +
>>>>>>>>> +	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>>>>>>>>> +		section_nr = pfn_to_section_nr(pfn);
>>>>>>>>> +		if (!present_section_nr(section_nr));
>>>>>>>>> +			continue;
>>>>>>>>> +
>>>>>>>>> +		section = __nr_to_section(section_nr);
>>>>>>>>> +		/* same memblock? */
>>>>>>>>> +		if (mem)
>>>>>>>>> +			if((section_nr >= mem->start_section_nr) &&
>>>>>>>>> +			   (section_nr <= mem->end_section_nr))
>>>>>>>>> +				continue;
>>>>>>>>> +
>>>>>>>>> +		mem = find_memory_block_hinted(section, mem);
>>>>>>>>> +		if (!mem)
>>>>>>>>> +			continue;
>>>>>>>>> +		if (mem->state == MEM_OFFLINE)
>>>>>>>>> +			continue;
>>>>>>>>> +
>>>>>>>>> +		kobject_put(&mem->dev.kobj);
>>>>>>>>> +		return false;
>>>>>>>>> +	}
>>>>>>>>> +
>>>>>>>>> +	if (mem)
>>>>>>>>> +		kobject_put(&mem->dev.kobj);
>>>>>>>>> +
>>>>>>>>> +	return true;
>>>>>>>>> +}
>>>>>>>>> +EXPORT_SYMBOL(is_memblk_offline);
>>>>>>>>> +
>>>>>>>>>       /*
>>>>>>>>>        * register_memory - Setup a sysfs device for a memory block
>>>>>>>>>        */
>>>>>>>>> Index: linux-3.5-rc6/include/linux/memory.h
>>>>>>>>> ===================================================================
>>>>>>>>> --- linux-3.5-rc6.orig/include/linux/memory.h	2012-07-08 09:23:56.000000000 +0900
>>>>>>>>> +++ linux-3.5-rc6/include/linux/memory.h	2012-07-09 18:08:43.484719355 +0900
>>>>>>>>> @@ -106,6 +106,10 @@ static inline int memory_isolate_notify(
>>>>>>>>>       {
>>>>>>>>>       	return 0;
>>>>>>>>>       }
>>>>>>>>> +static inline bool is_memblk_offline(unsigned long start, unsigned long size)
>>>>>>>>> +{
>>>>>>>>> +	return false;
>>>>>>>>> +}
>>>>>>>>>       #else
>>>>>>>>>       extern int register_memory_notifier(struct notifier_block *nb);
>>>>>>>>>       extern void unregister_memory_notifier(struct notifier_block *nb);
>>>>>>>>> @@ -120,6 +124,7 @@ extern int memory_isolate_notify(unsigne
>>>>>>>>>       extern struct memory_block *find_memory_block_hinted(struct mem_section *,
>>>>>>>>>       							struct memory_block *);
>>>>>>>>>       extern struct memory_block *find_memory_block(struct mem_section *);
>>>>>>>>> +extern bool is_memblk_offline(unsigned long start, unsigned long size);
>>>>>>>>>       #define CONFIG_MEM_BLOCK_SIZE	(PAGES_PER_SECTION<<PAGE_SHIFT)
>>>>>>>>>       enum mem_add_context { BOOT, HOTPLUG };
>>>>>>>>>       #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>>>>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>>>>>>> see: http://www.linux-mm.org/ .
>>>>>>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
> 

^ permalink raw reply

* Re: [PATCH 05/15] pci: resource assignment based on p2p alignment
From: Ram Pai @ 2012-07-17  5:23 UTC (permalink / raw)
  To: Ram Pai; +Cc: Gavin Shan, linux-pci, linuxppc-dev, bhelgaas, yinghai
In-Reply-To: <20120717050547.GD2369@ram-ThinkPad-T61>

On Tue, Jul 17, 2012 at 01:05:47PM +0800, Ram Pai wrote:
> On Tue, Jul 17, 2012 at 10:23:17AM +0800, Gavin Shan wrote:
> > The patch changes function pbus_size_io() and pbus_size_mem() to
> > do resource (I/O, memory and prefetchable memory) reassignment
> > based on the minimal alignments for the p2p bridge, which was
> > retrieved by function window_alignment().
> > 
> > Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
> > ---
> >  drivers/pci/setup-bus.c |   13 +++++++++----
> >  1 files changed, 9 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> > index c0fb9da..a29483a 100644
> > --- a/drivers/pci/setup-bus.c
> > +++ b/drivers/pci/setup-bus.c
> > @@ -731,6 +731,7 @@ static void pbus_size_io(struct pci_bus *bus, resource_size_t min_size,
> >  	struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO);
> >  	unsigned long size = 0, size0 = 0, size1 = 0;
> >  	resource_size_t children_add_size = 0;
> > +	resource_size_t io_align;
> > 
> >  	if (!b_res)
> >   		return;
> > @@ -756,13 +757,15 @@ static void pbus_size_io(struct pci_bus *bus, resource_size_t min_size,
> >  				children_add_size += get_res_add_size(realloc_head, r);
> >  		}
> >  	}
> > +
> > +	io_align = window_alignment(bus, IORESOURCE_IO);
> 
> this should also be
> 	io_align = max(4096, window_alignment(bus, IORESOURCE_IO));
> 
> right?
> 
> 
> >  	size0 = calculate_iosize(size, min_size, size1,
> > -			resource_size(b_res), 4096);
> > +			resource_size(b_res), io_align);
> >  	if (children_add_size > add_size)
> >  		add_size = children_add_size;
> >  	size1 = (!realloc_head || (realloc_head && !add_size)) ? size0 :
> >  		calculate_iosize(size, min_size, add_size + size1,
> > -			resource_size(b_res), 4096);
> > +			resource_size(b_res), io_align);
> >  	if (!size0 && !size1) {
> >  		if (b_res->start || b_res->end)
> >  			dev_info(&bus->self->dev, "disabling bridge window "
> > @@ -772,11 +775,11 @@ static void pbus_size_io(struct pci_bus *bus, resource_size_t min_size,
> >  		return;
> >  	}
> >  	/* Alignment of the IO window is always 4K */
> > -	b_res->start = 4096;
> > +	b_res->start = io_align;
> >  	b_res->end = b_res->start + size0 - 1;
> >  	b_res->flags |= IORESOURCE_STARTALIGN;
> >  	if (size1 > size0 && realloc_head) {
> > -		add_to_list(realloc_head, bus->self, b_res, size1-size0, 4096);
> > +		add_to_list(realloc_head, bus->self, b_res, size1-size0, io_align);
> >  		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window "
> >  				 "%pR to [bus %02x-%02x] add_size %lx\n", b_res,
> >  				 bus->secondary, bus->subordinate, size1-size0);
> > @@ -875,6 +878,8 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
> >  			min_align = align1 >> 1;
> >  		align += aligns[order];
> >  	}
> > +
> > +	min_align = max(min_align, window_alignment(bus, type));
> 
>   'type' can sometimes be (IORESOURCE_MEM | IORESOURCE_PREFETCH), which
>   can lead to unpredictable results depending on how window_alignment()
>   is implemented... Hence to be on the safer side I suggest
> 
> 	min_align = max(min_align, window_alignment(bus, b_res->flags & mask));

While you are at it, can we move the min_align calculation into
a separate function?

Somewhat around the following patch


diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 8fa2d4b..426c8ad6 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -762,6 +762,25 @@ static void pbus_size_io(struct pci_bus *bus, resource_size_t min_size,
 	}
 }
 
+static inline calculate_min_align(resource_size_t aligns *aligns, int max_order)
+{
+	resource_size_t align = 0;
+	resource_size_t min_align = 0;
+	int order;
+	for (order = 0; order <= max_order; order++) {
+		resource_size_t align1 = 1;
+
+		align1 <<= (order + 20);
+
+		if (!align)
+			min_align = align1;
+		else if (ALIGN(align + min_align, min_align) < align1)
+			min_align = align1 >> 1;
+		align += aligns[order];
+	}
+	return min_align;
+}
+
 /**
  * pbus_size_mem() - size the memory window of a given bus
  *
@@ -841,19 +860,9 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 				children_add_size += get_res_add_size(realloc_head, r);
 		}
 	}
-	align = 0;
-	min_align = 0;
-	for (order = 0; order <= max_order; order++) {
-		resource_size_t align1 = 1;
 
-		align1 <<= (order + 20);
+	min_align = calculate_min_align(aligns, max_order);
 
-		if (!align)
-			min_align = align1;
-		else if (ALIGN(align + min_align, min_align) < align1)
-			min_align = align1 >> 1;
-		align += aligns[order];
-	}
 	size0 = calculate_memsize(size, min_size, 0, resource_size(b_res), min_align);
 	if (children_add_size > add_size)
 		add_size = children_add_size;
@@ -880,6 +889,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 	return 1;
 }

 RP

^ permalink raw reply related

* Re: [PATCH 05/15] pci: resource assignment based on p2p alignment
From: Ram Pai @ 2012-07-17  5:57 UTC (permalink / raw)
  To: Gavin Shan; +Cc: linux-pci, Ram Pai, linuxppc-dev, bhelgaas, yinghai
In-Reply-To: <20120717053648.GA18497@shangw>

On Tue, Jul 17, 2012 at 01:36:49PM +0800, Gavin Shan wrote:
> On Tue, Jul 17, 2012 at 01:23:33PM +0800, Ram Pai wrote:
> >On Tue, Jul 17, 2012 at 01:05:47PM +0800, Ram Pai wrote:
> >> On Tue, Jul 17, 2012 at 10:23:17AM +0800, Gavin Shan wrote:
> >> > The patch changes function pbus_size_io() and pbus_size_mem() to
> >> > do resource (I/O, memory and prefetchable memory) reassignment
> >> > based on the minimal alignments for the p2p bridge, which was
> >> > retrieved by function window_alignment().
> >> > 
> >> > Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
> 
> [snip]
> 
> >> > @@ -772,11 +775,11 @@ static void pbus_size_io(struct pci_bus *bus, resource_size_t min_size,
> >> >  		return;
> >> >  	}
> >> >  	/* Alignment of the IO window is always 4K */
> >> > -	b_res->start = 4096;
> >> > +	b_res->start = io_align;
> >> >  	b_res->end = b_res->start + size0 - 1;
> >> >  	b_res->flags |= IORESOURCE_STARTALIGN;
> >> >  	if (size1 > size0 && realloc_head) {
> >> > -		add_to_list(realloc_head, bus->self, b_res, size1-size0, 4096);
> >> > +		add_to_list(realloc_head, bus->self, b_res, size1-size0, io_align);
> >> >  		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window "
> >> >  				 "%pR to [bus %02x-%02x] add_size %lx\n", b_res,
> >> >  				 bus->secondary, bus->subordinate, size1-size0);
> >> > @@ -875,6 +878,8 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
> >> >  			min_align = align1 >> 1;
> >> >  		align += aligns[order];
> >> >  	}
> >> > +
> >> > +	min_align = max(min_align, window_alignment(bus, type));
> >> 
> >>   'type' can sometimes be (IORESOURCE_MEM | IORESOURCE_PREFETCH), which
> >>   can lead to unpredictable results depending on how window_alignment()
> >>   is implemented... Hence to be on the safer side I suggest
> >> 
> >> 	min_align = max(min_align, window_alignment(bus, b_res->flags & mask));
> 
> Sorry, Ram. I didn't see your concern in last reply. So I have to
> cover your conver in this reply.
> 
> I think it'd better to pass "type" directly because platform (e.g. powernv)
> expects both IORESOURCE_MEM as well as IORESOURCE_PREFETCH. 
> In future, powernv platform will return M32 segment size for IORESOURCE_MEM, but
> might return M64 segment size for (IORESOURCE_MEM | IORESOURCE_PREFETCH).

Hmm.. this code is not about determining what kind of segment the
platform is returning. This code is about using the right alignment
constraints for the type of segment from which resource will be
allocated. right?

b_res is the resource that is being sized. b_res already knows
what kind of resource it is, i.e IORESOURCE_MEM or IORESOURCE_PREFETCH.
Hence we should be exactly using the same alignment constraints as
that dictated by the type of b_res. no?

RP

^ permalink raw reply

* RE: [PATCH] powerpc/85xx: workaround for chips with MSI hareware errata to support MSI-X
From: Jia Hongtao-B38951 @ 2012-07-17  6:18 UTC (permalink / raw)
  To: Wood Scott-B07421; +Cc: linuxppc-dev@lists.ozlabs.org, soniccat.liu@gmail.com
In-Reply-To: <5004A1B0.6070004@freescale.com>

DQo+IC0tLS0tT3JpZ2luYWwgTWVzc2FnZS0tLS0tDQo+IEZyb206IFdvb2QgU2NvdHQtQjA3NDIx
DQo+IFNlbnQ6IFR1ZXNkYXksIEp1bHkgMTcsIDIwMTIgNzoyMCBBTQ0KPiBUbzogSmlhIEhvbmd0
YW8tQjM4OTUxDQo+IENjOiBsaW51eHBwYy1kZXZAbGlzdHMub3psYWJzLm9yZzsgZ2FsYWtAa2Vy
bmVsLmNyYXNoaW5nLm9yZzsNCj4gc29uaWNjYXQubGl1QGdtYWlsLmNvbQ0KPiBTdWJqZWN0OiBS
ZTogW1BBVENIXSBwb3dlcnBjLzg1eHg6IHdvcmthcm91bmQgZm9yIGNoaXBzIHdpdGggTVNJIGhh
cmV3YXJlDQo+IGVycmF0YSB0byBzdXBwb3J0IE1TSS1YDQo+ID4gZGlmZiAtLWdpdCBhL2FyY2gv
cG93ZXJwYy9zeXNkZXYvZnNsX21zaS5oDQo+IGIvYXJjaC9wb3dlcnBjL3N5c2Rldi9mc2xfbXNp
LmgNCj4gPiBpbmRleCA4MjI1Zjg2Li4zNTRkNTQ2IDEwMDY0NA0KPiA+IC0tLSBhL2FyY2gvcG93
ZXJwYy9zeXNkZXYvZnNsX21zaS5oDQo+ID4gKysrIGIvYXJjaC9wb3dlcnBjL3N5c2Rldi9mc2xf
bXNpLmgNCj4gPiBAQCAtMjUsNiArMjUsOSBAQA0KPiA+ICAjZGVmaW5lIEZTTF9QSUNfSVBfSVBJ
QyAgIDB4MDAwMDAwMDINCj4gPiAgI2RlZmluZSBGU0xfUElDX0lQX1ZNUElDICAweDAwMDAwMDAz
DQo+ID4NCj4gPiArI2RlZmluZSBNU0lfSFdfRVJSQVRBX01BU0sgICAweDAwMDAwMEYwDQo+ID4g
KyNkZWZpbmUgTVNJX0hXX0VSUkFUQV9FTkRJQU4gMHgwMDAwMDAxMA0KPiANCj4gV2h5IGRvIHdl
IG5lZWQgYSBtYXNrIGZvciB0aGlzPw0KPiANCg0KSXQgc2VlbXMgdGhpcyBtYXNrIGlzIG5vdCBu
ZWVkZWQuDQpJIHdpbGwgZml4IHRoaXMgaW4gdGhlIG5leHQgdmVyc2lvbi4NCg0KVGhhbmtzLg0K
LUhvbmd0YW8uDQo=

^ permalink raw reply

* Re: [linuxppc-release] [PATCH v3 4/4] fsl-dma: use spin_lock_bh to instead of spin_lock_irqsave
From: Li Yang @ 2012-07-17  6:48 UTC (permalink / raw)
  To: Liu Qiang-B32616
  Cc: Li Yang-R58472, Vinod Koul, Tabi Timur-B04825,
	herbert@gondor.hengli.com.au, linux-crypto@vger.kernel.org,
	Dan Williams, linuxppc-dev@lists.ozlabs.org, davem@davemloft.net
In-Reply-To: <BCB48C05FCE8BC4D9E61E841ECBE6DB70C2788@039-SN2MPN1-011.039d.mgd.msft.net>

On Tue, Jul 17, 2012 at 12:23 PM, Liu Qiang-B32616 <B32616@freescale.com> wrote:
>> -----Original Message-----
>> From: Tabi Timur-B04825
>> Sent: Monday, July 16, 2012 10:25 PM
>> To: Liu Qiang-B32616
>> Cc: linux-crypto@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; Vinod
>> Koul; herbert@gondor.hengli.com.au; Dan Williams; Li Yang-R58472;
>> davem@davemloft.net
>> Subject: Re: [linuxppc-release] [PATCH v3 4/4] fsl-dma: use spin_lock_bh
>> to instead of spin_lock_irqsave
>>
>> Qiang Liu wrote:
>> > Use spin_lock_bh to instead of spin_lock_irqsave for improving
>> performance.
>>
>> You forgot to include the evidence that performance has improved, as well
>> as an explanation why it's okay to use spin_lock_bh, and why it's faster.
>>   I told you to respin the patch with that information in the patch
>> description.
> I attached the test result in v3 0/4, performance is improved by 2%.
> For my understanding, there is not any place to access descriptor lists
> in fsl-dma interrupt service handler except its tasklet, spin_lock_bh()
> is born for this. Interrupts will be turned off and context will be save in
> irqsave, there is needless to use irqsave in our case. You can refer to the
> implement of mv_xor.c or ioap-adma.c.
> If you think my explanation is ok, I can add it in the patch.

Disabling bottom half is not faster than disabling the interrupt.  But
by using it we reduce the time when interrupt is disabled.  We can get
better real time performance in term of interrupt handling latency.

Leo

^ permalink raw reply

* RE: [PATCH v3 3/4] fsl-dma: change release process of dma descriptor for supporting async_tx
From: Liu Qiang-B32616 @ 2012-07-17  7:06 UTC (permalink / raw)
  To: Ira W. Snyder
  Cc: Li Yang-R58472, Vinod Koul, Phillips Kim-R1AAHA,
	herbert@gondor.hengli.com.au, linux-crypto@vger.kernel.org,
	Dan Williams, linuxppc-dev@lists.ozlabs.org, davem@davemloft.net
In-Reply-To: <20120716200124.GD1227@ovro.caltech.edu>

> -----Original Message-----
> From: Ira W. Snyder [mailto:iws@ovro.caltech.edu]
> Sent: Tuesday, July 17, 2012 4:01 AM
> To: Liu Qiang-B32616
> Cc: linux-crypto@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; Phillips
> Kim-R1AAHA; herbert@gondor.hengli.com.au; davem@davemloft.net; Dan
> Williams; Vinod Koul; Li Yang-R58472
> Subject: Re: [PATCH v3 3/4] fsl-dma: change release process of dma
> descriptor for supporting async_tx
>=20
> On Mon, Jul 16, 2012 at 12:08:29PM +0800, Qiang Liu wrote:
> > Fix the potential risk when enable config NET_DMA and ASYNC_TX.
> > Async_tx is lack of support in current release process of dma
> > descriptor, all descriptors will be released whatever is acked or
> > no-acked by async_tx, so there is a potential race condition when dma
> > engine is uesd by others clients (e.g. when enable NET_DMA to offload
> TCP).
> >
> > In our case, a race condition which is raised when use both of talitos
> > and dmaengine to offload xor is because napi scheduler will sync all
> > pending requests in dma channels, it affects the process of raid
> > operations due to ack_tx is not checked in fsl dma. The no-acked
> > descriptor is freed which is submitted just now, as a dependent tx,
> > this freed descriptor trigger
> > BUG_ON(async_tx_test_ack(depend_tx)) in async_tx_submit().
> >
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Vinod Koul <vinod.koul@intel.com>
> > Cc: Li Yang <leoli@freescale.com>
> > Cc: Ira W. Snyder <iws@ovro.caltech.edu>
> > Signed-off-by: Qiang Liu <qiang.liu@freescale.com>
> > ---
> >  drivers/dma/fsldma.c |  378 +++++++++++++++++++++++++++++-------------
> -------
> >  drivers/dma/fsldma.h |    1 +
> >  2 files changed, 225 insertions(+), 154 deletions(-)
> >
> > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index
> > 4f2f212..4ee1b8f 100644
> > --- a/drivers/dma/fsldma.c
> > +++ b/drivers/dma/fsldma.c
> > @@ -400,6 +400,217 @@ out_splice:
> >  	list_splice_tail_init(&desc->tx_list, &chan->ld_pending);  }
> >
> > +/**
> > + * fsl_chan_xfer_ld_queue - transfer any pending transactions
> > + * @chan : Freescale DMA channel
> > + *
> > + * HARDWARE STATE: idle
> > + * LOCKING: must hold chan->desc_lock  */ static void
> > +fsl_chan_xfer_ld_queue(struct fsldma_chan *chan) {
> > +	struct fsl_desc_sw *desc;
> > +
> > +	/*
> > +	 * If the list of pending descriptors is empty, then we
> > +	 * don't need to do any work at all
> > +	 */
> > +	if (list_empty(&chan->ld_pending)) {
> > +		chan_dbg(chan, "no pending LDs\n");
> > +		return;
> > +	}
> > +
> > +	/*
> > +	 * The DMA controller is not idle, which means that the interrupt
> > +	 * handler will start any queued transactions when it runs after
> > +	 * this transaction finishes
> > +	 */
> > +	if (!chan->idle) {
> > +		chan_dbg(chan, "DMA controller still busy\n");
> > +		return;
> > +	}
> > +
> > +	/*
> > +	 * If there are some link descriptors which have not been
> > +	 * transferred, we need to start the controller
> > +	 */
> > +
> > +	/*
> > +	 * Move all elements from the queue of pending transactions
> > +	 * onto the list of running transactions
> > +	 */
> > +	chan_dbg(chan, "idle, starting controller\n");
> > +	desc =3D list_first_entry(&chan->ld_pending, struct fsl_desc_sw,
> node);
> > +	list_splice_tail_init(&chan->ld_pending, &chan->ld_running);
> > +
> > +	/*
> > +	 * The 85xx DMA controller doesn't clear the channel start bit
> > +	 * automatically at the end of a transfer. Therefore we must clear
> > +	 * it in software before starting the transfer.
> > +	 */
> > +	if ((chan->feature & FSL_DMA_IP_MASK) =3D=3D FSL_DMA_IP_85XX) {
> > +		u32 mode;
> > +
> > +		mode =3D DMA_IN(chan, &chan->regs->mr, 32);
> > +		mode &=3D ~FSL_DMA_MR_CS;
> > +		DMA_OUT(chan, &chan->regs->mr, mode, 32);
> > +	}
> > +
> > +	/*
> > +	 * Program the descriptor's address into the DMA controller,
> > +	 * then start the DMA transaction
> > +	 */
> > +	set_cdar(chan, desc->async_tx.phys);
> > +	get_cdar(chan);
> > +
> > +	dma_start(chan);
> > +	chan->idle =3D false;
> > +}
> > +
>=20
> Please add a note about the locking requirements here, and for the other
> new functions you've added.
>=20
> You call this function from two places:
>=20
> 1) fsldma_cleanup_descriptor() - called with mod->desc_lock held
> 2) fsl_tx_status() - WITHOUT mod->desc_lock held
>=20
> One of them is definitely wrong, and I'd bet that it is #2. You're
> modifying ld_completed without a lock.
Yes, My bad, I will correct it.

>=20
> > +static int
> > +fsldma_clean_completed_descriptor(struct fsldma_chan *chan) {
> > +	struct fsl_desc_sw *desc, *_desc;
> > +
> > +	/* Run the callback for each descriptor, in order */
> > +	list_for_each_entry_safe(desc, _desc, &chan->ld_completed, node) {
> > +
> > +		if (async_tx_test_ack(&desc->async_tx)) {
> > +			/* Remove from the list of transactions */
> > +			list_del(&desc->node);
> > +#ifdef FSL_DMA_LD_DEBUG
> > +			chan_dbg(chan, "LD %p free\n", desc); #endif
> > +			dma_pool_free(chan->desc_pool, desc,
> > +					desc->async_tx.phys);
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * fsldma_run_tx_complete_actions - cleanup and free a single link
> > +descriptor
> > + * @chan: Freescale DMA channel
> > + * @desc: descriptor to cleanup and free
> > + * @cookie: Freescale DMA transaction identifier
> > + *
> > + * This function is used on a descriptor which has been executed by
> > +the DMA
> > + * controller. It will run any callbacks, submit any dependencies,
> > +and then
> > + * free the descriptor.
> > + */
> > +static dma_cookie_t fsldma_run_tx_complete_actions(struct fsl_desc_sw
> *desc,
> > +		struct fsldma_chan *chan, dma_cookie_t cookie) {
> > +	struct dma_async_tx_descriptor *txd =3D &desc->async_tx;
> > +	struct device *dev =3D chan->common.device->dev;
> > +	dma_addr_t src =3D get_desc_src(chan, desc);
> > +	dma_addr_t dst =3D get_desc_dst(chan, desc);
> > +	u32 len =3D get_desc_cnt(chan, desc);
> > +
> > +	BUG_ON(txd->cookie < 0);
> > +
> > +	if (txd->cookie > 0) {
> > +		cookie =3D txd->cookie;
> > +
> > +		/* Run the link descriptor callback function */
> > +		if (txd->callback) {
> > +#ifdef FSL_DMA_LD_DEBUG
> > +			chan_dbg(chan, "LD %p callback\n", desc); #endif
> > +			txd->callback(txd->callback_param);
> > +		}
> > +
> > +		/* Unmap the dst buffer, if requested */
> > +		if (!(txd->flags & DMA_COMPL_SKIP_DEST_UNMAP)) {
> > +			if (txd->flags & DMA_COMPL_DEST_UNMAP_SINGLE)
> > +				dma_unmap_single(dev, dst, len, DMA_FROM_DEVICE);
> > +			else
> > +				dma_unmap_page(dev, dst, len, DMA_FROM_DEVICE);
> > +		}
> > +
> > +		/* Unmap the src buffer, if requested */
> > +		if (!(txd->flags & DMA_COMPL_SKIP_SRC_UNMAP)) {
> > +			if (txd->flags & DMA_COMPL_SRC_UNMAP_SINGLE)
> > +				dma_unmap_single(dev, src, len, DMA_TO_DEVICE);
> > +			else
> > +				dma_unmap_page(dev, src, len, DMA_TO_DEVICE);
> > +		}
> > +	}
> > +
> > +	/* Run any dependencies */
> > +	dma_run_dependencies(txd);
> > +
> > +	return cookie;
> > +}
> > +
> > +static int
> > +fsldma_clean_running_descriptor(struct fsl_desc_sw *desc,
> > +			struct fsldma_chan *chan)
> > +{
> > +	/* Remove from the list of transactions */
> > +	list_del(&desc->node);
> > +	/* the client is allowed to attach dependent operations
> > +	 * until 'ack' is set
> > +	 */
>=20
> This comment is does not follow the coding style. It should be:
>=20
> /*
>  * the client is allowed to attech dependent operations
>  * until 'ack' is set
>  */
Fine, I will correct it in v4. Include another 2 places related to coding s=
tyle.
Thanks.

>=20
> > +	if (!async_tx_test_ack(&desc->async_tx)) {
> > +		/* move this slot to the completed */
>=20
> Perhaps a better comment would be:
>=20
> Move this descriptor to the list of descriptors which is complete, but
> still awaiting the 'ack' bit to be set.
Accept.

>=20
> > +		list_add_tail(&desc->node, &chan->ld_completed);
> > +		return 0;
> > +	}
> > +
> > +	dma_pool_free(chan->desc_pool, desc,
> > +			desc->async_tx.phys);
>=20
> This should all be on one line. It is less than 80 characters wide.
>=20
Accept.

> > +	return 0;
> > +}
> > +
>=20
> Locking notes please.
I will add it in v4, please check. Thanks.

>=20
> > +static void fsldma_cleanup_descriptor(struct fsldma_chan *chan) {
> > +	struct fsl_desc_sw *desc, *_desc;
> > +	dma_cookie_t cookie =3D 0;
> > +	dma_addr_t curr_phys =3D get_cdar(chan);
> > +	int idle =3D dma_is_idle(chan);
> > +	int seen_current =3D 0;
> > +
> > +	fsldma_clean_completed_descriptor(chan);
> > +
> > +	/* Run the callback for each descriptor, in order */
> > +	list_for_each_entry_safe(desc, _desc, &chan->ld_running, node) {
> > +		/* do not advance past the current descriptor loaded into the
> > +		 * hardware channel, subsequent descriptors are either in
> > +		 * process or have not been submitted
> > +		 */
>=20
> Coding style.
>=20
> > +		if (seen_current)
> > +			break;
> > +
> > +		/* stop the search if we reach the current descriptor and the
> > +		 * channel is busy
> > +		 */
>=20
> Coding style.
>=20
> > +		if (desc->async_tx.phys =3D=3D curr_phys) {
> > +			seen_current =3D 1;
> > +			if (!idle)
> > +				break;
> > +		}
> > +
>=20
> This trick where you try to look at the hardware state to determine how
> much to clean up has been a source of headaches in the past versions of
> this driver.
I know a little about the history of this driver. Could you tell me what ki=
nd
headaches in the past versions, I can have a test and fix it in my patch. A=
nd
I also think use current phys addr should be more explicit.
Thanks.

>=20
> It is much easier to reason about the state of the hardware and avoid
> race conditions if you complete entire blocks of descriptors after the
> hardware interrupts you to tell you it is finished.
In current driver, it's ok, but there is a potential issue when enable
async_tx api. How can you determine these descriptors in ld_running whether
have been completed in fsl_tx_status()? (I mean in my patch.)

>=20
> This is the philosophy employed by the driver before your modifications:
> ld_pending: a block of descriptors which is ready to run as soon as the
> hardware becomes idle.
> ld_running: a block of descriptors which is being executed by the
> hardware.
These 2 lists are not enough, async_tx descriptors must be kept order as
expected of its submitter, we only can process the descriptor which has bee=
n
acked by async_tx api, but not free all descriptors in ld_running.
For example,
Step 1, in async_tx memcpy api,
tx2 =3D device->device_prep_dma_memcpy(...),
tx2->depend_tx =3D tx1;
step 2, in async_tx submit api,
async_tx_submit() will check tx2->depend_tx whether has been acked,
unfortunately, tx2->depend_tx is freed before step 2 (dma channels is flush=
ed
by other drivers), the contents of tx2->depend_tx is set to POOL_POISON_FRE=
ED
if DMAPOOL_DEBUG is enabled (illegal pointer will be used if disable this c=
onfig
option);
step 3, BUG_ON(async_tx_test_ack(depend_tx)) is triggered.

For resolve this potential risk, first, ack flag must be checked in dma dri=
ver,
second, ld_completed is added to restore all descriptors which have been ac=
ked
and completed.
In my following answer, most of all are around this idea.

>=20
> > +		cookie =3D fsldma_run_tx_complete_actions(desc, chan, cookie);
> > +
> > +		if (fsldma_clean_running_descriptor(desc, chan))
> > +			break;
> > +
> > +	}
> > +
> > +	/*
> > +	 * Start any pending transactions automatically
> > +	 *
> > +	 * In the ideal case, we keep the DMA controller busy while we go
> > +	 * ahead and free the descriptors below.
> > +	 */
> > +	fsl_chan_xfer_ld_queue(chan);
> > +
> > +	if (cookie > 0)
> > +		chan->common.completed_cookie =3D cookie; }
> > +
> >  static dma_cookie_t fsl_dma_tx_submit(struct dma_async_tx_descriptor
> > *tx)  {
> >  	struct fsldma_chan *chan =3D to_fsl_chan(tx->chan); @@ -534,8 +745,10
> > @@ static void fsl_dma_free_chan_resources(struct dma_chan *dchan)
> >
> >  	chan_dbg(chan, "free all channel resources\n");
> >  	spin_lock_irqsave(&chan->desc_lock, flags);
> > +	fsldma_cleanup_descriptor(chan);
> >  	fsldma_free_desc_list(chan, &chan->ld_pending);
> >  	fsldma_free_desc_list(chan, &chan->ld_running);
> > +	fsldma_free_desc_list(chan, &chan->ld_completed);
> >  	spin_unlock_irqrestore(&chan->desc_lock, flags);
> >
> >  	dma_pool_destroy(chan->desc_pool);
> > @@ -811,124 +1024,6 @@ static int fsl_dma_device_control(struct
> > dma_chan *dchan,  }
> >
> >  /**
> > - * fsldma_cleanup_descriptor - cleanup and free a single link
> > descriptor
> > - * @chan: Freescale DMA channel
> > - * @desc: descriptor to cleanup and free
> > - *
> > - * This function is used on a descriptor which has been executed by
> > the DMA
> > - * controller. It will run any callbacks, submit any dependencies,
> > and then
> > - * free the descriptor.
> > - */
> > -static void fsldma_cleanup_descriptor(struct fsldma_chan *chan,
> > -				      struct fsl_desc_sw *desc)
> > -{
> > -	struct dma_async_tx_descriptor *txd =3D &desc->async_tx;
> > -	struct device *dev =3D chan->common.device->dev;
> > -	dma_addr_t src =3D get_desc_src(chan, desc);
> > -	dma_addr_t dst =3D get_desc_dst(chan, desc);
> > -	u32 len =3D get_desc_cnt(chan, desc);
> > -
> > -	/* Run the link descriptor callback function */
> > -	if (txd->callback) {
> > -#ifdef FSL_DMA_LD_DEBUG
> > -		chan_dbg(chan, "LD %p callback\n", desc);
> > -#endif
> > -		txd->callback(txd->callback_param);
> > -	}
> > -
> > -	/* Run any dependencies */
> > -	dma_run_dependencies(txd);
> > -
> > -	/* Unmap the dst buffer, if requested */
> > -	if (!(txd->flags & DMA_COMPL_SKIP_DEST_UNMAP)) {
> > -		if (txd->flags & DMA_COMPL_DEST_UNMAP_SINGLE)
> > -			dma_unmap_single(dev, dst, len, DMA_FROM_DEVICE);
> > -		else
> > -			dma_unmap_page(dev, dst, len, DMA_FROM_DEVICE);
> > -	}
> > -
> > -	/* Unmap the src buffer, if requested */
> > -	if (!(txd->flags & DMA_COMPL_SKIP_SRC_UNMAP)) {
> > -		if (txd->flags & DMA_COMPL_SRC_UNMAP_SINGLE)
> > -			dma_unmap_single(dev, src, len, DMA_TO_DEVICE);
> > -		else
> > -			dma_unmap_page(dev, src, len, DMA_TO_DEVICE);
> > -	}
> > -
> > -#ifdef FSL_DMA_LD_DEBUG
> > -	chan_dbg(chan, "LD %p free\n", desc);
> > -#endif
> > -	dma_pool_free(chan->desc_pool, desc, txd->phys);
> > -}
> > -
> > -/**
> > - * fsl_chan_xfer_ld_queue - transfer any pending transactions
> > - * @chan : Freescale DMA channel
> > - *
> > - * HARDWARE STATE: idle
> > - * LOCKING: must hold chan->desc_lock
> > - */
> > -static void fsl_chan_xfer_ld_queue(struct fsldma_chan *chan) -{
> > -	struct fsl_desc_sw *desc;
> > -
> > -	/*
> > -	 * If the list of pending descriptors is empty, then we
> > -	 * don't need to do any work at all
> > -	 */
> > -	if (list_empty(&chan->ld_pending)) {
> > -		chan_dbg(chan, "no pending LDs\n");
> > -		return;
> > -	}
> > -
> > -	/*
> > -	 * The DMA controller is not idle, which means that the interrupt
> > -	 * handler will start any queued transactions when it runs after
> > -	 * this transaction finishes
> > -	 */
> > -	if (!chan->idle) {
> > -		chan_dbg(chan, "DMA controller still busy\n");
> > -		return;
> > -	}
> > -
> > -	/*
> > -	 * If there are some link descriptors which have not been
> > -	 * transferred, we need to start the controller
> > -	 */
> > -
> > -	/*
> > -	 * Move all elements from the queue of pending transactions
> > -	 * onto the list of running transactions
> > -	 */
> > -	chan_dbg(chan, "idle, starting controller\n");
> > -	desc =3D list_first_entry(&chan->ld_pending, struct fsl_desc_sw,
> node);
> > -	list_splice_tail_init(&chan->ld_pending, &chan->ld_running);
> > -
> > -	/*
> > -	 * The 85xx DMA controller doesn't clear the channel start bit
> > -	 * automatically at the end of a transfer. Therefore we must clear
> > -	 * it in software before starting the transfer.
> > -	 */
> > -	if ((chan->feature & FSL_DMA_IP_MASK) =3D=3D FSL_DMA_IP_85XX) {
> > -		u32 mode;
> > -
> > -		mode =3D DMA_IN(chan, &chan->regs->mr, 32);
> > -		mode &=3D ~FSL_DMA_MR_CS;
> > -		DMA_OUT(chan, &chan->regs->mr, mode, 32);
> > -	}
> > -
> > -	/*
> > -	 * Program the descriptor's address into the DMA controller,
> > -	 * then start the DMA transaction
> > -	 */
> > -	set_cdar(chan, desc->async_tx.phys);
> > -	get_cdar(chan);
> > -
> > -	dma_start(chan);
> > -	chan->idle =3D false;
> > -}
> > -
> > -/**
> >   * fsl_dma_memcpy_issue_pending - Issue the DMA start command
> >   * @chan : Freescale DMA channel
> >   */
> > @@ -954,11 +1049,17 @@ static enum dma_status fsl_tx_status(struct
> dma_chan *dchan,
> >  	enum dma_status ret;
> >  	unsigned long flags;
> >
> > -	spin_lock_irqsave(&chan->desc_lock, flags);
> >  	ret =3D dma_cookie_status(dchan, cookie, txstate);
> > +	if (ret =3D=3D DMA_SUCCESS) {
> > +		fsldma_clean_completed_descriptor(chan);
> > +		return ret;
> > +	}
> > +
> > +	spin_lock_irqsave(&chan->desc_lock, flags);
> > +	fsldma_cleanup_descriptor(chan);
>=20
> You've made status from a very quick operation (compare two cookies) into
> a potentially long running operation. It may now run callbacks, unmap
> pages, etc.
Yes, that's right, callbacks and unmap pages will be invoked if DMA status
is not DMA_SUCCESS.

>=20
> I note that Documentation/crypto/async-tx-api.txt section 3.6
> "Constraints" does say that async_*() functions cannot be called from IRQ
> context. However, the DMAEngine API itself does not have anything to say
> about IRQ context.
>=20
> I expect fsl_tx_status() to be quick, and not potentially call other
> arbitrary code.
fsl_tx_status() is not in IRQ context. It's reasonable to unmap pages and
run dependency of descriptor as fsldma_run_tx_complete_actions() does.

>=20
> Is it possible to leave this function unchanged?
According to my knowledge, it is unlikely to be left unchanged, or it will =
violate
the design of this interface. If DMA status is not DMA_SUCCESS, that means =
there
are new descriptors completed, we should move them to ld_completed, and do =
actions
to its async_tx descriptors, then dma descriptor will be freed at another t=
ime.
Most of my idea is from iop-adma.c, the original interface is not align wit=
h it.
Async_tx flags (expecially ack_test) is not considered when dma descriptor =
is freed,
so some random errors happened, I think you must familiar with this history=
.
As an important "synchronization flag", async_tx api must control the order=
 of tx
descriptor. Dma driver also should make sure keeping order when free the de=
scriptor.
That's the reason why I add ld_completed list.

>=20
> Also note that if you leave this function unchanged, the locking error
> pointed out above (fsldma_clean_completed_descriptor() is not called with
> the lock held) goes away.
I know. Thanks.

>=20
> >  	spin_unlock_irqrestore(&chan->desc_lock, flags);
> >
> > -	return ret;
> > +	return dma_cookie_status(dchan, cookie, txstate);
> >  }
> >
> >
> > /*--------------------------------------------------------------------
> > --------*/ @@ -1035,53 +1136,21 @@ static irqreturn_t
> > fsldma_chan_irq(int irq, void *data)  static void
> > dma_do_tasklet(unsigned long data)  {
> >  	struct fsldma_chan *chan =3D (struct fsldma_chan *)data;
> > -	struct fsl_desc_sw *desc, *_desc;
> > -	LIST_HEAD(ld_cleanup);
> >  	unsigned long flags;
> >
> >  	chan_dbg(chan, "tasklet entry\n");
> >
> >  	spin_lock_irqsave(&chan->desc_lock, flags);
> >
> > -	/* update the cookie if we have some descriptors to cleanup */
> > -	if (!list_empty(&chan->ld_running)) {
> > -		dma_cookie_t cookie;
> > -
> > -		desc =3D to_fsl_desc(chan->ld_running.prev);
> > -		cookie =3D desc->async_tx.cookie;
> > -		dma_cookie_complete(&desc->async_tx);
> > -
> > -		chan_dbg(chan, "completed_cookie=3D%d\n", cookie);
> > -	}
> > -
> > -	/*
> > -	 * move the descriptors to a temporary list so we can drop the lock
> > -	 * during the entire cleanup operation
> > -	 */
> > -	list_splice_tail_init(&chan->ld_running, &ld_cleanup);
> > +	/* Run all cleanup for this descriptor */
> > +	fsldma_cleanup_descriptor(chan);
> >
> >  	/* the hardware is now idle and ready for more */
> >  	chan->idle =3D true;
> >
> > -	/*
> > -	 * Start any pending transactions automatically
> > -	 *
> > -	 * In the ideal case, we keep the DMA controller busy while we go
> > -	 * ahead and free the descriptors below.
> > -	 */
> >  	fsl_chan_xfer_ld_queue(chan);
> >  	spin_unlock_irqrestore(&chan->desc_lock, flags);
> >
> > -	/* Run the callback for each descriptor, in order */
> > -	list_for_each_entry_safe(desc, _desc, &ld_cleanup, node) {
> > -
> > -		/* Remove from the list of transactions */
> > -		list_del(&desc->node);
> > -
> > -		/* Run all cleanup for this descriptor */
> > -		fsldma_cleanup_descriptor(chan, desc);
> > -	}
> > -
> >  	chan_dbg(chan, "tasklet exit\n");
> >  }
> >
> > @@ -1262,6 +1331,7 @@ static int __devinit fsl_dma_chan_probe(struct
> fsldma_device *fdev,
> >  	spin_lock_init(&chan->desc_lock);
> >  	INIT_LIST_HEAD(&chan->ld_pending);
> >  	INIT_LIST_HEAD(&chan->ld_running);
> > +	INIT_LIST_HEAD(&chan->ld_completed);
> >  	chan->idle =3D true;
> >
> >  	chan->common.device =3D &fdev->common; diff --git
> > a/drivers/dma/fsldma.h b/drivers/dma/fsldma.h index f5c3879..7ede908
> > 100644
> > --- a/drivers/dma/fsldma.h
> > +++ b/drivers/dma/fsldma.h
> > @@ -140,6 +140,7 @@ struct fsldma_chan {
> >  	spinlock_t desc_lock;		/* Descriptor operation lock */
> >  	struct list_head ld_pending;	/* Link descriptors queue */
> >  	struct list_head ld_running;	/* Link descriptors queue */
> > +	struct list_head ld_completed;	/* Link descriptors queue */
> >  	struct dma_chan common;		/* DMA common channel */
> >  	struct dma_pool *desc_pool;	/* Descriptors pool */
> >  	struct device *dev;		/* Channel device */
> > --
> > 1.7.5.1
> >
> >

^ permalink raw reply

* [PATCH] powerpc/p3041: change espi input-clock from 40MHz to 35MHz
From: Shaohui Xie @ 2012-07-17  7:18 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Shaohui Xie

Default CCB on P3041 is 750MHz, but espi cannot work at 40MHz with this CCB,
so we need to slow down the clock rate of espi to 35MHz to make it work stable
with the CCB.

Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com>
---
 arch/powerpc/boot/dts/p3041ds.dts |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/boot/dts/p3041ds.dts b/arch/powerpc/boot/dts/p3041ds.dts
index 22a215e..6cdcadc 100644
--- a/arch/powerpc/boot/dts/p3041ds.dts
+++ b/arch/powerpc/boot/dts/p3041ds.dts
@@ -58,7 +58,7 @@
 				#size-cells = <1>;
 				compatible = "spansion,s25sl12801";
 				reg = <0>;
-				spi-max-frequency = <40000000>; /* input clock */
+				spi-max-frequency = <35000000>; /* input clock */
 				partition@u-boot {
 					label = "u-boot";
 					reg = <0x00000000 0x00100000>;
-- 
1.6.4

^ permalink raw reply related

* Re: [PATCH 05/15] pci: resource assignment based on p2p alignment
From: Benjamin Herrenschmidt @ 2012-07-17  9:16 UTC (permalink / raw)
  To: Ram Pai; +Cc: bhelgaas, linux-pci, yinghai, Gavin Shan, linuxppc-dev
In-Reply-To: <20120717055715.GF2369@ram-ThinkPad-T61>

On Tue, 2012-07-17 at 13:57 +0800, Ram Pai wrote:
> Hmm.. this code is not about determining what kind of segment the
> platform is returning. This code is about using the right alignment
> constraints for the type of segment from which resource will be
> allocated. right?
> 
> b_res is the resource that is being sized. b_res already knows
> what kind of resource it is, i.e IORESOURCE_MEM or
> IORESOURCE_PREFETCH.
> Hence we should be exactly using the same alignment constraints as
> that dictated by the type of b_res. no? 

This is unclear.... ideally we want to know which of the host bridge
"apertures" is about to be used...

IE. A prefetchable resource can very well be allocated to a
non-prefetchable window, though the other way isn't supposed to happen.

Additionally, our PHB doesn't actually differenciate prefetchable and
non-prefetchable windows (whether you can prefetch or not is an
attribute of the CPU mapping, but basically non-cachable mappings are
never prefetchable for us).

So we can be lax in how we assign things between our single 32-bit
window divided in 128 segments and our 16x64-bit windows divided in 8
segments (and future HW will do thins differently even).

For example we would like in some cases to use M64's (64-bit windows) to
map SR-IOV BARs regardless of the "prefetchability" though that can only
work if we are not behind a PCIe switch, as those are technically
allowed to prefetch :-)

Worst is that the alignment constraint is based on the segment size, and
while we more/less fix the size of the 32-bit window, we plan to
dynamically allocate/resize the 64-bit ones which will mean variable
segment sizes as well.

So the more information you can get at that point, the better. The type
is useful because it allows us to know if you are trying to put a
prefetchable memory BAR inside a non-prefetchable region, in which case
we know it has to be in M32.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH 05/15] pci: resource assignment based on p2p alignment
From: Ram Pai @ 2012-07-17 10:03 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Gavin Shan, linux-pci, Ram Pai, linuxppc-dev, bhelgaas, yinghai
In-Reply-To: <1342516619.3669.5.camel@pasglop>

On Tue, Jul 17, 2012 at 07:16:59PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2012-07-17 at 13:57 +0800, Ram Pai wrote:
> > Hmm.. this code is not about determining what kind of segment the
> > platform is returning. This code is about using the right alignment
> > constraints for the type of segment from which resource will be
> > allocated. right?
> > 
> > b_res is the resource that is being sized. b_res already knows
> > what kind of resource it is, i.e IORESOURCE_MEM or
> > IORESOURCE_PREFETCH.
> > Hence we should be exactly using the same alignment constraints as
> > that dictated by the type of b_res. no? 
> 
> This is unclear.... ideally we want to know which of the host bridge
> "apertures" is about to be used...
> 
> IE. A prefetchable resource can very well be allocated to a
> non-prefetchable window, though the other way isn't supposed to happen.
> 
> Additionally, our PHB doesn't actually differenciate prefetchable and
> non-prefetchable windows (whether you can prefetch or not is an
> attribute of the CPU mapping, but basically non-cachable mappings are
> never prefetchable for us).
> 
> So we can be lax in how we assign things between our single 32-bit
> window divided in 128 segments and our 16x64-bit windows divided in 8
> segments (and future HW will do thins differently even).
> 
> For example we would like in some cases to use M64's (64-bit windows) to
> map SR-IOV BARs regardless of the "prefetchability" though that can only
> work if we are not behind a PCIe switch, as those are technically
> allowed to prefetch :-)
> 
> Worst is that the alignment constraint is based on the segment size, and
> while we more/less fix the size of the 32-bit window, we plan to
> dynamically allocate/resize the 64-bit ones which will mean variable
> segment sizes as well.
> 
> So the more information you can get at that point, the better. The type
> is useful because it allows us to know if you are trying to put a
> prefetchable memory BAR inside a non-prefetchable region, in which case
> we know it has to be in M32.


Ben,
	Lets say we passed that 'type' flag to size the minimum
	alignment constraints for that b_res.  And window_alignment(bus,
	type) of your platform  used that 'type' information to
	determine whether to use the alignment constraints of 32-bit
	window or 64-bit window.

	However, later when the b_res is actually allocated a resource,
	the pci_assign_resource() has no idea whether to allocate 32-bit
	window resource or 64-bit window resource, because the 'type'
	information is not captured anywhere in b_res.

	You would basically have a disconnect between what is sized and 
	what is allocated. Unless offcourse you pass that 'type' to
	the b_res->flags, which is currently not the case.

RP

^ permalink raw reply

* Re: [PATCH 05/15] pci: resource assignment based on p2p alignment
From: Benjamin Herrenschmidt @ 2012-07-17 10:38 UTC (permalink / raw)
  To: Ram Pai; +Cc: bhelgaas, linux-pci, yinghai, Gavin Shan, linuxppc-dev
In-Reply-To: <20120717100314.GB25613@ram-ThinkPad-T61>

On Tue, 2012-07-17 at 18:03 +0800, Ram Pai wrote:
>         Lets say we passed that 'type' flag to size the minimum
>         alignment constraints for that b_res.  And window_alignment(bus,
>         type) of your platform  used that 'type' information to
>         determine whether to use the alignment constraints of 32-bit
>         window or 64-bit window.
> 
>         However, later when the b_res is actually allocated a resource,
>         the pci_assign_resource() has no idea whether to allocate 32-bit
>         window resource or 64-bit window resource, because the 'type'
>         information is not captured anywhere in b_res.
> 
>         You would basically have a disconnect between what is sized and 
>         what is allocated. Unless offcourse you pass that 'type' to
>         the b_res->flags, which is currently not the case. 

Right, we ideally would need the core to query the alignment once per
"apertures" it tries as a candidate to allocate a given resource... but
that's for later.

For now we can probably live with giving out the max of the minimum
alignment we support for M64 and our M32 segment size.

Cheers,
Ben.

^ permalink raw reply

* RE: [PATCH v7 2/5] powerpc/85xx: add HOTPLUG_CPU support
From: Zhao Chenhui-B35336 @ 2012-07-17 11:39 UTC (permalink / raw)
  To: Kumar Gala
  Cc: Wood Scott-B07421, linuxppc-dev@lists.ozlabs.org,
	linux-kernel@vger.kernel.org, Li Yang-R58472
In-Reply-To: <0090811B-107A-4496-ADF1-DD07081BF6B9@kernel.crashing.org>

> -----Original Message-----
> From: Kumar Gala [mailto:galak@kernel.crashing.org]
> Sent: Friday, July 13, 2012 8:15 PM
> To: Zhao Chenhui-B35336
> Cc: linuxppc-dev@lists.ozlabs.org; Wood Scott-B07421; linux-kernel@vger.k=
ernel.org; Li Yang-R58472
> Subject: Re: [PATCH v7 2/5] powerpc/85xx: add HOTPLUG_CPU support
>=20
>=20
> On Jul 3, 2012, at 5:21 AM, Zhao Chenhui wrote:
>=20
> > From: Li Yang <leoli@freescale.com>
> >
> > Add support to disable and re-enable individual cores at runtime
> > on MPC85xx/QorIQ SMP machines. Currently support e500v1/e500v2 core.
> >
> > MPC85xx machines use ePAPR spin-table in boot page for CPU kick-off.
> > This patch uses the boot page from bootloader to boot core at runtime.
> > It supports 32-bit and 36-bit physical address.
> >
> > Add generic_set_cpu_up() to set cpu_state as CPU_UP_PREPARE in kick_cpu=
().
>=20
> Shouldn't the generic_setup_cpu_up() be a separate patch, and refactor sm=
p_generic_kick_cpu() to use
> it.
>=20
> Also, we should pull the conversion of the spintable from #defines to str=
uct into a separate patch
> before this one.
>=20

Ok. I will split this patch.

-Chenhui

^ permalink raw reply

* Re: [PATCH] powerpc/p3041: change espi input-clock from 40MHz to 35MHz
From: Kumar Gala @ 2012-07-17 13:39 UTC (permalink / raw)
  To: Shaohui Xie; +Cc: linuxppc-dev
In-Reply-To: <1342509511-13942-1-git-send-email-Shaohui.Xie@freescale.com>


On Jul 17, 2012, at 2:18 AM, Shaohui Xie wrote:

> Default CCB on P3041 is 750MHz, but espi cannot work at 40MHz with =
this CCB,
> so we need to slow down the clock rate of espi to 35MHz to make it =
work stable
> with the CCB.
>=20
> Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com>
> ---
> arch/powerpc/boot/dts/p3041ds.dts |    2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)

applied to next

- k=

^ permalink raw reply

* Re: [PATCH] powerpc/85xx: remove P1020RDB and P2020RDB CAMP device trees
From: Kumar Gala @ 2012-07-17 13:38 UTC (permalink / raw)
  To: Timur Tabi; +Cc: linuxppc-dev
In-Reply-To: <1342219243-8061-1-git-send-email-timur@freescale.com>


On Jul 13, 2012, at 5:40 PM, Timur Tabi wrote:

> We only need two examples of CAMP device trees in the upstream kernel.
>=20
> Co-operative Asymmetric Multi-Processing (CAMP) is a technique where =
two
> or more operating systems (typically multiple copies of the same Linux =
kernel)
> are loaded into memory, and each kernel is given a subset of the =
available
> cores to execute on.  For example, on a four-core system, one kernel =
runs on
> cores 0 and 1, and the other runs on cores 2 and 3.
>=20
> The devices are also partitioned among the operating systems, and this =
is
> done with customized device trees.  Each kernel gets its own device =
tree
> that has only the devices that it should know about.
>=20
> Unfortunately, this approach is very hackish.  The kernels are trusted =
to
> only access devices in their respective device trees, and the =
partitioning
> only works for devices that can be handled.  Crafting the device trees =
is a
> tricky process, and getting U-Boot to load and start all kernels is =
cumbersome.
>=20
> But most importantly, each CAMP setup is very application-specific, =
since
> the actual partitioning of resources is done in the DTS by the system
> designer.  Therefore, it doesn't make a lot of sense to have a lot of =
CAMP
> device trees, since we only expect them to be used as examples.
>=20
> Signed-off-by: Timur Tabi <timur@freescale.com>
> ---
> arch/powerpc/boot/dts/p1020rdb_camp_core0.dts |   63 -----------
> arch/powerpc/boot/dts/p1020rdb_camp_core1.dts |  141 =
-------------------------
> arch/powerpc/boot/dts/p2020rdb_camp_core0.dts |   67 ------------
> arch/powerpc/boot/dts/p2020rdb_camp_core1.dts |  125 =
----------------------
> 4 files changed, 0 insertions(+), 396 deletions(-)
> delete mode 100644 arch/powerpc/boot/dts/p1020rdb_camp_core0.dts
> delete mode 100644 arch/powerpc/boot/dts/p1020rdb_camp_core1.dts
> delete mode 100644 arch/powerpc/boot/dts/p2020rdb_camp_core0.dts
> delete mode 100644 arch/powerpc/boot/dts/p2020rdb_camp_core1.dts

applied to next

- k=

^ permalink raw reply

* Re: [PATCH] powerpc/85xx: p1022ds: disable the NAND flash node if video is enabled
From: Kumar Gala @ 2012-07-17 13:38 UTC (permalink / raw)
  To: Timur Tabi; +Cc: linuxppc-dev
In-Reply-To: <1342207722-2315-1-git-send-email-timur@freescale.com>


On Jul 13, 2012, at 2:28 PM, Timur Tabi wrote:

> The Freescale P1022 has a unique pin muxing "feature" where the DIU =
video
> controller's video signals are muxed with 24 of the local bus address =
signals.
> When the DIU is enabled, the bulk of the local bus is disabled, =
preventing
> access to memory-mapped devices like NAND flash and the pixis FPGA.
>=20
> Therefore, if the DIU is going to be enabled, then memory-mapped =
devices on
> the localbus, like NAND flash, need to be disabled.
>=20
> This patch is similar to "powerpc/85xx: p1022ds: disable the NOR flash =
node
> if video is enabled", except that it disables the NAND flash node =
instead.
> This PIXIS node needs to remain enabled because it is used by platform =
code
> to switch into indirect mode.
>=20
> Signed-off-by: Timur Tabi <timur@freescale.com>
> ---
> arch/powerpc/platforms/85xx/p1022_ds.c |   58 =
+++++++++++++++++++++-----------
> 1 files changed, 38 insertions(+), 20 deletions(-)

applied to next

- k=

^ permalink raw reply

* Re: [linuxppc-release] [PATCH v3 4/4] fsl-dma: use spin_lock_bh to instead of spin_lock_irqsave
From: Timur Tabi @ 2012-07-17 15:29 UTC (permalink / raw)
  To: Liu Qiang-B32616
  Cc: Li Yang-R58472, Vinod Koul, herbert@gondor.hengli.com.au,
	linux-crypto@vger.kernel.org, Dan Williams,
	linuxppc-dev@lists.ozlabs.org, davem@davemloft.net
In-Reply-To: <BCB48C05FCE8BC4D9E61E841ECBE6DB70C2788@039-SN2MPN1-011.039d.mgd.msft.net>

Liu Qiang-B32616 wrote:
> I attached the test result in v3 0/4, performance is improved by 2%.
> For my understanding, there is not any place to access descriptor lists
> in fsl-dma interrupt service handler except its tasklet, spin_lock_bh()
> is born for this. Interrupts will be turned off and context will be save in
> irqsave, there is needless to use irqsave in our case. You can refer to the
> implement of mv_xor.c or ioap-adma.c.
> If you think my explanation is ok, I can add it in the patch.

Yes, the explanation is ok.  Please make it part of the patch description,
so that it is saved in the git history.

-- 
Timur Tabi
Linux kernel developer at Freescale

^ permalink raw reply

* Re: [PATCH v3 3/4] fsl-dma: change release process of dma descriptor for supporting async_tx
From: Ira W. Snyder @ 2012-07-17 16:16 UTC (permalink / raw)
  To: Liu Qiang-B32616
  Cc: Li Yang-R58472, Vinod Koul, Phillips Kim-R1AAHA,
	herbert@gondor.hengli.com.au, linux-crypto@vger.kernel.org,
	Dan Williams, linuxppc-dev@lists.ozlabs.org, davem@davemloft.net
In-Reply-To: <BCB48C05FCE8BC4D9E61E841ECBE6DB70C283D@039-SN2MPN1-011.039d.mgd.msft.net>

On Tue, Jul 17, 2012 at 07:06:33AM +0000, Liu Qiang-B32616 wrote:
> > -----Original Message-----
> > From: Ira W. Snyder [mailto:iws@ovro.caltech.edu]
> > Sent: Tuesday, July 17, 2012 4:01 AM
> > To: Liu Qiang-B32616
> > Cc: linux-crypto@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; Phillips
> > Kim-R1AAHA; herbert@gondor.hengli.com.au; davem@davemloft.net; Dan
> > Williams; Vinod Koul; Li Yang-R58472
> > Subject: Re: [PATCH v3 3/4] fsl-dma: change release process of dma
> > descriptor for supporting async_tx
> > 
> > On Mon, Jul 16, 2012 at 12:08:29PM +0800, Qiang Liu wrote:
> > > Fix the potential risk when enable config NET_DMA and ASYNC_TX.
> > > Async_tx is lack of support in current release process of dma
> > > descriptor, all descriptors will be released whatever is acked or
> > > no-acked by async_tx, so there is a potential race condition when dma
> > > engine is uesd by others clients (e.g. when enable NET_DMA to offload
> > TCP).
> > >
> > > In our case, a race condition which is raised when use both of talitos
> > > and dmaengine to offload xor is because napi scheduler will sync all
> > > pending requests in dma channels, it affects the process of raid
> > > operations due to ack_tx is not checked in fsl dma. The no-acked
> > > descriptor is freed which is submitted just now, as a dependent tx,
> > > this freed descriptor trigger
> > > BUG_ON(async_tx_test_ack(depend_tx)) in async_tx_submit().
> > >
> > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > Cc: Vinod Koul <vinod.koul@intel.com>
> > > Cc: Li Yang <leoli@freescale.com>
> > > Cc: Ira W. Snyder <iws@ovro.caltech.edu>
> > > Signed-off-by: Qiang Liu <qiang.liu@freescale.com>
> > > ---
> > >  drivers/dma/fsldma.c |  378 +++++++++++++++++++++++++++++-------------
> > -------
> > >  drivers/dma/fsldma.h |    1 +
> > >  2 files changed, 225 insertions(+), 154 deletions(-)
> > >
> > > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index
> > > 4f2f212..4ee1b8f 100644
> > > --- a/drivers/dma/fsldma.c
> > > +++ b/drivers/dma/fsldma.c
> > > @@ -400,6 +400,217 @@ out_splice:
> > >  	list_splice_tail_init(&desc->tx_list, &chan->ld_pending);  }
> > >
> > > +/**
> > > + * fsl_chan_xfer_ld_queue - transfer any pending transactions
> > > + * @chan : Freescale DMA channel
> > > + *
> > > + * HARDWARE STATE: idle
> > > + * LOCKING: must hold chan->desc_lock  */ static void
> > > +fsl_chan_xfer_ld_queue(struct fsldma_chan *chan) {
> > > +	struct fsl_desc_sw *desc;
> > > +
> > > +	/*
> > > +	 * If the list of pending descriptors is empty, then we
> > > +	 * don't need to do any work at all
> > > +	 */
> > > +	if (list_empty(&chan->ld_pending)) {
> > > +		chan_dbg(chan, "no pending LDs\n");
> > > +		return;
> > > +	}
> > > +
> > > +	/*
> > > +	 * The DMA controller is not idle, which means that the interrupt
> > > +	 * handler will start any queued transactions when it runs after
> > > +	 * this transaction finishes
> > > +	 */
> > > +	if (!chan->idle) {
> > > +		chan_dbg(chan, "DMA controller still busy\n");
> > > +		return;
> > > +	}
> > > +
> > > +	/*
> > > +	 * If there are some link descriptors which have not been
> > > +	 * transferred, we need to start the controller
> > > +	 */
> > > +
> > > +	/*
> > > +	 * Move all elements from the queue of pending transactions
> > > +	 * onto the list of running transactions
> > > +	 */
> > > +	chan_dbg(chan, "idle, starting controller\n");
> > > +	desc = list_first_entry(&chan->ld_pending, struct fsl_desc_sw,
> > node);
> > > +	list_splice_tail_init(&chan->ld_pending, &chan->ld_running);
> > > +
> > > +	/*
> > > +	 * The 85xx DMA controller doesn't clear the channel start bit
> > > +	 * automatically at the end of a transfer. Therefore we must clear
> > > +	 * it in software before starting the transfer.
> > > +	 */
> > > +	if ((chan->feature & FSL_DMA_IP_MASK) == FSL_DMA_IP_85XX) {
> > > +		u32 mode;
> > > +
> > > +		mode = DMA_IN(chan, &chan->regs->mr, 32);
> > > +		mode &= ~FSL_DMA_MR_CS;
> > > +		DMA_OUT(chan, &chan->regs->mr, mode, 32);
> > > +	}
> > > +
> > > +	/*
> > > +	 * Program the descriptor's address into the DMA controller,
> > > +	 * then start the DMA transaction
> > > +	 */
> > > +	set_cdar(chan, desc->async_tx.phys);
> > > +	get_cdar(chan);
> > > +
> > > +	dma_start(chan);
> > > +	chan->idle = false;
> > > +}
> > > +
> > 
> > Please add a note about the locking requirements here, and for the other
> > new functions you've added.
> > 
> > You call this function from two places:
> > 
> > 1) fsldma_cleanup_descriptor() - called with mod->desc_lock held
> > 2) fsl_tx_status() - WITHOUT mod->desc_lock held
> > 
> > One of them is definitely wrong, and I'd bet that it is #2. You're
> > modifying ld_completed without a lock.
> Yes, My bad, I will correct it.
> 
> > 
> > > +static int
> > > +fsldma_clean_completed_descriptor(struct fsldma_chan *chan) {
> > > +	struct fsl_desc_sw *desc, *_desc;
> > > +
> > > +	/* Run the callback for each descriptor, in order */
> > > +	list_for_each_entry_safe(desc, _desc, &chan->ld_completed, node) {
> > > +
> > > +		if (async_tx_test_ack(&desc->async_tx)) {
> > > +			/* Remove from the list of transactions */
> > > +			list_del(&desc->node);
> > > +#ifdef FSL_DMA_LD_DEBUG
> > > +			chan_dbg(chan, "LD %p free\n", desc); #endif
> > > +			dma_pool_free(chan->desc_pool, desc,
> > > +					desc->async_tx.phys);
> > > +		}
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * fsldma_run_tx_complete_actions - cleanup and free a single link
> > > +descriptor
> > > + * @chan: Freescale DMA channel
> > > + * @desc: descriptor to cleanup and free
> > > + * @cookie: Freescale DMA transaction identifier
> > > + *
> > > + * This function is used on a descriptor which has been executed by
> > > +the DMA
> > > + * controller. It will run any callbacks, submit any dependencies,
> > > +and then
> > > + * free the descriptor.
> > > + */
> > > +static dma_cookie_t fsldma_run_tx_complete_actions(struct fsl_desc_sw
> > *desc,
> > > +		struct fsldma_chan *chan, dma_cookie_t cookie) {
> > > +	struct dma_async_tx_descriptor *txd = &desc->async_tx;
> > > +	struct device *dev = chan->common.device->dev;
> > > +	dma_addr_t src = get_desc_src(chan, desc);
> > > +	dma_addr_t dst = get_desc_dst(chan, desc);
> > > +	u32 len = get_desc_cnt(chan, desc);
> > > +
> > > +	BUG_ON(txd->cookie < 0);
> > > +
> > > +	if (txd->cookie > 0) {
> > > +		cookie = txd->cookie;
> > > +
> > > +		/* Run the link descriptor callback function */
> > > +		if (txd->callback) {
> > > +#ifdef FSL_DMA_LD_DEBUG
> > > +			chan_dbg(chan, "LD %p callback\n", desc); #endif
> > > +			txd->callback(txd->callback_param);
> > > +		}
> > > +
> > > +		/* Unmap the dst buffer, if requested */
> > > +		if (!(txd->flags & DMA_COMPL_SKIP_DEST_UNMAP)) {
> > > +			if (txd->flags & DMA_COMPL_DEST_UNMAP_SINGLE)
> > > +				dma_unmap_single(dev, dst, len, DMA_FROM_DEVICE);
> > > +			else
> > > +				dma_unmap_page(dev, dst, len, DMA_FROM_DEVICE);
> > > +		}
> > > +
> > > +		/* Unmap the src buffer, if requested */
> > > +		if (!(txd->flags & DMA_COMPL_SKIP_SRC_UNMAP)) {
> > > +			if (txd->flags & DMA_COMPL_SRC_UNMAP_SINGLE)
> > > +				dma_unmap_single(dev, src, len, DMA_TO_DEVICE);
> > > +			else
> > > +				dma_unmap_page(dev, src, len, DMA_TO_DEVICE);
> > > +		}
> > > +	}
> > > +
> > > +	/* Run any dependencies */
> > > +	dma_run_dependencies(txd);
> > > +
> > > +	return cookie;
> > > +}
> > > +
> > > +static int
> > > +fsldma_clean_running_descriptor(struct fsl_desc_sw *desc,
> > > +			struct fsldma_chan *chan)
> > > +{
> > > +	/* Remove from the list of transactions */
> > > +	list_del(&desc->node);
> > > +	/* the client is allowed to attach dependent operations
> > > +	 * until 'ack' is set
> > > +	 */
> > 
> > This comment is does not follow the coding style. It should be:
> > 
> > /*
> >  * the client is allowed to attech dependent operations
> >  * until 'ack' is set
> >  */
> Fine, I will correct it in v4. Include another 2 places related to coding style.
> Thanks.
> 
> > 
> > > +	if (!async_tx_test_ack(&desc->async_tx)) {
> > > +		/* move this slot to the completed */
> > 
> > Perhaps a better comment would be:
> > 
> > Move this descriptor to the list of descriptors which is complete, but
> > still awaiting the 'ack' bit to be set.
> Accept.
> 
> > 
> > > +		list_add_tail(&desc->node, &chan->ld_completed);
> > > +		return 0;
> > > +	}
> > > +
> > > +	dma_pool_free(chan->desc_pool, desc,
> > > +			desc->async_tx.phys);
> > 
> > This should all be on one line. It is less than 80 characters wide.
> > 
> Accept.
> 
> > > +	return 0;
> > > +}
> > > +
> > 
> > Locking notes please.
> I will add it in v4, please check. Thanks.
> 
> > 
> > > +static void fsldma_cleanup_descriptor(struct fsldma_chan *chan) {
> > > +	struct fsl_desc_sw *desc, *_desc;
> > > +	dma_cookie_t cookie = 0;
> > > +	dma_addr_t curr_phys = get_cdar(chan);
> > > +	int idle = dma_is_idle(chan);
> > > +	int seen_current = 0;
> > > +
> > > +	fsldma_clean_completed_descriptor(chan);
> > > +
> > > +	/* Run the callback for each descriptor, in order */
> > > +	list_for_each_entry_safe(desc, _desc, &chan->ld_running, node) {
> > > +		/* do not advance past the current descriptor loaded into the
> > > +		 * hardware channel, subsequent descriptors are either in
> > > +		 * process or have not been submitted
> > > +		 */
> > 
> > Coding style.
> > 
> > > +		if (seen_current)
> > > +			break;
> > > +
> > > +		/* stop the search if we reach the current descriptor and the
> > > +		 * channel is busy
> > > +		 */
> > 
> > Coding style.
> > 
> > > +		if (desc->async_tx.phys == curr_phys) {
> > > +			seen_current = 1;
> > > +			if (!idle)
> > > +				break;
> > > +		}
> > > +
> > 
> > This trick where you try to look at the hardware state to determine how
> > much to clean up has been a source of headaches in the past versions of
> > this driver.
> I know a little about the history of this driver. Could you tell me what kind
> headaches in the past versions, I can have a test and fix it in my patch. And
> I also think use current phys addr should be more explicit.
> Thanks.
> 

It has been a very long time since I last worked on this code, but I
will try to remember.

I remember one problem where the currently running descriptor was free'd
while the hardware was still executing it.

There was another related problem where the DMA hardware would lock up,
and it is impossible to recover without a hard reset. The CB (channel
busy) bit gets stuck on, and the running transfer never completes. This
may have been due to a programming issue in a user of the DMAEngine API,
and not this driver itself. I don't remember for sure anymore.

Our current system here at work has 120 boards with this DMA controller.
Each one runs 100+ DMA operations per second, 24/7/365. They've been
online for more than two years. I know for sure that the current code
doesn't lock up the DMA controller. :)

I think your bug is real, and needs to be fixed. I'm just trying to make
sure I understand the change, so it won't break other users. I'm only
trying to help.

> > 
> > It is much easier to reason about the state of the hardware and avoid
> > race conditions if you complete entire blocks of descriptors after the
> > hardware interrupts you to tell you it is finished.
> In current driver, it's ok, but there is a potential issue when enable
> async_tx api. How can you determine these descriptors in ld_running whether
> have been completed in fsl_tx_status()? (I mean in my patch.)
> 

I think your idea using an ld_completed list for descriptors which have
been run by the hardware, but not yet 'ack'ed by software is fine.

Would something similar to the following work?

ld_pending: build a list of descriptors until issue_pending() is called.
This is exactly what it does now, no changes needed.

ld_running: this list of descriptors is currently executing. Mostly
unchanged from how it works now.

When you get an interrupt, you know the descriptors in ld_running are
finished. Update the cookie to the highest number from ld_running. Free
the descriptors which have been 'ack'ed immediately. Move the ones which
are not 'ack'ed onto ld_completed. As part of this process, you should
run dma_run_dependencies() and callbacks as needed.

I think that this solves the problem you are seeing, which is that
descriptors which have not been 'ack'ed are free'd.

If this is unclear, I can write some code to demonstrate. If you could
write a small test driver, similar to drivers/dma/dmatest.c, which tests
the async_tx API without any extra hardware needed, I can run it. I have
never used the async_tx API before.

> > 
> > This is the philosophy employed by the driver before your modifications:
> > ld_pending: a block of descriptors which is ready to run as soon as the
> > hardware becomes idle.
> > ld_running: a block of descriptors which is being executed by the
> > hardware.
> These 2 lists are not enough, async_tx descriptors must be kept order as
> expected of its submitter, we only can process the descriptor which has been
> acked by async_tx api, but not free all descriptors in ld_running.
> For example,
> Step 1, in async_tx memcpy api,
> tx2 = device->device_prep_dma_memcpy(...),
> tx2->depend_tx = tx1;
> step 2, in async_tx submit api,
> async_tx_submit() will check tx2->depend_tx whether has been acked,
> unfortunately, tx2->depend_tx is freed before step 2 (dma channels is flushed
> by other drivers), the contents of tx2->depend_tx is set to POOL_POISON_FREED
> if DMAPOOL_DEBUG is enabled (illegal pointer will be used if disable this config
> option);
> step 3, BUG_ON(async_tx_test_ack(depend_tx)) is triggered.
> 
> For resolve this potential risk, first, ack flag must be checked in dma driver,
> second, ld_completed is added to restore all descriptors which have been acked
> and completed.
> In my following answer, most of all are around this idea.
> 
> > 
> > > +		cookie = fsldma_run_tx_complete_actions(desc, chan, cookie);
> > > +
> > > +		if (fsldma_clean_running_descriptor(desc, chan))
> > > +			break;
> > > +
> > > +	}
> > > +
> > > +	/*
> > > +	 * Start any pending transactions automatically
> > > +	 *
> > > +	 * In the ideal case, we keep the DMA controller busy while we go
> > > +	 * ahead and free the descriptors below.
> > > +	 */
> > > +	fsl_chan_xfer_ld_queue(chan);
> > > +
> > > +	if (cookie > 0)
> > > +		chan->common.completed_cookie = cookie; }
> > > +
> > >  static dma_cookie_t fsl_dma_tx_submit(struct dma_async_tx_descriptor
> > > *tx)  {
> > >  	struct fsldma_chan *chan = to_fsl_chan(tx->chan); @@ -534,8 +745,10
> > > @@ static void fsl_dma_free_chan_resources(struct dma_chan *dchan)
> > >
> > >  	chan_dbg(chan, "free all channel resources\n");
> > >  	spin_lock_irqsave(&chan->desc_lock, flags);
> > > +	fsldma_cleanup_descriptor(chan);
> > >  	fsldma_free_desc_list(chan, &chan->ld_pending);
> > >  	fsldma_free_desc_list(chan, &chan->ld_running);
> > > +	fsldma_free_desc_list(chan, &chan->ld_completed);
> > >  	spin_unlock_irqrestore(&chan->desc_lock, flags);
> > >
> > >  	dma_pool_destroy(chan->desc_pool);
> > > @@ -811,124 +1024,6 @@ static int fsl_dma_device_control(struct
> > > dma_chan *dchan,  }
> > >
> > >  /**
> > > - * fsldma_cleanup_descriptor - cleanup and free a single link
> > > descriptor
> > > - * @chan: Freescale DMA channel
> > > - * @desc: descriptor to cleanup and free
> > > - *
> > > - * This function is used on a descriptor which has been executed by
> > > the DMA
> > > - * controller. It will run any callbacks, submit any dependencies,
> > > and then
> > > - * free the descriptor.
> > > - */
> > > -static void fsldma_cleanup_descriptor(struct fsldma_chan *chan,
> > > -				      struct fsl_desc_sw *desc)
> > > -{
> > > -	struct dma_async_tx_descriptor *txd = &desc->async_tx;
> > > -	struct device *dev = chan->common.device->dev;
> > > -	dma_addr_t src = get_desc_src(chan, desc);
> > > -	dma_addr_t dst = get_desc_dst(chan, desc);
> > > -	u32 len = get_desc_cnt(chan, desc);
> > > -
> > > -	/* Run the link descriptor callback function */
> > > -	if (txd->callback) {
> > > -#ifdef FSL_DMA_LD_DEBUG
> > > -		chan_dbg(chan, "LD %p callback\n", desc);
> > > -#endif
> > > -		txd->callback(txd->callback_param);
> > > -	}
> > > -
> > > -	/* Run any dependencies */
> > > -	dma_run_dependencies(txd);
> > > -
> > > -	/* Unmap the dst buffer, if requested */
> > > -	if (!(txd->flags & DMA_COMPL_SKIP_DEST_UNMAP)) {
> > > -		if (txd->flags & DMA_COMPL_DEST_UNMAP_SINGLE)
> > > -			dma_unmap_single(dev, dst, len, DMA_FROM_DEVICE);
> > > -		else
> > > -			dma_unmap_page(dev, dst, len, DMA_FROM_DEVICE);
> > > -	}
> > > -
> > > -	/* Unmap the src buffer, if requested */
> > > -	if (!(txd->flags & DMA_COMPL_SKIP_SRC_UNMAP)) {
> > > -		if (txd->flags & DMA_COMPL_SRC_UNMAP_SINGLE)
> > > -			dma_unmap_single(dev, src, len, DMA_TO_DEVICE);
> > > -		else
> > > -			dma_unmap_page(dev, src, len, DMA_TO_DEVICE);
> > > -	}
> > > -
> > > -#ifdef FSL_DMA_LD_DEBUG
> > > -	chan_dbg(chan, "LD %p free\n", desc);
> > > -#endif
> > > -	dma_pool_free(chan->desc_pool, desc, txd->phys);
> > > -}
> > > -
> > > -/**
> > > - * fsl_chan_xfer_ld_queue - transfer any pending transactions
> > > - * @chan : Freescale DMA channel
> > > - *
> > > - * HARDWARE STATE: idle
> > > - * LOCKING: must hold chan->desc_lock
> > > - */
> > > -static void fsl_chan_xfer_ld_queue(struct fsldma_chan *chan) -{
> > > -	struct fsl_desc_sw *desc;
> > > -
> > > -	/*
> > > -	 * If the list of pending descriptors is empty, then we
> > > -	 * don't need to do any work at all
> > > -	 */
> > > -	if (list_empty(&chan->ld_pending)) {
> > > -		chan_dbg(chan, "no pending LDs\n");
> > > -		return;
> > > -	}
> > > -
> > > -	/*
> > > -	 * The DMA controller is not idle, which means that the interrupt
> > > -	 * handler will start any queued transactions when it runs after
> > > -	 * this transaction finishes
> > > -	 */
> > > -	if (!chan->idle) {
> > > -		chan_dbg(chan, "DMA controller still busy\n");
> > > -		return;
> > > -	}
> > > -
> > > -	/*
> > > -	 * If there are some link descriptors which have not been
> > > -	 * transferred, we need to start the controller
> > > -	 */
> > > -
> > > -	/*
> > > -	 * Move all elements from the queue of pending transactions
> > > -	 * onto the list of running transactions
> > > -	 */
> > > -	chan_dbg(chan, "idle, starting controller\n");
> > > -	desc = list_first_entry(&chan->ld_pending, struct fsl_desc_sw,
> > node);
> > > -	list_splice_tail_init(&chan->ld_pending, &chan->ld_running);
> > > -
> > > -	/*
> > > -	 * The 85xx DMA controller doesn't clear the channel start bit
> > > -	 * automatically at the end of a transfer. Therefore we must clear
> > > -	 * it in software before starting the transfer.
> > > -	 */
> > > -	if ((chan->feature & FSL_DMA_IP_MASK) == FSL_DMA_IP_85XX) {
> > > -		u32 mode;
> > > -
> > > -		mode = DMA_IN(chan, &chan->regs->mr, 32);
> > > -		mode &= ~FSL_DMA_MR_CS;
> > > -		DMA_OUT(chan, &chan->regs->mr, mode, 32);
> > > -	}
> > > -
> > > -	/*
> > > -	 * Program the descriptor's address into the DMA controller,
> > > -	 * then start the DMA transaction
> > > -	 */
> > > -	set_cdar(chan, desc->async_tx.phys);
> > > -	get_cdar(chan);
> > > -
> > > -	dma_start(chan);
> > > -	chan->idle = false;
> > > -}
> > > -
> > > -/**
> > >   * fsl_dma_memcpy_issue_pending - Issue the DMA start command
> > >   * @chan : Freescale DMA channel
> > >   */
> > > @@ -954,11 +1049,17 @@ static enum dma_status fsl_tx_status(struct
> > dma_chan *dchan,
> > >  	enum dma_status ret;
> > >  	unsigned long flags;
> > >
> > > -	spin_lock_irqsave(&chan->desc_lock, flags);
> > >  	ret = dma_cookie_status(dchan, cookie, txstate);
> > > +	if (ret == DMA_SUCCESS) {
> > > +		fsldma_clean_completed_descriptor(chan);
> > > +		return ret;
> > > +	}
> > > +
> > > +	spin_lock_irqsave(&chan->desc_lock, flags);
> > > +	fsldma_cleanup_descriptor(chan);
> > 
> > You've made status from a very quick operation (compare two cookies) into
> > a potentially long running operation. It may now run callbacks, unmap
> > pages, etc.
> Yes, that's right, callbacks and unmap pages will be invoked if DMA status
> is not DMA_SUCCESS.
> 
> > 
> > I note that Documentation/crypto/async-tx-api.txt section 3.6
> > "Constraints" does say that async_*() functions cannot be called from IRQ
> > context. However, the DMAEngine API itself does not have anything to say
> > about IRQ context.
> > 
> > I expect fsl_tx_status() to be quick, and not potentially call other
> > arbitrary code.
> fsl_tx_status() is not in IRQ context. It's reasonable to unmap pages and
> run dependency of descriptor as fsldma_run_tx_complete_actions() does.
> 

I looked at several other DMAEngine API drivers. Some of them do run
dependencies and callbacks from tx_status(). You are correct, this is
fine.

> > 
> > Is it possible to leave this function unchanged?
> According to my knowledge, it is unlikely to be left unchanged, or it will violate
> the design of this interface. If DMA status is not DMA_SUCCESS, that means there
> are new descriptors completed, we should move them to ld_completed, and do actions
> to its async_tx descriptors, then dma descriptor will be freed at another time.
> Most of my idea is from iop-adma.c, the original interface is not align with it.
> Async_tx flags (expecially ack_test) is not considered when dma descriptor is freed,
> so some random errors happened, I think you must familiar with this history.
> As an important "synchronization flag", async_tx api must control the order of tx
> descriptor. Dma driver also should make sure keeping order when free the descriptor.
> That's the reason why I add ld_completed list.
> 

Is it possible to put the descriptors into three different states?

1) ld_pending: ready to run, but not yet executed on the hardware

Descriptors move to #2 through either dma_async_issue_pending() or by
the interrupt which happens when the hardware completes executing a set
of DMA descriptors.

2) ld_running: currently executing on the hardware

Descriptors move to #3 through the interrupt which happens when the
hardware completes executing a set of DMA descriptors.

Dependencies and callbacks are executed as the descriptors are moved
onto ld_completed.

3) ld_completed: finished executing on the hardware, but not yet 'ack'ed

When they have been 'ack'ed, the descriptors are free'd.


As noted above, I think I could code this up fairly quickly if this
explanation is unclear.

Also note that I may not understand your bug completely, and my idea may
be completely wrong!

Ira

> > 
> > Also note that if you leave this function unchanged, the locking error
> > pointed out above (fsldma_clean_completed_descriptor() is not called with
> > the lock held) goes away.
> I know. Thanks.
> 
> > 
> > >  	spin_unlock_irqrestore(&chan->desc_lock, flags);
> > >
> > > -	return ret;
> > > +	return dma_cookie_status(dchan, cookie, txstate);
> > >  }
> > >
> > >
> > > /*--------------------------------------------------------------------
> > > --------*/ @@ -1035,53 +1136,21 @@ static irqreturn_t
> > > fsldma_chan_irq(int irq, void *data)  static void
> > > dma_do_tasklet(unsigned long data)  {
> > >  	struct fsldma_chan *chan = (struct fsldma_chan *)data;
> > > -	struct fsl_desc_sw *desc, *_desc;
> > > -	LIST_HEAD(ld_cleanup);
> > >  	unsigned long flags;
> > >
> > >  	chan_dbg(chan, "tasklet entry\n");
> > >
> > >  	spin_lock_irqsave(&chan->desc_lock, flags);
> > >
> > > -	/* update the cookie if we have some descriptors to cleanup */
> > > -	if (!list_empty(&chan->ld_running)) {
> > > -		dma_cookie_t cookie;
> > > -
> > > -		desc = to_fsl_desc(chan->ld_running.prev);
> > > -		cookie = desc->async_tx.cookie;
> > > -		dma_cookie_complete(&desc->async_tx);
> > > -
> > > -		chan_dbg(chan, "completed_cookie=%d\n", cookie);
> > > -	}
> > > -
> > > -	/*
> > > -	 * move the descriptors to a temporary list so we can drop the lock
> > > -	 * during the entire cleanup operation
> > > -	 */
> > > -	list_splice_tail_init(&chan->ld_running, &ld_cleanup);
> > > +	/* Run all cleanup for this descriptor */
> > > +	fsldma_cleanup_descriptor(chan);
> > >
> > >  	/* the hardware is now idle and ready for more */
> > >  	chan->idle = true;
> > >
> > > -	/*
> > > -	 * Start any pending transactions automatically
> > > -	 *
> > > -	 * In the ideal case, we keep the DMA controller busy while we go
> > > -	 * ahead and free the descriptors below.
> > > -	 */
> > >  	fsl_chan_xfer_ld_queue(chan);
> > >  	spin_unlock_irqrestore(&chan->desc_lock, flags);
> > >
> > > -	/* Run the callback for each descriptor, in order */
> > > -	list_for_each_entry_safe(desc, _desc, &ld_cleanup, node) {
> > > -
> > > -		/* Remove from the list of transactions */
> > > -		list_del(&desc->node);
> > > -
> > > -		/* Run all cleanup for this descriptor */
> > > -		fsldma_cleanup_descriptor(chan, desc);
> > > -	}
> > > -
> > >  	chan_dbg(chan, "tasklet exit\n");
> > >  }
> > >
> > > @@ -1262,6 +1331,7 @@ static int __devinit fsl_dma_chan_probe(struct
> > fsldma_device *fdev,
> > >  	spin_lock_init(&chan->desc_lock);
> > >  	INIT_LIST_HEAD(&chan->ld_pending);
> > >  	INIT_LIST_HEAD(&chan->ld_running);
> > > +	INIT_LIST_HEAD(&chan->ld_completed);
> > >  	chan->idle = true;
> > >
> > >  	chan->common.device = &fdev->common; diff --git
> > > a/drivers/dma/fsldma.h b/drivers/dma/fsldma.h index f5c3879..7ede908
> > > 100644
> > > --- a/drivers/dma/fsldma.h
> > > +++ b/drivers/dma/fsldma.h
> > > @@ -140,6 +140,7 @@ struct fsldma_chan {
> > >  	spinlock_t desc_lock;		/* Descriptor operation lock */
> > >  	struct list_head ld_pending;	/* Link descriptors queue */
> > >  	struct list_head ld_running;	/* Link descriptors queue */
> > > +	struct list_head ld_completed;	/* Link descriptors queue */
> > >  	struct dma_chan common;		/* DMA common channel */
> > >  	struct dma_pool *desc_pool;	/* Descriptors pool */
> > >  	struct device *dev;		/* Channel device */
> > > --
> > > 1.7.5.1
> > >
> > >
> 
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox