LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v3 3/4] fsl-dma: change release process of dma descriptor for supporting async_tx
From: Ira W. Snyder @ 2012-07-17 16:16 UTC (permalink / raw)
  To: Liu Qiang-B32616
  Cc: Li Yang-R58472, Vinod Koul, Phillips Kim-R1AAHA,
	herbert@gondor.hengli.com.au, linux-crypto@vger.kernel.org,
	Dan Williams, linuxppc-dev@lists.ozlabs.org, davem@davemloft.net
In-Reply-To: <BCB48C05FCE8BC4D9E61E841ECBE6DB70C283D@039-SN2MPN1-011.039d.mgd.msft.net>

On Tue, Jul 17, 2012 at 07:06:33AM +0000, Liu Qiang-B32616 wrote:
> > -----Original Message-----
> > From: Ira W. Snyder [mailto:iws@ovro.caltech.edu]
> > Sent: Tuesday, July 17, 2012 4:01 AM
> > To: Liu Qiang-B32616
> > Cc: linux-crypto@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; Phillips
> > Kim-R1AAHA; herbert@gondor.hengli.com.au; davem@davemloft.net; Dan
> > Williams; Vinod Koul; Li Yang-R58472
> > Subject: Re: [PATCH v3 3/4] fsl-dma: change release process of dma
> > descriptor for supporting async_tx
> > 
> > On Mon, Jul 16, 2012 at 12:08:29PM +0800, Qiang Liu wrote:
> > > Fix the potential risk when enable config NET_DMA and ASYNC_TX.
> > > Async_tx is lack of support in current release process of dma
> > > descriptor, all descriptors will be released whatever is acked or
> > > no-acked by async_tx, so there is a potential race condition when dma
> > > engine is uesd by others clients (e.g. when enable NET_DMA to offload
> > TCP).
> > >
> > > In our case, a race condition which is raised when use both of talitos
> > > and dmaengine to offload xor is because napi scheduler will sync all
> > > pending requests in dma channels, it affects the process of raid
> > > operations due to ack_tx is not checked in fsl dma. The no-acked
> > > descriptor is freed which is submitted just now, as a dependent tx,
> > > this freed descriptor trigger
> > > BUG_ON(async_tx_test_ack(depend_tx)) in async_tx_submit().
> > >
> > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > Cc: Vinod Koul <vinod.koul@intel.com>
> > > Cc: Li Yang <leoli@freescale.com>
> > > Cc: Ira W. Snyder <iws@ovro.caltech.edu>
> > > Signed-off-by: Qiang Liu <qiang.liu@freescale.com>
> > > ---
> > >  drivers/dma/fsldma.c |  378 +++++++++++++++++++++++++++++-------------
> > -------
> > >  drivers/dma/fsldma.h |    1 +
> > >  2 files changed, 225 insertions(+), 154 deletions(-)
> > >
> > > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index
> > > 4f2f212..4ee1b8f 100644
> > > --- a/drivers/dma/fsldma.c
> > > +++ b/drivers/dma/fsldma.c
> > > @@ -400,6 +400,217 @@ out_splice:
> > >  	list_splice_tail_init(&desc->tx_list, &chan->ld_pending);  }
> > >
> > > +/**
> > > + * fsl_chan_xfer_ld_queue - transfer any pending transactions
> > > + * @chan : Freescale DMA channel
> > > + *
> > > + * HARDWARE STATE: idle
> > > + * LOCKING: must hold chan->desc_lock  */ static void
> > > +fsl_chan_xfer_ld_queue(struct fsldma_chan *chan) {
> > > +	struct fsl_desc_sw *desc;
> > > +
> > > +	/*
> > > +	 * If the list of pending descriptors is empty, then we
> > > +	 * don't need to do any work at all
> > > +	 */
> > > +	if (list_empty(&chan->ld_pending)) {
> > > +		chan_dbg(chan, "no pending LDs\n");
> > > +		return;
> > > +	}
> > > +
> > > +	/*
> > > +	 * The DMA controller is not idle, which means that the interrupt
> > > +	 * handler will start any queued transactions when it runs after
> > > +	 * this transaction finishes
> > > +	 */
> > > +	if (!chan->idle) {
> > > +		chan_dbg(chan, "DMA controller still busy\n");
> > > +		return;
> > > +	}
> > > +
> > > +	/*
> > > +	 * If there are some link descriptors which have not been
> > > +	 * transferred, we need to start the controller
> > > +	 */
> > > +
> > > +	/*
> > > +	 * Move all elements from the queue of pending transactions
> > > +	 * onto the list of running transactions
> > > +	 */
> > > +	chan_dbg(chan, "idle, starting controller\n");
> > > +	desc = list_first_entry(&chan->ld_pending, struct fsl_desc_sw,
> > node);
> > > +	list_splice_tail_init(&chan->ld_pending, &chan->ld_running);
> > > +
> > > +	/*
> > > +	 * The 85xx DMA controller doesn't clear the channel start bit
> > > +	 * automatically at the end of a transfer. Therefore we must clear
> > > +	 * it in software before starting the transfer.
> > > +	 */
> > > +	if ((chan->feature & FSL_DMA_IP_MASK) == FSL_DMA_IP_85XX) {
> > > +		u32 mode;
> > > +
> > > +		mode = DMA_IN(chan, &chan->regs->mr, 32);
> > > +		mode &= ~FSL_DMA_MR_CS;
> > > +		DMA_OUT(chan, &chan->regs->mr, mode, 32);
> > > +	}
> > > +
> > > +	/*
> > > +	 * Program the descriptor's address into the DMA controller,
> > > +	 * then start the DMA transaction
> > > +	 */
> > > +	set_cdar(chan, desc->async_tx.phys);
> > > +	get_cdar(chan);
> > > +
> > > +	dma_start(chan);
> > > +	chan->idle = false;
> > > +}
> > > +
> > 
> > Please add a note about the locking requirements here, and for the other
> > new functions you've added.
> > 
> > You call this function from two places:
> > 
> > 1) fsldma_cleanup_descriptor() - called with mod->desc_lock held
> > 2) fsl_tx_status() - WITHOUT mod->desc_lock held
> > 
> > One of them is definitely wrong, and I'd bet that it is #2. You're
> > modifying ld_completed without a lock.
> Yes, My bad, I will correct it.
> 
> > 
> > > +static int
> > > +fsldma_clean_completed_descriptor(struct fsldma_chan *chan) {
> > > +	struct fsl_desc_sw *desc, *_desc;
> > > +
> > > +	/* Run the callback for each descriptor, in order */
> > > +	list_for_each_entry_safe(desc, _desc, &chan->ld_completed, node) {
> > > +
> > > +		if (async_tx_test_ack(&desc->async_tx)) {
> > > +			/* Remove from the list of transactions */
> > > +			list_del(&desc->node);
> > > +#ifdef FSL_DMA_LD_DEBUG
> > > +			chan_dbg(chan, "LD %p free\n", desc); #endif
> > > +			dma_pool_free(chan->desc_pool, desc,
> > > +					desc->async_tx.phys);
> > > +		}
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * fsldma_run_tx_complete_actions - cleanup and free a single link
> > > +descriptor
> > > + * @chan: Freescale DMA channel
> > > + * @desc: descriptor to cleanup and free
> > > + * @cookie: Freescale DMA transaction identifier
> > > + *
> > > + * This function is used on a descriptor which has been executed by
> > > +the DMA
> > > + * controller. It will run any callbacks, submit any dependencies,
> > > +and then
> > > + * free the descriptor.
> > > + */
> > > +static dma_cookie_t fsldma_run_tx_complete_actions(struct fsl_desc_sw
> > *desc,
> > > +		struct fsldma_chan *chan, dma_cookie_t cookie) {
> > > +	struct dma_async_tx_descriptor *txd = &desc->async_tx;
> > > +	struct device *dev = chan->common.device->dev;
> > > +	dma_addr_t src = get_desc_src(chan, desc);
> > > +	dma_addr_t dst = get_desc_dst(chan, desc);
> > > +	u32 len = get_desc_cnt(chan, desc);
> > > +
> > > +	BUG_ON(txd->cookie < 0);
> > > +
> > > +	if (txd->cookie > 0) {
> > > +		cookie = txd->cookie;
> > > +
> > > +		/* Run the link descriptor callback function */
> > > +		if (txd->callback) {
> > > +#ifdef FSL_DMA_LD_DEBUG
> > > +			chan_dbg(chan, "LD %p callback\n", desc); #endif
> > > +			txd->callback(txd->callback_param);
> > > +		}
> > > +
> > > +		/* Unmap the dst buffer, if requested */
> > > +		if (!(txd->flags & DMA_COMPL_SKIP_DEST_UNMAP)) {
> > > +			if (txd->flags & DMA_COMPL_DEST_UNMAP_SINGLE)
> > > +				dma_unmap_single(dev, dst, len, DMA_FROM_DEVICE);
> > > +			else
> > > +				dma_unmap_page(dev, dst, len, DMA_FROM_DEVICE);
> > > +		}
> > > +
> > > +		/* Unmap the src buffer, if requested */
> > > +		if (!(txd->flags & DMA_COMPL_SKIP_SRC_UNMAP)) {
> > > +			if (txd->flags & DMA_COMPL_SRC_UNMAP_SINGLE)
> > > +				dma_unmap_single(dev, src, len, DMA_TO_DEVICE);
> > > +			else
> > > +				dma_unmap_page(dev, src, len, DMA_TO_DEVICE);
> > > +		}
> > > +	}
> > > +
> > > +	/* Run any dependencies */
> > > +	dma_run_dependencies(txd);
> > > +
> > > +	return cookie;
> > > +}
> > > +
> > > +static int
> > > +fsldma_clean_running_descriptor(struct fsl_desc_sw *desc,
> > > +			struct fsldma_chan *chan)
> > > +{
> > > +	/* Remove from the list of transactions */
> > > +	list_del(&desc->node);
> > > +	/* the client is allowed to attach dependent operations
> > > +	 * until 'ack' is set
> > > +	 */
> > 
> > This comment is does not follow the coding style. It should be:
> > 
> > /*
> >  * the client is allowed to attech dependent operations
> >  * until 'ack' is set
> >  */
> Fine, I will correct it in v4. Include another 2 places related to coding style.
> Thanks.
> 
> > 
> > > +	if (!async_tx_test_ack(&desc->async_tx)) {
> > > +		/* move this slot to the completed */
> > 
> > Perhaps a better comment would be:
> > 
> > Move this descriptor to the list of descriptors which is complete, but
> > still awaiting the 'ack' bit to be set.
> Accept.
> 
> > 
> > > +		list_add_tail(&desc->node, &chan->ld_completed);
> > > +		return 0;
> > > +	}
> > > +
> > > +	dma_pool_free(chan->desc_pool, desc,
> > > +			desc->async_tx.phys);
> > 
> > This should all be on one line. It is less than 80 characters wide.
> > 
> Accept.
> 
> > > +	return 0;
> > > +}
> > > +
> > 
> > Locking notes please.
> I will add it in v4, please check. Thanks.
> 
> > 
> > > +static void fsldma_cleanup_descriptor(struct fsldma_chan *chan) {
> > > +	struct fsl_desc_sw *desc, *_desc;
> > > +	dma_cookie_t cookie = 0;
> > > +	dma_addr_t curr_phys = get_cdar(chan);
> > > +	int idle = dma_is_idle(chan);
> > > +	int seen_current = 0;
> > > +
> > > +	fsldma_clean_completed_descriptor(chan);
> > > +
> > > +	/* Run the callback for each descriptor, in order */
> > > +	list_for_each_entry_safe(desc, _desc, &chan->ld_running, node) {
> > > +		/* do not advance past the current descriptor loaded into the
> > > +		 * hardware channel, subsequent descriptors are either in
> > > +		 * process or have not been submitted
> > > +		 */
> > 
> > Coding style.
> > 
> > > +		if (seen_current)
> > > +			break;
> > > +
> > > +		/* stop the search if we reach the current descriptor and the
> > > +		 * channel is busy
> > > +		 */
> > 
> > Coding style.
> > 
> > > +		if (desc->async_tx.phys == curr_phys) {
> > > +			seen_current = 1;
> > > +			if (!idle)
> > > +				break;
> > > +		}
> > > +
> > 
> > This trick where you try to look at the hardware state to determine how
> > much to clean up has been a source of headaches in the past versions of
> > this driver.
> I know a little about the history of this driver. Could you tell me what kind
> headaches in the past versions, I can have a test and fix it in my patch. And
> I also think use current phys addr should be more explicit.
> Thanks.
> 

It has been a very long time since I last worked on this code, but I
will try to remember.

I remember one problem where the currently running descriptor was free'd
while the hardware was still executing it.

There was another related problem where the DMA hardware would lock up,
and it is impossible to recover without a hard reset. The CB (channel
busy) bit gets stuck on, and the running transfer never completes. This
may have been due to a programming issue in a user of the DMAEngine API,
and not this driver itself. I don't remember for sure anymore.

Our current system here at work has 120 boards with this DMA controller.
Each one runs 100+ DMA operations per second, 24/7/365. They've been
online for more than two years. I know for sure that the current code
doesn't lock up the DMA controller. :)

I think your bug is real, and needs to be fixed. I'm just trying to make
sure I understand the change, so it won't break other users. I'm only
trying to help.

> > 
> > It is much easier to reason about the state of the hardware and avoid
> > race conditions if you complete entire blocks of descriptors after the
> > hardware interrupts you to tell you it is finished.
> In current driver, it's ok, but there is a potential issue when enable
> async_tx api. How can you determine these descriptors in ld_running whether
> have been completed in fsl_tx_status()? (I mean in my patch.)
> 

I think your idea using an ld_completed list for descriptors which have
been run by the hardware, but not yet 'ack'ed by software is fine.

Would something similar to the following work?

ld_pending: build a list of descriptors until issue_pending() is called.
This is exactly what it does now, no changes needed.

ld_running: this list of descriptors is currently executing. Mostly
unchanged from how it works now.

When you get an interrupt, you know the descriptors in ld_running are
finished. Update the cookie to the highest number from ld_running. Free
the descriptors which have been 'ack'ed immediately. Move the ones which
are not 'ack'ed onto ld_completed. As part of this process, you should
run dma_run_dependencies() and callbacks as needed.

I think that this solves the problem you are seeing, which is that
descriptors which have not been 'ack'ed are free'd.

If this is unclear, I can write some code to demonstrate. If you could
write a small test driver, similar to drivers/dma/dmatest.c, which tests
the async_tx API without any extra hardware needed, I can run it. I have
never used the async_tx API before.

> > 
> > This is the philosophy employed by the driver before your modifications:
> > ld_pending: a block of descriptors which is ready to run as soon as the
> > hardware becomes idle.
> > ld_running: a block of descriptors which is being executed by the
> > hardware.
> These 2 lists are not enough, async_tx descriptors must be kept order as
> expected of its submitter, we only can process the descriptor which has been
> acked by async_tx api, but not free all descriptors in ld_running.
> For example,
> Step 1, in async_tx memcpy api,
> tx2 = device->device_prep_dma_memcpy(...),
> tx2->depend_tx = tx1;
> step 2, in async_tx submit api,
> async_tx_submit() will check tx2->depend_tx whether has been acked,
> unfortunately, tx2->depend_tx is freed before step 2 (dma channels is flushed
> by other drivers), the contents of tx2->depend_tx is set to POOL_POISON_FREED
> if DMAPOOL_DEBUG is enabled (illegal pointer will be used if disable this config
> option);
> step 3, BUG_ON(async_tx_test_ack(depend_tx)) is triggered.
> 
> For resolve this potential risk, first, ack flag must be checked in dma driver,
> second, ld_completed is added to restore all descriptors which have been acked
> and completed.
> In my following answer, most of all are around this idea.
> 
> > 
> > > +		cookie = fsldma_run_tx_complete_actions(desc, chan, cookie);
> > > +
> > > +		if (fsldma_clean_running_descriptor(desc, chan))
> > > +			break;
> > > +
> > > +	}
> > > +
> > > +	/*
> > > +	 * Start any pending transactions automatically
> > > +	 *
> > > +	 * In the ideal case, we keep the DMA controller busy while we go
> > > +	 * ahead and free the descriptors below.
> > > +	 */
> > > +	fsl_chan_xfer_ld_queue(chan);
> > > +
> > > +	if (cookie > 0)
> > > +		chan->common.completed_cookie = cookie; }
> > > +
> > >  static dma_cookie_t fsl_dma_tx_submit(struct dma_async_tx_descriptor
> > > *tx)  {
> > >  	struct fsldma_chan *chan = to_fsl_chan(tx->chan); @@ -534,8 +745,10
> > > @@ static void fsl_dma_free_chan_resources(struct dma_chan *dchan)
> > >
> > >  	chan_dbg(chan, "free all channel resources\n");
> > >  	spin_lock_irqsave(&chan->desc_lock, flags);
> > > +	fsldma_cleanup_descriptor(chan);
> > >  	fsldma_free_desc_list(chan, &chan->ld_pending);
> > >  	fsldma_free_desc_list(chan, &chan->ld_running);
> > > +	fsldma_free_desc_list(chan, &chan->ld_completed);
> > >  	spin_unlock_irqrestore(&chan->desc_lock, flags);
> > >
> > >  	dma_pool_destroy(chan->desc_pool);
> > > @@ -811,124 +1024,6 @@ static int fsl_dma_device_control(struct
> > > dma_chan *dchan,  }
> > >
> > >  /**
> > > - * fsldma_cleanup_descriptor - cleanup and free a single link
> > > descriptor
> > > - * @chan: Freescale DMA channel
> > > - * @desc: descriptor to cleanup and free
> > > - *
> > > - * This function is used on a descriptor which has been executed by
> > > the DMA
> > > - * controller. It will run any callbacks, submit any dependencies,
> > > and then
> > > - * free the descriptor.
> > > - */
> > > -static void fsldma_cleanup_descriptor(struct fsldma_chan *chan,
> > > -				      struct fsl_desc_sw *desc)
> > > -{
> > > -	struct dma_async_tx_descriptor *txd = &desc->async_tx;
> > > -	struct device *dev = chan->common.device->dev;
> > > -	dma_addr_t src = get_desc_src(chan, desc);
> > > -	dma_addr_t dst = get_desc_dst(chan, desc);
> > > -	u32 len = get_desc_cnt(chan, desc);
> > > -
> > > -	/* Run the link descriptor callback function */
> > > -	if (txd->callback) {
> > > -#ifdef FSL_DMA_LD_DEBUG
> > > -		chan_dbg(chan, "LD %p callback\n", desc);
> > > -#endif
> > > -		txd->callback(txd->callback_param);
> > > -	}
> > > -
> > > -	/* Run any dependencies */
> > > -	dma_run_dependencies(txd);
> > > -
> > > -	/* Unmap the dst buffer, if requested */
> > > -	if (!(txd->flags & DMA_COMPL_SKIP_DEST_UNMAP)) {
> > > -		if (txd->flags & DMA_COMPL_DEST_UNMAP_SINGLE)
> > > -			dma_unmap_single(dev, dst, len, DMA_FROM_DEVICE);
> > > -		else
> > > -			dma_unmap_page(dev, dst, len, DMA_FROM_DEVICE);
> > > -	}
> > > -
> > > -	/* Unmap the src buffer, if requested */
> > > -	if (!(txd->flags & DMA_COMPL_SKIP_SRC_UNMAP)) {
> > > -		if (txd->flags & DMA_COMPL_SRC_UNMAP_SINGLE)
> > > -			dma_unmap_single(dev, src, len, DMA_TO_DEVICE);
> > > -		else
> > > -			dma_unmap_page(dev, src, len, DMA_TO_DEVICE);
> > > -	}
> > > -
> > > -#ifdef FSL_DMA_LD_DEBUG
> > > -	chan_dbg(chan, "LD %p free\n", desc);
> > > -#endif
> > > -	dma_pool_free(chan->desc_pool, desc, txd->phys);
> > > -}
> > > -
> > > -/**
> > > - * fsl_chan_xfer_ld_queue - transfer any pending transactions
> > > - * @chan : Freescale DMA channel
> > > - *
> > > - * HARDWARE STATE: idle
> > > - * LOCKING: must hold chan->desc_lock
> > > - */
> > > -static void fsl_chan_xfer_ld_queue(struct fsldma_chan *chan) -{
> > > -	struct fsl_desc_sw *desc;
> > > -
> > > -	/*
> > > -	 * If the list of pending descriptors is empty, then we
> > > -	 * don't need to do any work at all
> > > -	 */
> > > -	if (list_empty(&chan->ld_pending)) {
> > > -		chan_dbg(chan, "no pending LDs\n");
> > > -		return;
> > > -	}
> > > -
> > > -	/*
> > > -	 * The DMA controller is not idle, which means that the interrupt
> > > -	 * handler will start any queued transactions when it runs after
> > > -	 * this transaction finishes
> > > -	 */
> > > -	if (!chan->idle) {
> > > -		chan_dbg(chan, "DMA controller still busy\n");
> > > -		return;
> > > -	}
> > > -
> > > -	/*
> > > -	 * If there are some link descriptors which have not been
> > > -	 * transferred, we need to start the controller
> > > -	 */
> > > -
> > > -	/*
> > > -	 * Move all elements from the queue of pending transactions
> > > -	 * onto the list of running transactions
> > > -	 */
> > > -	chan_dbg(chan, "idle, starting controller\n");
> > > -	desc = list_first_entry(&chan->ld_pending, struct fsl_desc_sw,
> > node);
> > > -	list_splice_tail_init(&chan->ld_pending, &chan->ld_running);
> > > -
> > > -	/*
> > > -	 * The 85xx DMA controller doesn't clear the channel start bit
> > > -	 * automatically at the end of a transfer. Therefore we must clear
> > > -	 * it in software before starting the transfer.
> > > -	 */
> > > -	if ((chan->feature & FSL_DMA_IP_MASK) == FSL_DMA_IP_85XX) {
> > > -		u32 mode;
> > > -
> > > -		mode = DMA_IN(chan, &chan->regs->mr, 32);
> > > -		mode &= ~FSL_DMA_MR_CS;
> > > -		DMA_OUT(chan, &chan->regs->mr, mode, 32);
> > > -	}
> > > -
> > > -	/*
> > > -	 * Program the descriptor's address into the DMA controller,
> > > -	 * then start the DMA transaction
> > > -	 */
> > > -	set_cdar(chan, desc->async_tx.phys);
> > > -	get_cdar(chan);
> > > -
> > > -	dma_start(chan);
> > > -	chan->idle = false;
> > > -}
> > > -
> > > -/**
> > >   * fsl_dma_memcpy_issue_pending - Issue the DMA start command
> > >   * @chan : Freescale DMA channel
> > >   */
> > > @@ -954,11 +1049,17 @@ static enum dma_status fsl_tx_status(struct
> > dma_chan *dchan,
> > >  	enum dma_status ret;
> > >  	unsigned long flags;
> > >
> > > -	spin_lock_irqsave(&chan->desc_lock, flags);
> > >  	ret = dma_cookie_status(dchan, cookie, txstate);
> > > +	if (ret == DMA_SUCCESS) {
> > > +		fsldma_clean_completed_descriptor(chan);
> > > +		return ret;
> > > +	}
> > > +
> > > +	spin_lock_irqsave(&chan->desc_lock, flags);
> > > +	fsldma_cleanup_descriptor(chan);
> > 
> > You've made status from a very quick operation (compare two cookies) into
> > a potentially long running operation. It may now run callbacks, unmap
> > pages, etc.
> Yes, that's right, callbacks and unmap pages will be invoked if DMA status
> is not DMA_SUCCESS.
> 
> > 
> > I note that Documentation/crypto/async-tx-api.txt section 3.6
> > "Constraints" does say that async_*() functions cannot be called from IRQ
> > context. However, the DMAEngine API itself does not have anything to say
> > about IRQ context.
> > 
> > I expect fsl_tx_status() to be quick, and not potentially call other
> > arbitrary code.
> fsl_tx_status() is not in IRQ context. It's reasonable to unmap pages and
> run dependency of descriptor as fsldma_run_tx_complete_actions() does.
> 

I looked at several other DMAEngine API drivers. Some of them do run
dependencies and callbacks from tx_status(). You are correct, this is
fine.

> > 
> > Is it possible to leave this function unchanged?
> According to my knowledge, it is unlikely to be left unchanged, or it will violate
> the design of this interface. If DMA status is not DMA_SUCCESS, that means there
> are new descriptors completed, we should move them to ld_completed, and do actions
> to its async_tx descriptors, then dma descriptor will be freed at another time.
> Most of my idea is from iop-adma.c, the original interface is not align with it.
> Async_tx flags (expecially ack_test) is not considered when dma descriptor is freed,
> so some random errors happened, I think you must familiar with this history.
> As an important "synchronization flag", async_tx api must control the order of tx
> descriptor. Dma driver also should make sure keeping order when free the descriptor.
> That's the reason why I add ld_completed list.
> 

Is it possible to put the descriptors into three different states?

1) ld_pending: ready to run, but not yet executed on the hardware

Descriptors move to #2 through either dma_async_issue_pending() or by
the interrupt which happens when the hardware completes executing a set
of DMA descriptors.

2) ld_running: currently executing on the hardware

Descriptors move to #3 through the interrupt which happens when the
hardware completes executing a set of DMA descriptors.

Dependencies and callbacks are executed as the descriptors are moved
onto ld_completed.

3) ld_completed: finished executing on the hardware, but not yet 'ack'ed

When they have been 'ack'ed, the descriptors are free'd.


As noted above, I think I could code this up fairly quickly if this
explanation is unclear.

Also note that I may not understand your bug completely, and my idea may
be completely wrong!

Ira

> > 
> > Also note that if you leave this function unchanged, the locking error
> > pointed out above (fsldma_clean_completed_descriptor() is not called with
> > the lock held) goes away.
> I know. Thanks.
> 
> > 
> > >  	spin_unlock_irqrestore(&chan->desc_lock, flags);
> > >
> > > -	return ret;
> > > +	return dma_cookie_status(dchan, cookie, txstate);
> > >  }
> > >
> > >
> > > /*--------------------------------------------------------------------
> > > --------*/ @@ -1035,53 +1136,21 @@ static irqreturn_t
> > > fsldma_chan_irq(int irq, void *data)  static void
> > > dma_do_tasklet(unsigned long data)  {
> > >  	struct fsldma_chan *chan = (struct fsldma_chan *)data;
> > > -	struct fsl_desc_sw *desc, *_desc;
> > > -	LIST_HEAD(ld_cleanup);
> > >  	unsigned long flags;
> > >
> > >  	chan_dbg(chan, "tasklet entry\n");
> > >
> > >  	spin_lock_irqsave(&chan->desc_lock, flags);
> > >
> > > -	/* update the cookie if we have some descriptors to cleanup */
> > > -	if (!list_empty(&chan->ld_running)) {
> > > -		dma_cookie_t cookie;
> > > -
> > > -		desc = to_fsl_desc(chan->ld_running.prev);
> > > -		cookie = desc->async_tx.cookie;
> > > -		dma_cookie_complete(&desc->async_tx);
> > > -
> > > -		chan_dbg(chan, "completed_cookie=%d\n", cookie);
> > > -	}
> > > -
> > > -	/*
> > > -	 * move the descriptors to a temporary list so we can drop the lock
> > > -	 * during the entire cleanup operation
> > > -	 */
> > > -	list_splice_tail_init(&chan->ld_running, &ld_cleanup);
> > > +	/* Run all cleanup for this descriptor */
> > > +	fsldma_cleanup_descriptor(chan);
> > >
> > >  	/* the hardware is now idle and ready for more */
> > >  	chan->idle = true;
> > >
> > > -	/*
> > > -	 * Start any pending transactions automatically
> > > -	 *
> > > -	 * In the ideal case, we keep the DMA controller busy while we go
> > > -	 * ahead and free the descriptors below.
> > > -	 */
> > >  	fsl_chan_xfer_ld_queue(chan);
> > >  	spin_unlock_irqrestore(&chan->desc_lock, flags);
> > >
> > > -	/* Run the callback for each descriptor, in order */
> > > -	list_for_each_entry_safe(desc, _desc, &ld_cleanup, node) {
> > > -
> > > -		/* Remove from the list of transactions */
> > > -		list_del(&desc->node);
> > > -
> > > -		/* Run all cleanup for this descriptor */
> > > -		fsldma_cleanup_descriptor(chan, desc);
> > > -	}
> > > -
> > >  	chan_dbg(chan, "tasklet exit\n");
> > >  }
> > >
> > > @@ -1262,6 +1331,7 @@ static int __devinit fsl_dma_chan_probe(struct
> > fsldma_device *fdev,
> > >  	spin_lock_init(&chan->desc_lock);
> > >  	INIT_LIST_HEAD(&chan->ld_pending);
> > >  	INIT_LIST_HEAD(&chan->ld_running);
> > > +	INIT_LIST_HEAD(&chan->ld_completed);
> > >  	chan->idle = true;
> > >
> > >  	chan->common.device = &fdev->common; diff --git
> > > a/drivers/dma/fsldma.h b/drivers/dma/fsldma.h index f5c3879..7ede908
> > > 100644
> > > --- a/drivers/dma/fsldma.h
> > > +++ b/drivers/dma/fsldma.h
> > > @@ -140,6 +140,7 @@ struct fsldma_chan {
> > >  	spinlock_t desc_lock;		/* Descriptor operation lock */
> > >  	struct list_head ld_pending;	/* Link descriptors queue */
> > >  	struct list_head ld_running;	/* Link descriptors queue */
> > > +	struct list_head ld_completed;	/* Link descriptors queue */
> > >  	struct dma_chan common;		/* DMA common channel */
> > >  	struct dma_pool *desc_pool;	/* Descriptors pool */
> > >  	struct device *dev;		/* Channel device */
> > > --
> > > 1.7.5.1
> > >
> > >
> 
> 

^ permalink raw reply

* Re: [PATCH 05/15] pci: resource assignment based on p2p alignment
From: Bjorn Helgaas @ 2012-07-17 17:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: yinghai, linux-pci, Ram Pai, Gavin Shan, linuxppc-dev
In-Reply-To: <1342521498.3669.7.camel@pasglop>

On Tue, Jul 17, 2012 at 4:38 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Tue, 2012-07-17 at 18:03 +0800, Ram Pai wrote:
>>         Lets say we passed that 'type' flag to size the minimum
>>         alignment constraints for that b_res.  And window_alignment(bus,
>>         type) of your platform  used that 'type' information to
>>         determine whether to use the alignment constraints of 32-bit
>>         window or 64-bit window.
>>
>>         However, later when the b_res is actually allocated a resource,
>>         the pci_assign_resource() has no idea whether to allocate 32-bit
>>         window resource or 64-bit window resource, because the 'type'
>>         information is not captured anywhere in b_res.
>>
>>         You would basically have a disconnect between what is sized and
>>         what is allocated. Unless offcourse you pass that 'type' to
>>         the b_res->flags, which is currently not the case.
>
> Right, we ideally would need the core to query the alignment once per
> "apertures" it tries as a candidate to allocate a given resource... but
> that's for later.
>
> For now we can probably live with giving out the max of the minimum
> alignment we support for M64 and our M32 segment size.

We already know the aperture we're proposing to allocate from (the
result of find_free_bus_resource()), don't we?  What if we passed it
to pcibios_window_alignment() along with the struct pci_bus *, e.g.:

  resource_size_t pcibios_window_alignment(struct pci_bus *bus, struct
resource *window)

^ permalink raw reply

* Re: [PATCH] PCI: use dev->irq instead of dev->pin to enable non MSI/INTx interrupt
From: Scott Wood @ 2012-07-17 17:41 UTC (permalink / raw)
  To: Shengzhou Liu; +Cc: bhelgaas, linux-pci, linuxppc-dev
In-Reply-To: <1342492551-13766-1-git-send-email-Shengzhou.Liu@freescale.com>

On 07/16/2012 09:35 PM, Shengzhou Liu wrote:
> On some platforms, root port has neither MSI/MSI-X nor INTx interrupt
> generated in RC mode. In this case, we have to use other interrupt(i.e.
> system shared interrupt) for port service irq to have AER, Hot-plug, etc,
> services to work.
> 
> Signed-off-by: Shengzhou Liu <Shengzhou.Liu@freescale.com>
> ---
>  drivers/pci/pcie/portdrv_core.c |    9 +++++++--
>  1 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> index 75915b3..a855254 100644
> --- a/drivers/pci/pcie/portdrv_core.c
> +++ b/drivers/pci/pcie/portdrv_core.c
> @@ -212,8 +212,13 @@ static int init_service_irqs(struct pci_dev *dev, int *irqs, int mask)
>  	if (!pcie_port_enable_msix(dev, irqs, mask))
>  		return 0;
>  
> -	/* We're not going to use MSI-X, so try MSI and fall back to INTx */
> -	if (!pci_enable_msi(dev) || dev->pin)
> +	/*
> +	 * We're not going to use MSI-X, so try MSI and fall back to INTx.
> +	 * If neither MSI/MSI-X nor INTx available, try other interrupt. (On
> +	 * some platforms, root port doesn't support generating MSI/MSI-X/INTx
> +	 * in RC mode)
> +	 */
> +	if (!pci_enable_msi(dev) || dev->irq)
>  		irq = dev->irq;

What about the other usage of dev->pin a few lines up?

-Scott

^ permalink raw reply

* [PATCH] of: require a match on all fields of of_device_id
From: Scott Wood @ 2012-07-18  1:11 UTC (permalink / raw)
  To: Thierry Reding, Rob Herring; +Cc: devicetree-discuss, linuxppc-dev

Commit 107a84e61cdd3406c842a0e4be7efffd3a05dba6 ("of: match by compatible
property first") breaks the gianfar ethernet driver found on various
Freescale PPC chips.

There are, for unfortunate historical reasons, two nodes with a
compatible of "gianfar".  One has a device_type of "network" and the
other has device_type of "mdio".  The match entries look like this:

>         {
>                 .type = "mdio",
>                 .compatible = "gianfar",
>         },

and

>         {
>                 .type = "network",
>                 .compatible = "gianfar",
>         },

With the above patch, both nodes get probed by the first driver, because
nothing else in the match struct is looked at if there's a compatible
match.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 drivers/of/base.c |   44 ++++++++++++++++++++++++++++++++------------
 1 file changed, 32 insertions(+), 12 deletions(-)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index bc86ea2..4e707cc 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -511,14 +511,37 @@ out:
 }
 EXPORT_SYMBOL(of_find_node_with_property);
 
-static const struct of_device_id *of_match_compat(const struct of_device_id *matches,
-						  const char *compat)
+/*
+ * Tell if an device_node matches the non-compatible fields of
+ * a specific of_match element.
+ */
+static bool of_match_one_noncompat(const struct of_device_id *match,
+				   const struct device_node *node)
+{
+	bool is_match = true;
+
+	if (match->name[0])
+		is_match &= node->name && !strcmp(match->name, node->name);
+	if (match->type[0])
+		is_match &= node->type && !strcmp(match->type, node->type);
+
+	return is_match;
+}
+
+/*
+ * Find an OF match using the supplied compatible string, rather than
+ * the node's entire string list.
+ */
+static const struct of_device_id *of_match_compat(
+	const struct of_device_id *matches, const char *compat,
+	const struct device_node *node)
 {
 	while (matches->name[0] || matches->type[0] || matches->compatible[0]) {
 		const char *cp = matches->compatible;
 		int len = strlen(cp);
 
-		if (len > 0 && of_compat_cmp(compat, cp, len) == 0)
+		if (len > 0 && of_compat_cmp(compat, cp, len) == 0 &&
+		    of_match_one_noncompat(matches, node))
 			return matches;
 
 		matches++;
@@ -544,23 +567,20 @@ const struct of_device_id *of_match_node(const struct of_device_id *matches,
 		return NULL;
 
 	of_property_for_each_string(node, "compatible", prop, cp) {
-		const struct of_device_id *match = of_match_compat(matches, cp);
+		const struct of_device_id *match =
+			of_match_compat(matches, cp, node);
 		if (match)
 			return match;
 	}
 
 	while (matches->name[0] || matches->type[0] || matches->compatible[0]) {
-		int match = 1;
-		if (matches->name[0])
-			match &= node->name
-				&& !strcmp(matches->name, node->name);
-		if (matches->type[0])
-			match &= node->type
-				&& !strcmp(matches->type, node->type);
-		if (match && !matches->compatible[0])
+		if (of_match_one_noncompat(matches, node) &&
+		    !matches->compatible[0])
 			return matches;
+
 		matches++;
 	}
+
 	return NULL;
 }
 EXPORT_SYMBOL(of_match_node);
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH] of: require a match on all fields of of_device_id
From: Tabi Timur-B04825 @ 2012-07-18  1:57 UTC (permalink / raw)
  To: Wood Scott-B07421
  Cc: devicetree-discuss@lists.ozlabs.org, Thierry Reding,
	linuxppc-dev@lists.ozlabs.org, Rob Herring
In-Reply-To: <20120718011151.GA6119@tyr.buserror.net>

On Tue, Jul 17, 2012 at 8:11 PM, Scott Wood <scottwood@freescale.com> wrote=
:
> Commit 107a84e61cdd3406c842a0e4be7efffd3a05dba6 ("of: match by compatible
> property first") breaks the gianfar ethernet driver found on various
> Freescale PPC chips.
>
> There are, for unfortunate historical reasons, two nodes with a
> compatible of "gianfar".

Would it be worth updating the binding for the two nodes to make the
compatible property different?  We could do something like this:

ethernet@24000 {
                        reg =3D <0x24000 0x1000>;
                        compatible =3D "fsl,gianfar-eth", "gianfar";
                        ...
};

(and something similar for MDIO nodes)

and update all the drivers to look for both strings.  After a few
years, we can delete "gianfar".

--=20
Timur Tabi
Linux kernel developer at Freescale=

^ permalink raw reply

* Re: [PATCH] of: require a match on all fields of of_device_id
From: Rob Herring @ 2012-07-18  2:38 UTC (permalink / raw)
  To: Scott Wood; +Cc: devicetree-discuss, Thierry Reding, linuxppc-dev
In-Reply-To: <20120718011151.GA6119@tyr.buserror.net>

On 07/17/2012 08:11 PM, Scott Wood wrote:
> Commit 107a84e61cdd3406c842a0e4be7efffd3a05dba6 ("of: match by compatible
> property first") breaks the gianfar ethernet driver found on various
> Freescale PPC chips.

You do know this is reverted, right?

> 
> There are, for unfortunate historical reasons, two nodes with a
> compatible of "gianfar".  One has a device_type of "network" and the
> other has device_type of "mdio".  The match entries look like this:
> 
>>         {
>>                 .type = "mdio",
>>                 .compatible = "gianfar",
>>         },
> 
> and
> 
>>         {
>>                 .type = "network",
>>                 .compatible = "gianfar",
>>         },
> 
> With the above patch, both nodes get probed by the first driver, because
> nothing else in the match struct is looked at if there's a compatible
> match.
> 
> Signed-off-by: Scott Wood <scottwood@freescale.com>

Here's my fix (untested) which is a bit simpler. I'm assuming if we care
about which compatible string we are matching to, then we require name
and type are blank and we only care about compatible strings.

diff --git a/drivers/of/base.c b/drivers/of/base.c
index eada3f4..6e10004 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -518,6 +518,9 @@ static const struct of_device_id
*of_match_compat(const struct of_device_id *mat
                const char *cp = matches->compatible;
                int len = strlen(cp);

+               if (matches->name[0] || matches->type[0])
+                       continue;
+
                if (len > 0 && of_compat_cmp(compat, cp, len) == 0)
                        return matches;

@@ -557,7 +560,10 @@ const struct of_device_id *of_match_node(const
struct of_device_id *matches,
                if (matches->type[0])
                        match &= node->type
                                && !strcmp(matches->type, node->type);
-               if (match && !matches->compatible[0])
+               if (matches->compatible[0])
+                       match &= of_device_is_compatible(node,
+                                               matches->compatible);
+               if (match)
                        return matches;
                matches++;
        }

> ---
>  drivers/of/base.c |   44 ++++++++++++++++++++++++++++++++------------
>  1 file changed, 32 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index bc86ea2..4e707cc 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -511,14 +511,37 @@ out:
>  }
>  EXPORT_SYMBOL(of_find_node_with_property);
>  
> -static const struct of_device_id *of_match_compat(const struct of_device_id *matches,
> -						  const char *compat)
> +/*
> + * Tell if an device_node matches the non-compatible fields of
> + * a specific of_match element.
> + */
> +static bool of_match_one_noncompat(const struct of_device_id *match,
> +				   const struct device_node *node)
> +{
> +	bool is_match = true;
> +
> +	if (match->name[0])
> +		is_match &= node->name && !strcmp(match->name, node->name);
> +	if (match->type[0])
> +		is_match &= node->type && !strcmp(match->type, node->type);
> +
> +	return is_match;
> +}
> +
> +/*
> + * Find an OF match using the supplied compatible string, rather than
> + * the node's entire string list.
> + */
> +static const struct of_device_id *of_match_compat(
> +	const struct of_device_id *matches, const char *compat,
> +	const struct device_node *node)
>  {
>  	while (matches->name[0] || matches->type[0] || matches->compatible[0]) {
>  		const char *cp = matches->compatible;
>  		int len = strlen(cp);
>  
> -		if (len > 0 && of_compat_cmp(compat, cp, len) == 0)
> +		if (len > 0 && of_compat_cmp(compat, cp, len) == 0 &&
> +		    of_match_one_noncompat(matches, node))
>  			return matches;
>  
>  		matches++;
> @@ -544,23 +567,20 @@ const struct of_device_id *of_match_node(const struct of_device_id *matches,
>  		return NULL;
>  
>  	of_property_for_each_string(node, "compatible", prop, cp) {
> -		const struct of_device_id *match = of_match_compat(matches, cp);
> +		const struct of_device_id *match =
> +			of_match_compat(matches, cp, node);
>  		if (match)
>  			return match;
>  	}
>  
>  	while (matches->name[0] || matches->type[0] || matches->compatible[0]) {
> -		int match = 1;
> -		if (matches->name[0])
> -			match &= node->name
> -				&& !strcmp(matches->name, node->name);
> -		if (matches->type[0])
> -			match &= node->type
> -				&& !strcmp(matches->type, node->type);
> -		if (match && !matches->compatible[0])
> +		if (of_match_one_noncompat(matches, node) &&
> +		    !matches->compatible[0])
>  			return matches;
> +
>  		matches++;
>  	}
> +
>  	return NULL;
>  }
>  EXPORT_SYMBOL(of_match_node);
> 

^ permalink raw reply related

* RE: [PATCH v3 3/4] fsl-dma: change release process of dma descriptor for supporting async_tx
From: Liu Qiang-B32616 @ 2012-07-18  3:31 UTC (permalink / raw)
  To: Ira W. Snyder, Dan Williams
  Cc: Li Yang-R58472, Vinod Koul, Phillips Kim-R1AAHA,
	herbert@gondor.hengli.com.au, linux-crypto@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, davem@davemloft.net
In-Reply-To: <20120717161653.GA20669@ovro.caltech.edu>

[-- Attachment #1: Type: text/plain, Size: 34498 bytes --]

Hi Ira,

My comments inline.

Hi Ira and Dan,

Here is a introduction about my scenario of this case.

My case is use async_tx API to offload raid5,
Use fsl-talitos to compute parity calculation, fsl-dma to process memcpy.

What bug occurred to me?
Exception is thrown by BUG_ON(async_tx_ack_test(depend_tx)) in async_tx_submit().

TASK = ee1a94a0[1390] 'md0_raid5' THREAD: ecf40000 CPU: 0
GPR00: 00000001 ecf41ca0 ee44/921a94a0 0000003f 00000001 c00593e4 00000000 00000001
GPR08: 00000000 a7a7a7a7 00000001 045/920000002 42028042 100a38d4 ed576d98 00000000
GPR16: ed5a11b0 00000000 2b162000 00000200 046/920000000 2d555000 ed3015e8 c15a7aa0
GPR24: 00000000 c155fc40 00000000 ecb63220 ecf41d28 e47/92f640bb0 ef640c30 ecf41ca0
NIP [c02b048c] async_tx_submit+0x6c/0x2b4
LR [c02b068c] async_tx_submit+0x26c/0x2b4
Call Trace:
[ecf41ca0] [c02b068c] async_tx_submit+0x26c/0x2b448/92 (unreliable)
[ecf41cd0] [c02b0a4c] async_memcpy+0x240/0x25c
[ecf41d20] [c0421064] async_copy_data+0xa0/0x17c
[ecf41d70] [c0421cf4] __raid_run_ops+0x874/0xe10
[ecf41df0] [c0426ee4] handle_stripe+0x820/0x25e8
[ecf41e90] [c0429080] raid5d+0x3d4/0x5b4
[ecf41f40] [c04329b8] md_thread+0x138/0x16c
[ecf41f90] [c008277c] kthread+0x8c/0x90
[ecf41ff0] [c0011630] kernel_thread+0x4c/0x68

What is the root cause of this bug?
The async_tx descriptor(depend_tx) is free'ed before async_tx_submit ack it.
That means async_tx descriptor should be acked before it's free'ed. In other
word, only ack'ed async_tx descriptor should be free'ed by dma engine.

How to fix it?
Add a queue ld_completed, move all completed descriptors in this queue, free
these descriptors at a proper time (these descriptors also should be acked).

Dan, do you think my understand about "ack" is right? Thanks.

- Qiang

> -----Original Message-----
> From: Ira W. Snyder [mailto:iws@ovro.caltech.edu]
> Sent: Wednesday, July 18, 2012 12:17 AM
> To: Liu Qiang-B32616
> Cc: linux-crypto@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; Phillips
> Kim-R1AAHA; herbert@gondor.hengli.com.au; davem@davemloft.net; Dan
> Williams; Vinod Koul; Li Yang-R58472
> Subject: Re: [PATCH v3 3/4] fsl-dma: change release process of dma
> descriptor for supporting async_tx
>
> On Tue, Jul 17, 2012 at 07:06:33AM +0000, Liu Qiang-B32616 wrote:
> > > -----Original Message-----
> > > From: Ira W. Snyder [mailto:iws@ovro.caltech.edu]
> > > Sent: Tuesday, July 17, 2012 4:01 AM
> > > To: Liu Qiang-B32616
> > > Cc: linux-crypto@vger.kernel.org; linuxppc-dev@lists.ozlabs.org;
> Phillips
> > > Kim-R1AAHA; herbert@gondor.hengli.com.au; davem@davemloft.net; Dan
> > > Williams; Vinod Koul; Li Yang-R58472
> > > Subject: Re: [PATCH v3 3/4] fsl-dma: change release process of dma
> > > descriptor for supporting async_tx
> > >
> > > On Mon, Jul 16, 2012 at 12:08:29PM +0800, Qiang Liu wrote:
> > > > Fix the potential risk when enable config NET_DMA and ASYNC_TX.
> > > > Async_tx is lack of support in current release process of dma
> > > > descriptor, all descriptors will be released whatever is acked or
> > > > no-acked by async_tx, so there is a potential race condition when
> dma
> > > > engine is uesd by others clients (e.g. when enable NET_DMA to
> offload
> > > TCP).
> > > >
> > > > In our case, a race condition which is raised when use both of
> talitos
> > > > and dmaengine to offload xor is because napi scheduler will sync
> all
> > > > pending requests in dma channels, it affects the process of raid
> > > > operations due to ack_tx is not checked in fsl dma. The no-acked
> > > > descriptor is freed which is submitted just now, as a dependent tx,
> > > > this freed descriptor trigger
> > > > BUG_ON(async_tx_test_ack(depend_tx)) in async_tx_submit().
> > > >
> > > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > > Cc: Vinod Koul <vinod.koul@intel.com>
> > > > Cc: Li Yang <leoli@freescale.com>
> > > > Cc: Ira W. Snyder <iws@ovro.caltech.edu>
> > > > Signed-off-by: Qiang Liu <qiang.liu@freescale.com>
> > > > ---
> > > >  drivers/dma/fsldma.c |  378 +++++++++++++++++++++++++++++---------
> ----
> > > -------
> > > >  drivers/dma/fsldma.h |    1 +
> > > >  2 files changed, 225 insertions(+), 154 deletions(-)
> > > >
> > > > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index
> > > > 4f2f212..4ee1b8f 100644
> > > > --- a/drivers/dma/fsldma.c
> > > > +++ b/drivers/dma/fsldma.c
> > > > @@ -400,6 +400,217 @@ out_splice:
> > > >         list_splice_tail_init(&desc->tx_list, &chan->ld_pending);  }
> > > >
> > > > +/**
> > > > + * fsl_chan_xfer_ld_queue - transfer any pending transactions
> > > > + * @chan : Freescale DMA channel
> > > > + *
> > > > + * HARDWARE STATE: idle
> > > > + * LOCKING: must hold chan->desc_lock  */ static void
> > > > +fsl_chan_xfer_ld_queue(struct fsldma_chan *chan) {
> > > > +       struct fsl_desc_sw *desc;
> > > > +
> > > > +       /*
> > > > +        * If the list of pending descriptors is empty, then we
> > > > +        * don't need to do any work at all
> > > > +        */
> > > > +       if (list_empty(&chan->ld_pending)) {
> > > > +               chan_dbg(chan, "no pending LDs\n");
> > > > +               return;
> > > > +       }
> > > > +
> > > > +       /*
> > > > +        * The DMA controller is not idle, which means that the
> interrupt
> > > > +        * handler will start any queued transactions when it runs
> after
> > > > +        * this transaction finishes
> > > > +        */
> > > > +       if (!chan->idle) {
> > > > +               chan_dbg(chan, "DMA controller still busy\n");
> > > > +               return;
> > > > +       }
> > > > +
> > > > +       /*
> > > > +        * If there are some link descriptors which have not been
> > > > +        * transferred, we need to start the controller
> > > > +        */
> > > > +
> > > > +       /*
> > > > +        * Move all elements from the queue of pending transactions
> > > > +        * onto the list of running transactions
> > > > +        */
> > > > +       chan_dbg(chan, "idle, starting controller\n");
> > > > +       desc = list_first_entry(&chan->ld_pending, struct fsl_desc_sw,
> > > node);
> > > > +       list_splice_tail_init(&chan->ld_pending, &chan->ld_running);
> > > > +
> > > > +       /*
> > > > +        * The 85xx DMA controller doesn't clear the channel start
> bit
> > > > +        * automatically at the end of a transfer. Therefore we must
> clear
> > > > +        * it in software before starting the transfer.
> > > > +        */
> > > > +       if ((chan->feature & FSL_DMA_IP_MASK) == FSL_DMA_IP_85XX) {
> > > > +               u32 mode;
> > > > +
> > > > +               mode = DMA_IN(chan, &chan->regs->mr, 32);
> > > > +               mode &= ~FSL_DMA_MR_CS;
> > > > +               DMA_OUT(chan, &chan->regs->mr, mode, 32);
> > > > +       }
> > > > +
> > > > +       /*
> > > > +        * Program the descriptor's address into the DMA controller,
> > > > +        * then start the DMA transaction
> > > > +        */
> > > > +       set_cdar(chan, desc->async_tx.phys);
> > > > +       get_cdar(chan);
> > > > +
> > > > +       dma_start(chan);
> > > > +       chan->idle = false;
> > > > +}
> > > > +
> > >
> > > Please add a note about the locking requirements here, and for the
> other
> > > new functions you've added.
> > >
> > > You call this function from two places:
> > >
> > > 1) fsldma_cleanup_descriptor() - called with mod->desc_lock held
> > > 2) fsl_tx_status() - WITHOUT mod->desc_lock held
> > >
> > > One of them is definitely wrong, and I'd bet that it is #2. You're
> > > modifying ld_completed without a lock.
> > Yes, My bad, I will correct it.
> >
> > >
> > > > +static int
> > > > +fsldma_clean_completed_descriptor(struct fsldma_chan *chan) {
> > > > +       struct fsl_desc_sw *desc, *_desc;
> > > > +
> > > > +       /* Run the callback for each descriptor, in order */
> > > > +       list_for_each_entry_safe(desc, _desc, &chan->ld_completed,
> node) {
> > > > +
> > > > +               if (async_tx_test_ack(&desc->async_tx)) {
> > > > +                       /* Remove from the list of transactions */
> > > > +                       list_del(&desc->node);
> > > > +#ifdef FSL_DMA_LD_DEBUG
> > > > +                       chan_dbg(chan, "LD %p free\n", desc); #endif
> > > > +                       dma_pool_free(chan->desc_pool, desc,
> > > > +                                       desc->async_tx.phys);
> > > > +               }
> > > > +       }
> > > > +
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * fsldma_run_tx_complete_actions - cleanup and free a single link
> > > > +descriptor
> > > > + * @chan: Freescale DMA channel
> > > > + * @desc: descriptor to cleanup and free
> > > > + * @cookie: Freescale DMA transaction identifier
> > > > + *
> > > > + * This function is used on a descriptor which has been executed
> by
> > > > +the DMA
> > > > + * controller. It will run any callbacks, submit any dependencies,
> > > > +and then
> > > > + * free the descriptor.
> > > > + */
> > > > +static dma_cookie_t fsldma_run_tx_complete_actions(struct
> fsl_desc_sw
> > > *desc,
> > > > +               struct fsldma_chan *chan, dma_cookie_t cookie) {
> > > > +       struct dma_async_tx_descriptor *txd = &desc->async_tx;
> > > > +       struct device *dev = chan->common.device->dev;
> > > > +       dma_addr_t src = get_desc_src(chan, desc);
> > > > +       dma_addr_t dst = get_desc_dst(chan, desc);
> > > > +       u32 len = get_desc_cnt(chan, desc);
> > > > +
> > > > +       BUG_ON(txd->cookie < 0);
> > > > +
> > > > +       if (txd->cookie > 0) {
> > > > +               cookie = txd->cookie;
> > > > +
> > > > +               /* Run the link descriptor callback function */
> > > > +               if (txd->callback) {
> > > > +#ifdef FSL_DMA_LD_DEBUG
> > > > +                       chan_dbg(chan, "LD %p callback\n", desc); #endif
> > > > +                       txd->callback(txd->callback_param);
> > > > +               }
> > > > +
> > > > +               /* Unmap the dst buffer, if requested */
> > > > +               if (!(txd->flags & DMA_COMPL_SKIP_DEST_UNMAP)) {
> > > > +                       if (txd->flags & DMA_COMPL_DEST_UNMAP_SINGLE)
> > > > +                               dma_unmap_single(dev, dst, len,
> DMA_FROM_DEVICE);
> > > > +                       else
> > > > +                               dma_unmap_page(dev, dst, len,
> DMA_FROM_DEVICE);
> > > > +               }
> > > > +
> > > > +               /* Unmap the src buffer, if requested */
> > > > +               if (!(txd->flags & DMA_COMPL_SKIP_SRC_UNMAP)) {
> > > > +                       if (txd->flags & DMA_COMPL_SRC_UNMAP_SINGLE)
> > > > +                               dma_unmap_single(dev, src, len,
> DMA_TO_DEVICE);
> > > > +                       else
> > > > +                               dma_unmap_page(dev, src, len,
> DMA_TO_DEVICE);
> > > > +               }
> > > > +       }
> > > > +
> > > > +       /* Run any dependencies */
> > > > +       dma_run_dependencies(txd);
> > > > +
> > > > +       return cookie;
> > > > +}
> > > > +
> > > > +static int
> > > > +fsldma_clean_running_descriptor(struct fsl_desc_sw *desc,
> > > > +                       struct fsldma_chan *chan)
> > > > +{
> > > > +       /* Remove from the list of transactions */
> > > > +       list_del(&desc->node);
> > > > +       /* the client is allowed to attach dependent operations
> > > > +        * until 'ack' is set
> > > > +        */
> > >
> > > This comment is does not follow the coding style. It should be:
> > >
> > > /*
> > >  * the client is allowed to attech dependent operations
> > >  * until 'ack' is set
> > >  */
> > Fine, I will correct it in v4. Include another 2 places related to
> coding style.
> > Thanks.
> >
> > >
> > > > +       if (!async_tx_test_ack(&desc->async_tx)) {
> > > > +               /* move this slot to the completed */
> > >
> > > Perhaps a better comment would be:
> > >
> > > Move this descriptor to the list of descriptors which is complete,
> but
> > > still awaiting the 'ack' bit to be set.
> > Accept.
> >
> > >
> > > > +               list_add_tail(&desc->node, &chan->ld_completed);
> > > > +               return 0;
> > > > +       }
> > > > +
> > > > +       dma_pool_free(chan->desc_pool, desc,
> > > > +                       desc->async_tx.phys);
> > >
> > > This should all be on one line. It is less than 80 characters wide.
> > >
> > Accept.
> >
> > > > +       return 0;
> > > > +}
> > > > +
> > >
> > > Locking notes please.
> > I will add it in v4, please check. Thanks.
> >
> > >
> > > > +static void fsldma_cleanup_descriptor(struct fsldma_chan *chan) {
> > > > +       struct fsl_desc_sw *desc, *_desc;
> > > > +       dma_cookie_t cookie = 0;
> > > > +       dma_addr_t curr_phys = get_cdar(chan);
> > > > +       int idle = dma_is_idle(chan);
> > > > +       int seen_current = 0;
> > > > +
> > > > +       fsldma_clean_completed_descriptor(chan);
> > > > +
> > > > +       /* Run the callback for each descriptor, in order */
> > > > +       list_for_each_entry_safe(desc, _desc, &chan->ld_running, node)
> {
> > > > +               /* do not advance past the current descriptor loaded
> into the
> > > > +                * hardware channel, subsequent descriptors are either
> in
> > > > +                * process or have not been submitted
> > > > +                */
> > >
> > > Coding style.
> > >
> > > > +               if (seen_current)
> > > > +                       break;
> > > > +
> > > > +               /* stop the search if we reach the current descriptor
> and the
> > > > +                * channel is busy
> > > > +                */
> > >
> > > Coding style.
> > >
> > > > +               if (desc->async_tx.phys == curr_phys) {
> > > > +                       seen_current = 1;
> > > > +                       if (!idle)
> > > > +                               break;
> > > > +               }
> > > > +
> > >
> > > This trick where you try to look at the hardware state to determine
> how
> > > much to clean up has been a source of headaches in the past versions
> of
> > > this driver.
> > I know a little about the history of this driver. Could you tell me
> what kind
> > headaches in the past versions, I can have a test and fix it in my
> patch. And
> > I also think use current phys addr should be more explicit.
> > Thanks.
> >
>
> It has been a very long time since I last worked on this code, but I
> will try to remember.
>
> I remember one problem where the currently running descriptor was free'd
> while the hardware was still executing it.
>
> There was another related problem where the DMA hardware would lock up,
> and it is impossible to recover without a hard reset. The CB (channel
> busy) bit gets stuck on, and the running transfer never completes. This
> may have been due to a programming issue in a user of the DMAEngine API,
> and not this driver itself. I don't remember for sure anymore.
>
> Our current system here at work has 120 boards with this DMA controller.
> Each one runs 100+ DMA operations per second, 24/7/365. They've been
> online for more than two years. I know for sure that the current code
> doesn't lock up the DMA controller. :)
>
> I think your bug is real, and needs to be fixed. I'm just trying to make
> sure I understand the change, so it won't break other users. I'm only
> trying to help.
Thanks for your information. I will do more explanation of my idea in the
following answer.
Could you provide some sample code for reproduce bug, I don't know dmatest.c
whether is enough. I will add dma self test in fsldma.c.

>
> > >
> > > It is much easier to reason about the state of the hardware and avoid
> > > race conditions if you complete entire blocks of descriptors after
> the
> > > hardware interrupts you to tell you it is finished.
> > In current driver, it's ok, but there is a potential issue when enable
> > async_tx api. How can you determine these descriptors in ld_running
> whether
> > have been completed in fsl_tx_status()? (I mean in my patch.)
> >
>
> I think your idea using an ld_completed list for descriptors which have
> been run by the hardware, but not yet 'ack'ed by software is fine.
>
> Would something similar to the following work?
>
> ld_pending: build a list of descriptors until issue_pending() is called.
> This is exactly what it does now, no changes needed.
Yes, no changes.

>
> ld_running: this list of descriptors is currently executing. Mostly
> unchanged from how it works now.
Yes, no changes.
I will add descriptions about all new interfaces.

>
> When you get an interrupt, you know the descriptors in ld_running are
> finished. Update the cookie to the highest number from ld_running. Free
> the descriptors which have been 'ack'ed immediately. Move the ones which
> are not 'ack'ed onto ld_completed. As part of this process, you should
> run dma_run_dependencies() and callbacks as needed.
Right, below is new implement of dma_do_tasklet();
First, dma_do_tasklet()
        -> fsldma_cleanup_descriptor()
                -> fsldma_clean_completed_descriptor(), free the descriptors which has been ack'ed (I mean in queue ld_completed, not include other queues);
Second, transverse the queue ld_running
        -> fsldma_run_tx_complete_actions(), find the last descriptor which is completed, update the cookie to the highest number, run callback, unmap dma pages, and run dma_run_dependencies();
        -> fsldma_clean_running_descriptor(), move this descriptor from ld_running to ld_completed.
Last,
        -> fsl_chan_xfer_ld_queue(), submit much more descriptors from ld_pending to ld_running.
I hope my description is clear enough.

>
> I think that this solves the problem you are seeing, which is that
> descriptors which have not been 'ack'ed are free'd.
>
> If this is unclear, I can write some code to demonstrate. If you could
> write a small test driver, similar to drivers/dma/dmatest.c, which tests
> the async_tx API without any extra hardware needed, I can run it. I have
> never used the async_tx API before.
Your description is clear, I agree with you, I will add fsldma self test in
the file.

>
> > >
> > > This is the philosophy employed by the driver before your
> modifications:
> > > ld_pending: a block of descriptors which is ready to run as soon as
> the
> > > hardware becomes idle.
> > > ld_running: a block of descriptors which is being executed by the
> > > hardware.
> > These 2 lists are not enough, async_tx descriptors must be kept order
> as
> > expected of its submitter, we only can process the descriptor which has
> been
> > acked by async_tx api, but not free all descriptors in ld_running.
> > For example,
> > Step 1, in async_tx memcpy api,
> > tx2 = device->device_prep_dma_memcpy(...),
> > tx2->depend_tx = tx1;
> > step 2, in async_tx submit api,
> > async_tx_submit() will check tx2->depend_tx whether has been acked,
> > unfortunately, tx2->depend_tx is freed before step 2 (dma channels is
> flushed
> > by other drivers), the contents of tx2->depend_tx is set to
> POOL_POISON_FREED
> > if DMAPOOL_DEBUG is enabled (illegal pointer will be used if disable
> this config
> > option);
> > step 3, BUG_ON(async_tx_test_ack(depend_tx)) is triggered.
> >
> > For resolve this potential risk, first, ack flag must be checked in dma
> driver,
> > second, ld_completed is added to restore all descriptors which have
> been acked
> > and completed.
> > In my following answer, most of all are around this idea.
> >
> > >
> > > > +               cookie = fsldma_run_tx_complete_actions(desc, chan,
> cookie);
> > > > +
> > > > +               if (fsldma_clean_running_descriptor(desc, chan))
> > > > +                       break;
> > > > +
> > > > +       }
> > > > +
> > > > +       /*
> > > > +        * Start any pending transactions automatically
> > > > +        *
> > > > +        * In the ideal case, we keep the DMA controller busy while
> we go
> > > > +        * ahead and free the descriptors below.
> > > > +        */
> > > > +       fsl_chan_xfer_ld_queue(chan);
> > > > +
> > > > +       if (cookie > 0)
> > > > +               chan->common.completed_cookie = cookie; }
> > > > +
> > > >  static dma_cookie_t fsl_dma_tx_submit(struct
> dma_async_tx_descriptor
> > > > *tx)  {
> > > >         struct fsldma_chan *chan = to_fsl_chan(tx->chan); @@ -534,8
> +745,10
> > > > @@ static void fsl_dma_free_chan_resources(struct dma_chan *dchan)
> > > >
> > > >         chan_dbg(chan, "free all channel resources\n");
> > > >         spin_lock_irqsave(&chan->desc_lock, flags);
> > > > +       fsldma_cleanup_descriptor(chan);
> > > >         fsldma_free_desc_list(chan, &chan->ld_pending);
> > > >         fsldma_free_desc_list(chan, &chan->ld_running);
> > > > +       fsldma_free_desc_list(chan, &chan->ld_completed);
> > > >         spin_unlock_irqrestore(&chan->desc_lock, flags);
> > > >
> > > >         dma_pool_destroy(chan->desc_pool);
> > > > @@ -811,124 +1024,6 @@ static int fsl_dma_device_control(struct
> > > > dma_chan *dchan,  }
> > > >
> > > >  /**
> > > > - * fsldma_cleanup_descriptor - cleanup and free a single link
> > > > descriptor
> > > > - * @chan: Freescale DMA channel
> > > > - * @desc: descriptor to cleanup and free
> > > > - *
> > > > - * This function is used on a descriptor which has been executed
> by
> > > > the DMA
> > > > - * controller. It will run any callbacks, submit any dependencies,
> > > > and then
> > > > - * free the descriptor.
> > > > - */
> > > > -static void fsldma_cleanup_descriptor(struct fsldma_chan *chan,
> > > > -                                     struct fsl_desc_sw *desc)
> > > > -{
> > > > -       struct dma_async_tx_descriptor *txd = &desc->async_tx;
> > > > -       struct device *dev = chan->common.device->dev;
> > > > -       dma_addr_t src = get_desc_src(chan, desc);
> > > > -       dma_addr_t dst = get_desc_dst(chan, desc);
> > > > -       u32 len = get_desc_cnt(chan, desc);
> > > > -
> > > > -       /* Run the link descriptor callback function */
> > > > -       if (txd->callback) {
> > > > -#ifdef FSL_DMA_LD_DEBUG
> > > > -               chan_dbg(chan, "LD %p callback\n", desc);
> > > > -#endif
> > > > -               txd->callback(txd->callback_param);
> > > > -       }
> > > > -
> > > > -       /* Run any dependencies */
> > > > -       dma_run_dependencies(txd);
> > > > -
> > > > -       /* Unmap the dst buffer, if requested */
> > > > -       if (!(txd->flags & DMA_COMPL_SKIP_DEST_UNMAP)) {
> > > > -               if (txd->flags & DMA_COMPL_DEST_UNMAP_SINGLE)
> > > > -                       dma_unmap_single(dev, dst, len, DMA_FROM_DEVICE);
> > > > -               else
> > > > -                       dma_unmap_page(dev, dst, len, DMA_FROM_DEVICE);
> > > > -       }
> > > > -
> > > > -       /* Unmap the src buffer, if requested */
> > > > -       if (!(txd->flags & DMA_COMPL_SKIP_SRC_UNMAP)) {
> > > > -               if (txd->flags & DMA_COMPL_SRC_UNMAP_SINGLE)
> > > > -                       dma_unmap_single(dev, src, len, DMA_TO_DEVICE);
> > > > -               else
> > > > -                       dma_unmap_page(dev, src, len, DMA_TO_DEVICE);
> > > > -       }
> > > > -
> > > > -#ifdef FSL_DMA_LD_DEBUG
> > > > -       chan_dbg(chan, "LD %p free\n", desc);
> > > > -#endif
> > > > -       dma_pool_free(chan->desc_pool, desc, txd->phys);
> > > > -}
> > > > -
> > > > -/**
> > > > - * fsl_chan_xfer_ld_queue - transfer any pending transactions
> > > > - * @chan : Freescale DMA channel
> > > > - *
> > > > - * HARDWARE STATE: idle
> > > > - * LOCKING: must hold chan->desc_lock
> > > > - */
> > > > -static void fsl_chan_xfer_ld_queue(struct fsldma_chan *chan) -{
> > > > -       struct fsl_desc_sw *desc;
> > > > -
> > > > -       /*
> > > > -        * If the list of pending descriptors is empty, then we
> > > > -        * don't need to do any work at all
> > > > -        */
> > > > -       if (list_empty(&chan->ld_pending)) {
> > > > -               chan_dbg(chan, "no pending LDs\n");
> > > > -               return;
> > > > -       }
> > > > -
> > > > -       /*
> > > > -        * The DMA controller is not idle, which means that the
> interrupt
> > > > -        * handler will start any queued transactions when it runs
> after
> > > > -        * this transaction finishes
> > > > -        */
> > > > -       if (!chan->idle) {
> > > > -               chan_dbg(chan, "DMA controller still busy\n");
> > > > -               return;
> > > > -       }
> > > > -
> > > > -       /*
> > > > -        * If there are some link descriptors which have not been
> > > > -        * transferred, we need to start the controller
> > > > -        */
> > > > -
> > > > -       /*
> > > > -        * Move all elements from the queue of pending transactions
> > > > -        * onto the list of running transactions
> > > > -        */
> > > > -       chan_dbg(chan, "idle, starting controller\n");
> > > > -       desc = list_first_entry(&chan->ld_pending, struct fsl_desc_sw,
> > > node);
> > > > -       list_splice_tail_init(&chan->ld_pending, &chan->ld_running);
> > > > -
> > > > -       /*
> > > > -        * The 85xx DMA controller doesn't clear the channel start
> bit
> > > > -        * automatically at the end of a transfer. Therefore we must
> clear
> > > > -        * it in software before starting the transfer.
> > > > -        */
> > > > -       if ((chan->feature & FSL_DMA_IP_MASK) == FSL_DMA_IP_85XX) {
> > > > -               u32 mode;
> > > > -
> > > > -               mode = DMA_IN(chan, &chan->regs->mr, 32);
> > > > -               mode &= ~FSL_DMA_MR_CS;
> > > > -               DMA_OUT(chan, &chan->regs->mr, mode, 32);
> > > > -       }
> > > > -
> > > > -       /*
> > > > -        * Program the descriptor's address into the DMA controller,
> > > > -        * then start the DMA transaction
> > > > -        */
> > > > -       set_cdar(chan, desc->async_tx.phys);
> > > > -       get_cdar(chan);
> > > > -
> > > > -       dma_start(chan);
> > > > -       chan->idle = false;
> > > > -}
> > > > -
> > > > -/**
> > > >   * fsl_dma_memcpy_issue_pending - Issue the DMA start command
> > > >   * @chan : Freescale DMA channel
> > > >   */
> > > > @@ -954,11 +1049,17 @@ static enum dma_status fsl_tx_status(struct
> > > dma_chan *dchan,
> > > >         enum dma_status ret;
> > > >         unsigned long flags;
> > > >
> > > > -       spin_lock_irqsave(&chan->desc_lock, flags);
> > > >         ret = dma_cookie_status(dchan, cookie, txstate);
> > > > +       if (ret == DMA_SUCCESS) {
> > > > +               fsldma_clean_completed_descriptor(chan);
> > > > +               return ret;
> > > > +       }
> > > > +
> > > > +       spin_lock_irqsave(&chan->desc_lock, flags);
> > > > +       fsldma_cleanup_descriptor(chan);
> > >
> > > You've made status from a very quick operation (compare two cookies)
> into
> > > a potentially long running operation. It may now run callbacks, unmap
> > > pages, etc.
> > Yes, that's right, callbacks and unmap pages will be invoked if DMA
> status
> > is not DMA_SUCCESS.
> >
> > >
> > > I note that Documentation/crypto/async-tx-api.txt section 3.6
> > > "Constraints" does say that async_*() functions cannot be called from
> IRQ
> > > context. However, the DMAEngine API itself does not have anything to
> say
> > > about IRQ context.
> > >
> > > I expect fsl_tx_status() to be quick, and not potentially call other
> > > arbitrary code.
> > fsl_tx_status() is not in IRQ context. It's reasonable to unmap pages
> and
> > run dependency of descriptor as fsldma_run_tx_complete_actions() does.
> >
>
> I looked at several other DMAEngine API drivers. Some of them do run
> dependencies and callbacks from tx_status(). You are correct, this is
> fine.
>
> > >
> > > Is it possible to leave this function unchanged?
> > According to my knowledge, it is unlikely to be left unchanged, or it
> will violate
> > the design of this interface. If DMA status is not DMA_SUCCESS, that
> means there
> > are new descriptors completed, we should move them to ld_completed, and
> do actions
> > to its async_tx descriptors, then dma descriptor will be freed at
> another time.
> > Most of my idea is from iop-adma.c, the original interface is not align
> with it.
> > Async_tx flags (expecially ack_test) is not considered when dma
> descriptor is freed,
> > so some random errors happened, I think you must familiar with this
> history.
> > As an important "synchronization flag", async_tx api must control the
> order of tx
> > descriptor. Dma driver also should make sure keeping order when free
> the descriptor.
> > That's the reason why I add ld_completed list.
> >
>
> Is it possible to put the descriptors into three different states?
>
> 1) ld_pending: ready to run, but not yet executed on the hardware
>
> Descriptors move to #2 through either dma_async_issue_pending() or by
> the interrupt which happens when the hardware completes executing a set
> of DMA descriptors.
>
> 2) ld_running: currently executing on the hardware
>
> Descriptors move to #3 through the interrupt which happens when the
> hardware completes executing a set of DMA descriptors.
>
> Dependencies and callbacks are executed as the descriptors are moved
> onto ld_completed.
>
> 3) ld_completed: finished executing on the hardware, but not yet 'ack'ed
>
> When they have been 'ack'ed, the descriptors are free'd.
>
>
> As noted above, I think I could code this up fairly quickly if this
> explanation is unclear.
Totally right. I think you understand my idea.

>
> Also note that I may not understand your bug completely, and my idea may
> be completely wrong!
I hope you've already understand my issue:)
Thanks.

>
> Ira
>
> > >
> > > Also note that if you leave this function unchanged, the locking
> error
> > > pointed out above (fsldma_clean_completed_descriptor() is not called
> with
> > > the lock held) goes away.
> > I know. Thanks.
> >
> > >
> > > >         spin_unlock_irqrestore(&chan->desc_lock, flags);
> > > >
> > > > -       return ret;
> > > > +       return dma_cookie_status(dchan, cookie, txstate);
> > > >  }
> > > >
> > > >
> > > > /*-----------------------------------------------------------------
> ---
> > > > --------*/ @@ -1035,53 +1136,21 @@ static irqreturn_t
> > > > fsldma_chan_irq(int irq, void *data)  static void
> > > > dma_do_tasklet(unsigned long data)  {
> > > >         struct fsldma_chan *chan = (struct fsldma_chan *)data;
> > > > -       struct fsl_desc_sw *desc, *_desc;
> > > > -       LIST_HEAD(ld_cleanup);
> > > >         unsigned long flags;
> > > >
> > > >         chan_dbg(chan, "tasklet entry\n");
> > > >
> > > >         spin_lock_irqsave(&chan->desc_lock, flags);
> > > >
> > > > -       /* update the cookie if we have some descriptors to cleanup
> */
> > > > -       if (!list_empty(&chan->ld_running)) {
> > > > -               dma_cookie_t cookie;
> > > > -
> > > > -               desc = to_fsl_desc(chan->ld_running.prev);
> > > > -               cookie = desc->async_tx.cookie;
> > > > -               dma_cookie_complete(&desc->async_tx);
> > > > -
> > > > -               chan_dbg(chan, "completed_cookie=%d\n", cookie);
> > > > -       }
> > > > -
> > > > -       /*
> > > > -        * move the descriptors to a temporary list so we can drop
> the lock
> > > > -        * during the entire cleanup operation
> > > > -        */
> > > > -       list_splice_tail_init(&chan->ld_running, &ld_cleanup);
> > > > +       /* Run all cleanup for this descriptor */
> > > > +       fsldma_cleanup_descriptor(chan);
> > > >
> > > >         /* the hardware is now idle and ready for more */
> > > >         chan->idle = true;
> > > >
> > > > -       /*
> > > > -        * Start any pending transactions automatically
> > > > -        *
> > > > -        * In the ideal case, we keep the DMA controller busy while
> we go
> > > > -        * ahead and free the descriptors below.
> > > > -        */
> > > >         fsl_chan_xfer_ld_queue(chan);
> > > >         spin_unlock_irqrestore(&chan->desc_lock, flags);
> > > >
> > > > -       /* Run the callback for each descriptor, in order */
> > > > -       list_for_each_entry_safe(desc, _desc, &ld_cleanup, node) {
> > > > -
> > > > -               /* Remove from the list of transactions */
> > > > -               list_del(&desc->node);
> > > > -
> > > > -               /* Run all cleanup for this descriptor */
> > > > -               fsldma_cleanup_descriptor(chan, desc);
> > > > -       }
> > > > -
> > > >         chan_dbg(chan, "tasklet exit\n");
> > > >  }
> > > >
> > > > @@ -1262,6 +1331,7 @@ static int __devinit
> fsl_dma_chan_probe(struct
> > > fsldma_device *fdev,
> > > >         spin_lock_init(&chan->desc_lock);
> > > >         INIT_LIST_HEAD(&chan->ld_pending);
> > > >         INIT_LIST_HEAD(&chan->ld_running);
> > > > +       INIT_LIST_HEAD(&chan->ld_completed);
> > > >         chan->idle = true;
> > > >
> > > >         chan->common.device = &fdev->common; diff --git
> > > > a/drivers/dma/fsldma.h b/drivers/dma/fsldma.h index
> f5c3879..7ede908
> > > > 100644
> > > > --- a/drivers/dma/fsldma.h
> > > > +++ b/drivers/dma/fsldma.h
> > > > @@ -140,6 +140,7 @@ struct fsldma_chan {
> > > >         spinlock_t desc_lock;           /* Descriptor operation lock */
> > > >         struct list_head ld_pending;    /* Link descriptors queue */
> > > >         struct list_head ld_running;    /* Link descriptors queue */
> > > > +       struct list_head ld_completed;  /* Link descriptors queue
> */
> > > >         struct dma_chan common;         /* DMA common channel */
> > > >         struct dma_pool *desc_pool;     /* Descriptors pool */
> > > >         struct device *dev;             /* Channel device */
> > > > --
> > > > 1.7.5.1
> > > >
> > > >
> >
> >



[-- Attachment #2: Type: text/html, Size: 69949 bytes --]

^ permalink raw reply

* RE: [PATCH v3 3/4] fsl-dma: change release process of dma descriptor for supporting async_tx
From: Liu Qiang-B32616 @ 2012-07-18  3:44 UTC (permalink / raw)
  To: Ira W. Snyder, Dan Williams
  Cc: Li Yang-R58472, Vinod Koul, Phillips Kim-R1AAHA,
	herbert@gondor.hengli.com.au, linux-crypto@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, davem@davemloft.net
In-Reply-To: <20120717161653.GA20669@ovro.caltech.edu>

Hi Ira,

My comments inline.

Hi Ira and Dan,

Here is a introduction about my scenario of this case.

My case is use async_tx API to offload raid5,
Use fsl-talitos to compute parity calculation, fsl-dma to process memcpy.

What bug occurred to me?
Exception is thrown by BUG_ON(async_tx_ack_test(depend_tx)) in async_tx_sub=
mit().

TASK =3D ee1a94a0[1390] 'md0_raid5' THREAD: ecf40000 CPU: 0
GPR00: 00000001 ecf41ca0 ee44/921a94a0 0000003f 00000001 c00593e4 00000000 =
00000001=20
GPR08: 00000000 a7a7a7a7 00000001 045/920000002 42028042 100a38d4 ed576d98 =
00000000=20
GPR16: ed5a11b0 00000000 2b162000 00000200 046/920000000 2d555000 ed3015e8 =
c15a7aa0=20
GPR24: 00000000 c155fc40 00000000 ecb63220 ecf41d28 e47/92f640bb0 ef640c30 =
ecf41ca0=20
NIP [c02b048c] async_tx_submit+0x6c/0x2b4
LR [c02b068c] async_tx_submit+0x26c/0x2b4
Call Trace:
[ecf41ca0] [c02b068c] async_tx_submit+0x26c/0x2b448/92 (unreliable)
[ecf41cd0] [c02b0a4c] async_memcpy+0x240/0x25c
[ecf41d20] [c0421064] async_copy_data+0xa0/0x17c
[ecf41d70] [c0421cf4] __raid_run_ops+0x874/0xe10
[ecf41df0] [c0426ee4] handle_stripe+0x820/0x25e8
[ecf41e90] [c0429080] raid5d+0x3d4/0x5b4
[ecf41f40] [c04329b8] md_thread+0x138/0x16c
[ecf41f90] [c008277c] kthread+0x8c/0x90
[ecf41ff0] [c0011630] kernel_thread+0x4c/0x68

What is the root cause of this bug?
The async_tx descriptor(depend_tx) is free'ed before async_tx_submit ack it=
.
That means async_tx descriptor should be acked before it's free'ed. In othe=
r
word, only ack'ed async_tx descriptor should be free'ed by dma engine.

How to fix it?
Add a queue ld_completed, move all completed descriptors in this queue, fre=
e
these descriptors at a proper time (these descriptors also should be acked)=
.

Dan, do you think my understand about "ack" is right? Thanks.=20

- Qiang

> -----Original Message-----
> From: Ira W. Snyder [mailto:iws@ovro.caltech.edu]
> Sent: Wednesday, July 18, 2012 12:17 AM
> To: Liu Qiang-B32616
> Cc: linux-crypto@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; Phillips
> Kim-R1AAHA; herbert@gondor.hengli.com.au; davem@davemloft.net; Dan
> Williams; Vinod Koul; Li Yang-R58472
> Subject: Re: [PATCH v3 3/4] fsl-dma: change release process of dma
> descriptor for supporting async_tx
>=20
> On Tue, Jul 17, 2012 at 07:06:33AM +0000, Liu Qiang-B32616 wrote:
> > > -----Original Message-----
> > > From: Ira W. Snyder [mailto:iws@ovro.caltech.edu]
> > > Sent: Tuesday, July 17, 2012 4:01 AM
> > > To: Liu Qiang-B32616
> > > Cc: linux-crypto@vger.kernel.org; linuxppc-dev@lists.ozlabs.org;
> Phillips
> > > Kim-R1AAHA; herbert@gondor.hengli.com.au; davem@davemloft.net; Dan
> > > Williams; Vinod Koul; Li Yang-R58472
> > > Subject: Re: [PATCH v3 3/4] fsl-dma: change release process of dma
> > > descriptor for supporting async_tx
> > >
> > > On Mon, Jul 16, 2012 at 12:08:29PM +0800, Qiang Liu wrote:
> > > > Fix the potential risk when enable config NET_DMA and ASYNC_TX.
> > > > Async_tx is lack of support in current release process of dma
> > > > descriptor, all descriptors will be released whatever is acked or
> > > > no-acked by async_tx, so there is a potential race condition when
> dma
> > > > engine is uesd by others clients (e.g. when enable NET_DMA to
> offload
> > > TCP).
> > > >
> > > > In our case, a race condition which is raised when use both of
> talitos
> > > > and dmaengine to offload xor is because napi scheduler will sync
> all
> > > > pending requests in dma channels, it affects the process of raid
> > > > operations due to ack_tx is not checked in fsl dma. The no-acked
> > > > descriptor is freed which is submitted just now, as a dependent tx,
> > > > this freed descriptor trigger
> > > > BUG_ON(async_tx_test_ack(depend_tx)) in async_tx_submit().
> > > >
> > > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > > Cc: Vinod Koul <vinod.koul@intel.com>
> > > > Cc: Li Yang <leoli@freescale.com>
> > > > Cc: Ira W. Snyder <iws@ovro.caltech.edu>
> > > > Signed-off-by: Qiang Liu <qiang.liu@freescale.com>
> > > > ---
> > > >  drivers/dma/fsldma.c |  378 +++++++++++++++++++++++++++++---------
> ----
> > > -------
> > > >  drivers/dma/fsldma.h |    1 +
> > > >  2 files changed, 225 insertions(+), 154 deletions(-)
> > > >
> > > > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index
> > > > 4f2f212..4ee1b8f 100644
> > > > --- a/drivers/dma/fsldma.c
> > > > +++ b/drivers/dma/fsldma.c
> > > > @@ -400,6 +400,217 @@ out_splice:
> > > >  	list_splice_tail_init(&desc->tx_list, &chan->ld_pending);  }
> > > >
> > > > +/**
> > > > + * fsl_chan_xfer_ld_queue - transfer any pending transactions
> > > > + * @chan : Freescale DMA channel
> > > > + *
> > > > + * HARDWARE STATE: idle
> > > > + * LOCKING: must hold chan->desc_lock  */ static void
> > > > +fsl_chan_xfer_ld_queue(struct fsldma_chan *chan) {
> > > > +	struct fsl_desc_sw *desc;
> > > > +
> > > > +	/*
> > > > +	 * If the list of pending descriptors is empty, then we
> > > > +	 * don't need to do any work at all
> > > > +	 */
> > > > +	if (list_empty(&chan->ld_pending)) {
> > > > +		chan_dbg(chan, "no pending LDs\n");
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * The DMA controller is not idle, which means that the
> interrupt
> > > > +	 * handler will start any queued transactions when it runs
> after
> > > > +	 * this transaction finishes
> > > > +	 */
> > > > +	if (!chan->idle) {
> > > > +		chan_dbg(chan, "DMA controller still busy\n");
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * If there are some link descriptors which have not been
> > > > +	 * transferred, we need to start the controller
> > > > +	 */
> > > > +
> > > > +	/*
> > > > +	 * Move all elements from the queue of pending transactions
> > > > +	 * onto the list of running transactions
> > > > +	 */
> > > > +	chan_dbg(chan, "idle, starting controller\n");
> > > > +	desc =3D list_first_entry(&chan->ld_pending, struct fsl_desc_sw,
> > > node);
> > > > +	list_splice_tail_init(&chan->ld_pending, &chan->ld_running);
> > > > +
> > > > +	/*
> > > > +	 * The 85xx DMA controller doesn't clear the channel start
> bit
> > > > +	 * automatically at the end of a transfer. Therefore we must
> clear
> > > > +	 * it in software before starting the transfer.
> > > > +	 */
> > > > +	if ((chan->feature & FSL_DMA_IP_MASK) =3D=3D FSL_DMA_IP_85XX) {
> > > > +		u32 mode;
> > > > +
> > > > +		mode =3D DMA_IN(chan, &chan->regs->mr, 32);
> > > > +		mode &=3D ~FSL_DMA_MR_CS;
> > > > +		DMA_OUT(chan, &chan->regs->mr, mode, 32);
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * Program the descriptor's address into the DMA controller,
> > > > +	 * then start the DMA transaction
> > > > +	 */
> > > > +	set_cdar(chan, desc->async_tx.phys);
> > > > +	get_cdar(chan);
> > > > +
> > > > +	dma_start(chan);
> > > > +	chan->idle =3D false;
> > > > +}
> > > > +
> > >
> > > Please add a note about the locking requirements here, and for the
> other
> > > new functions you've added.
> > >
> > > You call this function from two places:
> > >
> > > 1) fsldma_cleanup_descriptor() - called with mod->desc_lock held
> > > 2) fsl_tx_status() - WITHOUT mod->desc_lock held
> > >
> > > One of them is definitely wrong, and I'd bet that it is #2. You're
> > > modifying ld_completed without a lock.
> > Yes, My bad, I will correct it.
> >
> > >
> > > > +static int
> > > > +fsldma_clean_completed_descriptor(struct fsldma_chan *chan) {
> > > > +	struct fsl_desc_sw *desc, *_desc;
> > > > +
> > > > +	/* Run the callback for each descriptor, in order */
> > > > +	list_for_each_entry_safe(desc, _desc, &chan->ld_completed,
> node) {
> > > > +
> > > > +		if (async_tx_test_ack(&desc->async_tx)) {
> > > > +			/* Remove from the list of transactions */
> > > > +			list_del(&desc->node);
> > > > +#ifdef FSL_DMA_LD_DEBUG
> > > > +			chan_dbg(chan, "LD %p free\n", desc); #endif
> > > > +			dma_pool_free(chan->desc_pool, desc,
> > > > +					desc->async_tx.phys);
> > > > +		}
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * fsldma_run_tx_complete_actions - cleanup and free a single link
> > > > +descriptor
> > > > + * @chan: Freescale DMA channel
> > > > + * @desc: descriptor to cleanup and free
> > > > + * @cookie: Freescale DMA transaction identifier
> > > > + *
> > > > + * This function is used on a descriptor which has been executed
> by
> > > > +the DMA
> > > > + * controller. It will run any callbacks, submit any dependencies,
> > > > +and then
> > > > + * free the descriptor.
> > > > + */
> > > > +static dma_cookie_t fsldma_run_tx_complete_actions(struct
> fsl_desc_sw
> > > *desc,
> > > > +		struct fsldma_chan *chan, dma_cookie_t cookie) {
> > > > +	struct dma_async_tx_descriptor *txd =3D &desc->async_tx;
> > > > +	struct device *dev =3D chan->common.device->dev;
> > > > +	dma_addr_t src =3D get_desc_src(chan, desc);
> > > > +	dma_addr_t dst =3D get_desc_dst(chan, desc);
> > > > +	u32 len =3D get_desc_cnt(chan, desc);
> > > > +
> > > > +	BUG_ON(txd->cookie < 0);
> > > > +
> > > > +	if (txd->cookie > 0) {
> > > > +		cookie =3D txd->cookie;
> > > > +
> > > > +		/* Run the link descriptor callback function */
> > > > +		if (txd->callback) {
> > > > +#ifdef FSL_DMA_LD_DEBUG
> > > > +			chan_dbg(chan, "LD %p callback\n", desc); #endif
> > > > +			txd->callback(txd->callback_param);
> > > > +		}
> > > > +
> > > > +		/* Unmap the dst buffer, if requested */
> > > > +		if (!(txd->flags & DMA_COMPL_SKIP_DEST_UNMAP)) {
> > > > +			if (txd->flags & DMA_COMPL_DEST_UNMAP_SINGLE)
> > > > +				dma_unmap_single(dev, dst, len,
> DMA_FROM_DEVICE);
> > > > +			else
> > > > +				dma_unmap_page(dev, dst, len,
> DMA_FROM_DEVICE);
> > > > +		}
> > > > +
> > > > +		/* Unmap the src buffer, if requested */
> > > > +		if (!(txd->flags & DMA_COMPL_SKIP_SRC_UNMAP)) {
> > > > +			if (txd->flags & DMA_COMPL_SRC_UNMAP_SINGLE)
> > > > +				dma_unmap_single(dev, src, len,
> DMA_TO_DEVICE);
> > > > +			else
> > > > +				dma_unmap_page(dev, src, len,
> DMA_TO_DEVICE);
> > > > +		}
> > > > +	}
> > > > +
> > > > +	/* Run any dependencies */
> > > > +	dma_run_dependencies(txd);
> > > > +
> > > > +	return cookie;
> > > > +}
> > > > +
> > > > +static int
> > > > +fsldma_clean_running_descriptor(struct fsl_desc_sw *desc,
> > > > +			struct fsldma_chan *chan)
> > > > +{
> > > > +	/* Remove from the list of transactions */
> > > > +	list_del(&desc->node);
> > > > +	/* the client is allowed to attach dependent operations
> > > > +	 * until 'ack' is set
> > > > +	 */
> > >
> > > This comment is does not follow the coding style. It should be:
> > >
> > > /*
> > >  * the client is allowed to attech dependent operations
> > >  * until 'ack' is set
> > >  */
> > Fine, I will correct it in v4. Include another 2 places related to
> coding style.
> > Thanks.
> >
> > >
> > > > +	if (!async_tx_test_ack(&desc->async_tx)) {
> > > > +		/* move this slot to the completed */
> > >
> > > Perhaps a better comment would be:
> > >
> > > Move this descriptor to the list of descriptors which is complete,
> but
> > > still awaiting the 'ack' bit to be set.
> > Accept.
> >
> > >
> > > > +		list_add_tail(&desc->node, &chan->ld_completed);
> > > > +		return 0;
> > > > +	}
> > > > +
> > > > +	dma_pool_free(chan->desc_pool, desc,
> > > > +			desc->async_tx.phys);
> > >
> > > This should all be on one line. It is less than 80 characters wide.
> > >
> > Accept.
> >
> > > > +	return 0;
> > > > +}
> > > > +
> > >
> > > Locking notes please.
> > I will add it in v4, please check. Thanks.
> >
> > >
> > > > +static void fsldma_cleanup_descriptor(struct fsldma_chan *chan) {
> > > > +	struct fsl_desc_sw *desc, *_desc;
> > > > +	dma_cookie_t cookie =3D 0;
> > > > +	dma_addr_t curr_phys =3D get_cdar(chan);
> > > > +	int idle =3D dma_is_idle(chan);
> > > > +	int seen_current =3D 0;
> > > > +
> > > > +	fsldma_clean_completed_descriptor(chan);
> > > > +
> > > > +	/* Run the callback for each descriptor, in order */
> > > > +	list_for_each_entry_safe(desc, _desc, &chan->ld_running, node)
> {
> > > > +		/* do not advance past the current descriptor loaded
> into the
> > > > +		 * hardware channel, subsequent descriptors are either
> in
> > > > +		 * process or have not been submitted
> > > > +		 */
> > >
> > > Coding style.
> > >
> > > > +		if (seen_current)
> > > > +			break;
> > > > +
> > > > +		/* stop the search if we reach the current descriptor
> and the
> > > > +		 * channel is busy
> > > > +		 */
> > >
> > > Coding style.
> > >
> > > > +		if (desc->async_tx.phys =3D=3D curr_phys) {
> > > > +			seen_current =3D 1;
> > > > +			if (!idle)
> > > > +				break;
> > > > +		}
> > > > +
> > >
> > > This trick where you try to look at the hardware state to determine
> how
> > > much to clean up has been a source of headaches in the past versions
> of
> > > this driver.
> > I know a little about the history of this driver. Could you tell me
> what kind
> > headaches in the past versions, I can have a test and fix it in my
> patch. And
> > I also think use current phys addr should be more explicit.
> > Thanks.
> >
>=20
> It has been a very long time since I last worked on this code, but I
> will try to remember.
>=20
> I remember one problem where the currently running descriptor was free'd
> while the hardware was still executing it.
>=20
> There was another related problem where the DMA hardware would lock up,
> and it is impossible to recover without a hard reset. The CB (channel
> busy) bit gets stuck on, and the running transfer never completes. This
> may have been due to a programming issue in a user of the DMAEngine API,
> and not this driver itself. I don't remember for sure anymore.
>=20
> Our current system here at work has 120 boards with this DMA controller.
> Each one runs 100+ DMA operations per second, 24/7/365. They've been
> online for more than two years. I know for sure that the current code
> doesn't lock up the DMA controller. :)
>=20
> I think your bug is real, and needs to be fixed. I'm just trying to make
> sure I understand the change, so it won't break other users. I'm only
> trying to help.
Thanks for your information. I will do more explanation of my idea in the
following answer.
Could you provide some sample code for reproduce bug, I don't know dmatest.=
c
whether is enough. I will add dma self test in fsldma.c.

>=20
> > >
> > > It is much easier to reason about the state of the hardware and avoid
> > > race conditions if you complete entire blocks of descriptors after
> the
> > > hardware interrupts you to tell you it is finished.
> > In current driver, it's ok, but there is a potential issue when enable
> > async_tx api. How can you determine these descriptors in ld_running
> whether
> > have been completed in fsl_tx_status()? (I mean in my patch.)
> >
>=20
> I think your idea using an ld_completed list for descriptors which have
> been run by the hardware, but not yet 'ack'ed by software is fine.
>=20
> Would something similar to the following work?
>=20
> ld_pending: build a list of descriptors until issue_pending() is called.
> This is exactly what it does now, no changes needed.
Yes, no changes.

>=20
> ld_running: this list of descriptors is currently executing. Mostly
> unchanged from how it works now.
Yes, no changes.
I will add descriptions about all new interfaces.

>=20
> When you get an interrupt, you know the descriptors in ld_running are
> finished. Update the cookie to the highest number from ld_running. Free
> the descriptors which have been 'ack'ed immediately. Move the ones which
> are not 'ack'ed onto ld_completed. As part of this process, you should
> run dma_run_dependencies() and callbacks as needed.
Right, below is new implement of dma_do_tasklet();
First, dma_do_tasklet()
	-> fsldma_cleanup_descriptor()
		-> fsldma_clean_completed_descriptor(), free the descriptors which has be=
en ack'ed (I mean in queue ld_completed, not include other queues);
Second, transverse the queue ld_running
	-> fsldma_run_tx_complete_actions(), find the last descriptor which is com=
pleted, update the cookie to the highest number, run callback, unmap dma pa=
ges, and run dma_run_dependencies();
	-> fsldma_clean_running_descriptor(), move this descriptor from ld_running=
 to ld_completed.
Last,
	-> fsl_chan_xfer_ld_queue(), submit much more descriptors from ld_pending =
to ld_running.
I hope my description is clear enough.

>=20
> I think that this solves the problem you are seeing, which is that
> descriptors which have not been 'ack'ed are free'd.
>=20
> If this is unclear, I can write some code to demonstrate. If you could
> write a small test driver, similar to drivers/dma/dmatest.c, which tests
> the async_tx API without any extra hardware needed, I can run it. I have
> never used the async_tx API before.
Your description is clear, I agree with you, I will add fsldma self test in
the file.

>=20
> > >
> > > This is the philosophy employed by the driver before your
> modifications:
> > > ld_pending: a block of descriptors which is ready to run as soon as
> the
> > > hardware becomes idle.
> > > ld_running: a block of descriptors which is being executed by the
> > > hardware.
> > These 2 lists are not enough, async_tx descriptors must be kept order
> as
> > expected of its submitter, we only can process the descriptor which has
> been
> > acked by async_tx api, but not free all descriptors in ld_running.
> > For example,
> > Step 1, in async_tx memcpy api,
> > tx2 =3D device->device_prep_dma_memcpy(...),
> > tx2->depend_tx =3D tx1;
> > step 2, in async_tx submit api,
> > async_tx_submit() will check tx2->depend_tx whether has been acked,
> > unfortunately, tx2->depend_tx is freed before step 2 (dma channels is
> flushed
> > by other drivers), the contents of tx2->depend_tx is set to
> POOL_POISON_FREED
> > if DMAPOOL_DEBUG is enabled (illegal pointer will be used if disable
> this config
> > option);
> > step 3, BUG_ON(async_tx_test_ack(depend_tx)) is triggered.
> >
> > For resolve this potential risk, first, ack flag must be checked in dma
> driver,
> > second, ld_completed is added to restore all descriptors which have
> been acked
> > and completed.
> > In my following answer, most of all are around this idea.
> >
> > >
> > > > +		cookie =3D fsldma_run_tx_complete_actions(desc, chan,
> cookie);
> > > > +
> > > > +		if (fsldma_clean_running_descriptor(desc, chan))
> > > > +			break;
> > > > +
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * Start any pending transactions automatically
> > > > +	 *
> > > > +	 * In the ideal case, we keep the DMA controller busy while
> we go
> > > > +	 * ahead and free the descriptors below.
> > > > +	 */
> > > > +	fsl_chan_xfer_ld_queue(chan);
> > > > +
> > > > +	if (cookie > 0)
> > > > +		chan->common.completed_cookie =3D cookie; }
> > > > +
> > > >  static dma_cookie_t fsl_dma_tx_submit(struct
> dma_async_tx_descriptor
> > > > *tx)  {
> > > >  	struct fsldma_chan *chan =3D to_fsl_chan(tx->chan); @@ -534,8
> +745,10
> > > > @@ static void fsl_dma_free_chan_resources(struct dma_chan *dchan)
> > > >
> > > >  	chan_dbg(chan, "free all channel resources\n");
> > > >  	spin_lock_irqsave(&chan->desc_lock, flags);
> > > > +	fsldma_cleanup_descriptor(chan);
> > > >  	fsldma_free_desc_list(chan, &chan->ld_pending);
> > > >  	fsldma_free_desc_list(chan, &chan->ld_running);
> > > > +	fsldma_free_desc_list(chan, &chan->ld_completed);
> > > >  	spin_unlock_irqrestore(&chan->desc_lock, flags);
> > > >
> > > >  	dma_pool_destroy(chan->desc_pool);
> > > > @@ -811,124 +1024,6 @@ static int fsl_dma_device_control(struct
> > > > dma_chan *dchan,  }
> > > >
> > > >  /**
> > > > - * fsldma_cleanup_descriptor - cleanup and free a single link
> > > > descriptor
> > > > - * @chan: Freescale DMA channel
> > > > - * @desc: descriptor to cleanup and free
> > > > - *
> > > > - * This function is used on a descriptor which has been executed
> by
> > > > the DMA
> > > > - * controller. It will run any callbacks, submit any dependencies,
> > > > and then
> > > > - * free the descriptor.
> > > > - */
> > > > -static void fsldma_cleanup_descriptor(struct fsldma_chan *chan,
> > > > -				      struct fsl_desc_sw *desc)
> > > > -{
> > > > -	struct dma_async_tx_descriptor *txd =3D &desc->async_tx;
> > > > -	struct device *dev =3D chan->common.device->dev;
> > > > -	dma_addr_t src =3D get_desc_src(chan, desc);
> > > > -	dma_addr_t dst =3D get_desc_dst(chan, desc);
> > > > -	u32 len =3D get_desc_cnt(chan, desc);
> > > > -
> > > > -	/* Run the link descriptor callback function */
> > > > -	if (txd->callback) {
> > > > -#ifdef FSL_DMA_LD_DEBUG
> > > > -		chan_dbg(chan, "LD %p callback\n", desc);
> > > > -#endif
> > > > -		txd->callback(txd->callback_param);
> > > > -	}
> > > > -
> > > > -	/* Run any dependencies */
> > > > -	dma_run_dependencies(txd);
> > > > -
> > > > -	/* Unmap the dst buffer, if requested */
> > > > -	if (!(txd->flags & DMA_COMPL_SKIP_DEST_UNMAP)) {
> > > > -		if (txd->flags & DMA_COMPL_DEST_UNMAP_SINGLE)
> > > > -			dma_unmap_single(dev, dst, len, DMA_FROM_DEVICE);
> > > > -		else
> > > > -			dma_unmap_page(dev, dst, len, DMA_FROM_DEVICE);
> > > > -	}
> > > > -
> > > > -	/* Unmap the src buffer, if requested */
> > > > -	if (!(txd->flags & DMA_COMPL_SKIP_SRC_UNMAP)) {
> > > > -		if (txd->flags & DMA_COMPL_SRC_UNMAP_SINGLE)
> > > > -			dma_unmap_single(dev, src, len, DMA_TO_DEVICE);
> > > > -		else
> > > > -			dma_unmap_page(dev, src, len, DMA_TO_DEVICE);
> > > > -	}
> > > > -
> > > > -#ifdef FSL_DMA_LD_DEBUG
> > > > -	chan_dbg(chan, "LD %p free\n", desc);
> > > > -#endif
> > > > -	dma_pool_free(chan->desc_pool, desc, txd->phys);
> > > > -}
> > > > -
> > > > -/**
> > > > - * fsl_chan_xfer_ld_queue - transfer any pending transactions
> > > > - * @chan : Freescale DMA channel
> > > > - *
> > > > - * HARDWARE STATE: idle
> > > > - * LOCKING: must hold chan->desc_lock
> > > > - */
> > > > -static void fsl_chan_xfer_ld_queue(struct fsldma_chan *chan) -{
> > > > -	struct fsl_desc_sw *desc;
> > > > -
> > > > -	/*
> > > > -	 * If the list of pending descriptors is empty, then we
> > > > -	 * don't need to do any work at all
> > > > -	 */
> > > > -	if (list_empty(&chan->ld_pending)) {
> > > > -		chan_dbg(chan, "no pending LDs\n");
> > > > -		return;
> > > > -	}
> > > > -
> > > > -	/*
> > > > -	 * The DMA controller is not idle, which means that the
> interrupt
> > > > -	 * handler will start any queued transactions when it runs
> after
> > > > -	 * this transaction finishes
> > > > -	 */
> > > > -	if (!chan->idle) {
> > > > -		chan_dbg(chan, "DMA controller still busy\n");
> > > > -		return;
> > > > -	}
> > > > -
> > > > -	/*
> > > > -	 * If there are some link descriptors which have not been
> > > > -	 * transferred, we need to start the controller
> > > > -	 */
> > > > -
> > > > -	/*
> > > > -	 * Move all elements from the queue of pending transactions
> > > > -	 * onto the list of running transactions
> > > > -	 */
> > > > -	chan_dbg(chan, "idle, starting controller\n");
> > > > -	desc =3D list_first_entry(&chan->ld_pending, struct fsl_desc_sw,
> > > node);
> > > > -	list_splice_tail_init(&chan->ld_pending, &chan->ld_running);
> > > > -
> > > > -	/*
> > > > -	 * The 85xx DMA controller doesn't clear the channel start
> bit
> > > > -	 * automatically at the end of a transfer. Therefore we must
> clear
> > > > -	 * it in software before starting the transfer.
> > > > -	 */
> > > > -	if ((chan->feature & FSL_DMA_IP_MASK) =3D=3D FSL_DMA_IP_85XX) {
> > > > -		u32 mode;
> > > > -
> > > > -		mode =3D DMA_IN(chan, &chan->regs->mr, 32);
> > > > -		mode &=3D ~FSL_DMA_MR_CS;
> > > > -		DMA_OUT(chan, &chan->regs->mr, mode, 32);
> > > > -	}
> > > > -
> > > > -	/*
> > > > -	 * Program the descriptor's address into the DMA controller,
> > > > -	 * then start the DMA transaction
> > > > -	 */
> > > > -	set_cdar(chan, desc->async_tx.phys);
> > > > -	get_cdar(chan);
> > > > -
> > > > -	dma_start(chan);
> > > > -	chan->idle =3D false;
> > > > -}
> > > > -
> > > > -/**
> > > >   * fsl_dma_memcpy_issue_pending - Issue the DMA start command
> > > >   * @chan : Freescale DMA channel
> > > >   */
> > > > @@ -954,11 +1049,17 @@ static enum dma_status fsl_tx_status(struct
> > > dma_chan *dchan,
> > > >  	enum dma_status ret;
> > > >  	unsigned long flags;
> > > >
> > > > -	spin_lock_irqsave(&chan->desc_lock, flags);
> > > >  	ret =3D dma_cookie_status(dchan, cookie, txstate);
> > > > +	if (ret =3D=3D DMA_SUCCESS) {
> > > > +		fsldma_clean_completed_descriptor(chan);
> > > > +		return ret;
> > > > +	}
> > > > +
> > > > +	spin_lock_irqsave(&chan->desc_lock, flags);
> > > > +	fsldma_cleanup_descriptor(chan);
> > >
> > > You've made status from a very quick operation (compare two cookies)
> into
> > > a potentially long running operation. It may now run callbacks, unmap
> > > pages, etc.
> > Yes, that's right, callbacks and unmap pages will be invoked if DMA
> status
> > is not DMA_SUCCESS.
> >
> > >
> > > I note that Documentation/crypto/async-tx-api.txt section 3.6
> > > "Constraints" does say that async_*() functions cannot be called from
> IRQ
> > > context. However, the DMAEngine API itself does not have anything to
> say
> > > about IRQ context.
> > >
> > > I expect fsl_tx_status() to be quick, and not potentially call other
> > > arbitrary code.
> > fsl_tx_status() is not in IRQ context. It's reasonable to unmap pages
> and
> > run dependency of descriptor as fsldma_run_tx_complete_actions() does.
> >
>=20
> I looked at several other DMAEngine API drivers. Some of them do run
> dependencies and callbacks from tx_status(). You are correct, this is
> fine.
>=20
> > >
> > > Is it possible to leave this function unchanged?
> > According to my knowledge, it is unlikely to be left unchanged, or it
> will violate
> > the design of this interface. If DMA status is not DMA_SUCCESS, that
> means there
> > are new descriptors completed, we should move them to ld_completed, and
> do actions
> > to its async_tx descriptors, then dma descriptor will be freed at
> another time.
> > Most of my idea is from iop-adma.c, the original interface is not align
> with it.
> > Async_tx flags (expecially ack_test) is not considered when dma
> descriptor is freed,
> > so some random errors happened, I think you must familiar with this
> history.
> > As an important "synchronization flag", async_tx api must control the
> order of tx
> > descriptor. Dma driver also should make sure keeping order when free
> the descriptor.
> > That's the reason why I add ld_completed list.
> >
>=20
> Is it possible to put the descriptors into three different states?
>=20
> 1) ld_pending: ready to run, but not yet executed on the hardware
>=20
> Descriptors move to #2 through either dma_async_issue_pending() or by
> the interrupt which happens when the hardware completes executing a set
> of DMA descriptors.
>=20
> 2) ld_running: currently executing on the hardware
>=20
> Descriptors move to #3 through the interrupt which happens when the
> hardware completes executing a set of DMA descriptors.
>=20
> Dependencies and callbacks are executed as the descriptors are moved
> onto ld_completed.
>=20
> 3) ld_completed: finished executing on the hardware, but not yet 'ack'ed
>=20
> When they have been 'ack'ed, the descriptors are free'd.
>=20
>=20
> As noted above, I think I could code this up fairly quickly if this
> explanation is unclear.
Totally right. I think you understand my idea.

>=20
> Also note that I may not understand your bug completely, and my idea may
> be completely wrong!
I hope you've already understand my issue:)
Thanks.

>=20
> Ira
>=20
> > >
> > > Also note that if you leave this function unchanged, the locking
> error
> > > pointed out above (fsldma_clean_completed_descriptor() is not called
> with
> > > the lock held) goes away.
> > I know. Thanks.
> >
> > >
> > > >  	spin_unlock_irqrestore(&chan->desc_lock, flags);
> > > >
> > > > -	return ret;
> > > > +	return dma_cookie_status(dchan, cookie, txstate);
> > > >  }
> > > >
> > > >
> > > > /*-----------------------------------------------------------------
> ---
> > > > --------*/ @@ -1035,53 +1136,21 @@ static irqreturn_t
> > > > fsldma_chan_irq(int irq, void *data)  static void
> > > > dma_do_tasklet(unsigned long data)  {
> > > >  	struct fsldma_chan *chan =3D (struct fsldma_chan *)data;
> > > > -	struct fsl_desc_sw *desc, *_desc;
> > > > -	LIST_HEAD(ld_cleanup);
> > > >  	unsigned long flags;
> > > >
> > > >  	chan_dbg(chan, "tasklet entry\n");
> > > >
> > > >  	spin_lock_irqsave(&chan->desc_lock, flags);
> > > >
> > > > -	/* update the cookie if we have some descriptors to cleanup
> */
> > > > -	if (!list_empty(&chan->ld_running)) {
> > > > -		dma_cookie_t cookie;
> > > > -
> > > > -		desc =3D to_fsl_desc(chan->ld_running.prev);
> > > > -		cookie =3D desc->async_tx.cookie;
> > > > -		dma_cookie_complete(&desc->async_tx);
> > > > -
> > > > -		chan_dbg(chan, "completed_cookie=3D%d\n", cookie);
> > > > -	}
> > > > -
> > > > -	/*
> > > > -	 * move the descriptors to a temporary list so we can drop
> the lock
> > > > -	 * during the entire cleanup operation
> > > > -	 */
> > > > -	list_splice_tail_init(&chan->ld_running, &ld_cleanup);
> > > > +	/* Run all cleanup for this descriptor */
> > > > +	fsldma_cleanup_descriptor(chan);
> > > >
> > > >  	/* the hardware is now idle and ready for more */
> > > >  	chan->idle =3D true;
> > > >
> > > > -	/*
> > > > -	 * Start any pending transactions automatically
> > > > -	 *
> > > > -	 * In the ideal case, we keep the DMA controller busy while
> we go
> > > > -	 * ahead and free the descriptors below.
> > > > -	 */
> > > >  	fsl_chan_xfer_ld_queue(chan);
> > > >  	spin_unlock_irqrestore(&chan->desc_lock, flags);
> > > >
> > > > -	/* Run the callback for each descriptor, in order */
> > > > -	list_for_each_entry_safe(desc, _desc, &ld_cleanup, node) {
> > > > -
> > > > -		/* Remove from the list of transactions */
> > > > -		list_del(&desc->node);
> > > > -
> > > > -		/* Run all cleanup for this descriptor */
> > > > -		fsldma_cleanup_descriptor(chan, desc);
> > > > -	}
> > > > -
> > > >  	chan_dbg(chan, "tasklet exit\n");
> > > >  }
> > > >
> > > > @@ -1262,6 +1331,7 @@ static int __devinit
> fsl_dma_chan_probe(struct
> > > fsldma_device *fdev,
> > > >  	spin_lock_init(&chan->desc_lock);
> > > >  	INIT_LIST_HEAD(&chan->ld_pending);
> > > >  	INIT_LIST_HEAD(&chan->ld_running);
> > > > +	INIT_LIST_HEAD(&chan->ld_completed);
> > > >  	chan->idle =3D true;
> > > >
> > > >  	chan->common.device =3D &fdev->common; diff --git
> > > > a/drivers/dma/fsldma.h b/drivers/dma/fsldma.h index
> f5c3879..7ede908
> > > > 100644
> > > > --- a/drivers/dma/fsldma.h
> > > > +++ b/drivers/dma/fsldma.h
> > > > @@ -140,6 +140,7 @@ struct fsldma_chan {
> > > >  	spinlock_t desc_lock;		/* Descriptor operation lock */
> > > >  	struct list_head ld_pending;	/* Link descriptors queue */
> > > >  	struct list_head ld_running;	/* Link descriptors queue */
> > > > +	struct list_head ld_completed;	/* Link descriptors queue
> */
> > > >  	struct dma_chan common;		/* DMA common channel */
> > > >  	struct dma_pool *desc_pool;	/* Descriptors pool */
> > > >  	struct device *dev;		/* Channel device */
> > > > --
> > > > 1.7.5.1
> > > >
> > > >
> >
> >

^ permalink raw reply

* Re: [PATCH 05/15] pci: resource assignment based on p2p alignment
From: Ram Pai @ 2012-07-18  4:25 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Gavin Shan, Ram Pai, linuxppc-dev, linux-pci, yinghai
In-Reply-To: <CAErSpo4AS=QCRE3w-QBzuuSSEK8VeQ+PXxyp3mTyDAEvASxQcQ@mail.gmail.com>

On Tue, Jul 17, 2012 at 11:14:51AM -0600, Bjorn Helgaas wrote:
> On Tue, Jul 17, 2012 at 4:38 AM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > On Tue, 2012-07-17 at 18:03 +0800, Ram Pai wrote:
> >>         Lets say we passed that 'type' flag to size the minimum
> >>         alignment constraints for that b_res.  And window_alignment(bus,
> >>         type) of your platform  used that 'type' information to
> >>         determine whether to use the alignment constraints of 32-bit
> >>         window or 64-bit window.
> >>
> >>         However, later when the b_res is actually allocated a resource,
> >>         the pci_assign_resource() has no idea whether to allocate 32-bit
> >>         window resource or 64-bit window resource, because the 'type'
> >>         information is not captured anywhere in b_res.
> >>
> >>         You would basically have a disconnect between what is sized and
> >>         what is allocated. Unless offcourse you pass that 'type' to
> >>         the b_res->flags, which is currently not the case.
> >
> > Right, we ideally would need the core to query the alignment once per
> > "apertures" it tries as a candidate to allocate a given resource... but
> > that's for later.
> >
> > For now we can probably live with giving out the max of the minimum
> > alignment we support for M64 and our M32 segment size.
> 
> We already know the aperture we're proposing to allocate from (the
> result of find_free_bus_resource()), don't we?  What if we passed it
> to pcibios_window_alignment() along with the struct pci_bus *, e.g.:
> 
>   resource_size_t pcibios_window_alignment(struct pci_bus *bus, struct
> resource *window)

Hmm..'struct resource *window' may not yet know which aperature it is
allocated from. Will it? We are still in the sizing process, the allocation will
be done much later.

RP

^ permalink raw reply

* Re: [PATCH 05/15] pci: resource assignment based on p2p alignment
From: Ram Pai @ 2012-07-18  4:28 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Gavin Shan, linux-pci, Ram Pai, linuxppc-dev, bhelgaas, yinghai
In-Reply-To: <1342521498.3669.7.camel@pasglop>

On Tue, Jul 17, 2012 at 08:38:18PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2012-07-17 at 18:03 +0800, Ram Pai wrote:
> >         Lets say we passed that 'type' flag to size the minimum
> >         alignment constraints for that b_res.  And window_alignment(bus,
> >         type) of your platform  used that 'type' information to
> >         determine whether to use the alignment constraints of 32-bit
> >         window or 64-bit window.
> > 
> >         However, later when the b_res is actually allocated a resource,
> >         the pci_assign_resource() has no idea whether to allocate 32-bit
> >         window resource or 64-bit window resource, because the 'type'
> >         information is not captured anywhere in b_res.
> > 
> >         You would basically have a disconnect between what is sized and 
> >         what is allocated. Unless offcourse you pass that 'type' to
> >         the b_res->flags, which is currently not the case. 
> 
> Right, we ideally would need the core to query the alignment once per
> "apertures" it tries as a candidate to allocate a given resource... but
> that's for later.
> 
> For now we can probably live with giving out the max of the minimum
> alignment we support for M64 and our M32 segment size.

Its an approximation, which may not be terribly bad. But not comforting enough.

RP

^ permalink raw reply

* Re: [PATCH 05/15] pci: resource assignment based on p2p alignment
From: Ram Pai @ 2012-07-18  5:02 UTC (permalink / raw)
  To: Gavin Shan; +Cc: Ram Pai, linuxppc-dev, linux-pci, Bjorn Helgaas, yinghai
In-Reply-To: <20120718010746.GA4238@shangw>

On Wed, Jul 18, 2012 at 09:07:46AM +0800, Gavin Shan wrote:
> 4. Either "b_res->flags & mask" or "type" to be used for window_alignment(),
>    I don't think it's big deal because "b_res->flags & mask == type" for
>    current implementation. However, I'm not sure I still need change it
>    to "type"?
> 
>    min_align = max(min_align, window_alignment(bus, b_res->flags & mask));

Ah. you are right.

(b_res->flags & mask) or type, either one is ok. I had a wrong
assumption about b_res->flags. I thought it has either IORESOURCE_MEM
set or IORESOURCE_PREFETCH set, but not both.

Whatever concern I had, i dont have it any more.
RP

^ permalink raw reply

* [PATCH v2] PCI: use dev->irq instead of dev->pin to enable non MSI/INTx interrupt
From: Shengzhou Liu @ 2012-07-18  6:06 UTC (permalink / raw)
  To: bhelgaas, linux-pci; +Cc: linuxppc-dev, Shengzhou Liu

On some platforms, root port has neither MSI/MSI-X nor INTx interrupt
generated in RC mode. In this case, we have to use other interrupt(e.g.
system shared interrupt) for port service irq to have AER, Hot-plug, etc,
services to work.

Signed-off-by: Shengzhou Liu <Shengzhou.Liu@freescale.com>
---
 drivers/pci/pcie/portdrv_core.c |   16 ++++++++++++----
 1 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index 75915b3..49acf72 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -200,10 +200,13 @@ static int init_service_irqs(struct pci_dev *dev, int *irqs, int mask)
 {
 	int i, irq = -1;
 
-	/* We have to use INTx if MSI cannot be used for PCIe PME or pciehp. */
+	/*
+	 * We have to use INTx or other interrupts(e.g. system shared interrupt)
+	 * if MSI cannot be used for PCIe PME or pciehp.
+	 */
 	if (((mask & PCIE_PORT_SERVICE_PME) && pcie_pme_no_msi()) ||
 	    ((mask & PCIE_PORT_SERVICE_HP) && pciehp_no_msi())) {
-		if (dev->pin)
+		if (dev->irq)
 			irq = dev->irq;
 		goto no_msi;
 	}
@@ -212,8 +215,13 @@ static int init_service_irqs(struct pci_dev *dev, int *irqs, int mask)
 	if (!pcie_port_enable_msix(dev, irqs, mask))
 		return 0;
 
-	/* We're not going to use MSI-X, so try MSI and fall back to INTx */
-	if (!pci_enable_msi(dev) || dev->pin)
+	/*
+	 * We're not going to use MSI-X, so try MSI and fall back to INTx.
+	 * If neither MSI/MSI-X nor INTx available, try other interrupt. (On
+	 * some platforms, root port doesn't support generating MSI/MSI-X/INTx
+	 * in RC mode)
+	 */
+	if (!pci_enable_msi(dev) || dev->irq)
 		irq = dev->irq;
 
  no_msi:
-- 
1.6.4

^ permalink raw reply related

* RE: [PATCH] PCI: use dev->irq instead of dev->pin to enable non MSI/INTx interrupt
From: Liu Shengzhou-B36685 @ 2012-07-18  6:36 UTC (permalink / raw)
  To: Wood Scott-B07421
  Cc: bhelgaas@google.com, linux-pci@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <5005A3D2.5060504@freescale.com>

DQo+IC0tLS0tT3JpZ2luYWwgTWVzc2FnZS0tLS0tDQo+IEZyb206IFdvb2QgU2NvdHQtQjA3NDIx
DQo+IFNlbnQ6IFdlZG5lc2RheSwgSnVseSAxOCwgMjAxMiAxOjQyIEFNDQo+IFRvOiBMaXUgU2hl
bmd6aG91LUIzNjY4NQ0KPiBDYzogYmhlbGdhYXNAZ29vZ2xlLmNvbTsgbGludXgtcGNpQHZnZXIu
a2VybmVsLm9yZzsgbGludXhwcGMtDQo+IGRldkBsaXN0cy5vemxhYnMub3JnDQo+IFN1YmplY3Q6
IFJlOiBbUEFUQ0hdIFBDSTogdXNlIGRldi0+aXJxIGluc3RlYWQgb2YgZGV2LT5waW4gdG8gZW5h
YmxlIG5vbg0KPiBNU0kvSU5UeCBpbnRlcnJ1cHQNCj4gDQo+IE9uIDA3LzE2LzIwMTIgMDk6MzUg
UE0sIFNoZW5nemhvdSBMaXUgd3JvdGU6DQo+ID4gT24gc29tZSBwbGF0Zm9ybXMsIHJvb3QgcG9y
dCBoYXMgbmVpdGhlciBNU0kvTVNJLVggbm9yIElOVHggaW50ZXJydXB0DQo+ID4gZ2VuZXJhdGVk
IGluIFJDIG1vZGUuIEluIHRoaXMgY2FzZSwgd2UgaGF2ZSB0byB1c2Ugb3RoZXIgaW50ZXJydXB0
KGkuZS4NCj4gPiBzeXN0ZW0gc2hhcmVkIGludGVycnVwdCkgZm9yIHBvcnQgc2VydmljZSBpcnEg
dG8gaGF2ZSBBRVIsIEhvdC1wbHVnLA0KPiA+IGV0Yywgc2VydmljZXMgdG8gd29yay4NCj4gPg0K
PiA+IFNpZ25lZC1vZmYtYnk6IFNoZW5nemhvdSBMaXUgPFNoZW5nemhvdS5MaXVAZnJlZXNjYWxl
LmNvbT4NCj4gPiAtLS0NCj4gPiAgZHJpdmVycy9wY2kvcGNpZS9wb3J0ZHJ2X2NvcmUuYyB8ICAg
IDkgKysrKysrKy0tDQo+ID4gIDEgZmlsZXMgY2hhbmdlZCwgNyBpbnNlcnRpb25zKCspLCAyIGRl
bGV0aW9ucygtKQ0KPiA+DQo+ID4gZGlmZiAtLWdpdCBhL2RyaXZlcnMvcGNpL3BjaWUvcG9ydGRy
dl9jb3JlLmMNCj4gPiBiL2RyaXZlcnMvcGNpL3BjaWUvcG9ydGRydl9jb3JlLmMgaW5kZXggNzU5
MTViMy4uYTg1NTI1NCAxMDA2NDQNCj4gPiAtLS0gYS9kcml2ZXJzL3BjaS9wY2llL3BvcnRkcnZf
Y29yZS5jDQo+ID4gKysrIGIvZHJpdmVycy9wY2kvcGNpZS9wb3J0ZHJ2X2NvcmUuYw0KPiA+IEBA
IC0yMTIsOCArMjEyLDEzIEBAIHN0YXRpYyBpbnQgaW5pdF9zZXJ2aWNlX2lycXMoc3RydWN0IHBj
aV9kZXYgKmRldiwgaW50DQo+ICppcnFzLCBpbnQgbWFzaykNCj4gPiAgCWlmICghcGNpZV9wb3J0
X2VuYWJsZV9tc2l4KGRldiwgaXJxcywgbWFzaykpDQo+ID4gIAkJcmV0dXJuIDA7DQo+ID4NCj4g
PiAtCS8qIFdlJ3JlIG5vdCBnb2luZyB0byB1c2UgTVNJLVgsIHNvIHRyeSBNU0kgYW5kIGZhbGwg
YmFjayB0byBJTlR4ICovDQo+ID4gLQlpZiAoIXBjaV9lbmFibGVfbXNpKGRldikgfHwgZGV2LT5w
aW4pDQo+ID4gKwkvKg0KPiA+ICsJICogV2UncmUgbm90IGdvaW5nIHRvIHVzZSBNU0ktWCwgc28g
dHJ5IE1TSSBhbmQgZmFsbCBiYWNrIHRvIElOVHguDQo+ID4gKwkgKiBJZiBuZWl0aGVyIE1TSS9N
U0ktWCBub3IgSU5UeCBhdmFpbGFibGUsIHRyeSBvdGhlciBpbnRlcnJ1cHQuIChPbg0KPiA+ICsJ
ICogc29tZSBwbGF0Zm9ybXMsIHJvb3QgcG9ydCBkb2Vzbid0IHN1cHBvcnQgZ2VuZXJhdGluZyBN
U0kvTVNJLVgvSU5UeA0KPiA+ICsJICogaW4gUkMgbW9kZSkNCj4gPiArCSAqLw0KPiA+ICsJaWYg
KCFwY2lfZW5hYmxlX21zaShkZXYpIHx8IGRldi0+aXJxKQ0KPiA+ICAJCWlycSA9IGRldi0+aXJx
Ow0KPiANCj4gV2hhdCBhYm91dCB0aGUgb3RoZXIgdXNhZ2Ugb2YgZGV2LT5waW4gYSBmZXcgbGlu
ZXMgdXA/DQo+IA0KPiAtU2NvdHQNClllcywgaXQgc2hvdWxkIGJlIGNvbnNpc3RlbnQgd2l0aCBi
b3R0b20sIGRvbmUgaW4gdjIuDQpUaGFua3MsDQpTaGVuZ3pob3UNCg==

^ permalink raw reply

* Re: [PATCH] mm: setup pageblock_order before it's used by sparse
From: Benjamin Herrenschmidt @ 2012-07-18  7:15 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Tony Luck, Jiang Liu, linux-mm, linux-kernel, stable, Minchan Kim,
	Keping Chen, Mel Gorman, KOSAKI Motohiro, David Rientjes,
	Xishi Qiu, Andrew Morton, David Gibson, linuxppc-dev,
	KAMEZAWA Hiroyuki, Jiang Liu
In-Reply-To: <CAE9FiQVxY9E3L_xmRj10+9D6NVbKaxaAd2oJ6EFe1D+Gy2971w@mail.gmail.com>

On Mon, 2012-07-02 at 20:25 -0700, Yinghai Lu wrote:
> > That means pageblock_order is always set to "MAX_ORDER - 1", not sure
> > whether this is intended. And it has the same issue as IA64 of wasting
> > memory if CONFIG_SPARSE is enabled.
> 
> adding BenH, need to know if it is powerpc intended.
> 
> >
> > So it would be better to keep function set_pageblock_order(), it will
> > fix the memory wasting on both IA64 and PowerPC.
> 
> Should setup pageblock_order as early as possible to avoid confusing.

Hrm, HPAGE_SHIFT is initially 0 because we only know at runtime what
huge page sizes are going to be supported (if any).

The business with pageblock_order is new to me and does look bogus today
indeed. But not a huge deal either. Our MAX_ORDER is typically 9 (64K
pages) or 13 (4K pages) and our standard huge page size is generally 16M
so there isn't a big difference here.

Still, maybe something worth looking into...

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] mm: setup pageblock_order before it's used by sparse
From: Benjamin Herrenschmidt @ 2012-07-18  7:17 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Tony Luck, linux-mm, linuxppc-dev, linux-kernel, stable,
	Minchan Kim, Keping Chen, Mel Gorman, KOSAKI Motohiro,
	David Rientjes, Xishi Qiu, Andrew Morton, David Gibson,
	Yinghai Lu, KAMEZAWA Hiroyuki, Jiang Liu
In-Reply-To: <4FF26726.8010904@huawei.com>

On Tue, 2012-07-03 at 11:29 +0800, Jiang Liu wrote:
> OK, waiting response from PPC. If we could find some ways to set
> HPAGE_SIZE
> early on PPC too, we can setup pageblock_order in arch instead of
> page_alloc.c
> as early as possible. 

We could split our hugetlbpage_init() into two, with the bit that sets
HPAGE_SHIFT called earlier if it's really an issue.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] MIPS: fix bug.h MIPS build regression
From: Geert Uytterhoeven @ 2012-07-18  8:35 UTC (permalink / raw)
  To: David Daney, Andrew Morton
  Cc: Linux MIPS Mailing List, Linux-sh list, linux-kernel,
	Ralf Baechle, Linuxppc-dev, Paul Mundt, Chris Zankel,
	Yoichi Yuasa
In-Reply-To: <CAMuHMdXSGgH+M_2+xuzY2t_sGZto24=atv3Kaj+R-dWAUPzW7w@mail.gmail.com>

On Mon, Jul 16, 2012 at 9:27 PM, Geert Uytterhoeven
<geert@linux-m68k.org> wrote:
> On Fri, Jun 22, 2012 at 7:54 PM, David Daney <ddaney@caviumnetworks.com> wrote:
>> On 06/20/2012 09:12 AM, Ralf Baechle wrote:
>>>
>>> On Wed, Jun 20, 2012 at 03:27:59PM +0900, Yoichi Yuasa wrote:
>>>
>>>> Commit: 3777808873b0c49c5cf27e44c948dfb02675d578 breaks all MIPS builds.
>>>
>>>
>>> Thanks, fix applied.
>>>
>>
>> Where was it applied?
>>
>> It doesn't show up in linux-next for 20120622, which is where it is needed.
>
> It's also desperately needed in mainline for 3.5.
>
> Ralf?

Andrew? This prevents any green MIPS builds.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* mpc85xx_edac.c: Should the mpc85xx_l2_isr be shared irqs?
From: Xufeng Zhang @ 2012-07-18  9:00 UTC (permalink / raw)
  To: linuxppc-dev

Hi All,

I detected below error when booting p1021mds after enabled EDAC feature:
EDAC MC: Ver: 2.1.0 Jul 17 2012
Freescale(R) MPC85xx EDAC driver, (C) 2006 Montavista Software
EDAC MC0: Giving out device to 'MPC85xx_edac' 'mpc85xx_mc_err': DEV mpc85xx_mc_e
rr
IRQ 45/[EDAC] MC err: IRQF_DISABLED is not guaranteed on shared IRQs
MPC85xx_edac acquired irq 45 for MC
MPC85xx_edac MC err registered
EDAC DEVICE0: Giving out device to module 'MPC85xx_edac' controller 'mpc85xx_l2_
err': DEV 'mpc85xx_l2_err' (INTERRUPT)
mpc85xx_l2_err_probe: Unable to requiest irq 45 for MPC85xx L2 err

Then kernel hang.

When request irq for l2-cache, since it share the same irq with memory
controller,
I think the code should be:
--- a/drivers/edac/mpc85xx_edac.c
+++ b/drivers/edac/mpc85xx_edac.c
@@ -577,7 +577,7 @@ static int __devinit mpc85xx_l2_err_probe(struct
of_device *op,
 	if (edac_op_state == EDAC_OPSTATE_INT) {
 		pdata->irq = irq_of_parse_and_map(op->node, 0);
 		res = devm_request_irq(&op->dev, pdata->irq,
-				       mpc85xx_l2_isr, IRQF_DISABLED,
+				       mpc85xx_l2_isr, IRQF_DISABLED | IRQF_SHARED,
 				       "[EDAC] L2 err", edac_dev);
 		if (res < 0) {
 			printk(KERN_ERR





Thanks,
Xufeng Zhang

^ permalink raw reply

* Where can i find patches for Freescale MPC5125 Tower?
From: Mahesh Bajaj @ 2012-07-18  9:44 UTC (permalink / raw)
  To: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 378 bytes --]

Hello,

We have bought these evaluation boards for MPC5125 from Freescale. We did
receive the linux source code for this device. But i have noticed that
there are some patches available for many files. How do i get these patch
files?

Any help on this will be greatly appreciated. Support for Freescale is
NONE.

I saw the patch related info here...

Thanks in advance,
Mahesh.

[-- Attachment #2: Type: text/html, Size: 492 bytes --]

^ permalink raw reply

* [RFC PATCH v4 0/13] memory-hotplug : hot-remove physical memory
From: Yasuaki Ishimatsu @ 2012-07-18 10:01 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linuxppc-dev, linux-acpi
  Cc: len.brown, wency, paulus, minchan.kim, kosaki.motohiro, rientjes,
	cl, akpm, liuj97

This patch series aims to support physical memory hot-remove.

  [RFC PATCH v4 1/13] memory-hotplug : rename remove_memory to offline_memory
  [RFC PATCH v4 2/13] memory-hotplug : add physical memory hotplug code to acpi_memory_device_remove
  [RFC PATCH v4 3/13] memory-hotplug : check whether memory is present or not
  [RFC PATCH v4 4/13] memory-hotplug : remove /sys/firmware/memmap/X sysfs
  [RFC PATCH v4 5/13] memory-hotplug : does not release memory region in PAGES_PER_SECTION chunks
  [RFC PATCH v4 6/13] memory-hotplug : add memory_block_release
  [RFC PATCH v4 7/13] memory-hotplug : remove_memory calls __remove_pages
  [RFC PATCH v4 8/13] memory-hotplug : check page type in get_page_bootmem
  [RFC PATCH v4 9/13] memory-hotplug : move register_page_bootmem_info_node and put_page_bootmem for
sparse-vmemmap4
  [RFC PATCH v4 10/13] memory-hotplug : implement register_page_bootmem_info_section of sparse-vmemmap
  [RFC PATCH v4 11/13] memory-hotplug : free memmap of sparse-vmemmap
  [RFC PATCH v4 12/13] memory-hotplug : add node_device_release
  [RFC PATCH v4 13/13] memory-hotplug : remove sysfs file of node

Even if you apply these patches, you cannot remove the physical memory
completely since these patches are still under development. But other
components can be removed. I want you to cooperate to improve the
physical memory hot-remove. So please review these patches and give
your comment/idea.

The patches can free/remove following things:

  - acpi_memory_info                          : [RFC PATCH 2/13]
  - /sys/firmware/memmap/X/{end, start, type} : [RFC PATCH 4/13]
  - iomem_resource                            : [RFC PATCH 5/13]
  - mem_section and related sysfs files       : [RFC PATCH 6-11/13]
  - node and related sysfs files              : [RFC PATCH 12-13/13]

The patches cannot do following things yet:

  - page table of removed memory

If you find lack of function for physical memory hot-remove, please let me
know.

change log of v4:
 * remove "memory-hotplug : unify argument of firmware_map_add_early/hotplug"
   from the patch series, since the patch is a bugfix. It is being disccussed
   on other thread. But for testing the patch series, the patch is needed.
   So I added the patch as [PATCH 0/13].

 [RFC PATCH v4 2/13]
   * check memory is online or not at remove_memory()
   * add memory_add_physaddr_to_nid() to acpi_memory_device_remove() for
     getting node id
 
 [RFC PATCH v4 3/13]
   * create new patch : check memory is online or not at online_pages()

 [RFC PATCH v4 4/13]
   * add __ref section to remove_memory()
   * call firmware_map_remove_entry() before remove_sysfs_fw_map_entry()

 [RFC PATCH v4 11/13]
   * rewrite register_page_bootmem_memmap() for removing page used as PT/PMD

change log of v3:
 * rebase to 3.5.0-rc6

 [RFC PATCH v2 2/13]
   * remove extra kobject_put()

   * The patch was commented by Wen. Wen's comment is
     "acpi_memory_device_remove() should ignore a return value of
     remove_memory() since caller does not care the return value".
     But I did not change it since I think caller should care the
     return value. And I am trying to fix it as follow:

     https://lkml.org/lkml/2012/7/5/624

 [RFC PATCH v2 4/13]
   * remove a firmware_memmap_entry allocated by kzmalloc()

change log of v2:
 [RFC PATCH v2 2/13]
   * check whether memory block is offline or not before calling offline_memory()
   * check whether section is valid or not in is_memblk_offline()
   * call kobject_put() for each memory_block in is_memblk_offline()

 [RFC PATCH v2 3/13]
   * unify the end argument of firmware_map_add_early/hotplug

 [RFC PATCH v2 4/13]
   * add release_firmware_map_entry() for freeing firmware_map_entry

 [RFC PATCH v2 6/13]
  * add release_memory_block() for freeing memory_block

 [RFC PATCH v2 11/13]
  * fix wrong arguments of free_pages()

---
 arch/powerpc/platforms/pseries/hotplug-memory.c |   16 +-
 arch/x86/mm/init_64.c                           |  144 ++++++++++++++++++++++++
 drivers/acpi/acpi_memhotplug.c                  |   28 ++++
 drivers/base/memory.c                           |   54 ++++++++-
 drivers/base/node.c                             |    7 +
 drivers/firmware/memmap.c                       |   78 ++++++++++++-
 include/linux/firmware-map.h                    |    6 +
 include/linux/memory.h                          |    5 
 include/linux/memory_hotplug.h                  |   17 --
 include/linux/mm.h                              |    5 
 mm/memory_hotplug.c                             |   98 ++++++++++++----
 mm/sparse.c                                     |    5 
 12 files changed, 414 insertions(+), 49 deletions(-)

^ permalink raw reply

* [RFC PATCH 0/13] firmware_map : unify argument of firmware_map_add_early/hotplug
From: Yasuaki Ishimatsu @ 2012-07-18 10:04 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linuxppc-dev, linux-acpi
  Cc: len.brown, wency, paulus, minchan.kim, kosaki.motohiro, rientjes,
	cl, akpm, liuj97
In-Reply-To: <50068974.1070409@jp.fujitsu.com>

There are two ways to create /sys/firmware/memmap/X sysfs:

  - firmware_map_add_early
    When the system starts, it is calledd from e820_reserve_resources()
  - firmware_map_add_hotplug
    When the memory is hot plugged, it is called from add_memory()

But these functions are called without unifying value of end argument as below:

  - end argument of firmware_map_add_early()   : start + size - 1
  - end argument of firmware_map_add_hogplug() : start + size

The patch unifies them to "start + size". Even if applying the patch,
/sys/firmware/memmap/X/end file content does not change.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@kernel.org>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Tejun Heo <tj@kernel.org>
CC: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

---
 arch/x86/kernel/e820.c    |    2 +-
 drivers/firmware/memmap.c |    8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)

Index: linux-3.5-rc6/arch/x86/kernel/e820.c
===================================================================
--- linux-3.5-rc6.orig/arch/x86/kernel/e820.c	2012-07-18 17:19:38.391365260 +0900
+++ linux-3.5-rc6/arch/x86/kernel/e820.c	2012-07-18 17:19:43.616300222 +0900
@@ -944,7 +944,7 @@ void __init e820_reserve_resources(void)
 	for (i = 0; i < e820_saved.nr_map; i++) {
 		struct e820entry *entry = &e820_saved.map[i];
 		firmware_map_add_early(entry->addr,
-			entry->addr + entry->size - 1,
+			entry->addr + entry->size,
 			e820_type_to_string(entry->type));
 	}
 }
Index: linux-3.5-rc6/drivers/firmware/memmap.c
===================================================================
--- linux-3.5-rc6.orig/drivers/firmware/memmap.c	2012-07-18 17:19:38.388365299 +0900
+++ linux-3.5-rc6/drivers/firmware/memmap.c	2012-07-18 18:30:47.608390251 +0900
@@ -98,7 +98,7 @@ static LIST_HEAD(map_entries);
 /**
  * firmware_map_add_entry() - Does the real work to add a firmware memmap entry.
  * @start: Start of the memory range.
- * @end:   End of the memory range (inclusive).
+ * @end:   End of the memory range.
  * @type:  Type of the memory range.
  * @entry: Pre-allocated (either kmalloc() or bootmem allocator), uninitialised
  *         entry.
@@ -113,7 +113,7 @@ static int firmware_map_add_entry(u64 st
 	BUG_ON(start > end);
 
 	entry->start = start;
-	entry->end = end;
+	entry->end = end - 1;
 	entry->type = type;
 	INIT_LIST_HEAD(&entry->list);
 	kobject_init(&entry->kobj, &memmap_ktype);
@@ -148,7 +148,7 @@ static int add_sysfs_fw_map_entry(struct
  * firmware_map_add_hotplug() - Adds a firmware mapping entry when we do
  * memory hotplug.
  * @start: Start of the memory range.
- * @end:   End of the memory range (inclusive).
+ * @end:   End of the memory range.
  * @type:  Type of the memory range.
  *
  * Adds a firmware mapping entry. This function is for memory hotplug, it is
@@ -175,7 +175,7 @@ int __meminit firmware_map_add_hotplug(u
 /**
  * firmware_map_add_early() - Adds a firmware mapping entry.
  * @start: Start of the memory range.
- * @end:   End of the memory range (inclusive).
+ * @end:   End of the memory range.
  * @type:  Type of the memory range.
  *
  * Adds a firmware mapping entry. This function uses the bootmem allocator

^ permalink raw reply

* [RFC PATCH v4 1/13] memory-hotplug : rename remove_memory to offline_memory
From: Yasuaki Ishimatsu @ 2012-07-18 10:05 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linuxppc-dev, linux-acpi
  Cc: len.brown, wency, paulus, minchan.kim, kosaki.motohiro, rientjes,
	cl, akpm, liuj97
In-Reply-To: <50068974.1070409@jp.fujitsu.com>

remove_memory() does not remove memory but just offlines memory. The patch
changes name of it to offline_memory().

CC: David Rientjes <rientjes@google.com>
CC: Jiang Liu <liuj97@gmail.com>
CC: Len Brown <len.brown@intel.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org> 
CC: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> 
CC: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

---
 drivers/acpi/acpi_memhotplug.c |    2 +-
 drivers/base/memory.c          |    4 ++--
 include/linux/memory_hotplug.h |    2 +-
 mm/memory_hotplug.c            |    6 +++---
 4 files changed, 7 insertions(+), 7 deletions(-)

Index: linux-3.5-rc4/drivers/acpi/acpi_memhotplug.c
===================================================================
--- linux-3.5-rc4.orig/drivers/acpi/acpi_memhotplug.c	2012-07-03 14:21:46.102416917 +0900
+++ linux-3.5-rc4/drivers/acpi/acpi_memhotplug.c	2012-07-03 14:21:49.458374960 +0900
@@ -318,7 +318,7 @@ static int acpi_memory_disable_device(st
 	 */
 	list_for_each_entry_safe(info, n, &mem_device->res_list, list) {
 		if (info->enabled) {
-			result = remove_memory(info->start_addr, info->length);
+			result = offline_memory(info->start_addr, info->length);
 			if (result)
 				return result;
 		}
Index: linux-3.5-rc4/drivers/base/memory.c
===================================================================
--- linux-3.5-rc4.orig/drivers/base/memory.c	2012-07-03 14:21:46.095417003 +0900
+++ linux-3.5-rc4/drivers/base/memory.c	2012-07-03 14:21:49.459374948 +0900
@@ -266,8 +266,8 @@ memory_block_action(unsigned long phys_i
 			break;
 		case MEM_OFFLINE:
 			start_paddr = page_to_pfn(first_page) << PAGE_SHIFT;
-			ret = remove_memory(start_paddr,
-					    nr_pages << PAGE_SHIFT);
+			ret = offline_memory(start_paddr,
+					     nr_pages << PAGE_SHIFT);
 			break;
 		default:
 			WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
Index: linux-3.5-rc4/mm/memory_hotplug.c
===================================================================
--- linux-3.5-rc4.orig/mm/memory_hotplug.c	2012-07-03 14:21:46.102416917 +0900
+++ linux-3.5-rc4/mm/memory_hotplug.c	2012-07-03 14:21:49.466374860 +0900
@@ -990,7 +990,7 @@ out:
 	return ret;
 }
 
-int remove_memory(u64 start, u64 size)
+int offline_memory(u64 start, u64 size)
 {
 	unsigned long start_pfn, end_pfn;
 
@@ -999,9 +999,9 @@ int remove_memory(u64 start, u64 size)
 	return offline_pages(start_pfn, end_pfn, 120 * HZ);
 }
 #else
-int remove_memory(u64 start, u64 size)
+int offline_memory(u64 start, u64 size)
 {
 	return -EINVAL;
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */
-EXPORT_SYMBOL_GPL(remove_memory);
+EXPORT_SYMBOL_GPL(offline_memory);
Index: linux-3.5-rc4/include/linux/memory_hotplug.h
===================================================================
--- linux-3.5-rc4.orig/include/linux/memory_hotplug.h	2012-07-03 14:21:46.102416917 +0900
+++ linux-3.5-rc4/include/linux/memory_hotplug.h	2012-07-03 14:21:49.471374796 +0900
@@ -233,7 +233,7 @@ static inline int is_mem_section_removab
 extern int mem_online_node(int nid);
 extern int add_memory(int nid, u64 start, u64 size);
 extern int arch_add_memory(int nid, u64 start, u64 size);
-extern int remove_memory(u64 start, u64 size);
+extern int offline_memory(u64 start, u64 size);
 extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 								int nr_pages);
 extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms);

^ permalink raw reply

* [RFC PATCH v4 2/13] memory-hotplug : add physical memory hotplug code to acpi_memory_device_remove
From: Yasuaki Ishimatsu @ 2012-07-18 10:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linuxppc-dev, linux-acpi
  Cc: len.brown, wency, paulus, minchan.kim, kosaki.motohiro, rientjes,
	cl, akpm, liuj97
In-Reply-To: <50068974.1070409@jp.fujitsu.com>

acpi_memory_device_remove() has been prepared to remove physical memory.
But, the function only frees acpi_memory_device currentlry. 

The patch adds following functions into acpi_memory_device_remove():
  - offline memory
  - remove physical memory. It only check whether memory is online or not.
  - free acpi_memory_device

CC: David Rientjes <rientjes@google.com>
CC: Jiang Liu <liuj97@gmail.com>
CC: Len Brown <len.brown@intel.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org> 
CC: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> 
CC: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

---
 drivers/acpi/acpi_memhotplug.c |   27 ++++++++++++++++++++++++++-
 drivers/base/memory.c          |   39 +++++++++++++++++++++++++++++++++++++++
 include/linux/memory.h         |    5 +++++
 include/linux/memory_hotplug.h |    5 +++++
 mm/memory_hotplug.c            |   22 ++++++++++++++++++++++
 5 files changed, 97 insertions(+), 1 deletion(-)

Index: linux-3.5-rc6/drivers/acpi/acpi_memhotplug.c
===================================================================
--- linux-3.5-rc6.orig/drivers/acpi/acpi_memhotplug.c	2012-07-17 11:20:15.117796971 +0900
+++ linux-3.5-rc6/drivers/acpi/acpi_memhotplug.c	2012-07-17 13:36:30.325594022 +0900
@@ -29,6 +29,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/types.h>
+#include <linux/memory.h>
 #include <linux/memory_hotplug.h>
 #include <linux/slab.h>
 #include <acpi/acpi_drivers.h>
@@ -452,12 +453,36 @@ static int acpi_memory_device_add(struct
 static int acpi_memory_device_remove(struct acpi_device *device, int type)
 {
 	struct acpi_memory_device *mem_device = NULL;
-
+	struct acpi_memory_info *info, *tmp;
+	int result;
+	int node;
 
 	if (!device || !acpi_driver_data(device))
 		return -EINVAL;
 
 	mem_device = acpi_driver_data(device);
+
+	node = acpi_get_node(mem_device->device->handle);
+	list_for_each_entry_safe(info, tmp, &mem_device->res_list, list) {
+		if (!info->enabled)
+			continue;
+
+		if (!is_memblk_offline(info->start_addr, info->length)) {
+			result = offline_memory(info->start_addr, info->length);
+			if (result)
+				return result;
+		}
+		if (node < 0)
+			node = memory_add_physaddr_to_nid(info->start_addr);
+
+		result = remove_memory(node, info->start_addr, info->length);
+		if (result)
+			return result;
+
+		list_del(&info->list);
+		kfree(info);
+	}
+
 	kfree(mem_device);
 
 	return 0;
Index: linux-3.5-rc6/include/linux/memory_hotplug.h
===================================================================
--- linux-3.5-rc6.orig/include/linux/memory_hotplug.h	2012-07-17 11:20:15.133796772 +0900
+++ linux-3.5-rc6/include/linux/memory_hotplug.h	2012-07-17 11:29:41.490716352 +0900
@@ -221,6 +221,7 @@ static inline void unlock_memory_hotplug
 #ifdef CONFIG_MEMORY_HOTREMOVE
 
 extern int is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
+extern int remove_memory(int nid, u64 start, u64 size);
 
 #else
 static inline int is_mem_section_removable(unsigned long pfn,
@@ -228,6 +229,10 @@ static inline int is_mem_section_removab
 {
 	return 0;
 }
+static inline int remove_memory(int nid, u64 start, u64 size)
+{
+	return -EBUSY;
+}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 extern int mem_online_node(int nid);
Index: linux-3.5-rc6/mm/memory_hotplug.c
===================================================================
--- linux-3.5-rc6.orig/mm/memory_hotplug.c	2012-07-17 11:20:15.129796821 +0900
+++ linux-3.5-rc6/mm/memory_hotplug.c	2012-07-17 13:25:18.952986069 +0900
@@ -998,6 +998,28 @@ int offline_memory(u64 start, u64 size)
 	end_pfn = start_pfn + PFN_DOWN(size);
 	return offline_pages(start_pfn, end_pfn, 120 * HZ);
 }
+
+int remove_memory(int nid, u64 start, u64 size)
+{
+	int ret = -EBUSY;
+	lock_memory_hotplug();
+	/*
+	 * The memory might become online by other task, even if you offine it.
+	 * So we check whether the cpu has been onlined or not.
+	 */
+	if (!is_memblk_offline(start, size)) {
+		pr_warn("memory removing [mem %#010llx-%#010llx] failed, "
+			"because the memmory range is online\n",
+			start, start + size);
+		ret = -EAGAIN;
+	}
+
+	unlock_memory_hotplug();
+	return ret;
+
+}
+EXPORT_SYMBOL_GPL(remove_memory);
+
 #else
 int offline_memory(u64 start, u64 size)
 {
Index: linux-3.5-rc6/drivers/base/memory.c
===================================================================
--- linux-3.5-rc6.orig/drivers/base/memory.c	2012-07-17 11:20:15.120796934 +0900
+++ linux-3.5-rc6/drivers/base/memory.c	2012-07-17 11:20:54.626302995 +0900
@@ -70,6 +70,45 @@ void unregister_memory_isolate_notifier(
 }
 EXPORT_SYMBOL(unregister_memory_isolate_notifier);
 
+bool is_memblk_offline(unsigned long start, unsigned long size)
+{
+	struct memory_block *mem = NULL;
+	struct mem_section *section;
+	unsigned long start_pfn, end_pfn;
+	unsigned long pfn, section_nr;
+
+	start_pfn = PFN_DOWN(start);
+	end_pfn = start_pfn + PFN_DOWN(start);
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+		section_nr = pfn_to_section_nr(pfn);
+		if (!present_section_nr(section_nr));
+			continue;
+
+		section = __nr_to_section(section_nr);
+		/* same memblock? */
+		if (mem)
+			if((section_nr >= mem->start_section_nr) &&
+			   (section_nr <= mem->end_section_nr))
+				continue;
+
+		mem = find_memory_block_hinted(section, mem);
+		if (!mem)
+			continue;
+		if (mem->state == MEM_OFFLINE)
+			continue;
+
+		kobject_put(&mem->dev.kobj);
+		return false;
+	}
+
+	if (mem)
+		kobject_put(&mem->dev.kobj);
+
+	return true;
+}
+EXPORT_SYMBOL(is_memblk_offline);
+
 /*
  * register_memory - Setup a sysfs device for a memory block
  */
Index: linux-3.5-rc6/include/linux/memory.h
===================================================================
--- linux-3.5-rc6.orig/include/linux/memory.h	2012-07-17 11:18:00.693477455 +0900
+++ linux-3.5-rc6/include/linux/memory.h	2012-07-17 11:20:54.632302919 +0900
@@ -106,6 +106,10 @@ static inline int memory_isolate_notify(
 {
 	return 0;
 }
+static inline bool is_memblk_offline(unsigned long start, unsigned long size)
+{
+	return false;
+}
 #else
 extern int register_memory_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
@@ -120,6 +124,7 @@ extern int memory_isolate_notify(unsigne
 extern struct memory_block *find_memory_block_hinted(struct mem_section *,
 							struct memory_block *);
 extern struct memory_block *find_memory_block(struct mem_section *);
+extern bool is_memblk_offline(unsigned long start, unsigned long size);
 #define CONFIG_MEM_BLOCK_SIZE	(PAGES_PER_SECTION<<PAGE_SHIFT)
 enum mem_add_context { BOOT, HOTPLUG };
 #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */

^ permalink raw reply

* [PATCH v4 3/13] memory-hotplug : check whether memory is present or not
From: Yasuaki Ishimatsu @ 2012-07-18 10:07 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linuxppc-dev, linux-acpi
  Cc: len.brown, wency, paulus, minchan.kim, kosaki.motohiro, rientjes,
	cl, akpm, liuj97
In-Reply-To: <50068974.1070409@jp.fujitsu.com>

If system supports memory hot-remove, online_pages() may online removed pages.
So online_pages() need to check whether onlining pages are present or not.

CC: David Rientjes <rientjes@google.com>
CC: Jiang Liu <liuj97@gmail.com>
CC: Len Brown <len.brown@intel.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org> 
CC: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> 
CC: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

---
 include/linux/mmzone.h |   21 +++++++++++++++++++++
 mm/memory_hotplug.c    |   13 +++++++++++++
 2 files changed, 34 insertions(+)

Index: linux-3.5-rc6/include/linux/mmzone.h
===================================================================
--- linux-3.5-rc6.orig/include/linux/mmzone.h	2012-07-08 09:23:56.000000000 +0900
+++ linux-3.5-rc6/include/linux/mmzone.h	2012-07-17 16:10:21.588186145 +0900
@@ -1168,6 +1168,27 @@ void sparse_init(void);
 #define sparse_index_init(_sec, _nid)  do {} while (0)
 #endif /* CONFIG_SPARSEMEM */
 
+#ifdef CONFIG_SPARSEMEM
+static inline int pfns_present(unsigned long pfn, unsigned long nr_pages)
+{
+	int i;
+	for (i = 0; i < nr_pages; i++) {
+		if (pfn_present(pfn + 1))
+			continue;
+		else {
+			unlock_memory_hotplug();
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+#else
+static inline int pfns_present(unsigned long pfn, unsigned long nr_pages)
+{
+	return 0;
+}
+#endif /* CONFIG_SPARSEMEM*/
+
 #ifdef CONFIG_NODES_SPAN_OTHER_NODES
 bool early_pfn_in_nid(unsigned long pfn, int nid);
 #else
Index: linux-3.5-rc6/mm/memory_hotplug.c
===================================================================
--- linux-3.5-rc6.orig/mm/memory_hotplug.c	2012-07-17 14:26:40.000000000 +0900
+++ linux-3.5-rc6/mm/memory_hotplug.c	2012-07-17 16:09:50.070580170 +0900
@@ -467,6 +467,19 @@ int __ref online_pages(unsigned long pfn
 	struct memory_notify arg;
 
 	lock_memory_hotplug();
+	/*
+ 	 * If system supports memory hot-remove, the memory may have been
+ 	 * removed. So we check whether the memory has been removed or not.
+ 	 *
+ 	 * Note: When CONFIG_SPARSEMEM is defined, pfns_present() become
+ 	 *       effective. If CONFIG_SPARSEMEM is not defined, pfns_present()
+ 	 *       always returns 0.
+ 	 */
+	ret = pfns_present(pfn, nr_pages);
+	if (ret) {
+		unlock_memory_hotplug();
+		return ret;
+	}
 	arg.start_pfn = pfn;
 	arg.nr_pages = nr_pages;
 	arg.status_change_nid = -1;

^ permalink raw reply

* [RFC PATCH v4 4/13] memory-hotplug : remove /sys/firmware/memmap/X sysfs
From: Yasuaki Ishimatsu @ 2012-07-18 10:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linuxppc-dev, linux-acpi
  Cc: len.brown, wency, paulus, minchan.kim, kosaki.motohiro, rientjes,
	cl, akpm, liuj97
In-Reply-To: <50068974.1070409@jp.fujitsu.com>

When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type}
sysfs files are created. But there is no code to remove these files. The patch
implements the function to remove them.

Note : The code does not free firmware_map_entry since there is no way to free
       memory which is allocated by bootmem.

CC: David Rientjes <rientjes@google.com>
CC: Jiang Liu <liuj97@gmail.com>
CC: Len Brown <len.brown@intel.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org> 
CC: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> 
CC: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

---
 drivers/firmware/memmap.c    |   78 ++++++++++++++++++++++++++++++++++++++++++-
 include/linux/firmware-map.h |    6 +++
 mm/memory_hotplug.c          |    9 +++-
 3 files changed, 90 insertions(+), 3 deletions(-)

Index: linux-3.5-rc6/mm/memory_hotplug.c
===================================================================
--- linux-3.5-rc6.orig/mm/memory_hotplug.c	2012-07-18 17:20:05.670024283 +0900
+++ linux-3.5-rc6/mm/memory_hotplug.c	2012-07-18 17:51:03.933189930 +0900
@@ -1012,9 +1012,9 @@ int offline_memory(u64 start, u64 size)
 	return offline_pages(start_pfn, end_pfn, 120 * HZ);
 }
 
-int remove_memory(int nid, u64 start, u64 size)
+int __ref remove_memory(int nid, u64 start, u64 size)
 {
-	int ret = -EBUSY;
+	int ret = 0;
 	lock_memory_hotplug();
 	/*
 	 * The memory might become online by other task, even if you offine it.
@@ -1025,8 +1025,13 @@ int remove_memory(int nid, u64 start, u6
 			"because the memmory range is online\n",
 			start, start + size);
 		ret = -EAGAIN;
+		goto out;
 	}
 
+	/* remove memmap entry */
+	firmware_map_remove(start, start + size, "System RAM");
+
+out:
 	unlock_memory_hotplug();
 	return ret;
 
Index: linux-3.5-rc6/include/linux/firmware-map.h
===================================================================
--- linux-3.5-rc6.orig/include/linux/firmware-map.h	2012-07-18 17:19:37.007382563 +0900
+++ linux-3.5-rc6/include/linux/firmware-map.h	2012-07-18 17:42:20.804730245 +0900
@@ -25,6 +25,7 @@
 
 int firmware_map_add_early(u64 start, u64 end, const char *type);
 int firmware_map_add_hotplug(u64 start, u64 end, const char *type);
+int firmware_map_remove(u64 start, u64 end, const char *type);
 
 #else /* CONFIG_FIRMWARE_MEMMAP */
 
@@ -38,6 +39,11 @@ static inline int firmware_map_add_hotpl
 	return 0;
 }
 
+static inline int firmware_map_remove(u64 start, u64 end, const char *type)
+{
+	return 0;
+}
+
 #endif /* CONFIG_FIRMWARE_MEMMAP */
 
 #endif /* _LINUX_FIRMWARE_MAP_H */
Index: linux-3.5-rc6/drivers/firmware/memmap.c
===================================================================
--- linux-3.5-rc6.orig/drivers/firmware/memmap.c	2012-07-18 17:19:43.618300182 +0900
+++ linux-3.5-rc6/drivers/firmware/memmap.c	2012-07-18 17:42:20.846729721 +0900
@@ -21,6 +21,7 @@
 #include <linux/types.h>
 #include <linux/bootmem.h>
 #include <linux/slab.h>
+#include <linux/mm.h>
 
 /*
  * Data types ------------------------------------------------------------------
@@ -79,7 +80,22 @@ static const struct sysfs_ops memmap_att
 	.show = memmap_attr_show,
 };
 
+#define to_memmap_entry(obj) container_of(obj, struct firmware_map_entry, kobj)
+
+static void release_firmware_map_entry(struct kobject *kobj)
+{
+	struct firmware_map_entry *entry = to_memmap_entry(kobj);
+	struct page *page;
+
+	page = virt_to_page(entry);
+	if (PageSlab(page) || PageCompound(page))
+		kfree(entry);
+
+	/* There is no way to free memory allocated from bootmem*/
+}
+
 static struct kobj_type memmap_ktype = {
+	.release	= release_firmware_map_entry,
 	.sysfs_ops	= &memmap_attr_ops,
 	.default_attrs	= def_attrs,
 };
@@ -123,6 +139,16 @@ static int firmware_map_add_entry(u64 st
 	return 0;
 }
 
+/**
+ * firmware_map_remove_entry() - Does the real work to remove a firmware
+ * memmap entry.
+ * @entry: removed entry.
+ **/
+static inline void firmware_map_remove_entry(struct firmware_map_entry *entry)
+{
+	list_del(&entry->list);
+}
+
 /*
  * Add memmap entry on sysfs
  */
@@ -144,6 +170,31 @@ static int add_sysfs_fw_map_entry(struct
 	return 0;
 }
 
+/*
+ * Remove memmap entry on sysfs
+ */
+static inline void remove_sysfs_fw_map_entry(struct firmware_map_entry *entry)
+{
+	kobject_put(&entry->kobj);
+}
+
+/*
+ * Search memmap entry
+ */
+
+struct firmware_map_entry * __meminit
+find_firmware_map_entry(u64 start, u64 end, const char *type)
+{
+	struct firmware_map_entry *entry;
+
+	list_for_each_entry(entry, &map_entries, list)
+		if ((entry->start == start) && (entry->end == end) &&
+		    (!strcmp(entry->type, type)))
+			return entry;
+
+	return NULL;
+}
+
 /**
  * firmware_map_add_hotplug() - Adds a firmware mapping entry when we do
  * memory hotplug.
@@ -196,6 +247,32 @@ int __init firmware_map_add_early(u64 st
 	return firmware_map_add_entry(start, end, type, entry);
 }
 
+/**
+ * firmware_map_remove() - remove a firmware mapping entry
+ * @start: Start of the memory range.
+ * @end:   End of the memory range.
+ * @type:  Type of the memory range.
+ *
+ * removes a firmware mapping entry.
+ *
+ * Returns 0 on success, or -EINVAL if no entry.
+ **/
+int __meminit firmware_map_remove(u64 start, u64 end, const char *type)
+{
+	struct firmware_map_entry *entry;
+
+	entry = find_firmware_map_entry(start, end - 1, type);
+	if (!entry)
+		return -EINVAL;
+
+	firmware_map_remove_entry(entry);
+
+	/* remove the memmap entry */
+	remove_sysfs_fw_map_entry(entry);
+
+	return 0;
+}
+
 /*
  * Sysfs functions -------------------------------------------------------------
  */
@@ -218,7 +295,6 @@ static ssize_t type_show(struct firmware
 }
 
 #define to_memmap_attr(_attr) container_of(_attr, struct memmap_attribute, attr)
-#define to_memmap_entry(obj) container_of(obj, struct firmware_map_entry, kobj)
 
 static ssize_t memmap_attr_show(struct kobject *kobj,
 				struct attribute *attr, char *buf)

^ permalink raw reply

* [RFC PATCH v4 5/13] memory-hotplug : does not release memory region in PAGES_PER_SECTION chunks
From: Yasuaki Ishimatsu @ 2012-07-18 10:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linuxppc-dev, linux-acpi
  Cc: len.brown, wency, paulus, minchan.kim, kosaki.motohiro, rientjes,
	cl, akpm, liuj97
In-Reply-To: <50068974.1070409@jp.fujitsu.com>

Since applying a patch(de7f0cba96786c), release_mem_region() has been changed
as called in PAGES_PER_SECTION chunks because register_memory_resource() is
called in PAGES_PER_SECTION chunks by add_memory(). But it seems firmware
dependency. If CRS are written in the PAGES_PER_SECTION chunks in ACPI DSDT
Table, register_memory_resource() is called in PAGES_PER_SECTION chunks.
But if CRS are written in the DIMM unit in ACPI DSDT Table,
register_memory_resource() is called in DIMM unit. So release_mem_region()
should not be called in PAGES_PER_SECTION chunks. The patch fixes it.

CC: David Rientjes <rientjes@google.com>
CC: Jiang Liu <liuj97@gmail.com>
CC: Len Brown <len.brown@intel.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org> 
CC: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> 
CC: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

---
 arch/powerpc/platforms/pseries/hotplug-memory.c |   13 +++++++++----
 mm/memory_hotplug.c                             |    4 ++--
 2 files changed, 11 insertions(+), 6 deletions(-)

Index: linux-3.5-rc6/mm/memory_hotplug.c
===================================================================
--- linux-3.5-rc6.orig/mm/memory_hotplug.c	2012-07-18 17:51:03.933189930 +0900
+++ linux-3.5-rc6/mm/memory_hotplug.c	2012-07-18 17:51:17.550020005 +0900
@@ -358,11 +358,11 @@ int __remove_pages(struct zone *zone, un
 	BUG_ON(phys_start_pfn & ~PAGE_SECTION_MASK);
 	BUG_ON(nr_pages % PAGES_PER_SECTION);
 
+	release_mem_region(phys_start_pfn << PAGE_SHIFT,  nr_pages * PAGE_SIZE);
+
 	sections_to_remove = nr_pages / PAGES_PER_SECTION;
 	for (i = 0; i < sections_to_remove; i++) {
 		unsigned long pfn = phys_start_pfn + i*PAGES_PER_SECTION;
-		release_mem_region(pfn << PAGE_SHIFT,
-				   PAGES_PER_SECTION << PAGE_SHIFT);
 		ret = __remove_section(zone, __pfn_to_section(pfn));
 		if (ret)
 			break;
Index: linux-3.5-rc6/arch/powerpc/platforms/pseries/hotplug-memory.c
===================================================================
--- linux-3.5-rc6.orig/arch/powerpc/platforms/pseries/hotplug-memory.c	2012-07-18 17:50:49.893365814 +0900
+++ linux-3.5-rc6/arch/powerpc/platforms/pseries/hotplug-memory.c	2012-07-18 17:51:17.553019968 +0900
@@ -77,7 +77,8 @@ static int pseries_remove_memblock(unsig
 {
 	unsigned long start, start_pfn;
 	struct zone *zone;
-	int ret;
+	int i, ret;
+	int sections_to_remove;
 
 	start_pfn = base >> PAGE_SHIFT;
 
@@ -97,9 +98,13 @@ static int pseries_remove_memblock(unsig
 	 * to sysfs "state" file and we can't remove sysfs entries
 	 * while writing to it. So we have to defer it to here.
 	 */
-	ret = __remove_pages(zone, start_pfn, memblock_size >> PAGE_SHIFT);
-	if (ret)
-		return ret;
+	sections_to_remove = (memblock_size >> PAGE_SHIFT) / PAGES_PER_SECTION;
+	for (i = 0; i < sections_to_remove; i++) {
+		unsigned long pfn = start_pfn + i * PAGES_PER_SECTION;
+		ret = __remove_pages(zone, start_pfn,  PAGES_PER_SECTION);
+		if (ret)
+			return ret;
+	}
 
 	/*
 	 * Update memory regions for memory remove

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox