public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Calvin Owens <calvinowens@fb.com>
To: Sreekanth Reddy <sreekanth.reddy@avagotech.com>
Cc: <martin.petersen@oracle.com>, <linux-scsi@vger.kernel.org>,
	<jejb@kernel.org>, <JBottomley@Parallels.com>,
	<Sathya.Prakash@avagotech.com>, <chaitra.basappa@avagotech.com>,
	<linux-kernel@vger.kernel.org>, <hch@infradead.org>,
	<calvinowens@fb.com>
Subject: Re: [PATCH] mpt2sas: mpt3sas: Fix memory corruption during initialization
Date: Wed, 6 May 2015 11:48:52 -0700	[thread overview]
Message-ID: <20150506184852.GA1007288@mail.thefacebook.com> (raw)
In-Reply-To: <1430751931-14002-1-git-send-email-Sreekanth.Reddy@avagotech.com>

On Monday 05/04 at 20:35 +0530, Sreekanth Reddy wrote:
> I have applied this patch on the latest upstream mpt3sas driver, then
> I have compiled and loaded the driver.  In the driver logs I didn't
> see any attached drives are added to the OS, 'fdisk -l' command also
> doesn't list the drives which are actually attached to the HBA.
>
> When I debug this issue then I see that in '_scsih_target_alloc'
> driver is searching for sas_device from the lists
> 'sas_device_init_list' & 'sas_device_list' based on the device sas
> address using the function
> mpt3sas_scsih_sas_device_find_by_sas_address(), since this device is
> not in the 'sas_device_init_list' (as it is moved it to head list)
> driver exit from this function without updating the required device
> addition information.

Yes, I misunderstood that the initialization depended on the devices
still being on the init_list.

What's interesting about this is that when I tested it, it still worked.
I think the MPT2SAS_PORT_ENABLE_COMPLETE fw_event might zero
ioc->start_scan and allow scsih_scan_finished() to start probing devices
before all the devices are actually on the init_list. It seems to be
very repeatable per-machine whether or not it works.

But in any case, my patch was wrong.

> To solve the original problem (i.e memory corruption), here I have
> attached the patch, in this patch I have added one atomic flag
> is_on_sas_device_init_list in _sas_device_structure and I followed
> below algorithm.

The problem is that this only solves a single case. There isn't anything
to enforce that this or a similar chain of events can't happen elsewhere
in the code.

I think the best general solution would be to add a refcount to these
objects.  They sit on a list that can be concurrently accessed from
multiple threads, so I think a refcount is the best way to ensure that
objects aren't freed out from under other users.

I'm working on a patchset that does this. I'm starting by adding a
refcount to the sas_device object only, and refactoring the code in
mpt2sas_scsih.c to use it. I should be able to send up a first version
of that pretty soon to get some feedback.

Thanks,
Calvin

 
> 1. when ever a device is added to sas_device_init_list then driver
> will set this atomic flag of this device to one.
> 
> 2. And during the addition of this device to SCSI mid layer, if the
> device is successfully added to the OS then driver will move this
> device list in to sas_device_list list from sas_device_init_list list
> and at this time driver will reset this flag to zero.  if device is
> failed to register with SCSI mid layer then also driver will reset
> this flag to zero in function _scsih_sas_device_remove and will remove
> the device entry from sas_device_init_list and will free the device
> structure.
> 
> 3. Now when a device is removed then driver will receive target not responding event and in the function _scsih_device_remove_by_handle,
>          a. driver will check whether addition of discovered devices to SML process is currently running or not,
>                i. if addition (or registration) of discovered devices to SML process is running then driver will check whether device is in sas_device_init_list or not (by reading the atomic flag)?.
>                     if it is in a sas_device_init_list then driver will ignore this device removal event (since device registration with SML will fail and it is removed in function _scsih_sas_device_remove as mentioned in step 2).
>              ii. if the device is not in a sas_device_init_list or addition (or registration) of discovered devices to SML process is already completed then device structure is removed from this function and this device entry is removed from sas_device_list.
> 
> 4. if the device removal event is received after device structure is freed due to failure of device registration with SML them in the function _scsih_device_remove_by_handle driver won't find this device in the sas_device_list or in a sas_device_init_list and so driver will ignore this  device removal event.
> 
> Signed-off-by: Sreekanth Reddy <Sreekanth.Reddy@avagotech.com>
> ---
>  drivers/scsi/mpt2sas/mpt2sas_base.h  |  2 ++
>  drivers/scsi/mpt2sas/mpt2sas_scsih.c | 45 +++++++++++++++++++++++++++---------
>  drivers/scsi/mpt3sas/mpt3sas_base.h  |  2 ++
>  drivers/scsi/mpt3sas/mpt3sas_scsih.c | 43 ++++++++++++++++++++++++++--------
>  4 files changed, 71 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/scsi/mpt2sas/mpt2sas_base.h b/drivers/scsi/mpt2sas/mpt2sas_base.h
> index caff8d1..1aa10d2 100644
> --- a/drivers/scsi/mpt2sas/mpt2sas_base.h
> +++ b/drivers/scsi/mpt2sas/mpt2sas_base.h
> @@ -376,6 +376,7 @@ struct _sas_device {
>  	u8	phy;
>  	u8	responding;
>  	u8	pfa_led_on;
> +	atomic_t is_on_sas_device_init_list;
>  };
>  
>  /**
> @@ -833,6 +834,7 @@ struct MPT2SAS_ADAPTER {
>  	u8		broadcast_aen_busy;
>  	u16		broadcast_aen_pending;
>  	u8		shost_recovery;
> +	u8		discovered_device_addition_on;
>  
>  	struct mutex	reset_in_progress_mutex;
>  	spinlock_t 	ioc_reset_in_progress_lock;
> diff --git a/drivers/scsi/mpt2sas/mpt2sas_scsih.c b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
> index 3f26147..2a61286 100644
> --- a/drivers/scsi/mpt2sas/mpt2sas_scsih.c
> +++ b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
> @@ -590,13 +590,20 @@ _scsih_sas_device_remove(struct MPT2SAS_ADAPTER *ioc,
>      struct _sas_device *sas_device)
>  {
>  	unsigned long flags;
> +	struct _sas_device *same_sas_device;
>  
>  	if (!sas_device)
>  		return;
>  
>  	spin_lock_irqsave(&ioc->sas_device_lock, flags);
> -	list_del(&sas_device->list);
> -	kfree(sas_device);
> +	same_sas_device = _scsih_sas_device_find_by_handle(ioc,
> +						sas_device->handle);
> +	if (same_sas_device) {
> +		list_del(&same_sas_device->list);
> +		if (atomic_read(&sas_device->is_on_sas_device_init_list))
> +			atomic_set(&sas_device->is_on_sas_device_init_list, 0);
> +		kfree(same_sas_device);
> +	}
>  	spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
>  }
>  
> @@ -658,6 +664,7 @@ _scsih_sas_device_init_add(struct MPT2SAS_ADAPTER *ioc,
>  	    "(0x%04x), sas_addr(0x%016llx)\n", ioc->name, __func__,
>  	    sas_device->handle, (unsigned long long)sas_device->sas_address));
>  
> +	atomic_set(&sas_device->is_on_sas_device_init_list, 1);
>  	spin_lock_irqsave(&ioc->sas_device_lock, flags);
>  	list_add_tail(&sas_device->list, &ioc->sas_device_init_list);
>  	_scsih_determine_boot_device(ioc, sas_device, 0);
> @@ -5364,8 +5371,14 @@ _scsih_device_remove_by_handle(struct MPT2SAS_ADAPTER *ioc, u16 handle)
>  
>  	spin_lock_irqsave(&ioc->sas_device_lock, flags);
>  	sas_device = _scsih_sas_device_find_by_handle(ioc, handle);
> -	if (sas_device)
> -		list_del(&sas_device->list);
> +	if (sas_device) {
> +		if (ioc->discovered_device_addition_on &&
> +		    atomic_read(&sas_device->is_on_sas_device_init_list)) {
> +			spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
> +			return;
> +		} else
> +			list_del(&sas_device->list);
> +	}
>  	spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
>  	if (sas_device)
>  		_scsih_remove_device(ioc, sas_device);
> @@ -5391,8 +5404,14 @@ mpt2sas_device_remove_by_sas_address(struct MPT2SAS_ADAPTER *ioc,
>  	spin_lock_irqsave(&ioc->sas_device_lock, flags);
>  	sas_device = mpt2sas_scsih_sas_device_find_by_sas_address(ioc,
>  	    sas_address);
> -	if (sas_device)
> -		list_del(&sas_device->list);
> +	if (sas_device) {
> +		if (ioc->discovered_device_addition_on &&
> +		    atomic_read(&sas_device->is_on_sas_device_init_list)) {
> +			spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
> +			return;
> +		} else
> +			list_del(&sas_device->list);
> +	}
>  	spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
>  	if (sas_device)
>  		_scsih_remove_device(ioc, sas_device);
> @@ -7978,32 +7997,36 @@ _scsih_probe_sas(struct MPT2SAS_ADAPTER *ioc)
>  	struct _sas_device *sas_device, *next;
>  	unsigned long flags;
>  
> +	ioc->discovered_device_addition_on = 1;
>  	/* SAS Device List */
>  	list_for_each_entry_safe(sas_device, next, &ioc->sas_device_init_list,
>  	    list) {
>  
>  		if (ioc->hide_drives)
>  			continue;
> -
> +
>  		if (!mpt2sas_transport_port_add(ioc, sas_device->handle,
>  		    sas_device->sas_address_parent)) {
> -			list_del(&sas_device->list);
> -			kfree(sas_device);
> +			mpt2sas_transport_port_remove(ioc,
> +					sas_device->sas_address,
> +					sas_device->sas_address_parent);
> +			_scsih_sas_device_remove(ioc, sas_device);
>  			continue;
>  		} else if (!sas_device->starget) {
>  			if (!ioc->is_driver_loading) {
>  				mpt2sas_transport_port_remove(ioc,
>  					sas_device->sas_address,
>  					sas_device->sas_address_parent);
> -				list_del(&sas_device->list);
> -				kfree(sas_device);
> +				_scsih_sas_device_remove(ioc, sas_device);
>  				continue;
>  			}
>  		}
>  		spin_lock_irqsave(&ioc->sas_device_lock, flags);
>  		list_move_tail(&sas_device->list, &ioc->sas_device_list);
> +		atomic_dec(&sas_device->is_on_sas_device_init_list);
>  		spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
>  	}
> +	ioc->discovered_device_addition_on = 0;
>  }
>  
>  /**
> diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.h b/drivers/scsi/mpt3sas/mpt3sas_base.h
> index afa8816..6188490 100644
> --- a/drivers/scsi/mpt3sas/mpt3sas_base.h
> +++ b/drivers/scsi/mpt3sas/mpt3sas_base.h
> @@ -315,6 +315,7 @@ struct _sas_device {
>  	u8	responding;
>  	u8	fast_path;
>  	u8	pfa_led_on;
> +	atomic_t is_on_sas_device_init_list;
>  };
>  
>  /**
> @@ -766,6 +767,7 @@ struct MPT3SAS_ADAPTER {
>  	u8		broadcast_aen_busy;
>  	u16		broadcast_aen_pending;
>  	u8		shost_recovery;
> +	u8		discovered_device_addition_on;
>  
>  	struct mutex	reset_in_progress_mutex;
>  	spinlock_t	ioc_reset_in_progress_lock;
> diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> index 5a97e32..53cc9ea 100644
> --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> @@ -582,13 +582,20 @@ _scsih_sas_device_remove(struct MPT3SAS_ADAPTER *ioc,
>  	struct _sas_device *sas_device)
>  {
>  	unsigned long flags;
> +	struct _sas_device *same_sas_device;
>  
>  	if (!sas_device)
>  		return;
>  
>  	spin_lock_irqsave(&ioc->sas_device_lock, flags);
> -	list_del(&sas_device->list);
> -	kfree(sas_device);
> +	same_sas_device = _scsih_sas_device_find_by_handle(ioc,
> +						sas_device->handle);
> +	if (same_sas_device) {
> +		list_del(&same_sas_device->list);
> +		if (atomic_read(&sas_device->is_on_sas_device_init_list))
> +			atomic_set(&sas_device->is_on_sas_device_init_list, 0);
> +		kfree(same_sas_device);
> +	}
>  	spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
>  }
>  
> @@ -610,8 +616,14 @@ _scsih_device_remove_by_handle(struct MPT3SAS_ADAPTER *ioc, u16 handle)
>  
>  	spin_lock_irqsave(&ioc->sas_device_lock, flags);
>  	sas_device = _scsih_sas_device_find_by_handle(ioc, handle);
> -	if (sas_device)
> -		list_del(&sas_device->list);
> +	if (sas_device) {
> +		if (ioc->discovered_device_addition_on &&
> +		    atomic_read(&sas_device->is_on_sas_device_init_list)) {
> +			spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
> +			return;
> +		} else
> +			list_del(&sas_device->list);
> +	}
>  	spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
>  	if (sas_device)
>  		_scsih_remove_device(ioc, sas_device);
> @@ -637,8 +649,14 @@ mpt3sas_device_remove_by_sas_address(struct MPT3SAS_ADAPTER *ioc,
>  	spin_lock_irqsave(&ioc->sas_device_lock, flags);
>  	sas_device = mpt3sas_scsih_sas_device_find_by_sas_address(ioc,
>  	    sas_address);
> -	if (sas_device)
> -		list_del(&sas_device->list);
> +	if (sas_device) {
> +		if (ioc->discovered_device_addition_on &&
> +		    atomic_read(&sas_device->is_on_sas_device_init_list)) {
> +			spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
> +			return;
> +		} else
> +			list_del(&sas_device->list);
> +	}
>  	spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
>  	if (sas_device)
>  		_scsih_remove_device(ioc, sas_device);
> @@ -663,6 +681,7 @@ _scsih_sas_device_add(struct MPT3SAS_ADAPTER *ioc,
>  		ioc->name, __func__, sas_device->handle,
>  		(unsigned long long)sas_device->sas_address));
>  
> +	atomic_set(&sas_device->is_on_sas_device_init_list, 1);
>  	spin_lock_irqsave(&ioc->sas_device_lock, flags);
>  	list_add_tail(&sas_device->list, &ioc->sas_device_list);
>  	spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
> @@ -7610,14 +7629,17 @@ _scsih_probe_sas(struct MPT3SAS_ADAPTER *ioc)
>  	struct _sas_device *sas_device, *next;
>  	unsigned long flags;
>  
> +	ioc->discovered_device_addition_on = 1;
>  	/* SAS Device List */
>  	list_for_each_entry_safe(sas_device, next, &ioc->sas_device_init_list,
>  	    list) {
>  
>  		if (!mpt3sas_transport_port_add(ioc, sas_device->handle,
>  		    sas_device->sas_address_parent)) {
> -			list_del(&sas_device->list);
> -			kfree(sas_device);
> +			mpt3sas_transport_port_remove(ioc,
> +					sas_device->sas_address,
> +					sas_device->sas_address_parent);
> +			_scsih_sas_device_remove(ioc, sas_device);
>  			continue;
>  		} else if (!sas_device->starget) {
>  			/*
> @@ -7630,16 +7652,17 @@ _scsih_probe_sas(struct MPT3SAS_ADAPTER *ioc)
>  				mpt3sas_transport_port_remove(ioc,
>  				    sas_device->sas_address,
>  				    sas_device->sas_address_parent);
> -				list_del(&sas_device->list);
> -				kfree(sas_device);
> +				_scsih_sas_device_remove(ioc, sas_device);
>  				continue;
>  			}
>  		}
>  
>  		spin_lock_irqsave(&ioc->sas_device_lock, flags);
>  		list_move_tail(&sas_device->list, &ioc->sas_device_list);
> +		atomic_dec(&sas_device->is_on_sas_device_init_list);
>  		spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
>  	}
> +	ioc->discovered_device_addition_on = 0;
>  }
>  
>  /**
> -- 
> 2.0.2
> 

  parent reply	other threads:[~2015-05-06 18:49 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-04 15:05 [PATCH] mpt2sas: mpt3sas: Fix memory corruption during initialization Sreekanth Reddy
2015-05-05 15:35 ` Tomas Henzl
2015-05-12  9:38   ` Sreekanth Reddy
2015-05-06 18:48 ` Calvin Owens [this message]
2015-05-15  3:41   ` [PATCH 0/6] Fixes for memory corruption in mpt2sas Calvin Owens
2015-05-15  3:41     ` [PATCH 1/6] Add refcount to sas_device struct Calvin Owens
2015-05-15  3:41     ` [PATCH 2/6] Refactor code to use new sas_device refcount Calvin Owens
2015-05-15  3:41     ` [PATCH 3/6] Fix unsafe sas_device_list usage Calvin Owens
2015-05-15  3:42     ` [PATCH 4/6] Add refcount to fw_event_work struct Calvin Owens
2015-05-15  3:42     ` [PATCH 5/6] Refactor code to use new fw_event refcount Calvin Owens
2015-05-15  3:42     ` [PATCH 6/6] Fix unsafe fw_event_list usage Calvin Owens
2015-06-09  3:50     ` [RESEND][PATCH 0/6] Fixes for memory corruption in mpt2sas Calvin Owens
2015-06-09  3:50       ` [PATCH 1/6] Add refcount to sas_device struct Calvin Owens
2015-07-03 15:24         ` Christoph Hellwig
2015-06-09  3:50       ` [PATCH 2/6] Refactor code to use new sas_device refcount Calvin Owens
2015-07-03 15:38         ` Christoph Hellwig
2015-07-12  4:15           ` Calvin Owens
2015-06-09  3:50       ` [PATCH 3/6] Fix unsafe sas_device_list usage Calvin Owens
2015-07-03 16:03         ` Christoph Hellwig
2015-06-09  3:50       ` [PATCH 4/6] Add refcount to fw_event_work struct Calvin Owens
2015-07-03 15:38         ` Christoph Hellwig
2015-06-09  3:50       ` [PATCH 5/6] Refactor code to use new fw_event refcount Calvin Owens
2015-07-03 16:00         ` Christoph Hellwig
2015-07-12  4:13           ` Calvin Owens
2015-06-09  3:50       ` [PATCH 6/6] Fix unsafe fw_event_list usage Calvin Owens
2015-07-03 16:02         ` Christoph Hellwig
2015-07-12  4:20           ` Calvin Owens
2015-07-02 20:15       ` [RESEND][PATCH 0/6] Fixes for memory corruption in mpt2sas Bart Van Assche
2015-07-12  4:24       ` [PATCH 0/2 v2] " Calvin Owens
2015-07-12  4:24         ` [PATCH 1/2] mpt2sas: Refcount sas_device objects and fix unsafe list usage Calvin Owens
2015-07-13  6:52           ` Christoph Hellwig
2015-07-21  7:06             ` Calvin Owens
2015-07-13 15:05           ` Joe Lawrence
2015-07-21  7:04             ` Calvin Owens
2015-07-16 14:57           ` Sreekanth Reddy
2015-07-21  7:03             ` Calvin Owens
2015-07-12  4:24         ` [PATCH 2/2] mpt2sas: Refcount fw_events " Calvin Owens
2015-07-13  6:52           ` Christoph Hellwig
2015-08-01  5:02         ` [PATCH v3 0/2] Fixes for memory corruption in mpt2sas Calvin Owens
2015-08-01  5:02           ` [PATCH v3 1/2] mpt2sas: Refcount sas_device objects and fix unsafe list usage Calvin Owens
2015-08-10 13:15             ` Sreekanth Reddy
2015-08-14  1:43               ` Calvin Owens
2015-08-01  5:02           ` [PATCH v3 2/2] mpt2sas: Refcount fw_events " Calvin Owens
2015-08-14  1:48           ` [PATCH v4 0/2] Fixes for memory corruption in mpt2sas Calvin Owens
2015-08-14  1:48             ` [PATCH v4 1/2] mpt2sas: Refcount sas_device objects and fix unsafe list usage Calvin Owens
2015-08-14  1:48               ` [PATCH v4 2/2] mpt2sas: Refcount fw_events " Calvin Owens
2015-08-25 21:06                 ` Nicholas A. Bellinger
2015-09-04 14:35                 ` Sreekanth Reddy
2015-08-25 21:03               ` [PATCH v4 1/2] mpt2sas: Refcount sas_device objects " Nicholas A. Bellinger
2015-09-04 14:34               ` Sreekanth Reddy
2015-08-25 21:21             ` [PATCH v4 0/2] Fixes for memory corruption in mpt2sas Nicholas A. Bellinger
2015-07-02 19:22     ` [PATCH 0/6] " Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2015-04-10  7:14 [PATCH] mpt2sas: mpt3sas: Fix memory corruption during initialization Calvin Owens
2015-04-10 14:30 ` James Bottomley
2015-04-10 16:43   ` Sathya Prakash

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150506184852.GA1007288@mail.thefacebook.com \
    --to=calvinowens@fb.com \
    --cc=JBottomley@Parallels.com \
    --cc=Sathya.Prakash@avagotech.com \
    --cc=chaitra.basappa@avagotech.com \
    --cc=hch@infradead.org \
    --cc=jejb@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=sreekanth.reddy@avagotech.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox