All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: Root device multipathed host freeze with the latest upstream multipath-tools package
@ 2008-01-22 16:26 Sarraf, Ritesh
  2008-01-22 18:14 ` Kiyoshi Ueda
  0 siblings, 1 reply; 5+ messages in thread
From: Sarraf, Ritesh @ 2008-01-22 16:26 UTC (permalink / raw)
  To: George, Martin, k-ueda; +Cc: dm-devel

Now that 'relatime' is pushed to all major distribuitons, should we just
document and discourage 'noatime' ? (At least for SAN Boot)
Apart from Backup softwares and tmpclean, I can't recollect any other
user of it.
OTOH LUNs would be easily backed by users on the target.

Ritesh

________________________________

From: George, Martin 
Sent: Tuesday, January 22, 2008 9:00 PM
To: k-ueda@ct.jp.nec.com
Cc: Sarraf, Ritesh
Subject: Root device multipathed host freeze with the latest upstream
multipath-tools package


Hi Kiyoshi,
 
I took the latest upstream multipath-tools package (Jan 15, 2008) and
installed it on my RHEL 5.1 host to verify the libprio fix. To simulate
the FCP path faults, I ran your script (as attached in the mail) which
alternately offlined/onlined the corresponding SCSI paths of the root dm
device in the syfs. Listing my observations below:
 
1) The freeze was still reproducible. On checking the sysrq dumps (as
attached), I could see it was the script itself i.e. test.sh which seems
to have stalled on the exec () system call perhaps waiting for inode
write out for updated access time (the script resides on my root dm
device itself). As suggested by you in the bugzilla, I remounted the
root device using the noatime option and then reran the script - I have
not hit the freeze yet. Is this the expected behavior? 
 
2) With the latest upstream multipath-tools package, "multipath -ll"
displays all paths with the same priority - I am not able to prioritize
paths into primary/secondary despite the normal group_by_prio setting.
Does the libprio fix alter the behavior here?
 
Wanted to know your comments on the same.
 
Thanks a lot,
-Martin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Root device multipathed host freeze with the latest upstream multipath-tools package
  2008-01-22 16:26 Root device multipathed host freeze with the latest upstream multipath-tools package Sarraf, Ritesh
@ 2008-01-22 18:14 ` Kiyoshi Ueda
  2008-01-23 10:58   ` Martin George
  0 siblings, 1 reply; 5+ messages in thread
From: Kiyoshi Ueda @ 2008-01-22 18:14 UTC (permalink / raw)
  To: marting; +Cc: Ritesh.Sarraf, dm-devel

Hi Martin,

Thank you for your testing.
Please see my comments below.

On Tue, 22 Jan 2008 21:56:13 +0530, "Sarraf, Ritesh" wrote:
> Hi Kiyoshi,
>  
> I took the latest upstream multipath-tools package (Jan 15, 2008) and
> installed it on my RHEL 5.1 host to verify the libprio fix. To simulate
> the FCP path faults, I ran your script (as attached in the mail) which
> alternately offlined/onlined the corresponding SCSI paths of the root dm
> device in the syfs. Listing my observations below:
>  
> 1) The freeze was still reproducible. On checking the sysrq dumps (as
> attached), I could see it was the script itself i.e. test.sh which seems
> to have stalled on the exec () system call perhaps waiting for inode
> write out for updated access time (the script resides on my root dm
> device itself). As suggested by you in the bugzilla, I remounted the
> root device using the noatime option and then reran the script - I have
> not hit the freeze yet. Is this the expected behavior? 

As for your script, it is the expected behavior.
I found that you added some sleep commands to my original script
posted by the following email.

    http://marc.info/?l=dm-devel&m=119465024621783&w=2

sleep is not shell build-in command, so need to access the root device.
I guess that is the reason of the freeze.

Please retest using a script doesn't include any sleep command or
your fault injection method.
If you need to sleep anyway, empty while loop like below might be used
though you have to change the '1000000' depending on your system:
    i=0
    while [ $i -lt 1000000 ]; do
        i=$(($i + 1))
    done


> 2) With the latest upstream multipath-tools package, "multipath -ll"
> displays all paths with the same priority - I am not able to prioritize
> paths into primary/secondary despite the normal group_by_prio setting.
> Does the libprio fix alter the behavior here?

The keyword of libprio setting is "prio", and the name of the netapp
prioritizer is "netapp".
So you need to change your multipath.conf like this:

    From: prio_callout    "/sbin/mpath_prio_netapp /dev/%n"
    To:   prio            "netapp"

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Root device multipathed host freeze with the latest upstream multipath-tools package
  2008-01-22 18:14 ` Kiyoshi Ueda
@ 2008-01-23 10:58   ` Martin George
  2008-01-23 20:28     ` Kiyoshi Ueda
  0 siblings, 1 reply; 5+ messages in thread
From: Martin George @ 2008-01-23 10:58 UTC (permalink / raw)
  To: Kiyoshi Ueda; +Cc: Sarraf, Ritesh, dm-devel

Kiyoshi Ueda wrote:
> Hi Martin,
> 
> Thank you for your testing.
> Please see my comments below.
> 
> On Tue, 22 Jan 2008 21:56:13 +0530, "Sarraf, Ritesh" wrote:
>  > Hi Kiyoshi,
>  > 
>  > I took the latest upstream multipath-tools package (Jan 15, 2008) and
>  > installed it on my RHEL 5.1 host to verify the libprio fix. To simulate
>  > the FCP path faults, I ran your script (as attached in the mail) which
>  > alternately offlined/onlined the corresponding SCSI paths of the root dm
>  > device in the syfs. Listing my observations below:
>  > 
>  > 1) The freeze was still reproducible. On checking the sysrq dumps (as
>  > attached), I could see it was the script itself i.e. test.sh which seems
>  > to have stalled on the exec () system call perhaps waiting for inode
>  > write out for updated access time (the script resides on my root dm
>  > device itself). As suggested by you in the bugzilla, I remounted the
>  > root device using the noatime option and then reran the script - I have
>  > not hit the freeze yet. Is this the expected behavior?
> 
> As for your script, it is the expected behavior.
> I found that you added some sleep commands to my original script
> posted by the following email.
> 
>     http://marc.info/?l=dm-devel&m=119465024621783&w=2 
> <http://marc.info/?l=dm-devel&m=119465024621783&w=2>
> 
> sleep is not shell build-in command, so need to access the root device.
> I guess that is the reason of the freeze.

So does that mean you should never access the root partition in such a 
scenario? What about utilities like syslogd which may access the root to 
log messages? There could be many such utilities for that matter which 
accesses the root and all would have to be stopped.

Thanks,
-Martin

> 
> Please retest using a script doesn't include any sleep command or
> your fault injection method.
> If you need to sleep anyway, empty while loop like below might be used
> though you have to change the '1000000' depending on your system:
>     i=0
>     while [ $i -lt 1000000 ]; do
>         i=$(($i + 1))
>     done
> 
> 
>  > 2) With the latest upstream multipath-tools package, "multipath -ll"
>  > displays all paths with the same priority - I am not able to prioritize
>  > paths into primary/secondary despite the normal group_by_prio setting.
>  > Does the libprio fix alter the behavior here?
> 
> The keyword of libprio setting is "prio", and the name of the netapp
> prioritizer is "netapp".
> So you need to change your multipath.conf like this:
> 
>     From: prio_callout    "/sbin/mpath_prio_netapp /dev/%n"
>     To:   prio            "netapp"
> 
> Thanks,
> Kiyoshi Ueda
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Root device multipathed host freeze with the latest upstream multipath-tools package
  2008-01-23 10:58   ` Martin George
@ 2008-01-23 20:28     ` Kiyoshi Ueda
  2008-01-30 15:20       ` Martin George
  0 siblings, 1 reply; 5+ messages in thread
From: Kiyoshi Ueda @ 2008-01-23 20:28 UTC (permalink / raw)
  To: marting; +Cc: Ritesh.Sarraf, dm-devel

Hi Martin,

On Wed, 23 Jan 2008 16:28:16 +0530, Martin George wrote:
> Kiyoshi Ueda wrote:
> > Hi Martin,
> > 
> > Thank you for your testing.
> > Please see my comments below.
> > 
> > On Tue, 22 Jan 2008 21:56:13 +0530, "Sarraf, Ritesh" wrote:
> >  > Hi Kiyoshi,
> >  > 
> >  > I took the latest upstream multipath-tools package (Jan 15, 2008) and
> >  > installed it on my RHEL 5.1 host to verify the libprio fix. To simulate
> >  > the FCP path faults, I ran your script (as attached in the mail) which
> >  > alternately offlined/onlined the corresponding SCSI paths of the root dm
> >  > device in the syfs. Listing my observations below:
> >  > 
> >  > 1) The freeze was still reproducible. On checking the sysrq dumps (as
> >  > attached), I could see it was the script itself i.e. test.sh which seems
> >  > to have stalled on the exec () system call perhaps waiting for inode
> >  > write out for updated access time (the script resides on my root dm
> >  > device itself). As suggested by you in the bugzilla, I remounted the
> >  > root device using the noatime option and then reran the script - I have
> >  > not hit the freeze yet. Is this the expected behavior?
> > 
> > As for your script, it is the expected behavior.
> > I found that you added some sleep commands to my original script
> > posted by the following email.
> > 
> >     http://marc.info/?l=dm-devel&m=119465024621783&w=2 
> > <http://marc.info/?l=dm-devel&m=119465024621783&w=2>
> > 
> > sleep is not shell build-in command, so need to access the root device.
> > I guess that is the reason of the freeze.
> 
> So does that mean you should never access the root partition in such a 
> scenario? What about utilities like syslogd which may access the root to 
> log messages? There could be many such utilities for that matter which 
> accesses the root and all would have to be stopped.

No.
Generally you can access the root.
But you can't in your single-threaded test script.

On your testing scenario, only your script would online/offline paths
for the root like this:
    while true; do
        <offline all paths>
        sleep
        <online all paths>
        sleep
    done

So if your script accesses to the root after it offlines all paths,
it is freezed and nobody will online the paths.
So you must avoid your script to be freezed.
Other utilities accessing to the root don't matter.

I guess that your script is freezed at the sleep after the offline.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Root device multipathed host freeze with the latest upstream multipath-tools package
  2008-01-23 20:28     ` Kiyoshi Ueda
@ 2008-01-30 15:20       ` Martin George
  0 siblings, 0 replies; 5+ messages in thread
From: Martin George @ 2008-01-30 15:20 UTC (permalink / raw)
  To: Kiyoshi Ueda; +Cc: Sarraf, Ritesh, dm-devel

Kiyoshi,

I made the suggested changes to the script (removing 'sleep' & using an 
empty while loop instead) and it worked fine. Preliminary IO runs with 
FCP path faults also look good.

Thanks,
-Martin

Kiyoshi Ueda wrote:
> Hi Martin,
> 
> On Wed, 23 Jan 2008 16:28:16 +0530, Martin George wrote:
>  > Kiyoshi Ueda wrote:
>  > > Hi Martin,
>  > >
>  > > Thank you for your testing.
>  > > Please see my comments below.
>  > >
>  > > On Tue, 22 Jan 2008 21:56:13 +0530, "Sarraf, Ritesh" wrote:
>  > >  > Hi Kiyoshi,
>  > >  >
>  > >  > I took the latest upstream multipath-tools package (Jan 15, 
> 2008) and
>  > >  > installed it on my RHEL 5.1 host to verify the libprio fix. To 
> simulate
>  > >  > the FCP path faults, I ran your script (as attached in the mail) 
> which
>  > >  > alternately offlined/onlined the corresponding SCSI paths of the 
> root dm
>  > >  > device in the syfs. Listing my observations below:
>  > >  >
>  > >  > 1) The freeze was still reproducible. On checking the sysrq 
> dumps (as
>  > >  > attached), I could see it was the script itself i.e. test.sh 
> which seems
>  > >  > to have stalled on the exec () system call perhaps waiting for inode
>  > >  > write out for updated access time (the script resides on my root dm
>  > >  > device itself). As suggested by you in the bugzilla, I remounted the
>  > >  > root device using the noatime option and then reran the script - 
> I have
>  > >  > not hit the freeze yet. Is this the expected behavior?
>  > >
>  > > As for your script, it is the expected behavior.
>  > > I found that you added some sleep commands to my original script
>  > > posted by the following email.
>  > >
>  > >     http://marc.info/?l=dm-devel&m=119465024621783&w=2 
> <http://marc.info/?l=dm-devel&m=119465024621783&w=2>
>  > > <http://marc.info/?l=dm-devel&m=119465024621783&w=2 
> <http://marc.info/?l=dm-devel&m=119465024621783&w=2>>
>  > >
>  > > sleep is not shell build-in command, so need to access the root device.
>  > > I guess that is the reason of the freeze.
>  >
>  > So does that mean you should never access the root partition in such a
>  > scenario? What about utilities like syslogd which may access the root to
>  > log messages? There could be many such utilities for that matter which
>  > accesses the root and all would have to be stopped.
> 
> No.
> Generally you can access the root.
> But you can't in your single-threaded test script.
> 
> On your testing scenario, only your script would online/offline paths
> for the root like this:
>     while true; do
>         <offline all paths>
>         sleep
>         <online all paths>
>         sleep
>     done
> 
> So if your script accesses to the root after it offlines all paths,
> it is freezed and nobody will online the paths.
> So you must avoid your script to be freezed.
> Other utilities accessing to the root don't matter.
> 
> I guess that your script is freezed at the sleep after the offline.
> 
> Thanks,
> Kiyoshi Ueda
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-01-30 15:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-22 16:26 Root device multipathed host freeze with the latest upstream multipath-tools package Sarraf, Ritesh
2008-01-22 18:14 ` Kiyoshi Ueda
2008-01-23 10:58   ` Martin George
2008-01-23 20:28     ` Kiyoshi Ueda
2008-01-30 15:20       ` Martin George

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.