All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tejas Rao <raot@bnl.gov>
To: NeilBrown <neilb@suse.de>
Cc: Scott Sinno <scott.sinno@nasa.gov>,
	linux-raid@vger.kernel.org, "Knister,
	Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]"
	<aaron.s.knister@nasa.gov>
Subject: Re: clustered MD - beyond RAID1
Date: Mon, 21 Dec 2015 20:36:08 -0500	[thread overview]
Message-ID: <5678A908.6070401@bnl.gov> (raw)
In-Reply-To: <8737uv4fz6.fsf@notabene.neil.brown.name>

Each GPFS disk (block device) has a list of servers associated with it. 
When the first storage server fails (expired disk lease), the storage 
node is expelled and a different server which also sees the shared 
storage will do I/O.

There is a "leaseRecoveryWait" parameter which tells the filesystem 
manager to wait for few seconds to allow the expelled node to complete 
any I/O in flight to the shared storage device to avoid any out of order 
i/O. After this wait time, the filesystem manager completes recovery on 
the failed node, replaying journal logs, freeing up shared tokens/locks 
etc. After the recovery is complete a different storage node will do 
I/O. There is a concept of primary/secondary servers for a given block 
device. The secondary server will only do I/O when the primary server 
has failed and this has been confirmed.

See "servers=ServerList" in man page for mmcrnsd. ( I don't think I am 
allowed to send web links)

We currently have 10's of petabytes in production using linux md raid. 
We are currently not sharing md devices, only hardware raid block 
devices are shared. In our experience hardware raid controllers are 
expensive. Linux raid has worked well over the years and performance is 
very good as GPFS coalesces I/O in large filesystem blocksize blocks 
(8MB) and if aligned properly eliminate RMW (doing full stripe writes) 
and the need for NVRAM (unless someone is doing POSIX fsync).

In the future ,we would prefer to use linux raid (RAID6) in a shared 
environment shielding us against server failures. Unfortunately we can 
only do this after Redhat supports such an environment with linux raid. 
Currently they do not support this even in an active/passive environment 
(only one server can have a md device assembled and active regardless).

Tejas.

On 12/21/2015 17:03, NeilBrown wrote:
> On Tue, Dec 22 2015, Tejas Rao wrote:
>
>> GPFS guarantees that only one node will write to a linux block device
>> using disk leases.
>
> Do you have a reference to documentation explaining that?
> A few moments searching the internet suggests that a "disk lease" is
> much like a heart-beat.  A node uses it to say "I'm still alive, please
> don't ignore me".  I could find no evidence that only one node could
> hold a disk lease at any time.
>
> NeilBrown
>
>
>>                     Only a node with a disk lease has the right to submit
>> I/O and disk leases expire every 30 secs and needs to be renewed. Lustre
>> and other distributed file systems have other ways of handing this.
>>
>> Using md devices in a shared/clustered environment is something not
>> supported by Redhat on RHEL6 or RHEL7 kernels, so this is something we
>> would not try in our production environments.
>>
>> Tejas.
>>
>> On 12/21/2015 15:47, NeilBrown wrote:
>>> On Tue, Dec 22 2015, Tejas Rao wrote:
>>>
>>>> What if the application is doing the locking and making sure that only 1
>>>> node writes to a md device at a time? Will this work? How are rebuilds
>>>> handled? This would be helpful with distributed filesystems like
>>>> GPFS/lustre etc.
>>>>
>>> You would also need to make sure that the filesystem only wrote from a
>>> single node at a time (or access the block device directly).  I doubt
>>> GPFS/lustre make any promise like that, but I'm happy to be educated.
>>>
>>> rebuilds are handled by using a cluster-wide lock to block all writes to
>>> a range of addresses while those stripes are repaired.
>>>
>>> NeilBrown
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


  parent reply	other threads:[~2015-12-22  1:36 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-18 15:29 clustered MD - beyond RAID1 Scott Sinno
2015-12-20 23:25 ` NeilBrown
2015-12-21 19:19   ` Tejas Rao
2015-12-21 20:47     ` NeilBrown
2015-12-21 21:27       ` Tejas Rao
2015-12-21 22:03         ` NeilBrown
2015-12-21 22:29           ` Adam Goryachev
2015-12-21 23:09             ` NeilBrown
2015-12-22  1:36           ` Tejas Rao [this message]
2015-12-22  2:29             ` Alireza Haghdoost
2015-12-22  4:13             ` NeilBrown
     [not found]               ` <CAB9NSeXhoHd3_BDRrWAsBrW0Dj2=NucyUFt8pSP0zB5K=RkUOg@mail.gmail.com>
2016-12-05  1:46                 ` Aaron Knister
     [not found]           ` <5678A2B9.6070008@bnl.gov>
2015-12-22  1:50             ` Aaron Knister
2015-12-22  2:33               ` Tejas Rao
     [not found]                 ` <5678B693.40907-IGkKxAqZmp0@public.gmane.org>
2015-12-25  8:47                   ` roger zhou
  -- strict thread matches above, loose matches on Subject: below --
2016-12-02 18:12 Robert Woodworth
2016-12-02 20:02 ` Shaohua Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5678A908.6070401@bnl.gov \
    --to=raot@bnl.gov \
    --cc=aaron.s.knister@nasa.gov \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=scott.sinno@nasa.gov \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.