All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nathaniel Rutman <Nathan.Rutman@Sun.COM>
To: lustre-devel@lists.lustre.org
Subject: [Lustre-devel] Agent/Coordinator RPC mechanisms.
Date: Mon, 03 Nov 2008 12:20:54 -0800	[thread overview]
Message-ID: <490F5D26.3000105@sun.com> (raw)
In-Reply-To: <490F2F10.1040302@cea.fr>

Aurelien Degremont wrote:
>
> Agent/coordinator mechanisms to discuss at next conf call.
> If you have strong disagreement, do not hesitate to send them now so i
> can modify them before next conf call.
>
>
> A - Coordinator/Agent start
> ---
>
> 1 - MDT starts (Coordinator features are available by default as the
> coordinator reuse MDT threads)
> 2 - Client start with a agent flag (mount -o agent)
> 3 - Client connects to MDT (piggyback the coordinator registration on
> the MDT connection RPC (with a flag?) ?)
yes, I think so, just use a connect flag
> 4 - If no direct registration, Client send a registration request to the
> coordinator through MDT connection after it was initiated.
don't see a need, unless there's some agent data we want to report at 
registration
> 5 - Agent is ready.
>
> B - Request dispatch
> ---
>
> 1 - Coordinator receives a request. It writes in its llog file the
> migration request.
> 2 - Coordinator sends a migration request to one of its registered agents.
On the client's reverse import, presumably.  So we need to add a service 
during
agent startup, probably mdc startup.   No agents on a liblustre client.
>
> 3 - The agent manages the requests.
> 4 - The agent sends periodically some migration status update to
> coordinator.
We were talking about the copytool sending updates via file ioctls
> 5 - When coordinator receives status finished, it cleans its llog entry
> for this migration.
This works for copyin/copyout, but not unlink, since there's no file for
an agent to do an update ioctl on.
>
> C - MDT crash
> ---
>
> 1 - MDT crashes.
> 2 - MDT is restarted.
> 3 - The coordinator recreates its migration list, reading the its llog.
> 4 - The client, when doing its recovery with the MDT, reconnects to the
> coordinator. It also sends the current status of its migrations.
Status is sent by copytools periodically, asynchronously from reconnect.
As far as the copytools/agent is concerned, the MDT restart is invisible.
> 5 - Thanks to this, the coordinator has rebuilt its migration list and
> agent list.
> (as this is standard mdt recovery, this supports failover also)
The agent list is rebuild at reconnect time.  The migration list is simply
the list of unfinished migrations; it reads that from the llog whenever 
it wants to
(no need to keep it in memory all the time) and decides to restart
stuck/broken migrations as usual.  (E.g. it could read the log once 
every minute
checking for last_status_update_time's older than X.)  I don't see any 
reason it needs
to be in memory all the time.
So logs should contain fid, request type, agent_id (for aborts), 
last_status_update_time, last_status.
>
> E - Client crash
> ---
>
> 1 - Client crashes
> 2 - MDT notices the client node did not respond anymore. The node is
> evicted, its migrations are dispatched on another nodes. Node eviction
> (oss are supposed to evict it also) prevent the movers from this node to
> go on their migration. We could restart it on another agent without 
> issue.
2. MDT evicts client
3. Eviction triggers coordinator to re-dispatch immediately all of the 
migrations from that agent
4. For copyin, MDT must force any existing agent I/O to stop.  Hmm, but 
agents are ignoring
the layout lock - how are we going to do this?  Maybe it's not so bad if 
two agents are trying to
copyin the file at the same time?  File data is the same...

F - Copytool crash
Copytool crash is different from a client crash, since the client will 
not get evicted
1. Copytool crashes
2. Coordinator periodically scans the list of open migrations for old 
last_status_update_time's
3. Coordinator sends abort signal to old agent
4. Coordinator re-dispatches migration

       reply	other threads:[~2008-11-03 20:20 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <490F2F10.1040302@cea.fr>
2008-11-03 20:20 ` Nathaniel Rutman [this message]
2008-11-03 23:42   ` [Lustre-devel] Agent/Coordinator RPC mechanisms Andreas Dilger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=490F5D26.3000105@sun.com \
    --to=nathan.rutman@sun.com \
    --cc=lustre-devel@lists.lustre.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.