public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Mark Tinguely <tinguely@sgi.com>
To: Dave Chinner <david@fromorbit.com>
Cc: xfs@oss.sgi.com
Subject: Re: [PATCH 59/60] xfs: Add xfs_log_rlimit.c
Date: Wed, 26 Jun 2013 08:48:15 -0500	[thread overview]
Message-ID: <51CAF11F.4050905@sgi.com> (raw)
In-Reply-To: <20130626040518.GG29376@dastard>

On 06/25/13 23:05, Dave Chinner wrote:
> On Tue, Jun 25, 2013 at 09:06:39AM -0500, Mark Tinguely wrote:
>> On 06/24/13 17:27, Dave Chinner wrote:
>>> On Mon, Jun 24, 2013 at 04:26:28PM -0500, Mark Tinguely wrote:
>>>> On 06/18/13 23:51, Dave Chinner wrote:
>>>>> +	 * 2) If the lsunit option is specified, a transaction requires 2 LSU
>>>>> +	 *    for the reservation because there are two log writes that can
>>>>> +	 *    require padding - the transaction data and the commit record which
>>>>> +	 *    are written separately and both can require padding to the LSU.
>>>>> +	 *    Consider that we can have an active CIL reservation holding 2*LSU,
>>>>> +	 *    but the CIL is not over a push threshold, in this case, if we
>>>>> +	 *    don't have enough log space for at one new transaction, which
>>>>> +	 *    includes another 2*LSU in the reservation, we will run into dead
>>>>> +	 *    loop situation in log space grant procedure. i.e.
>>>>> +	 *    xlog_grant_head_wait().
>>>>> +	 *
>>>>> +	 *    Hence the log size needs to be able to contain two maximally sized
>>>>> +	 *    and padded transactions, which is (2 * (2 * LSU + maxlres)).
>>>>> +	 *
>>>>
>>>> Any thoughts on how we can separate the 2 * log stripe unit from the
>>>> reservation.
>>>
>>> You can't. The reservation, by definition, is the worse case log
>>> space usage for the transaction. Therefore, it has to take into
>>> account the LSU padding that may be necessary if this transaction is
>>> committed by itself to the log (e.g. wsync operation)
>>
>> I am thinking we should separate the extra 2*LSU allocated space
>> from the individual pieces of the transaction
>
> Why? How are you going to prevent log space deadlocks if we don't
> reserve that space for each individual transaction in a permanent
> transaction reservation?

Allocate 2 per transaction, not one per segment of a transaction.
It may be a wash for 2 or 3 segments per transaction, but would help if 
there are 10 segments per transaction.
>
>> (caused by
>> xfs_trans_rolls and xfs_bmap_finish) and associate it with the
>> transaction. Sync writes are for the transaction anyway and not the
>> pieces.
>
> Every transaction commit that is executed can require LSU padding
> because it could be the transaction the CIL steals the required LSU
> reservation from.  When CIL flushes occur - and hence when the LSU
> padding needs to be stolen from a transaction commit - is not under
> the control of the currently executing transaction.

I was thinking we need 2 (2*LSU) per transaction. One for a CIL push 
because it could be too large and one if it is sync transaction. In the 
mean time if another transaction pushes the log for one of the above 
reasons, the CIL will steal whatever is left over from the transaction 
and one of the (2*LSU) from that ticket.

>> It is the multiplication of the (2*LSU) by each piece of the
>> transaction that is the killer.
>> A mkdir will have 1.5MB of log reserved space just for the possible
>> LSU padding.
>
> A killer for what, exactly? This is the current status quo, and I
> don't see people having performance problems related to it....
>
>> The cil can steal the left over current ticket space and the global
>> LSU space.
>
> The space is avilable only if the current transaction _unit
> reservation_ has it reserved. Hence every unit reservation needs
> that space to be reserved.
>
> I'm not saying this is optimal - it is an architectural requirement
> that the log has always had and therefore *necessary*.

Killer only in the amount of space as the segments increase.

>>>> The added extended attribute calls for parent inode pointers
>>>> (especially xfs_rename() where it could add up to one and remove up
>>>> to two attributes) is causing a huge multiplication cnt for
>>>> reservation.
>>>
>>> So you are adding a attribute reservations to
>>> create/mkdir/rename/link transaction reservations? That doesn't seem
>>> like a good idea, and it's contrary to the direction we want to head
>>> with the transaction subsystem.  Can you post your patches so we
>>> some some context to the question you are asking?
>>>
>>> FWIW, the intended method of linking multiple independent
>>> transactions together for parent pointers into an atomic commit is
>>> this:
>>>
>>> http://xfs.org/index.php/Improving_Metadata_Performance_By_Reducing_Journal_Overhead#Atomic_Multi-Transaction_Operations
>>>
>>> While the text is somewhat out of date (talks about rolling
>>> transactions and being able to replace them), it predated the
>>> delayed logging implementation and hence doesn't take into account
>>> the obvious, simple extension that can be made to the CIL to
>>> implement this. i.e.  being able to hold the current CIL context
>>> open for a number of transactions to take place before releasing it
>>> again.
>>>
>>> The point of doing this is still relevant, however:
>>>
>>> "This may also allow us to split some of the large compound
>>> transactions into smaller, more self contained transactions. This
>>> would reduce reservation pressure on log space in the common case
>>> where all the corner cases in the transactions are not taken."
>>>
>>> IOWs, rather than building new, large "aggregation" transactions, we
>>> should be adding infrastructure to the CIL to allow atomic
>>> transaction aggregation techniques to be used and then just using
>>> the existing transaction infrastructure to implement operations such
>>> as "create with ACL"....
>>
>> I will have to think on this. I don't see how the allocation of log
>> space for a new transaction while holding locks is a good thing.
>
> Who said anything about requiring new locks? :)

I meant we have to hold the inode locks between the directory code the 
the attribute code. If we do not, then the attributes quickly become out 
of line with the directory status.
>
> The initial implementation I was thinking of is basically a
> reference counter and a method for delaying CIL pushes until the
> reference count drops to zero.
>
> Indeed, if we push the aggregated space reservations up into this
> aggregated transaction, the individual transactions can just pull
> out of the space we already know is available to use. IOWs, it's
> relatively trivial to do in a way that leaves open many avenues for
> later optimisation....

sweet.

>
>>> FYI, we need this for all the operations that add attributes at
>>> create time, such as default ACLs and DMAPI/DMF attributes....
>>>
>>>> Those multiplications would be killers on 256KiB log
>>>> stripe units.
>>>
>>> Numbers, please?
>>>
>>> Cheers,
>>>
>>> Dave.
>>
>> /*
>>   * Various log count values.
>>   */
>> #define XFS_DEFAULT_LOG_COUNT           1
>> #define XFS_DEFAULT_PERM_LOG_COUNT      2
>> #define XFS_ITRUNCATE_LOG_COUNT         2
>> #define XFS_INACTIVE_LOG_COUNT          2
>> #define XFS_CREATE_LOG_COUNT            2
>> #define XFS_MKDIR_LOG_COUNT             3
>> #define XFS_SYMLINK_LOG_COUNT           3
>> #define XFS_REMOVE_LOG_COUNT            2
>> #define XFS_LINK_LOG_COUNT              2
>> #define XFS_RENAME_LOG_COUNT            2
>> #define XFS_WRITE_LOG_COUNT             2
>> #define XFS_ADDAFORK_LOG_COUNT          2
>> #define XFS_ATTRINVAL_LOG_COUNT         1
>> #define XFS_ATTRSET_LOG_COUNT           3
>> #define XFS_ATTRRM_LOG_COUNT            3
>>
>> Even if you are creative, the multipliers add up quickly.
>> xfs_rename will do the directory ops. possibly a attibset and up to
>> 2 attrrm and we have to hold the src and target inodes locks over
>> all the operations.
>
> We don't have to hold them locked across the entire operation
> because the VFS is holding them locked so nothing else can do a
> namespace operation which we are modifying them.

I find without holding the lock over both dir and attrib routines that 
remove occasionally can't find parent pointer attribute entry. I suspect 
remove racing in rename.

> Anyway, you're worried about multiplying out the existing
> reservations, but that's really not that big of a deal - it's just
> log space, and we've got lots of that to use, and we have plenty of
> avenues to optimise usage in future..
>
> e.g.  mkdir is a compound transaction that is really 2 different
> operations. The first is physical inode allocation, then second is
> allocating a free inode. the first only happens for 1 in every 64
> free inode allocations (rarer if we are also removing inodes, too).
> So most of the time we are reserving far more space in the log that
> we actually need to allocate an inode.
>
> The actual unit reservation for the create is the maximum of the two
> individual component reservations times the number of commits that
> will be required, and the total is the unit reservation times the
> number of commits needed. So for a create transaction, it's an
> overestimation of the worst case.
>
> Is that a problem? No. Can it be improved? Yes - see the comments
> about splitting out physical inode chunk allocation into the
> background here:
>
> http://xfs.org/index.php/Improving_inode_Caching#Contiguous_Inode_Allocation
>
> And suddenly we end up with mkdir/create/etc only requiring a single
> transaction reservation. There goes one of your 2*LSU paddings...
>
> What about attributes? The triple-stage transaction is only required
> when -replacing- an existing attribute. The first adds the new
> attribute, the second flips the complete/incomplete flags, and the
> third removes the old attributes. Just adding a new attribute only
> requires a single transaction and most attributes are never
> overwritten, so again we are almost always reserving far too much
> space for such an operation.
>
> Can we optimise this? Yes, we can. Let's disaggregate it - if we are
> replacing attribute content without changing the size or name, then
> we could just add a new transaction that logs a direct overwrite.
> That's a *tiny* transaction. If we are replacing an attribute of the
> same size with a new name, then it's a little more complex as we
> have to manipulate the hash index entries, but that is still a
> single, simple transaction. If we are adding, knowing we aren't
> doing a replacment, then that is a single transaction, too.
>
> Start to see how this greatly reduces the overall operational
> transaction reservations? We can build bigger and more complex
> transactions as part of the process, but optimisation of the
> reservations comes from separation into smaller, more fine-grained
> transactions and reservations.
>
> Keep in mind that from this perspective, I'm quite happy for an
> initial parent pointer implementation to start with gigantic
> create+attr/rename+attr transaction reservations. I'm also happy for
> it to take us 2-3 years to optimise/disaggegate the transactions to
> reduce the overhead to be smaller and more efficient.
>
> I don't expect the initial implementation to be perfect or even
> optimal - attempting to make it so puts us right into prematuve
> optimisation territory. Make parent pointers robust first, then we
> can worry about where we can optimise reservations down to the
> smallest possible subset.
>
> Cheers,
>
> Dave.

Thanks.

--Mark.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2013-06-26 13:48 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-19  4:50 [PATCH 00/60] xfs: patch queue for 3.11 Dave Chinner
2013-06-19  4:50 ` [PATCH 01/60] xfs: update mount options documentation Dave Chinner
2013-06-20 15:35   ` Mark Tinguely
2013-06-19  4:50 ` [PATCH 02/60] xfs: add pluging for bulkstat readahead Dave Chinner
2013-06-20 16:59   ` Mark Tinguely
2013-06-19  4:50 ` [PATCH 03/60] xfs: plug directory buffer readahead Dave Chinner
2013-06-20 18:45   ` Mark Tinguely
2013-06-19  4:50 ` [PATCH 04/60] xfs: don't use speculative prealloc for small files Dave Chinner
2013-06-19 12:59   ` Brian Foster
2013-06-20 19:31   ` Mark Tinguely
2013-06-19  4:50 ` [PATCH 05/60] xfs: don't do IO when creating an new inode Dave Chinner
2013-06-21 13:57   ` Mark Tinguely
2013-06-19  4:50 ` [PATCH 06/60] xfs: xfs_ifree doesn't need to modify the inode buffer Dave Chinner
2013-06-21 21:24   ` Mark Tinguely
2013-06-19  4:50 ` [PATCH 07/60] xfs: Introduce ordered log vector support Dave Chinner
2013-06-22 17:26   ` Mark Tinguely
2013-06-19  4:50 ` [PATCH 08/60] xfs: Introduce an ordered buffer item Dave Chinner
2013-06-23 17:27   ` Mark Tinguely
2013-06-19  4:50 ` [PATCH 09/60] xfs: Inode create log items Dave Chinner
2013-06-22 15:49   ` Mark Tinguely
2013-06-19  4:50 ` [PATCH 10/60] xfs: Inode create transaction reservations Dave Chinner
2013-06-23 17:29   ` Mark Tinguely
2013-06-19  4:50 ` [PATCH 11/60] xfs: Inode create item recovery Dave Chinner
2013-06-24 14:37   ` Mark Tinguely
2013-06-19  4:50 ` [PATCH 12/60] xfs: Use inode create transaction Dave Chinner
2013-06-24 18:55   ` Mark Tinguely
2013-06-19  4:50 ` [PATCH 13/60] xfs: remove local fork format handling from xfs_bmapi_write() Dave Chinner
2013-06-19  4:50 ` [PATCH 14/60] xfs: move getdents code into it's own file Dave Chinner
2013-06-19  4:50 ` [PATCH 15/60] xfs: reshuffle dir2 definitions around for userspace Dave Chinner
2013-06-19  4:50 ` [PATCH 16/60] xfs: split out attribute listing code into separate file Dave Chinner
2013-06-19  4:50 ` [PATCH 17/60] xfs: split out attribute fork truncation " Dave Chinner
2013-06-19  4:50 ` [PATCH 18/60] xfs: split out xfs inode operations " Dave Chinner
2013-06-19  4:50 ` [PATCH 19/60] xfs: consolidate xfs_vnodeops.c into xfs_inode_ops.c Dave Chinner
2013-06-19  4:50 ` [PATCH 20/60] xfs: move xfs_getbmap to xfs_extent_ops.c Dave Chinner
2013-06-19  4:50 ` [PATCH 21/60] xfs: introduce xfs_sb.c for sharing with libxfs Dave Chinner
2013-06-19  4:50 ` [PATCH 22/60] xfs: move xfs_trans_reservations to xfs_trans.h Dave Chinner
2013-06-19  4:50 ` [PATCH 23/60] xfs: sync minor header differences needed by userspace Dave Chinner
2013-06-19  4:50 ` [PATCH 24/60] xfs: move xfs_bmap_punch_delalloc() to xfs_aops.c Dave Chinner
2013-06-19  4:50 ` [PATCH 25/60] xfs: split out transaction reservation code Dave Chinner
2013-06-19  4:50 ` [PATCH 26/60] xfs: minor cleanups Dave Chinner
2013-06-19  4:50 ` [PATCH 27/60] xfs: fix issues that cause userspace warnings Dave Chinner
2013-06-19  4:50 ` [PATCH 28/60] xfs: consolidate xfs_rename.c Dave Chinner
2013-06-19  4:50 ` [PATCH 29/60] xfs: consolidate xfs_utils.c Dave Chinner
2013-06-19  9:40   ` Christoph Hellwig
2013-06-19  4:50 ` [PATCH 30/60] xfs: split out inode log item format definition Dave Chinner
2013-06-19  4:50 ` [PATCH 31/60] xfs: split out buf log item format definitions Dave Chinner
2013-06-19  4:50 ` [PATCH 32/60] xfs: move inode fork definitions to a new header file Dave Chinner
2013-06-19  4:50 ` [PATCH 33/60] xfs: move unrealted definitions out of xfs_inode.h Dave Chinner
2013-06-19  4:50 ` [PATCH 34/60] xfs: introduce xfs_inode_buf.c for inode buffer operations Dave Chinner
2013-06-19  4:50 ` [PATCH 35/60] xfs: start repopulating xfs_inode.[ch] with kernel code Dave Chinner
2013-06-19  4:50 ` [PATCH 36/60] xfs: move swap extent code to xfs_extent_ops Dave Chinner
2013-06-19  4:50 ` [PATCH 37/60] xfs: split out inode log item format definition Dave Chinner
2013-06-19  4:50 ` [PATCH 38/60] xfs: separate dquot on disk format definitions out of xfs_quota.h Dave Chinner
2013-06-19  4:50 ` [PATCH 39/60] xfs: separate icreate log format definitions from xfs_icreate_item.h Dave Chinner
2013-06-19  4:50 ` [PATCH 40/60] xfs: don't special case shared superblock mounts Dave Chinner
2013-06-19  4:50 ` [PATCH 41/60] xfs: kill __KERNEL__ check for debug code in allocation code Dave Chinner
2013-06-19  4:50 ` [PATCH 42/60] xfs: split out on-disk transaction definitions Dave Chinner
2013-06-19  4:50 ` [PATCH 43/60] xfs: remove __KERNEL__ from debug code Dave Chinner
2013-06-19  4:50 ` [PATCH 44/60] xfs: remove __KERNEL__ check from xfs_dir2_leaf.c Dave Chinner
2013-06-19  4:50 ` [PATCH 45/60] xfs: xfs_filestreams.h doesn't need __KERNEL__ Dave Chinner
2013-06-19  4:50 ` [PATCH 46/60] xfs: split out the remote symlink handling Dave Chinner
2013-06-19  4:50 ` [PATCH 47/60] xfs: separate out log format definitions Dave Chinner
2013-06-19  4:50 ` [PATCH 48/60] xfs: move kernel specific type definitions to xfs.h Dave Chinner
2013-06-19  4:50 ` [PATCH 49/60] xfs: make struct xfs_perag kernel only Dave Chinner
2013-06-19  4:50 ` [PATCH 50/60] xfs: create xfs_bmap_util.[ch] Dave Chinner
2013-06-19  4:50 ` [PATCH 51/60] xfs: introduce xfs_quota_defs.h Dave Chinner
2013-06-19  4:51 ` [PATCH 52/60] xfs: introduce xfs_rtalloc_defs.h Dave Chinner
2013-06-19  4:51 ` [PATCH 53/60] xfs: Introduce a new structure to hold transaction reservation items Dave Chinner
2013-06-19  4:51 ` [PATCH 54/60] xfs: Introduce tr_fsyncts to m_reservation Dave Chinner
2013-06-19  4:51 ` [PATCH 55/60] xfs: Make writeid transaction use tr_writeid Dave Chinner
2013-06-19  4:51 ` [PATCH 56/60] xfs: refactor xfs_trans_reserve() interface Dave Chinner
2013-06-19  4:51 ` [PATCH 57/60] xfs: Get rid of all XFS_XXX_LOG_RES() macro Dave Chinner
2013-06-19  4:51 ` [PATCH 58/60] xfs: Refactor xfs_ticket_alloc() to extract a new helper Dave Chinner
2013-06-19  4:51 ` [PATCH 59/60] xfs: Add xfs_log_rlimit.c Dave Chinner
2013-06-20 17:24   ` Michael L. Semon
2013-06-21  6:10   ` Michael L. Semon
2013-06-24 21:26   ` Mark Tinguely
2013-06-24 22:27     ` Dave Chinner
2013-06-25 14:06       ` Mark Tinguely
2013-06-26  4:05         ` Dave Chinner
2013-06-26 13:48           ` Mark Tinguely [this message]
2013-06-26 22:18             ` Dave Chinner
2013-06-19  4:51 ` [PATCH 60/60] xfs: Validate log space at mount time Dave Chinner
2013-06-19  9:15 ` [PATCH 00/60] xfs: patch queue for 3.11 Christoph Hellwig
2013-06-19 21:34   ` Dave Chinner
2013-06-20  9:17     ` Christoph Hellwig
2013-06-19 14:35 ` Ben Myers
2013-06-19 14:44   ` Christoph Hellwig
2013-06-19 14:54     ` Ric Wheeler
2013-06-19 15:47       ` Ben Myers
2013-06-19 23:33         ` Dave Chinner
2013-06-20 19:14           ` Ben Myers
2013-06-20 19:31             ` Chandra Seetharaman
2013-06-19 22:54   ` Dave Chinner
2013-06-20  4:51     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51CAF11F.4050905@sgi.com \
    --to=tinguely@sgi.com \
    --cc=david@fromorbit.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox