Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>,
	hch@lst.de, darrick.wong@oracle.com, axboe@kernel.dk,
	tytso@mit.edu, adilger.kernel@dilger.ca, ming.lei@redhat.com,
	jthumshirn@suse.de, minwoo.im.dev@gmail.com,
	damien.lemoal@wdc.com, andrea.parri@amarulasolutions.com,
	hare@suse.com, tj@kernel.org, hannes@cmpxchg.org,
	khlebnikov@yandex-team.ru, ajay.joshi@wdc.com,
	bvanassche@acm.org, arnd@arndb.de, houtao1@huawei.com,
	asml.silence@gmail.com, linux-block@vger.kernel.org,
	linux-ext4@vger.kernel.org
Subject: Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
Date: Mon, 20 Apr 2020 08:36:46 +1000	[thread overview]
Message-ID: <20200419223646.GB9765@dread.disaster.area> (raw)
In-Reply-To: <yq1sghe1uu3.fsf@oracle.com>

On Wed, Apr 08, 2020 at 12:10:12AM -0400, Martin K. Petersen wrote:
> 
> Hi Dave!
> 
> >> In the standards space, the allocation concept was mainly aimed at
> >> protecting filesystem internals against out-of-space conditions on
> >> devices that dedup identical blocks and where simply zeroing the blocks
> >> therefore is ineffective.
> 
> > Um, so we're supposed to use space allocation before overwriting
> > existing metadata in the filesystem?
> 
> Not before overwriting, no. Once you have allocated an LBA it remains
> allocated until you discard it.

That is not a consistent argument. If the data has been deduped and
we overwrite, the storage array has to allocate new physical space
for an overwrite to an existing LBA. i.e. deduped data has multiple
LBAs pointing to the same physical storage. Any overwrite of an LBA
that maps to mulitply referenced physical storage requires the
storage array to allocate new physical space for that overwrite.

i.e. allocation is not determined by whether the LBA has been
written to, "pinned" or not - it's whether the act of writing to
that LBA requires the storage to allocate new space to allow the
write to proceed.

That's my point here - one particular shared data overwrite case is
being special cased by preallocation (avoiding dedupe of zero filled
data) to prevent ENOSPC, ignoring all the other cases where we
overwrite shared non-zero data and will also require new physical
space for the new data. In all those cases, the storage has to take
the same action - allocation on overwrite - and so all of them are
susceptible to ENOSPC.

> > So that the underlying storage can reserve space for it before we
> > write it? Which would mean we have to issue a space allocation before
> > we dirty the metadata, which means before we dirty any metadata in a
> > transaction. Which means we'll basically have to redesign the
> > filesystems from the ground up, yes?
> 
> My understanding is that this facility was aimed at filesystems that do
> not dynamically allocate metadata. The intent was that mkfs would
> preallocate the metadata LBA ranges, not the filesystem. For filesystems
> that allocate metadata dynamically, then yes, an additional step is
> required if you want to pin the LBAs.

Ok, so you are confirming what I thought: it's almost completely
useless to us.

i.e. this requires issuing IO to "reserve" space whilst preserving
data before every metadata object goes from clean to dirty in
memory.  But the problem with that is we don't know how much
metadata we are going to dirty in any specific operation. Worse is
that we don't know exactly *what* metadata we will modify until we
walk structures and do lookups, which often happen after we've
dirtied other structures. An ENOSPC from a space reservation at that
point is fatal to the filesystem anyway, so there's no point in even
trying to do this.  Like I said, functionality like this cannot be
retrofitted to existing filesysetms.

IOWs, this is pretty much useless functionality for the filesystem
layer, and if the only use is for some mythical filesystem with
completely static metadata then the standards space really jumped
the shark on this one....

> > You might be talking about filesystem metadata and block devices,
> > but this patchset ends up connecting ext4's user data fallocate() to
> > the block device, thereby allowing users to reserve space directly
> > in the underlying block device and directly exposing this issue to
> > userspace.
> 
> I missed that Chaitanya's repost of this series included the ext4 patch.
> Sorry!
> 
> >> How XFS decides to enforce space allocation policy and potentially
> >> leverage this plumbing is entirely up to you.
> >
> > Do I understand this correctly? i.e. that it is the filesystem's
> > responsibility to prevent users from preallocating more space than
> > exists in an underlying storage pool that has been intentionally
> > hidden from the filesystem so it can be underprovisioned?
> 
> No. But as an administrative policy it is useful to prevent runaway
> applications from writing a petabyte of random garbage to media. My
> point was that it is up to you and the other filesystem developers to
> decide how you want to leverage the low-level allocation capability and
> how you want to provide it to processes. And whether CAP_SYS_ADMIN,
> ulimit, or something else is the appropriate policy interface for this.

My cynical translation: the storage standards space haven't given
any thought to how it can be used and/or administered in the real
world. Pass the buck - let the filesystem people work that out.

What I'm hearing is that this wasn't designed for typical filesystem
use, it wasn't designed for typical user application use, and how to
prevent abuse wasn't thought about at all.

That sounds like a big fat NACK to me....

> In terms of thin provisioning and space management there are various
> thresholds that may be reported by the device. In past discussions there
> haven't been much interest in getting these exposed. It is also unclear
> to me whether it is actually beneficial to send low space warnings to
> hundreds or thousands of hosts attached to an array. In many cases the
> individual server admins are not even the right audience. The most
> common notification mechanism is a message to the storage array admin
> saying "click here to buy more disk".

Notifications are not relevant to preallocation functionality at all.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2020-04-19 22:37 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-29 17:47 [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
2020-03-29 17:47 ` [PATCH 1/4] block: create payloadless issue bio helper Chaitanya Kulkarni
2020-03-29 17:47 ` [PATCH 2/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
2020-03-29 17:47 ` [PATCH 3/4] loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0) Chaitanya Kulkarni
2020-03-29 17:47 ` [PATCH 4/4] ext4: Notify block device about alloc-assigned blk Chaitanya Kulkarni
2020-04-01  6:22 ` [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Konstantin Khlebnikov
2020-04-02  2:29   ` Martin K. Petersen
2020-04-02  9:49     ` Konstantin Khlebnikov
2020-04-02 22:41 ` Dave Chinner
2020-04-03  1:34   ` Martin K. Petersen
2020-04-03  2:57     ` Dave Chinner
     [not found]       ` <(Dave>
     [not found] ` <(Chaitanya>
     [not found]   ` <Kulkarni's>
     [not found]     ` <message>
     [not found]       ` <of>
     [not found]         ` <"Fri>
     [not found]           ` <3>
     [not found]         ` <"Mon>
     [not found]           ` <13>
     [not found]         ` <"Tue>
     [not found]           ` <12>
     [not found]         ` <"Sun>
     [not found]           ` <29>
     [not found]             ` <Mar>
     [not found]               ` <2020>
     [not found]                 ` <13:57:57>
     [not found]                   ` <+1100")>
2020-04-03  3:45                     ` Martin K. Petersen
2020-04-07  2:27                       ` Dave Chinner
2020-04-08  4:10                         ` Martin K. Petersen
2020-04-19 22:36                           ` Dave Chinner [this message]
2020-04-23  0:40                             ` Martin K. Petersen
     [not found]                 ` <10:47:10>
     [not found]                   ` <-0700")>
2020-04-01  2:29                     ` Martin K. Petersen
2020-04-01  4:53                       ` Chaitanya Kulkarni
2020-05-12 16:01                     ` [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Martin K. Petersen
2020-05-12 16:04                       ` Christoph Hellwig
2020-05-12 16:12                         ` Martin K. Petersen
2020-05-12 16:18                           ` Johannes Thumshirn
2020-05-12 16:24                             ` Martin K. Petersen
     [not found]                 ` <20:35:11>
     [not found]                   ` <+0800")>
2020-07-13 16:47                     ` [PATCH 2/2] block: improve discard bio alignment in __blkdev_issue_discard() Martin K. Petersen
2020-07-13 17:50                       ` Coly Li
  -- strict thread matches above, loose matches on Subject: below --
2020-05-12  8:55 [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 01/10] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 02/10] block: rename __bio_add_pc_page to bio_add_hw_page Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 03/10] block: Introduce REQ_OP_ZONE_APPEND Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 04/10] block: introduce blk_req_zone_write_trylock Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 05/10] block: Modify revalidate zones Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 06/10] scsi: sd_zbc: factor out sanity checks for zoned commands Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 08/10] null_blk: Support REQ_OP_ZONE_APPEND Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 09/10] block: export bio_release_pages and bio_iov_iter_get_pages Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 10/10] zonefs: use REQ_OP_ZONE_APPEND for sync DIO Johannes Thumshirn
2020-05-12 13:17 ` [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Christoph Hellwig
     [not found]   ` <(Christoph>
2020-05-13  2:37 ` Jens Axboe
2020-07-13 12:35 [PATCH 0/2] two generic block layer fixes for 5.9 Coly Li
2020-07-13 12:35 ` [PATCH 1/2] block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers Coly Li
2020-07-13 23:12   ` Damien Le Moal
2020-07-13 12:35 ` [PATCH 2/2] block: improve discard bio alignment in __blkdev_issue_discard() Coly Li
     [not found]   ` <(Coly>

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200419223646.GB9765@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=ajay.joshi@wdc.com \
    --cc=andrea.parri@amarulasolutions.com \
    --cc=arnd@arndb.de \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=chaitanya.kulkarni@wdc.com \
    --cc=damien.lemoal@wdc.com \
    --cc=darrick.wong@oracle.com \
    --cc=hannes@cmpxchg.org \
    --cc=hare@suse.com \
    --cc=hch@lst.de \
    --cc=houtao1@huawei.com \
    --cc=jthumshirn@suse.de \
    --cc=khlebnikov@yandex-team.ru \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=ming.lei@redhat.com \
    --cc=minwoo.im.dev@gmail.com \
    --cc=tj@kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).