From: Himanshu Madhani <himanshu.madhani@oracle.com>
To: Damien Le Moal <dlemoal@kernel.org>, linux-fsdevel@vger.kernel.org
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>,
Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Subject: Re: [PATCH] zonefs: Improve error handling
Date: Thu, 15 Feb 2024 17:06:40 -0800 [thread overview]
Message-ID: <b4b29ce3-36c4-4422-8346-842322b47516@oracle.com> (raw)
In-Reply-To: <20240214064526.3433662-1-dlemoal@kernel.org>
On 2/13/24 22:45, Damien Le Moal wrote:
> Write error handling is racy and can sometime lead to the error recovery
> path wrongly changing the inode size of a sequential zone file to an
> incorrect value which results in garbage data being readable at the end
> of a file. There are 2 problems:
>
> 1) zonefs_file_dio_write() updates a zone file write pointer offset
> after issuing a direct IO with iomap_dio_rw(). This update is done
> only if the IO succeed for synchronous direct writes. However, for
> asynchronous direct writes, the update is done without waiting for
> the IO completion so that the next asynchronous IO can be
> immediately issued. However, if an asynchronous IO completes with a
> failure right before the i_truncate_mutex lock protecting the update,
> the update may change the value of the inode write pointer offset
> that was corrected by the error path (zonefs_io_error() function).
>
> 2) zonefs_io_error() is called when a read or write error occurs. This
> function executes a report zone operation using the callback function
> zonefs_io_error_cb(), which does all the error recovery handling
> based on the current zone condition, write pointer position and
> according to the mount options being used. However, depending on the
> zoned device being used, a report zone callback may be executed in a
> context that is different from the context of __zonefs_io_error(). As
> a result, zonefs_io_error_cb() may be executed without the inode
> truncate mutex lock held, which can lead to invalid error processing.
>
> Fix both problems as follows:
> - Problem 1: Perform the inode write pointer offset update before a
> direct write is issued with iomap_dio_rw(). This is safe to do as
> partial direct writes are not supported (IOMAP_DIO_PARTIAL is not
> set) and any failed IO will trigger the execution of zonefs_io_error()
> which will correct the inode write pointer offset to reflect the
> current state of the one on the device.
> - Problem 2: Change zonefs_io_error_cb() into zonefs_handle_io_error()
> and call this function directly from __zonefs_io_error() after
> obtaining the zone information using blkdev_report_zones() with a
> simple callback function that copies to a local stack variable the
> struct blk_zone obtained from the device. This ensures that error
> handling is performed holding the inode truncate mutex.
> This change also simplifies error handling for conventional zone files
> by bypassing the execution of report zones entirely. This is safe to
> do because the condition of conventional zones cannot be read-only or
> offline and conventional zone files are always fully mapped with a
> constant file size.
>
> Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
> Cc: stable@vger.kernel.org
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
> fs/zonefs/file.c | 42 +++++++++++++++++++-----------
> fs/zonefs/super.c | 66 +++++++++++++++++++++++++++--------------------
> 2 files changed, 65 insertions(+), 43 deletions(-)
>
> diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
> index 6ab2318a9c8e..dba5dcb62bef 100644
> --- a/fs/zonefs/file.c
> +++ b/fs/zonefs/file.c
> @@ -348,7 +348,12 @@ static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size,
> struct zonefs_inode_info *zi = ZONEFS_I(inode);
>
> if (error) {
> - zonefs_io_error(inode, true);
> + /*
> + * For Sync IOs, error recovery is called from
> + * zonefs_file_dio_write().
> + */
> + if (!is_sync_kiocb(iocb))
> + zonefs_io_error(inode, true);
> return error;
> }
>
> @@ -491,6 +496,14 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
> ret = -EINVAL;
> goto inode_unlock;
> }
> + /*
> + * Advance the zone write pointer offset. This assumes that the
> + * IO will succeed, which is OK to do because we do not allow
> + * partial writes (IOMAP_DIO_PARTIAL is not set) and if the IO
> + * fails, the error path will correct the write pointer offset.
> + */
> + z->z_wpoffset += count;
> + zonefs_inode_account_active(inode);
> mutex_unlock(&zi->i_truncate_mutex);
> }
>
> @@ -504,20 +517,19 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
> if (ret == -ENOTBLK)
> ret = -EBUSY;
>
> - if (zonefs_zone_is_seq(z) &&
> - (ret > 0 || ret == -EIOCBQUEUED)) {
> - if (ret > 0)
> - count = ret;
> -
> - /*
> - * Update the zone write pointer offset assuming the write
> - * operation succeeded. If it did not, the error recovery path
> - * will correct it. Also do active seq file accounting.
> - */
> - mutex_lock(&zi->i_truncate_mutex);
> - z->z_wpoffset += count;
> - zonefs_inode_account_active(inode);
> - mutex_unlock(&zi->i_truncate_mutex);
> + /*
> + * For a failed IO or partial completion, trigger error recovery
> + * to update the zone write pointer offset to a correct value.
> + * For asynchronous IOs, zonefs_file_write_dio_end_io() may already
> + * have executed error recovery if the IO already completed when we
> + * reach here. However, we cannot know that and execute error recovery
> + * again (that will not change anything).
> + */
> + if (zonefs_zone_is_seq(z)) {
> + if (ret > 0 && ret != count)
> + ret = -EIO;
> + if (ret < 0 && ret != -EIOCBQUEUED)
> + zonefs_io_error(inode, true);
> }
>
> inode_unlock:
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index 93971742613a..b6e8e7c96251 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -246,16 +246,18 @@ static void zonefs_inode_update_mode(struct inode *inode)
> z->z_mode = inode->i_mode;
> }
>
> -struct zonefs_ioerr_data {
> - struct inode *inode;
> - bool write;
> -};
> -
> static int zonefs_io_error_cb(struct blk_zone *zone, unsigned int idx,
> void *data)
> {
> - struct zonefs_ioerr_data *err = data;
> - struct inode *inode = err->inode;
> + struct blk_zone *z = data;
> +
> + *z = *zone;
> + return 0;
> +}
> +
> +static void zonefs_handle_io_error(struct inode *inode, struct blk_zone *zone,
> + bool write)
> +{
> struct zonefs_zone *z = zonefs_inode_zone(inode);
> struct super_block *sb = inode->i_sb;
> struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
> @@ -270,8 +272,8 @@ static int zonefs_io_error_cb(struct blk_zone *zone, unsigned int idx,
> data_size = zonefs_check_zone_condition(sb, z, zone);
> isize = i_size_read(inode);
> if (!(z->z_flags & (ZONEFS_ZONE_READONLY | ZONEFS_ZONE_OFFLINE)) &&
> - !err->write && isize == data_size)
> - return 0;
> + !write && isize == data_size)
> + return;
>
> /*
> * At this point, we detected either a bad zone or an inconsistency
> @@ -292,7 +294,7 @@ static int zonefs_io_error_cb(struct blk_zone *zone, unsigned int idx,
> * In all cases, warn about inode size inconsistency and handle the
> * IO error according to the zone condition and to the mount options.
> */
> - if (zonefs_zone_is_seq(z) && isize != data_size)
> + if (isize != data_size)
> zonefs_warn(sb,
> "inode %lu: invalid size %lld (should be %lld)\n",
> inode->i_ino, isize, data_size);
> @@ -352,8 +354,6 @@ static int zonefs_io_error_cb(struct blk_zone *zone, unsigned int idx,
> zonefs_i_size_write(inode, data_size);
> z->z_wpoffset = data_size;
> zonefs_inode_account_active(inode);
> -
> - return 0;
> }
>
> /*
> @@ -367,23 +367,25 @@ void __zonefs_io_error(struct inode *inode, bool write)
> {
> struct zonefs_zone *z = zonefs_inode_zone(inode);
> struct super_block *sb = inode->i_sb;
> - struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
> unsigned int noio_flag;
> - unsigned int nr_zones = 1;
> - struct zonefs_ioerr_data err = {
> - .inode = inode,
> - .write = write,
> - };
> + struct blk_zone zone;
> int ret;
>
> /*
> - * The only files that have more than one zone are conventional zone
> - * files with aggregated conventional zones, for which the inode zone
> - * size is always larger than the device zone size.
> + * Conventional zone have no write pointer and cannot become read-only
> + * or offline. So simply fake a report for a single or aggregated zone
> + * and let zonefs_handle_io_error() correct the zone inode information
> + * according to the mount options.
> */
> - if (z->z_size > bdev_zone_sectors(sb->s_bdev))
> - nr_zones = z->z_size >>
> - (sbi->s_zone_sectors_shift + SECTOR_SHIFT);
> + if (!zonefs_zone_is_seq(z)) {
> + zone.start = z->z_sector;
> + zone.len = z->z_size >> SECTOR_SHIFT;
> + zone.wp = zone.start + zone.len;
> + zone.type = BLK_ZONE_TYPE_CONVENTIONAL;
> + zone.cond = BLK_ZONE_COND_NOT_WP;
> + zone.capacity = zone.len;
> + goto handle_io_error;
> + }
>
> /*
> * Memory allocations in blkdev_report_zones() can trigger a memory
> @@ -394,12 +396,20 @@ void __zonefs_io_error(struct inode *inode, bool write)
> * the GFP_NOIO context avoids both problems.
> */
> noio_flag = memalloc_noio_save();
> - ret = blkdev_report_zones(sb->s_bdev, z->z_sector, nr_zones,
> - zonefs_io_error_cb, &err);
> - if (ret != nr_zones)
> + ret = blkdev_report_zones(sb->s_bdev, z->z_sector, 1,
> + zonefs_io_error_cb, &zone);
> + memalloc_noio_restore(noio_flag);
> +
> + if (ret != 1) {
> zonefs_err(sb, "Get inode %lu zone information failed %d\n",
> inode->i_ino, ret);
> - memalloc_noio_restore(noio_flag);
> + zonefs_warn(sb, "remounting filesystem read-only\n");
> + sb->s_flags |= SB_RDONLY;
> + return;
> + }
> +
> +handle_io_error:
> + zonefs_handle_io_error(inode, &zone, write);
> }
>
> static struct kmem_cache *zonefs_inode_cachep;
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
--
Himanshu Madhani Oracle Linux Engineering
prev parent reply other threads:[~2024-02-16 1:06 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-14 6:45 [PATCH] zonefs: Improve error handling Damien Le Moal
2024-02-15 7:56 ` Shinichiro Kawasaki
2024-02-15 10:25 ` Johannes Thumshirn
2024-02-16 1:06 ` Himanshu Madhani [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b4b29ce3-36c4-4422-8346-842322b47516@oracle.com \
--to=himanshu.madhani@oracle.com \
--cc=dlemoal@kernel.org \
--cc=johannes.thumshirn@wdc.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=shinichiro.kawasaki@wdc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).