linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Jan Kara <jack@suse.cz>, Theodore Ts'o <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: block allocator issue with ext4+DAX
Date: Wed, 30 Mar 2016 16:01:29 -0600	[thread overview]
Message-ID: <20160330220129.GA9101@linux.intel.com> (raw)

I've hit an issue in my testing which I believe to be related to the ext4
block allocator when using the DAX mount option.  I originally found this
issue with the generic/102 xfstest, but have reduced it to the minimal
reproducer at the bottom of this email.  I've been able to reproduce this with
both BRD and with PMEM as the underlying block device.

For this test we're running in a very small filesystem, only 512 MiB.  We
fallocate() 400 MiB of that space, unlink the file, then try and rewrite that
400 MiB file one chunk at a time.

What actually happens is that during the rewrite we run out of memory and the
DAX call to get_block() in dax_io() fails with -ENOSPC.

Here are the steps to reproduce this issue:

  # fdisk -l /dev/ram0
  Disk /dev/ram0: 1 GiB, 1073741824 bytes, 2097152 sectors
  Units: sectors of 1 * 512 = 512 bytes
  Sector size (logical/physical): 512 bytes / 4096 bytes
  I/O size (minimum/optimal): 4096 bytes / 4096 bytes
  
  # mkfs.ext4 /dev/ram0 512M
  
  # mount /dev/ram0 /mnt
  
  # gcc -o test test.c
  
  # ./test	# success!
  
  # umount /mnt
  
  # mount -o dax /dev/ram0 /mnt	# requires CONFIG_BLK_DEV_RAM_DAX
  
  # ./test	# failure
  Partial write - only 577536 written

This test succeeds with xfs, ext2, and with ext4 without the DAX mount option.
I've also tried it with O_DIRECT, and that has the same behavior - we succeed
without DAX and fail with DAX.

Another clue is that a sync() call in the middle of the test between the
unlink and the following writes clears up the issue.

Something that might be related is the output in
/proc/fs/ext4/ram0/mb_groups.  Here is that output when we're in a good
state, and the writes will succeed:

#group: free  frags first [ 2^0   2^1   2^2   2^3   2^4   2^5   2^6   2^7 2^8   2^9   2^10  2^11  2^12  2^13  ]
#0    : 30673 1     2095  [ 1     0     0     0     1     0     1     1     1 1     1     0     1     3     ]
#1    : 32735 1     33    [ 1     1     1     1     1     0     1     1     1 1     1     1     1     3     ]
#2    : 28672 1     4096  [ 0     0     0     0     0     0     0     0     0 0     0     0     1     3     ]
#3    : 32735 1     33    [ 1     1     1     1     1     0     1     1     1 1     1     1     1     3     ]

Here is the output in that file when we're in a bad state, and our writes are
about to fail:

#group: free  frags first [ 2^0   2^1   2^2   2^3   2^4   2^5   2^6   2^7   2^8   2^9   2^10  2^11  2^12  2^13  ]
#0    : 18385 1     14383 [ 1     0     0     0     1     0     1     1     1     1     1     0     0     2     ]
#1    : 2015  1     33    [ 1     1     1     1     1     0     1     1     1     1     1     0     0     0     ]
#2    : 0     0     32768 [ 0     0     0     0     0     0     0     0     0     0     0     0     0     0     ]
#3    : 2015  1     33    [ 1     1     1     1     1     0     1     1     1     1     1     0     0     0     ]

It appears as though we've exhausted group #2.  Interestingly, if I run sync()
at this point it takes us from the bad output to the good, which leads me to
believe the newly unlinked blocks in group #2 are finally being freed back
into that group for reallocation or something. (I've clearly reached the
limits of my ext4-fu. :)  )

I'm happy to help test proposed fixes.

Thanks,
- Ross

---
#define _GNU_SOURCE
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define MB(a) ((a)*1024ULL*1024)

int main(int argc, char *argv[])
{
	int i, fd, ret;
	void *buffer; 

	buffer = malloc(MB(1));

	fd = open("/mnt/file", O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
	if (fd < 0) {
		perror("fd");
		return 1;
	}

	ret = fallocate(fd, 0, 0, MB(400));
	if (ret) {
		perror("fallocate");
		return 1;
	}
	close(fd);

	unlink("/mnt/file");

	/* a sync() call here makes the DAX case of this test pass */
//	sync();

	fd = open("/mnt/file", O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
	if (fd < 0) {
		perror("fd");
		return 1;
	}

	for (i = 0; i < 400; i++) {
		ret = write(fd, buffer, MB(1));

		if (ret < 0) {
			perror("write");
			return 1;
		} else if (ret != MB(1)) {
			fprintf(stderr, "Partial write - only %lu written\n",
					ret);
			return 1;
		}
	}

	close(fd);
	free(buffer);
	return 0;
}

             reply	other threads:[~2016-03-30 22:01 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-30 22:01 Ross Zwisler [this message]
2016-03-31  8:59 ` block allocator issue with ext4+DAX Jan Kara
2016-03-31 15:13   ` Ross Zwisler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160330220129.GA9101@linux.intel.com \
    --to=ross.zwisler@linux.intel.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).