[PATCH 00/11] reiserfs: xattr rework

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/11] reiserfs: xattr rework
@ 2006-02-20 20:14 Jeff Mahoney
  2006-03-01 12:34 ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Mahoney @ 2006-02-20 20:14 UTC (permalink / raw)
  To: ReiserFS List; +Cc: E. Gryaznova

Hello all -

Following this message is a series of 11 patches that rework the reiserfs
xattr code. The current implementation open codes a number of functions that
are well tested and stable elsewhere, but will slight modifications. It also
does a number of things suboptimally, such as operations that affect all of
the xattrs associated with an inode, as well as not handling journalling as
well as can be.

I've run them through a weekend of stress testing successfully, but I'd like
some additional testing before considering them safe.

Here's the run down:

* 01 - Make internal xattr operations use vfs ops
  This eliminates the open coding of the read/write loops in favor of using
  vfs_read and vfs_write. Performance-wise, it's very little difference
  from the open coding, and implementation-wise, it's more tested. Yes, this
  violates the no-file-io-in-the-kernel rules, but this rule has been violated
  since the day the original patches were accepted.

* 02 - Simplify xattr internal file lookups/opens
  This patch uses vfs-level operations where possible, using the i_mutex
  to protect directories rather than a special xattr lock.

* 03 - Eliminate per-super xattr lock
  Since we use i_mutex, we can eliminate the xattr lock, removing the locking
  constraints from namei.c

* 04 - Make per-inode xattr locking more fine grained
  The better locking allows us to tighten the critical sections that must
  be protected against concurrent access. Writers protect against concurrency
  using the owning inode's i_mutex, but readers still need protection against
  writers. This patch pushes the xattr_sem down to surround just the operation
  on the xattr itself.

* 05 - Make i_has_xattr_dir flag more sane
  With the simplification of open/lookups, there is now only one place for
  the flag to be set and one for it to be cleared.

* 06 - Use generic xattr handlers
  This patch moves the reiserfs xattr handlers to the generic VFS supplied
  struct xattr_handler. It uses the generic handlers where possible, but
  due to the generic handlers splitting the prefix from the name, that's not
  always possible since ReiserFS requires that the prefix be preserved.

* 07 - Use O_NOATIME for internal files
  This patch adds O_NOATIME for internal files, since it is never exported and
  it saves us adding another transaction for the dirty inode. It also means
  that on read operations, we don't start a transaction with an xattr sem
  held when the inode is dirtied.

* 08 - Add per-file data=ordered mode and use it for xattrs
  Despite being metadata, extended attributes are written with the same
  data logging policy as the rest of the file system. This patch adds a 
  per-file data=ordered mode and uses it when the file system is using
  data=writeback.

* 09 - Journaled xattrs
  This is the big patch. This eliminates the deadlocks that can still occur
  with xattrs on ReiserFS. The basic rule is that before doing a write
  operation on the xattr, a transaction must be started. In order to do this,
  reiserfs_xattr_set is split into two functions; One wraps the other in a
  transaction for the ->setxattr calls, and the other is used directly by
  internal functions like those which implement ACLs. This also alters the
  return value from reiserfs_cache_default_acl to contain the number of blocks
  required to handle the additional blocks required by inheriting the default
  ACL.

* 10 - Use generic readdir for operations across all xattrs
  One of the most wasteful operations with the existing code is operations
  that traverse all the extended attributes for a given inode. Since they are
  stored as regular files, it makes sense to use ->readdir to enumerate them.
  However, since the operations may modify the file system, the cached path
  must be dropped before the filldir routine is called. This means that
  the path must be re-read for each iteration. This patch leverages the
  generic readdir function to simply fill and return an array of dentries
  that can be operated on after the path has already been freed. ReiserFS's
  hashed directories really come in handy here, since the directory offset
  doesn't change when entries are deleted. Therefore, the old code which was
  quite involved and full of duplication is reduced to a single for_each_xattr
  routine that calls out to individual delete or chown functions as needed.

* 11 - Associate xattr's packing locality w/ owning inode's
  Since there isn't any linkage between an inode and its xattr directory, it's
  possible for the xattrs to be placed quite far away from the inode on disk.
  This patch leverages the dentry's fsdata pointer to allow reiserfs_new_inode
  to use the owning inode's dir id in determining where to place the new
  objects.

  -Jeff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/11] reiserfs: xattr rework
  2006-02-20 20:14 [PATCH 00/11] reiserfs: xattr rework Jeff Mahoney
@ 2006-03-01 12:34 ` Jan Kara
  2006-03-02 21:53   ` Jeff Mahoney
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2006-03-01 12:34 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: ReiserFS List, E. Gryaznova

  Hello,

> Following this message is a series of 11 patches that rework the reiserfs
> xattr code. The current implementation open codes a number of functions that
> are well tested and stable elsewhere, but will slight modifications. It also
> does a number of things suboptimally, such as operations that affect all of
> the xattrs associated with an inode, as well as not handling journalling as
> well as can be.
> 
> I've run them through a weekend of stress testing successfully, but I'd like
> some additional testing before considering them safe.
> 
> Here's the run down:
> 
> * 01 - Make internal xattr operations use vfs ops
>   This eliminates the open coding of the read/write loops in favor of using
>   vfs_read and vfs_write. Performance-wise, it's very little difference
>   from the open coding, and implementation-wise, it's more tested. Yes, this
>   violates the no-file-io-in-the-kernel rules, but this rule has been violated
>   since the day the original patches were accepted.
  I'm not completely sure but from briefly looking at the patches I
think there might be the following problem: you first start a transaction
and then call a VFS write operation. That will lock a page it wants to
write. But if bdflush works on the xattr file and sends some dirty data
to disk, it will first take page lock and then start the transaction.
This introduces essentially a lock inversion (as starting a transaction
behaves like taking a lock) and hence deadlocks... I've been solving
these for the quota code some time ago (quota also has similar needs as
your xattr code) - I really had some deadlock reports for heavily loaded
machines. There the only reasonable solution was an extra write
functions bypassing the page cache. Maybe in your case you can solve the
problems differently as you don't need the working solution for all the
filesystems but just for Reiserfs. 

									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/11] reiserfs: xattr rework
  2006-03-01 12:34 ` Jan Kara
@ 2006-03-02 21:53   ` Jeff Mahoney
  2006-03-03  0:29     ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Mahoney @ 2006-03-02 21:53 UTC (permalink / raw)
  To: Jan Kara; +Cc: ReiserFS List, E. Gryaznova

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jan Kara wrote:
>   Hello,
> 
>> Following this message is a series of 11 patches that rework the reiserfs
>> xattr code. The current implementation open codes a number of functions that
>> are well tested and stable elsewhere, but will slight modifications. It also
>> does a number of things suboptimally, such as operations that affect all of
>> the xattrs associated with an inode, as well as not handling journalling as
>> well as can be.
>>
>> I've run them through a weekend of stress testing successfully, but I'd like
>> some additional testing before considering them safe.
>>
>> Here's the run down:
>>
>> * 01 - Make internal xattr operations use vfs ops
>>   This eliminates the open coding of the read/write loops in favor of using
>>   vfs_read and vfs_write. Performance-wise, it's very little difference
>>   from the open coding, and implementation-wise, it's more tested. Yes, this
>>   violates the no-file-io-in-the-kernel rules, but this rule has been violated
>>   since the day the original patches were accepted.
>   I'm not completely sure but from briefly looking at the patches I
> think there might be the following problem: you first start a transaction
> and then call a VFS write operation. That will lock a page it wants to
> write. But if bdflush works on the xattr file and sends some dirty data
> to disk, it will first take page lock and then start the transaction.
> This introduces essentially a lock inversion (as starting a transaction
> behaves like taking a lock) and hence deadlocks... I've been solving
> these for the quota code some time ago (quota also has similar needs as
> your xattr code) - I really had some deadlock reports for heavily loaded
> machines. There the only reasonable solution was an extra write
> functions bypassing the page cache. Maybe in your case you can solve the
> problems differently as you don't need the working solution for all the
> filesystems but just for Reiserfs. 

Sigh. Ok. The way you describe it definitely makes it a lock inversion
issue. I haven't run into it yet, but as you say, it occurs on heavily
loaded machines. I've done some load testing, but apparently not enough,
since your analysis is sound.

But, I think there is a silver lining after all. It sounds like you've
already worked around these issues for the journaled quota code. What do
you think about turning the quota read/write functions into something
more generic and using that for xattrs as well as quotas?

Ultimately, I think that quota files and xattrs are the same things -
"normal" files read from and written to during the I/O path. The changes
to journaled quotas would be minimal - just turn
reiserfs_quota_{read,write} into small wrappers that make the
type->inode mapping and then call the remainder of the existing code as
reiserfs_internal_{read,write} with the appropriate inode. The xattr
code could juse the reiserfs_internal_{read,write} similarly and get all
the deadlock avoidance work you've already done for free.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFEB2l0LPWxlyuTD7IRAsuaAKCHnEKlcgxcIGEV5n5f3m4CY5c/pACeK4CH
h1cj7wLkY+irYbaagmUHtm4=
=o4nE
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/11] reiserfs: xattr rework
  2006-03-02 21:53   ` Jeff Mahoney
@ 2006-03-03  0:29     ` Jan Kara
  2006-03-03  0:56       ` Jeff Mahoney
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2006-03-03  0:29 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: ReiserFS List, E. Gryaznova

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Jan Kara wrote:
> >   Hello,
> > 
> >> Following this message is a series of 11 patches that rework the reiserfs
> >> xattr code. The current implementation open codes a number of functions that
> >> are well tested and stable elsewhere, but will slight modifications. It also
> >> does a number of things suboptimally, such as operations that affect all of
> >> the xattrs associated with an inode, as well as not handling journalling as
> >> well as can be.
> >>
> >> I've run them through a weekend of stress testing successfully, but I'd like
> >> some additional testing before considering them safe.
> >>
> >> Here's the run down:
> >>
> >> * 01 - Make internal xattr operations use vfs ops
> >>   This eliminates the open coding of the read/write loops in favor of using
> >>   vfs_read and vfs_write. Performance-wise, it's very little difference
> >>   from the open coding, and implementation-wise, it's more tested. Yes, this
> >>   violates the no-file-io-in-the-kernel rules, but this rule has been violated
> >>   since the day the original patches were accepted.
> >   I'm not completely sure but from briefly looking at the patches I
> > think there might be the following problem: you first start a transaction
> > and then call a VFS write operation. That will lock a page it wants to
> > write. But if bdflush works on the xattr file and sends some dirty data
> > to disk, it will first take page lock and then start the transaction.
> > This introduces essentially a lock inversion (as starting a transaction
> > behaves like taking a lock) and hence deadlocks... I've been solving
> > these for the quota code some time ago (quota also has similar needs as
> > your xattr code) - I really had some deadlock reports for heavily loaded
> > machines. There the only reasonable solution was an extra write
> > functions bypassing the page cache. Maybe in your case you can solve the
> > problems differently as you don't need the working solution for all the
> > filesystems but just for Reiserfs. 
> 
> Sigh. Ok. The way you describe it definitely makes it a lock inversion
> issue. I haven't run into it yet, but as you say, it occurs on heavily
> loaded machines. I've done some load testing, but apparently not enough,
> since your analysis is sound.
> 
> But, I think there is a silver lining after all. It sounds like you've
> already worked around these issues for the journaled quota code. What do
> you think about turning the quota read/write functions into something
> more generic and using that for xattrs as well as quotas?
> 
> Ultimately, I think that quota files and xattrs are the same things -
> "normal" files read from and written to during the I/O path. The changes
> to journaled quotas would be minimal - just turn
> reiserfs_quota_{read,write} into small wrappers that make the
> type->inode mapping and then call the remainder of the existing code as
> reiserfs_internal_{read,write} with the appropriate inode. The xattr
> code could juse the reiserfs_internal_{read,write} similarly and get all
> the deadlock avoidance work you've already done for free.
  You are right that the quota code and xattrs need to do the same thing.
We only need to do slight interface changes (currently functions take a
superblock and a type and pick appropriate quota inode themselves) and
function renaming. I would vote for renaming the s_op->quota_{read,write}
to s_op->internal_{read,write} and pass appropriate inode directly from
the quota code. The only thing I'm not sure about is how to deal with the
journaling mode - quota code either uses data journaling or just ordered
mode depending on mount options (journaled / non-journaled quota). So we
probably also need to pass the journaling mode to the write function.
  BTW: note that using these functions bypassing page cache means that
userspace really should not touch these files. It is asking for data
corruption. Quota code does during quotaon sync the quota inode and
set it as immutable to prevent accidents. Also during quotaoff it flushes
the page cache of the inode so that userspace is able to see the changes
made by kernel. I guess something similar will be needed for xattrs too.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/11] reiserfs: xattr rework
  2006-03-03  0:29     ` Jan Kara
@ 2006-03-03  0:56       ` Jeff Mahoney
  2006-03-03 10:04         ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Mahoney @ 2006-03-03  0:56 UTC (permalink / raw)
  To: Jan Kara; +Cc: ReiserFS List, E. Gryaznova

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jan Kara wrote:
>   You are right that the quota code and xattrs need to do the same thing.
> We only need to do slight interface changes (currently functions take a
> superblock and a type and pick appropriate quota inode themselves) and
> function renaming. I would vote for renaming the s_op->quota_{read,write}
> to s_op->internal_{read,write} and pass appropriate inode directly from
> the quota code. The only thing I'm not sure about is how to deal with the
> journaling mode - quota code either uses data journaling or just ordered
> mode depending on mount options (journaled / non-journaled quota). So we
> probably also need to pass the journaling mode to the write function.
>   BTW: note that using these functions bypassing page cache means that
> userspace really should not touch these files. It is asking for data
> corruption. Quota code does during quotaon sync the quota inode and
> set it as immutable to prevent accidents. Also during quotaoff it flushes
> the page cache of the inode so that userspace is able to see the changes
> made by kernel. I guess something similar will be needed for xattrs too.

If you feel that changing the entire quota system to reflect the change
is a good plan, that's your call. Personally, I'd like to keep the
patches as small as possible, but if you think there is a need for
internal_{read,write} elsewhere, I wouldn't object.

The data journaling mode can be set as a flag associated with the inode.
 Currently, i_data_log is set in REISERFS_I(inode)->i_flags. I add
i_data_ordered in one of my later patches. They can be tested easily
with reiserfs_file_data_{log,ordered}. There's no reason that one
couldn't be moved up and made a prerequisite for the first patch.

There is one difference between quota files and xattrs: The quota files
are visible to userspace, xattrs are completely hidden. There's nothing
needed to flush anything to the page cache.

I'll work up a patch locally and do some testing.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFEB5RPLPWxlyuTD7IRAn94AJwKvmGcT09QtjcBOFJoyd6JrxRTywCeN5nE
giNMMOkNlSsuGCwR6ad9/ao=
=DOMs
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/11] reiserfs: xattr rework
  2006-03-03  0:56       ` Jeff Mahoney
@ 2006-03-03 10:04         ` Jan Kara
  2006-03-03 22:17           ` Jeff Mahoney
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2006-03-03 10:04 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: Jan Kara, ReiserFS List, E. Gryaznova

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Jan Kara wrote:
> >   You are right that the quota code and xattrs need to do the same thing.
> > We only need to do slight interface changes (currently functions take a
> > superblock and a type and pick appropriate quota inode themselves) and
> > function renaming. I would vote for renaming the s_op->quota_{read,write}
> > to s_op->internal_{read,write} and pass appropriate inode directly from
> > the quota code. The only thing I'm not sure about is how to deal with the
> > journaling mode - quota code either uses data journaling or just ordered
> > mode depending on mount options (journaled / non-journaled quota). So we
> > probably also need to pass the journaling mode to the write function.
> >   BTW: note that using these functions bypassing page cache means that
> > userspace really should not touch these files. It is asking for data
> > corruption. Quota code does during quotaon sync the quota inode and
> > set it as immutable to prevent accidents. Also during quotaoff it flushes
> > the page cache of the inode so that userspace is able to see the changes
> > made by kernel. I guess something similar will be needed for xattrs too.
> 
> If you feel that changing the entire quota system to reflect the change
> is a good plan, that's your call. Personally, I'd like to keep the
> patches as small as possible, but if you think there is a need for
> internal_{read,write} elsewhere, I wouldn't object.
  OK, when I'm thinking about it in the morning, you're probably right.
And I can do the bigger change if I see that more filesystems would use
it too.

> The data journaling mode can be set as a flag associated with the inode.
>  Currently, i_data_log is set in REISERFS_I(inode)->i_flags. I add
> i_data_ordered in one of my later patches. They can be tested easily
> with reiserfs_file_data_{log,ordered}. There's no reason that one
> couldn't be moved up and made a prerequisite for the first patch.
  Fine. So we can just set proper journaling flags in reiserfs_quota_on
and then honor them in the internal writing functions.

								Honza
  
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/11] reiserfs: xattr rework
  2006-03-03 10:04         ` Jan Kara
@ 2006-03-03 22:17           ` Jeff Mahoney
  2006-03-06 11:59             ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Mahoney @ 2006-03-03 22:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: ReiserFS List, E. Gryaznova

[-- Attachment #1: Type: text/plain, Size: 1322 bytes --]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jan Kara wrote:
>> The data journaling mode can be set as a flag associated with the inode.
>>  Currently, i_data_log is set in REISERFS_I(inode)->i_flags. I add
>> i_data_ordered in one of my later patches. They can be tested easily
>> with reiserfs_file_data_{log,ordered}. There's no reason that one
>> couldn't be moved up and made a prerequisite for the first patch.
>   Fine. So we can just set proper journaling flags in reiserfs_quota_on
> and then honor them in the internal writing functions.

Ok, how do the attached patches look to you? The internal I/O changes
need to be applied after the journaled xattr patch or we get an Oops
trying to start a transaction without calling reiserfs_write_lock()
first. I've modified the first patch in the xattr series to abstract out
the fp->f_op->{read,write} calls to an xattr_{read,write} pair of
functions. This makes it easier to move to the internal i/o code later.
I've included it for clarity, but that is the only change.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFECMCALPWxlyuTD7IRAlqfAJ0bQnHuNqzEif4hVfvGKI8tR2bUrACfU1Mg
BFbe0xayAJHhvNgtGl7N6Jk=
=P6ED
-----END PGP SIGNATURE-----

[-- Attachment #2: reiserfs-10-internal-file-io.diff --]
[-- Type: text/x-patch, Size: 9368 bytes --]

From: Jeff Mahoney <jeffm@suse.com>
Subject: [PATCH 10/13] reiserfs: make quota file i/o routines generic

 Jan Kara, with his work on journaled quotas, discovered some conditions
 were lock inversions can occur between pdflush and the internal file i/o
 paths when transactions are involved.

 This patch modifies the quota code in a few ways:
 1) Makes the reiserfs_quota_{read,write} generic and adds wrappers to
    map the quota type to the appropriate inode.
 2) Uses the i_data_log flag to mark whether the quota file is journaled
    or not. This is the standard way of doing this, and allows the generic
    code to work without knowledge of quotas.

 fs/reiserfs/super.c         |  212 ++++++++++++++++++++++++--------------------
 include/linux/reiserfs_fs.h |    5 +
 2 files changed, 122 insertions(+), 95 deletions(-)

Signed-off-by: Jeff Mahoney <jeffm@suse.com>

diff -ruNpX ../dontdiff linux-2.6.15-staging1/fs/reiserfs/super.c linux-2.6.15-staging2/fs/reiserfs/super.c
--- linux-2.6.15-staging1/fs/reiserfs/super.c	2006-03-03 17:09:01.000000000 -0500
+++ linux-2.6.15-staging2/fs/reiserfs/super.c	2006-03-03 17:09:04.000000000 -0500
@@ -1949,6 +1949,109 @@ static int reiserfs_statfs(struct super_
 	return 0;
 }
 
+#if defined(CONFIG_QUOTA) || defined(CONFIG_REISERFS_FS_XATTR)
+/* Read data from quotafile - avoid pagecache and such because we cannot afford
+ * acquiring the locks... As quota files are never truncated and quota code
+ * itself serializes the operations (and noone else should touch the files)
+ * we don't have to be afraid of races */
+ssize_t reiserfs_internal_read(struct inode *inode, char *data, size_t len,
+                               loff_t off)
+{
+	struct super_block *sb = inode->i_sb;
+	unsigned long blk = off >> sb->s_blocksize_bits;
+	int err = 0, offset = off & (sb->s_blocksize - 1), tocopy;
+	size_t toread;
+	struct buffer_head tmp_bh, *bh;
+	loff_t i_size = i_size_read(inode);
+
+	if (off > i_size)
+		return 0;
+	if (off + len > i_size)
+		len = i_size - off;
+	toread = len;
+	while (toread > 0) {
+		tocopy =
+		    sb->s_blocksize - offset <
+		    toread ? sb->s_blocksize - offset : toread;
+		tmp_bh.b_state = 0;
+		/* Quota files are without tails so we can safely use this function */
+		reiserfs_write_lock(sb);
+		err = reiserfs_get_block(inode, blk, &tmp_bh, 0);
+		reiserfs_write_unlock(sb);
+		if (err)
+			return err;
+		if (!buffer_mapped(&tmp_bh))	/* A hole? */
+			memset(data, 0, tocopy);
+		else {
+			bh = sb_bread(sb, tmp_bh.b_blocknr);
+			if (!bh)
+				return -EIO;
+			memcpy(data, bh->b_data + offset, tocopy);
+			brelse(bh);
+		}
+		offset = 0;
+		toread -= tocopy;
+		data += tocopy;
+		blk++;
+	}
+	return len;
+}
+
+/* Write to quotafile (we know the transaction is already started and has
+ * enough credits) */
+ssize_t reiserfs_internal_write(struct inode *inode, const char *data,
+                                size_t len, loff_t off)
+{
+	struct super_block *sb = inode->i_sb;
+	unsigned long blk = off >> sb->s_blocksize_bits;
+	int err = 0, offset = off & (sb->s_blocksize - 1), tocopy;
+	size_t towrite = len;
+	struct buffer_head tmp_bh, *bh;
+
+	mutex_lock(&inode->i_mutex);
+	while (towrite > 0) {
+		tocopy = sb->s_blocksize - offset < towrite ?
+		    sb->s_blocksize - offset : towrite;
+		tmp_bh.b_state = 0;
+		err = reiserfs_get_block(inode, blk, &tmp_bh, GET_BLOCK_CREATE);
+		if (err)
+			goto out;
+		if (offset || tocopy != sb->s_blocksize)
+			bh = sb_bread(sb, tmp_bh.b_blocknr);
+		else
+			bh = sb_getblk(sb, tmp_bh.b_blocknr);
+		if (!bh) {
+			err = -EIO;
+			goto out;
+		}
+		lock_buffer(bh);
+		memcpy(bh->b_data + offset, data, tocopy);
+		flush_dcache_page(bh->b_page);
+		set_buffer_uptodate(bh);
+		unlock_buffer(bh);
+		reiserfs_prepare_for_journal(sb, bh, 1);
+		journal_mark_dirty(current->journal_info, sb, bh);
+		if (!reiserfs_file_data_log(inode))
+			reiserfs_add_ordered_list(inode, bh);
+		brelse(bh);
+		offset = 0;
+		towrite -= tocopy;
+		data += tocopy;
+		blk++;
+	}
+      out:
+	if (len == towrite)
+		return err;
+	if (inode->i_size < off + len - towrite)
+		i_size_write(inode, off + len - towrite);
+	inode->i_version++;
+	inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+	mark_inode_dirty(inode);
+	mutex_unlock(&inode->i_mutex);
+	return len - towrite;
+}
+#endif
+
 #ifdef CONFIG_QUOTA
 static int reiserfs_dquot_initialize(struct inode *inode, int type)
 {
@@ -2097,8 +2200,12 @@ static int reiserfs_write_info(struct su
  */
 static int reiserfs_quota_on_mount(struct super_block *sb, int type)
 {
-	return vfs_quota_on_mount(sb, REISERFS_SB(sb)->s_qf_names[type],
-				  REISERFS_SB(sb)->s_jquota_fmt, type);
+	int ret = vfs_quota_on_mount(sb, REISERFS_SB(sb)->s_qf_names[type],
+				     REISERFS_SB(sb)->s_jquota_fmt, type);
+	if (ret == 0)
+		REISERFS_I(sb_dqopt(sb)->files[type])->i_flags |= i_data_log;
+
+	return ret;
 }
 
 /*
@@ -2139,109 +2246,24 @@ static int reiserfs_quota_on(struct supe
 				 "reiserfs: Quota file not on filesystem root. "
 				 "Journalled quota will not work.");
 	path_release(&nd);
-	return vfs_quota_on(sb, type, format_id, path);
+	err = vfs_quota_on(sb, type, format_id, path);
+	if (err == 0)
+		REISERFS_I(sb_dqopt(sb)->files[type])->i_flags |= i_data_log;
+	return err;
 }
 
-/* Read data from quotafile - avoid pagecache and such because we cannot afford
- * acquiring the locks... As quota files are never truncated and quota code
- * itself serializes the operations (and noone else should touch the files)
- * we don't have to be afraid of races */
-static ssize_t reiserfs_quota_read(struct super_block *sb, int type, char *data,
-				   size_t len, loff_t off)
+static ssize_t reiserfs_quota_read(struct super_block *sb, int type,
+                                   char *data, size_t len, loff_t off)
 {
 	struct inode *inode = sb_dqopt(sb)->files[type];
-	unsigned long blk = off >> sb->s_blocksize_bits;
-	int err = 0, offset = off & (sb->s_blocksize - 1), tocopy;
-	size_t toread;
-	struct buffer_head tmp_bh, *bh;
-	loff_t i_size = i_size_read(inode);
-
-	if (off > i_size)
-		return 0;
-	if (off + len > i_size)
-		len = i_size - off;
-	toread = len;
-	while (toread > 0) {
-		tocopy =
-		    sb->s_blocksize - offset <
-		    toread ? sb->s_blocksize - offset : toread;
-		tmp_bh.b_state = 0;
-		/* Quota files are without tails so we can safely use this function */
-		reiserfs_write_lock(sb);
-		err = reiserfs_get_block(inode, blk, &tmp_bh, 0);
-		reiserfs_write_unlock(sb);
-		if (err)
-			return err;
-		if (!buffer_mapped(&tmp_bh))	/* A hole? */
-			memset(data, 0, tocopy);
-		else {
-			bh = sb_bread(sb, tmp_bh.b_blocknr);
-			if (!bh)
-				return -EIO;
-			memcpy(data, bh->b_data + offset, tocopy);
-			brelse(bh);
-		}
-		offset = 0;
-		toread -= tocopy;
-		data += tocopy;
-		blk++;
-	}
-	return len;
+	return reiserfs_internal_read(inode, data, len, off);
 }
 
-/* Write to quotafile (we know the transaction is already started and has
- * enough credits) */
 static ssize_t reiserfs_quota_write(struct super_block *sb, int type,
 				    const char *data, size_t len, loff_t off)
 {
 	struct inode *inode = sb_dqopt(sb)->files[type];
-	unsigned long blk = off >> sb->s_blocksize_bits;
-	int err = 0, offset = off & (sb->s_blocksize - 1), tocopy;
-	int journal_quota = REISERFS_SB(sb)->s_qf_names[type] != NULL;
-	size_t towrite = len;
-	struct buffer_head tmp_bh, *bh;
-
-	mutex_lock(&inode->i_mutex);
-	while (towrite > 0) {
-		tocopy = sb->s_blocksize - offset < towrite ?
-		    sb->s_blocksize - offset : towrite;
-		tmp_bh.b_state = 0;
-		err = reiserfs_get_block(inode, blk, &tmp_bh, GET_BLOCK_CREATE);
-		if (err)
-			goto out;
-		if (offset || tocopy != sb->s_blocksize)
-			bh = sb_bread(sb, tmp_bh.b_blocknr);
-		else
-			bh = sb_getblk(sb, tmp_bh.b_blocknr);
-		if (!bh) {
-			err = -EIO;
-			goto out;
-		}
-		lock_buffer(bh);
-		memcpy(bh->b_data + offset, data, tocopy);
-		flush_dcache_page(bh->b_page);
-		set_buffer_uptodate(bh);
-		unlock_buffer(bh);
-		reiserfs_prepare_for_journal(sb, bh, 1);
-		journal_mark_dirty(current->journal_info, sb, bh);
-		if (!journal_quota)
-			reiserfs_add_ordered_list(inode, bh);
-		brelse(bh);
-		offset = 0;
-		towrite -= tocopy;
-		data += tocopy;
-		blk++;
-	}
-      out:
-	if (len == towrite)
-		return err;
-	if (inode->i_size < off + len - towrite)
-		i_size_write(inode, off + len - towrite);
-	inode->i_version++;
-	inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-	mark_inode_dirty(inode);
-	mutex_unlock(&inode->i_mutex);
-	return len - towrite;
+	return reiserfs_internal_write(inode, data, len, off);
 }
 
 #endif
diff -ruNpX ../dontdiff linux-2.6.15-staging1/include/linux/reiserfs_fs.h linux-2.6.15-staging2/include/linux/reiserfs_fs.h
--- linux-2.6.15-staging1/include/linux/reiserfs_fs.h	2006-03-03 17:09:03.000000000 -0500
+++ linux-2.6.15-staging2/include/linux/reiserfs_fs.h	2006-03-03 17:09:04.000000000 -0500
@@ -2188,4 +2188,9 @@ int reiserfs_ioctl(struct inode *inode, 
 #define reiserfs_write_lock( sb ) lock_kernel()
 #define reiserfs_write_unlock( sb ) unlock_kernel()
 
+ssize_t reiserfs_internal_read(struct inode *inode, char *data,
+                                size_t len, loff_t off);
+ssize_t reiserfs_internal_write(struct inode *inode, const char *data,
+                                size_t len, loff_t off);
+
 #endif				/* _LINUX_REISER_FS_H */

[-- Attachment #3: reiserfs-11-use-internal-file-io.diff --]
[-- Type: text/x-patch, Size: 1537 bytes --]

From: Jeff Mahoney <jeffm@suse.com>
Subject: [PATCH 11/13] reiserfs: use generic internal i/o methods for xattrs

 This patch modifies the xattr code to use the generic internal file i/o
 methods from the previous patch.

 fs/reiserfs/xattr.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

Signed-off-by: Jeff Mahoney <jeffm@suse.com>

diff -ruNpX ../dontdiff linux-2.6.15-staging1/fs/reiserfs/xattr.c linux-2.6.15-staging2/fs/reiserfs/xattr.c
--- linux-2.6.15-staging1/fs/reiserfs/xattr.c	2006-03-03 17:09:04.000000000 -0500
+++ linux-2.6.15-staging2/fs/reiserfs/xattr.c	2006-03-03 17:09:05.000000000 -0500
@@ -415,26 +415,26 @@ static inline __u32 xattr_hash(const cha
 static ssize_t xattr_read(struct file *file, char *data, size_t count,
                           loff_t *pos)
 {
-	mm_segment_t old_fs = get_fs();
 	ssize_t err;
+	struct inode *inode = file->f_dentry->d_inode;
 
-	set_fs(KERNEL_DS);
-	err = file->f_op->read(file, data, count, pos);
-	set_fs(old_fs);
+	err = reiserfs_internal_read(file, data, count, *pos);
 
+	if (err >= 0)
+		*pos += err;
 	return err;
 }
 
 static ssize_t xattr_write(struct file *file, const char *data, size_t count,
                            loff_t *pos)
 {
-	mm_segment_t old_fs = get_fs();
 	ssize_t err;
+	struct inode *inode = file->f_dentry->d_inode;
 
-	set_fs(KERNEL_DS);
-	err = file->f_op->write(file, data, count, pos);
-	set_fs(old_fs);
+	err = reiserfs_internal_write(file, data, count, *pos);
 
+	if (err >= 0)
+		*pos += err;
 	return err;
 }
 

[-- Attachment #4: reiserfs-01-make-internal-xattr-operations-use-vfs-ops.diff --]
[-- Type: text/x-patch, Size: 9139 bytes --]

From: Jeff Mahoney <jeffm@suse.com>
Subject: [PATCH 01/13] reiserfs: make internal xattr operations use vfs ops

 The reiserfs xattr code makes use of hidden files to contain the xattrs.
 The writing of these files is needlessly complex and can be replaced by
 vfs-level operations.

 fs/reiserfs/xattr.c |  234 +++++++++++++++++++++-------------------------------
 1 files changed, 95 insertions(+), 139 deletions(-)

Signed-off-by: Jeff Mahoney <jeffm@suse.com>

diff -ruNpX ../dontdiff linux-2.6.15-staging1/fs/reiserfs/xattr.c linux-2.6.15-staging2/fs/reiserfs/xattr.c
--- linux-2.6.15-staging1/fs/reiserfs/xattr.c	2006-03-02 12:09:57.000000000 -0500
+++ linux-2.6.15-staging2/fs/reiserfs/xattr.c	2006-03-03 17:08:53.000000000 -0500
@@ -438,44 +438,38 @@ int xattr_readdir(struct file *file, fil
 	return res;
 }
 
-/* Internal operations on file data */
-static inline void reiserfs_put_page(struct page *page)
+static inline __u32 xattr_hash(const char *msg, int len)
 {
-	kunmap(page);
-	page_cache_release(page);
+	return csum_partial(msg, len, 0);
 }
 
-static struct page *reiserfs_get_page(struct inode *dir, unsigned long n)
+/* Generic extended attribute operations that can be used by xa plugins */
+
+static ssize_t xattr_read(struct file *file, char *data, size_t count,
+                          loff_t *pos)
 {
-	struct address_space *mapping = dir->i_mapping;
-	struct page *page;
-	/* We can deadlock if we try to free dentries,
-	   and an unlink/rmdir has just occured - GFP_NOFS avoids this */
-	mapping_set_gfp_mask(mapping, GFP_NOFS);
-	page = read_cache_page(mapping, n,
-			       (filler_t *) mapping->a_ops->readpage, NULL);
-	if (!IS_ERR(page)) {
-		wait_on_page_locked(page);
-		kmap(page);
-		if (!PageUptodate(page))
-			goto fail;
-
-		if (PageError(page))
-			goto fail;
-	}
-	return page;
-
-      fail:
-	reiserfs_put_page(page);
-	return ERR_PTR(-EIO);
+	mm_segment_t old_fs = get_fs();
+	ssize_t err;
+
+	set_fs(KERNEL_DS);
+	err = file->f_op->read(file, data, count, pos);
+	set_fs(old_fs);
+
+	return err;
 }
 
-static inline __u32 xattr_hash(const char *msg, int len)
+static ssize_t xattr_write(struct file *file, const char *data, size_t count,
+                           loff_t *pos)
 {
-	return csum_partial(msg, len, 0);
-}
+	mm_segment_t old_fs = get_fs();
+	ssize_t err;
 
-/* Generic extended attribute operations that can be used by xa plugins */
+	set_fs(KERNEL_DS);
+	err = file->f_op->write(file, data, count, pos);
+	set_fs(old_fs);
+
+	return err;
+}
 
 /*
  * inode->i_mutex: down
@@ -486,21 +480,27 @@ reiserfs_xattr_set(struct inode *inode, 
 {
 	int err = 0;
 	struct file *fp;
-	struct page *page;
-	char *data;
-	struct address_space *mapping;
-	size_t file_pos = 0;
-	size_t buffer_pos = 0;
 	struct inode *xinode;
-	struct iattr newattrs;
-	__u32 xahash = 0;
+	loff_t pos = 0;
+	struct reiserfs_xattr_header rxh = {
+		.h_magic = __constant_cpu_to_le32(REISERFS_XATTR_MAGIC),
+	};
+	gfp_t gfp_mask;
 
 	if (get_inode_sd_version(inode) == STAT_DATA_V1)
 		return -EOPNOTSUPP;
 
-	/* Empty xattrs are ok, they're just empty files, no hash */
-	if (buffer && buffer_size)
-		xahash = xattr_hash(buffer, buffer_size);
+	if (!buffer) {
+		struct reiserfs_xattr_handler *xah =
+					find_xattr_handler_prefix(name);
+		/* Deletion pre-operation */
+		if (xah->del) {
+			err = xah->del(inode, name);
+			if (err)
+				goto out;
+		}
+		return reiserfs_xattr_del(inode, name);
+	}
 
       open_file:
 	fp = open_xa_file(inode, name, flags);
@@ -524,62 +524,47 @@ reiserfs_xattr_set(struct inode *inode, 
 		goto open_file;
 	}
 
-	/* Resize it so we're ok to write there */
-	newattrs.ia_size = buffer_size;
-	newattrs.ia_valid = ATTR_SIZE | ATTR_CTIME;
-	mutex_lock(&xinode->i_mutex);
-	err = notify_change(fp->f_dentry, &newattrs);
-	if (err)
+	rxh.h_hash = le32_to_cpu (xattr_hash(buffer, buffer_size));
+
+	/* We've got a transaction open - we don't want to be dropping down
+	 * into other file system code. Not the nicest thing, but it's not
+	 * like this code is poetry to begin with. */
+	gfp_mask = mapping_gfp_mask(xinode->i_mapping);
+	mapping_set_gfp_mask(xinode->i_mapping, gfp_mask & ~__GFP_FS);
+
+	err = xattr_write(fp, (char *)&rxh, sizeof (rxh), &pos);
+
+	if (err != sizeof (rxh)) {
+		if (err > 0)
+			err = -EIO;
 		goto out_filp;
+	}
 
-	mapping = xinode->i_mapping;
-	while (buffer_pos < buffer_size || buffer_pos == 0) {
-		size_t chunk;
-		size_t skip = 0;
-		size_t page_offset = (file_pos & (PAGE_CACHE_SIZE - 1));
-		if (buffer_size - buffer_pos > PAGE_CACHE_SIZE)
-			chunk = PAGE_CACHE_SIZE;
-		else
-			chunk = buffer_size - buffer_pos;
-
-		page = reiserfs_get_page(xinode, file_pos >> PAGE_CACHE_SHIFT);
-		if (IS_ERR(page)) {
-			err = PTR_ERR(page);
-			goto out_filp;
-		}
 
-		lock_page(page);
-		data = page_address(page);
+	if (buffer_size) {
+		err = xattr_write(fp, buffer, buffer_size, &pos);
 
-		if (file_pos == 0) {
-			struct reiserfs_xattr_header *rxh;
-			skip = file_pos = sizeof(struct reiserfs_xattr_header);
-			if (chunk + skip > PAGE_CACHE_SIZE)
-				chunk = PAGE_CACHE_SIZE - skip;
-			rxh = (struct reiserfs_xattr_header *)data;
-			rxh->h_magic = cpu_to_le32(REISERFS_XATTR_MAGIC);
-			rxh->h_hash = cpu_to_le32(xahash);
+		if (err != buffer_size) {
+			if (err > 0)
+				err = -EIO;
+			goto out_filp;
 		}
+	}
 
-		err = mapping->a_ops->prepare_write(fp, page, page_offset,
-						    page_offset + chunk + skip);
-		if (!err) {
-			if (buffer)
-				memcpy(data + skip, buffer + buffer_pos, chunk);
-			err =
-			    mapping->a_ops->commit_write(fp, page, page_offset,
-							 page_offset + chunk +
-							 skip);
-		}
-		unlock_page(page);
-		reiserfs_put_page(page);
-		buffer_pos += chunk;
-		file_pos += chunk;
-		skip = 0;
-		if (err || buffer_size == 0 || !buffer)
-			break;
+	if (buffer_size + sizeof (rxh) < i_size_read(xinode)) {
+		struct iattr newattrs = {
+			.ia_size = buffer_size + sizeof (rxh),
+			.ia_valid = ATTR_SIZE | ATTR_CTIME,
+		};
+
+		mutex_lock(&xinode->i_mutex);
+		err = notify_change(fp->f_dentry, &newattrs);
+		mutex_unlock(&xinode->i_mutex);
+		goto out_filp;
 	}
 
+	err = 0;
+
 	/* We can't mark the inode dirty if it's not hashed. This is the case
 	 * when we're inheriting the default ACL. If we dirty it, the inode
 	 * gets marked dirty, but won't (ever) make it onto the dirty list until
@@ -590,7 +575,7 @@ reiserfs_xattr_set(struct inode *inode, 
 	}
 
       out_filp:
-	mutex_unlock(&xinode->i_mutex);
+	mapping_set_gfp_mask(xinode->i_mapping, gfp_mask);
 	fput(fp);
 
       out:
@@ -607,11 +592,9 @@ reiserfs_xattr_get(const struct inode *i
 	ssize_t err = 0;
 	struct file *fp;
 	size_t isize;
-	size_t file_pos = 0;
-	size_t buffer_pos = 0;
-	struct page *page;
 	struct inode *xinode;
-	__u32 hash = 0;
+	loff_t pos = 0;
+	struct reiserfs_xattr_header rxh;
 
 	if (name == NULL)
 		return -EINVAL;
@@ -628,66 +611,39 @@ reiserfs_xattr_get(const struct inode *i
 	}
 
 	xinode = fp->f_dentry->d_inode;
-	isize = xinode->i_size;
+	isize = i_size_read(xinode) - sizeof (rxh);
 	REISERFS_I(inode)->i_flags |= i_has_xattr_dir;
 
 	/* Just return the size needed */
 	if (buffer == NULL) {
-		err = isize - sizeof(struct reiserfs_xattr_header);
+		err = isize;
 		goto out_dput;
 	}
 
-	if (buffer_size < isize - sizeof(struct reiserfs_xattr_header)) {
+	if (buffer_size < isize) {
 		err = -ERANGE;
 		goto out_dput;
 	}
 
-	while (file_pos < isize) {
-		size_t chunk;
-		char *data;
-		size_t skip = 0;
-		if (isize - file_pos > PAGE_CACHE_SIZE)
-			chunk = PAGE_CACHE_SIZE;
-		else
-			chunk = isize - file_pos;
-
-		page = reiserfs_get_page(xinode, file_pos >> PAGE_CACHE_SHIFT);
-		if (IS_ERR(page)) {
-			err = PTR_ERR(page);
-			goto out_dput;
-		}
+	buffer_size = isize;
 
-		lock_page(page);
-		data = page_address(page);
-		if (file_pos == 0) {
-			struct reiserfs_xattr_header *rxh =
-			    (struct reiserfs_xattr_header *)data;
-			skip = file_pos = sizeof(struct reiserfs_xattr_header);
-			chunk -= skip;
-			/* Magic doesn't match up.. */
-			if (rxh->h_magic != cpu_to_le32(REISERFS_XATTR_MAGIC)) {
-				unlock_page(page);
-				reiserfs_put_page(page);
-				reiserfs_warning(inode->i_sb,
-						 "Invalid magic for xattr (%s) "
-						 "associated with %k", name,
-						 INODE_PKEY(inode));
-				err = -EIO;
-				goto out_dput;
-			}
-			hash = le32_to_cpu(rxh->h_hash);
-		}
-		memcpy(buffer + buffer_pos, data + skip, chunk);
-		unlock_page(page);
-		reiserfs_put_page(page);
-		file_pos += chunk;
-		buffer_pos += chunk;
-		skip = 0;
+	err = xattr_read(fp, (char *)&rxh, sizeof (rxh), &pos);
+
+	if (err != sizeof (rxh)) {
+		if (err > 0)
+			err = -EIO;
+		goto out_dput;
+	}
+
+	err = xattr_read(fp, buffer, buffer_size, &pos);
+
+	if (err != buffer_size) {
+		if (err > 0)
+			err = -EIO;
+		goto out_dput;
 	}
-	err = isize - sizeof(struct reiserfs_xattr_header);
 
-	if (xattr_hash(buffer, isize - sizeof(struct reiserfs_xattr_header)) !=
-	    hash) {
+	if (xattr_hash(buffer, buffer_size) != le32_to_cpu(rxh.h_hash)) {
 		reiserfs_warning(inode->i_sb,
 				 "Invalid hash for xattr (%s) associated "
 				 "with %k", name, INODE_PKEY(inode));

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/11] reiserfs: xattr rework
  2006-03-03 22:17           ` Jeff Mahoney
@ 2006-03-06 11:59             ` Jan Kara
  2006-03-07 21:39               ` Jeff Mahoney
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2006-03-06 11:59 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: ReiserFS List, E. Gryaznova

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Jan Kara wrote:
> >> The data journaling mode can be set as a flag associated with the inode.
> >>  Currently, i_data_log is set in REISERFS_I(inode)->i_flags. I add
> >> i_data_ordered in one of my later patches. They can be tested easily
> >> with reiserfs_file_data_{log,ordered}. There's no reason that one
> >> couldn't be moved up and made a prerequisite for the first patch.
> >   Fine. So we can just set proper journaling flags in reiserfs_quota_on
> > and then honor them in the internal writing functions.
> 
> Ok, how do the attached patches look to you? The internal I/O changes
> need to be applied after the journaled xattr patch or we get an Oops
> trying to start a transaction without calling reiserfs_write_lock()
> first. I've modified the first patch in the xattr series to abstract out
> the fp->f_op->{read,write} calls to an xattr_{read,write} pair of
> functions. This makes it easier to move to the internal i/o code later.
> I've included it for clarity, but that is the only change.
  The patch looks fine. Just two minor comments:

<snip>
> diff -ruNpX ../dontdiff linux-2.6.15-staging1/fs/reiserfs/super.c linux-2.6.15-staging2/fs/reiserfs/super.c
> --- linux-2.6.15-staging1/fs/reiserfs/super.c	2006-03-03 17:09:01.000000000 -0500
> +++ linux-2.6.15-staging2/fs/reiserfs/super.c	2006-03-03 17:09:04.000000000 -0500
> @@ -1949,6 +1949,109 @@ static int reiserfs_statfs(struct super_
>  	return 0;
>  }
>  
> +#if defined(CONFIG_QUOTA) || defined(CONFIG_REISERFS_FS_XATTR)
> +/* Read data from quotafile - avoid pagecache and such because we cannot afford
> + * acquiring the locks... As quota files are never truncated and quota code
> + * itself serializes the operations (and noone else should touch the files)
> + * we don't have to be afraid of races */
 Update here the comment to reflect that we use this function also for
xattrs now - I suppose those files cannot be truncated either and that
xattr code serializes the operations there.

> +ssize_t reiserfs_internal_read(struct inode *inode, char *data, size_t len,
> +                               loff_t off)
  <snip>

> +/* Write to quotafile (we know the transaction is already started and has
> + * enough credits) */
  Here again update the comment...

> +ssize_t reiserfs_internal_write(struct inode *inode, const char *data,
> +                                size_t len, loff_t off)

									Honza

-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/11] reiserfs: xattr rework
  2006-03-06 11:59             ` Jan Kara
@ 2006-03-07 21:39               ` Jeff Mahoney
  2006-03-08 18:20                 ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Mahoney @ 2006-03-07 21:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: ReiserFS List, E. Gryaznova

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jan Kara wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Jan Kara wrote:
>>>> The data journaling mode can be set as a flag associated with the inode.
>>>>  Currently, i_data_log is set in REISERFS_I(inode)->i_flags. I add
>>>> i_data_ordered in one of my later patches. They can be tested easily
>>>> with reiserfs_file_data_{log,ordered}. There's no reason that one
>>>> couldn't be moved up and made a prerequisite for the first patch.
>>>   Fine. So we can just set proper journaling flags in reiserfs_quota_on
>>> and then honor them in the internal writing functions.
>> Ok, how do the attached patches look to you? The internal I/O changes
>> need to be applied after the journaled xattr patch or we get an Oops
>> trying to start a transaction without calling reiserfs_write_lock()
>> first. I've modified the first patch in the xattr series to abstract out
>> the fp->f_op->{read,write} calls to an xattr_{read,write} pair of
>> functions. This makes it easier to move to the internal i/o code later.
>> I've included it for clarity, but that is the only change.
>   The patch looks fine. Just two minor comments:
> 
> <snip>
>> diff -ruNpX ../dontdiff linux-2.6.15-staging1/fs/reiserfs/super.c linux-2.6.15-staging2/fs/reiserfs/super.c
>> --- linux-2.6.15-staging1/fs/reiserfs/super.c	2006-03-03 17:09:01.000000000 -0500
>> +++ linux-2.6.15-staging2/fs/reiserfs/super.c	2006-03-03 17:09:04.000000000 -0500
>> @@ -1949,6 +1949,109 @@ static int reiserfs_statfs(struct super_
>>  	return 0;
>>  }
>>  
>> +#if defined(CONFIG_QUOTA) || defined(CONFIG_REISERFS_FS_XATTR)
>> +/* Read data from quotafile - avoid pagecache and such because we cannot afford
>> + * acquiring the locks... As quota files are never truncated and quota code
>> + * itself serializes the operations (and noone else should touch the files)
>> + * we don't have to be afraid of races */
>  Update here the comment to reflect that we use this function also for
> xattrs now - I suppose those files cannot be truncated either and that
> xattr code serializes the operations there.
> 
>> +ssize_t reiserfs_internal_read(struct inode *inode, char *data, size_t len,
>> +                               loff_t off)
>   <snip>
> 
>> +/* Write to quotafile (we know the transaction is already started and has
>> + * enough credits) */
>   Here again update the comment...
> 
>> +ssize_t reiserfs_internal_write(struct inode *inode, const char *data,
>> +                                size_t len, loff_t off)
> 
> 									Honza
> 

I've updated the patches with the comment changes, though I did run into
a much bigger snag.

The internal i/o patches don't support tails, and that's a silver bullet
against this working for xattrs. Most xattrs, such as ACLs, are likley
to be only a few tens of bytes long and allocating an entire block is
extremely wasteful.

I've managed to alter internal read to handle tails by allocating an
anonymous page and using it with the temporary buffer head to get the
tail data from reiserfs_get_block back. But the rest of the tail packing
code very much needs the page cache. Is there going to be any way this
can be managed without reintroducing deadlocks?

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFEDf16LPWxlyuTD7IRApj7AJ4jymXed/tOb5tpVar1SZ6ZnrYl0ACfUuxk
eC6kOQW5zNuUIaoLrV0gIf8=
=u6L1
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/11] reiserfs: xattr rework
  2006-03-07 21:39               ` Jeff Mahoney
@ 2006-03-08 18:20                 ` Jan Kara
  2006-03-08 19:12                   ` Jeff Mahoney
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2006-03-08 18:20 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: ReiserFS List, E. Gryaznova

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Jan Kara wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA1
> >>
> >> Jan Kara wrote:
> >>>> The data journaling mode can be set as a flag associated with the inode.
> >>>>  Currently, i_data_log is set in REISERFS_I(inode)->i_flags. I add
> >>>> i_data_ordered in one of my later patches. They can be tested easily
> >>>> with reiserfs_file_data_{log,ordered}. There's no reason that one
> >>>> couldn't be moved up and made a prerequisite for the first patch.
> >>>   Fine. So we can just set proper journaling flags in reiserfs_quota_on
> >>> and then honor them in the internal writing functions.
> >> Ok, how do the attached patches look to you? The internal I/O changes
> >> need to be applied after the journaled xattr patch or we get an Oops
> >> trying to start a transaction without calling reiserfs_write_lock()
> >> first. I've modified the first patch in the xattr series to abstract out
> >> the fp->f_op->{read,write} calls to an xattr_{read,write} pair of
> >> functions. This makes it easier to move to the internal i/o code later.
> >> I've included it for clarity, but that is the only change.
> >   The patch looks fine. Just two minor comments:
> > 
> > <snip>
> >> diff -ruNpX ../dontdiff linux-2.6.15-staging1/fs/reiserfs/super.c linux-2.6.15-staging2/fs/reiserfs/super.c
> >> --- linux-2.6.15-staging1/fs/reiserfs/super.c	2006-03-03 17:09:01.000000000 -0500
> >> +++ linux-2.6.15-staging2/fs/reiserfs/super.c	2006-03-03 17:09:04.000000000 -0500
> >> @@ -1949,6 +1949,109 @@ static int reiserfs_statfs(struct super_
> >>  	return 0;
> >>  }
> >>  
> >> +#if defined(CONFIG_QUOTA) || defined(CONFIG_REISERFS_FS_XATTR)
> >> +/* Read data from quotafile - avoid pagecache and such because we cannot afford
> >> + * acquiring the locks... As quota files are never truncated and quota code
> >> + * itself serializes the operations (and noone else should touch the files)
> >> + * we don't have to be afraid of races */
> >  Update here the comment to reflect that we use this function also for
> > xattrs now - I suppose those files cannot be truncated either and that
> > xattr code serializes the operations there.
> > 
> >> +ssize_t reiserfs_internal_read(struct inode *inode, char *data, size_t len,
> >> +                               loff_t off)
> >   <snip>
> > 
> >> +/* Write to quotafile (we know the transaction is already started and has
> >> + * enough credits) */
> >   Here again update the comment...
> > 
> >> +ssize_t reiserfs_internal_write(struct inode *inode, const char *data,
> >> +                                size_t len, loff_t off)
> > 
> > 									Honza
> > 
> 
> I've updated the patches with the comment changes, though I did run into
> a much bigger snag.
> 
> The internal i/o patches don't support tails, and that's a silver bullet
> against this working for xattrs. Most xattrs, such as ACLs, are likley
> to be only a few tens of bytes long and allocating an entire block is
> extremely wasteful.
  Umm, that is really nasty. Ext3 solves this by sharing a block among
several inodes but that's far to much work to fix this bug...

> I've managed to alter internal read to handle tails by allocating an
> anonymous page and using it with the temporary buffer head to get the
> tail data from reiserfs_get_block back. But the rest of the tail packing
> code very much needs the page cache. Is there going to be any way this
> can be managed without reintroducing deadlocks?
  I've been trying to find some other way when solving problems for
quotas but find none. If you want xattr changes to be journaled with
other data changes, you have to first start a transaction and then issue
a write that will consequently need PageLock. So do you really need
a trasaction started before a write starts? For journaled quota this was
must but for xattrs it might not be necessary. Then we would still need
to sort out the problems with xattr lock but that might be easier to
deal with.

							Bye
								Honza

-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/11] reiserfs: xattr rework
  2006-03-08 18:20                 ` Jan Kara
@ 2006-03-08 19:12                   ` Jeff Mahoney
  2006-03-08 23:14                     ` Andreas Dilger
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Mahoney @ 2006-03-08 19:12 UTC (permalink / raw)
  To: Jan Kara; +Cc: ReiserFS List, E. Gryaznova

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jan Kara wrote:
>> The internal i/o patches don't support tails, and that's a silver bullet
>> against this working for xattrs. Most xattrs, such as ACLs, are likley
>> to be only a few tens of bytes long and allocating an entire block is
>> extremely wasteful.
>   Umm, that is really nasty. Ext3 solves this by sharing a block among
> several inodes but that's far to much work to fix this bug...

I had considered sharing files, and the code knows to drop a link to a
shared file when it's changed. That's one of the features I had wanted
from the beginning but never got around to implementing.

>> I've managed to alter internal read to handle tails by allocating an
>> anonymous page and using it with the temporary buffer head to get the
>> tail data from reiserfs_get_block back. But the rest of the tail packing
>> code very much needs the page cache. Is there going to be any way this
>> can be managed without reintroducing deadlocks?
>   I've been trying to find some other way when solving problems for
> quotas but find none. If you want xattr changes to be journaled with
> other data changes, you have to first start a transaction and then issue
> a write that will consequently need PageLock. So do you really need
> a trasaction started before a write starts? For journaled quota this was
> must but for xattrs it might not be necessary. Then we would still need
> to sort out the problems with xattr lock but that might be easier to
> deal with.

The locking constraints for xattrs have changed somewhat. Have you
looked at the rest of the xattr patches, or am I being dumb and missing
your point?

Do the quota races occur when they are completely journaled or only when
ordered writes are used? I'm wondering if perhaps we could alter
reiserfs_file_write a bit to not mark the buffers dirty yet for internal
files, and have the ordered writeout do it then. Then, we wouldn't race
against pdflush.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFEDyyALPWxlyuTD7IRAiCuAJ4o8/SpIk8VBAl8/98kppRB9o7VcACfSGnM
pTqsa6fCMSDNBZoketOyx0I=
=ZKxx
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/11] reiserfs: xattr rework
  2006-03-08 19:12                   ` Jeff Mahoney
@ 2006-03-08 23:14                     ` Andreas Dilger
  2006-03-09 18:26                       ` Hans Reiser
  0 siblings, 1 reply; 15+ messages in thread
From: Andreas Dilger @ 2006-03-08 23:14 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: Jan Kara, ReiserFS List, E. Gryaznova

On Mar 08, 2006  14:12 -0500, Jeff Mahoney wrote:
> Jan Kara wrote:
> >> The internal i/o patches don't support tails, and that's a silver bullet
> >> against this working for xattrs. Most xattrs, such as ACLs, are likley
> >> to be only a few tens of bytes long and allocating an entire block is
> >> extremely wasteful.
> >
> >   Umm, that is really nasty. Ext3 solves this by sharing a block among
> > several inodes but that's far to much work to fix this bug...
> 
> I had considered sharing files, and the code knows to drop a link to a
> shared file when it's changed. That's one of the features I had wanted
> from the beginning but never got around to implementing.

Just FYI, ext3 has recently implemented support for larger inodes exactly
to store small EAs with the inode instead of an external block.  For ACLs
there is some benefit to sharing the block, because the overhead is
amortized over many inodes.  However, virtually all other EA data is
unique per inode and the ext3 EA block sharing only works if ALL the EAs
for an inode are identical, so that isn't very useful if you have anything
other than ACLs to store.

The performance of in-inode EAs is vastly better than external blocks
because of seeks and not wasting 4kB of disk/RAM for 10-100 bytes of EA.

Not sure if this is useful (haven't been following discussion too closely),
but thought I would steer you away from the shared-block idea early.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/11] reiserfs: xattr rework
  2006-03-08 23:14                     ` Andreas Dilger
@ 2006-03-09 18:26                       ` Hans Reiser
  2006-03-09 19:11                         ` Jeff Mahoney
  0 siblings, 1 reply; 15+ messages in thread
From: Hans Reiser @ 2006-03-09 18:26 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Jeff Mahoney, Jan Kara, ReiserFS List, E. Gryaznova

All of this is why xattrs should not be implemented without first
implementing plugins.:-/

Really guys, there was a reason for my not wanting xattrs to go into
V3.  Do you see it now?  This is what happens when marketing determines
feature ship schedules.  You could implement this xattr stuff in V4 in
1/5th the time, 5 times the performance, and twice the elegance.   Of
course, implementing them as pseudo files would be far more elegant
still....

Hans

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/11] reiserfs: xattr rework
  2006-03-09 18:26                       ` Hans Reiser
@ 2006-03-09 19:11                         ` Jeff Mahoney
  2006-03-09 19:52                           ` Hans Reiser
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Mahoney @ 2006-03-09 19:11 UTC (permalink / raw)
  To: Hans Reiser; +Cc: Andreas Dilger, Jan Kara, ReiserFS List, E. Gryaznova

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hans Reiser wrote:
> All of this is why xattrs should not be implemented without first
> implementing plugins.:-/
> 
> Really guys, there was a reason for my not wanting xattrs to go into
> V3.  Do you see it now?  This is what happens when marketing determines
> feature ship schedules.  You could implement this xattr stuff in V4 in
> 1/5th the time, 5 times the performance, and twice the elegance.   Of
> course, implementing them as pseudo files would be far more elegant
> still....

And as I've said before, Hans, if the original code base was capable of
supporting the plethora of items the white paper hyped, we wouldn't have
run into this problem either.

I would have loved to have implemented xattrs as another item type, but
as soon as I did that, the kernel crashed almost instantly on not
recognizing the new item type in the balance code. While it was
certainly fixable in that version, properly fixing it would have
required a ReiserFS 3.7 with capability bits similar to ext[23]. Looking
back, maybe that wouldn't have been such a bad thing.

As for waiting for v4, we've been through this before. Users wanted ACLs
on ReiserFS yesterday, and I'd hardly brush aside features that users
have been demanding as marketing. There's no denying that a
reiser4-based solution would have been cleaner, but sometimes we just
have to make do with what we've got.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFEEH3lLPWxlyuTD7IRAgz6AJ4l4/f92LJAKs65OqAbl8cIIL1PRACdE5Yb
MmQcruV2Nmd2l3RS+V0TYrI=
=ujKr
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/11] reiserfs: xattr rework
  2006-03-09 19:11                         ` Jeff Mahoney
@ 2006-03-09 19:52                           ` Hans Reiser
  0 siblings, 0 replies; 15+ messages in thread
From: Hans Reiser @ 2006-03-09 19:52 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: Andreas Dilger, Jan Kara, ReiserFS List, E. Gryaznova

Jeff Mahoney wrote:

>
>
> And as I've said before, Hans, if the original code base was capable of
> supporting the plethora of items the white paper hyped, we wouldn't have
> run into this problem either.

This is an odd criticism.  V3 never pretended to have item plugins.  V4
does.  If you add a new item plugin to V4, it will work quite well.

>
> I would have loved to have implemented xattrs as another item type, but
> as soon as I did that, the kernel crashed almost instantly on not
> recognizing the new item type in the balance code. While it was
> certainly fixable in that version, properly fixing it would have
> required a ReiserFS 3.7 with capability bits similar to ext[23]. Looking
> back, maybe that wouldn't have been such a bad thing.

All of the performance issues and stable code destabilization issues you
are experiencing now are what I expected, and what motivated my wanting
to defer acls for V4, and by so doing (and thus concentrating our
development resources on V4) advance the schedule for V4.

>
> As for waiting for v4, we've been through this before. Users wanted ACLs
> on ReiserFS yesterday, and I'd hardly brush aside features that users
> have been demanding as marketing.

There is a difference between brushing them aside, and choosing to put
them into the next major release because proper solution of them
requires deeper work than stuffing them into a file.  

Also, marketing is important.  The problem is when marketing motivates
wishful thinking about the shape of the code.  Architecture is like
sculpting, you must sense what the wood or marble wants to be shaped
into as clearly as you see the vision in your head of what you want it
to be.   When it won't support the shape you want, then in sculpting you
need another piece of wood, and in programming you need another major
release.

That said, I understand you are seeking to provide the users with what
they want, and that while we disagree on the best strategy for that, we
both try to do the best we can for the users.

> There's no denying that a
> reiser4-based solution would have been cleaner, but sometimes we just
> have to make do with what we've got.
>
> -Jeff
>
> --
> Jeff Mahoney
> SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2006-03-09 19:52 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-20 20:14 [PATCH 00/11] reiserfs: xattr rework Jeff Mahoney
2006-03-01 12:34 ` Jan Kara
2006-03-02 21:53   ` Jeff Mahoney
2006-03-03  0:29     ` Jan Kara
2006-03-03  0:56       ` Jeff Mahoney
2006-03-03 10:04         ` Jan Kara
2006-03-03 22:17           ` Jeff Mahoney
2006-03-06 11:59             ` Jan Kara
2006-03-07 21:39               ` Jeff Mahoney
2006-03-08 18:20                 ` Jan Kara
2006-03-08 19:12                   ` Jeff Mahoney
2006-03-08 23:14                     ` Andreas Dilger
2006-03-09 18:26                       ` Hans Reiser
2006-03-09 19:11                         ` Jeff Mahoney
2006-03-09 19:52                           ` Hans Reiser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.