hfsplus BUG(), kmap and journalling.

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* hfsplus BUG(), kmap and journalling.
@ 2012-10-18 16:55 Hin-Tak Leung
  2012-10-19 12:45 ` Vyacheslav Dubeyko
  0 siblings, 1 reply; 7+ messages in thread
From: Hin-Tak Leung @ 2012-10-18 16:55 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Vyacheslav Dubeyko, Till Kamppeter, Naohiro Aota

Hi,

While looking at a few of the older BUG() traces I have consistently running du on a somewhat large directory with lots of small files and small directories, I noticed that it tends to have two sleeping "? hfs_bnode_read()" towards the top. As it is a very small and simple function which just reads a b-tree node record - sometimes only a few bytes between a kmap/kunmap, I see that it might just be the number of simultaneous kmap() being run. So I put a mutex around it just to make sure only one copy of hfs_bnode_read() is run at a time.

This seems to make it much harder to get a BUG() - I needed to run du a few times over and over to get it again. Of course it might just be a mutex slowing the driver down to make it less likely to get confused, but as I read that the number of simultaneous kmap() in the kernel is limited, I think I might be on to something.

Also this shifts the problem onto multiple copies of "? hfsplus_bmap()". (which also kmap()/kunmap()'s, but much more complicated).

I thought of doing hfsplus_kmap()/etc(which seems to exist a long time ago but removed!) , but this might cause dead locks since some of the hfsplus code is kmapping/kunmapping all the time, and recursively. So a better way might be just to make sure only one instance of some of the routines are only run one at a time. i.e. multiple mutexes.

This is both ugly and sounds like voodoo though. Also I am not sure why the existing mutex'es, which protects some of the internal structures, doesn't protect against too many kmap's. (maybe they protect "writes", but not against too many simultaneous reads).

So does anybody has an idea how many kmaps are allowed and how to tell that I am close to my machine's limit?

Also a side note on the Netgear journalling code: I see that it jounrnals the volume header, some of the special files (the catalog, allocation bitmap, etc), but (1) it has some code to journal the attribute file, but it was actually non-functional, since without Vyacheslav's recent patches, the linux kernel doesn't even read/write that correctly, let alone doing *journalled* read/write correctly, (2) there is a part which tries to do data-page journalling, but it seems to be wrong - or at least, not quite working. (this I found while I was looking at some curious warning messages and how they come about).

Luckily that codes just bails out when it gets confused - i.e. it does non-journalled writes, rather than writing wrong journal to disk. So it doesn't harm data under routine normal use. (i.e. mount/unmount cleanly).

But that got me worrying a bit about inter-operability: it is probably unsafe to use Linux to replay the journal written by Mac OS X, and vice versa. i.e. if you have a dual boot machine, or a portable disk that you use between two OSes, if it disconnects/unplugs/crashes under one OS, it is better to plug it right back and let the same OS replaying the journal then unmount cleanly before using it under the other OS.

I'll be interested on hearing any tips on finding out kmap's limit at run time, if anybody has any idea...

Hin-Tak

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: hfsplus BUG(), kmap and journalling.
  2012-10-18 16:55 hfsplus BUG(), kmap and journalling Hin-Tak Leung
@ 2012-10-19 12:45 ` Vyacheslav Dubeyko
  2012-10-20  6:24   ` Hin-Tak Leung
  0 siblings, 1 reply; 7+ messages in thread
From: Vyacheslav Dubeyko @ 2012-10-19 12:45 UTC (permalink / raw)
  To: htl10; +Cc: linux-fsdevel, Till Kamppeter, Naohiro Aota

Hi Hin-Tak,

On Thu, 2012-10-18 at 17:55 +0100, Hin-Tak Leung wrote:
> Hi,
>
> While looking at a few of the older BUG() traces I have consistently
> running du on a somewhat large directory with lots of small files and
> small directories, I noticed that it tends to have two sleeping "?
> hfs_bnode_read()" towards the top. As it is a very small and simple
> function which just reads a b-tree node record - sometimes only a few
> bytes between a kmap/kunmap, I see that it might just be the number of
> simultaneous kmap() being run. So I put a mutex around it just to make
> sure only one copy of hfs_bnode_read() is run at a time.

Yeah, you touch very important problem. It needs to rework hfsplus
driver from using kmap()/kunmap() because kmap() is slow, theoretically
deadlocky and is deprecated. The alternative is kunmap_atomic() but it
needs to dive more deeply in every case of kmap() using in hfsplus
driver.

The mutex is useless. It simply hides the issue.

> This seems to make it much harder to get a BUG() - I needed to run du
> a few times over and over to get it again. Of course it might just be
> a mutex slowing the driver down to make it less likely to get
> confused, but as I read that the number of simultaneous kmap() in the
> kernel is limited, I think I might be on to something.
> Also this shifts the problem onto multiple copies of "?
> hfsplus_bmap()". (which also kmap()/kunmap()'s, but much more
> complicated).

Namely, the mutex hides the issue.

> I thought of doing hfsplus_kmap()/etc(which seems to exist a long time
> ago but removed!) , but this might cause dead locks since some of the
> hfsplus code is kmapping/kunmapping all the time, and recursively. So
> a better way might be just to make sure only one instance of some of
> the routines are only run one at a time. i.e. multiple mutexes.
> This is both ugly and sounds like voodoo though. Also I am not sure
> why the existing mutex'es, which protects some of the internal
> structures, doesn't protect against too many kmap's. (maybe they
> protect "writes", but not against too many simultaneous reads).
> So does anybody has an idea how many kmaps are allowed and how to tell
> that I am close to my machine's limit?

As I can understand, the hfsplus_kmap() doesn't do something useful. It
really needs to rework kmap()/kunmap() using instead of mutex using.

Could you try to fix this issue? :-)

> Also a side note on the Netgear journalling code: I see that it
> jounrnals the volume header, some of the special files (the catalog,
> allocation bitmap, etc), but (1) it has some code to journal the
> attribute file, but it was actually non-functional, since without
> Vyacheslav's recent patches, the linux kernel doesn't even read/write
> that correctly, let alone doing *journalled* read/write correctly, (2)
> there is a part which tries to do data-page journalling, but it seems
> to be wrong - or at least, not quite working. (this I found while I
> was looking at some curious warning messages and how they come about).
> Luckily that codes just bails out when it gets confused - i.e. it does
> non-journalled writes, rather than writing wrong journal to disk. So
> it doesn't harm data under routine normal use. (i.e. mount/unmount
> cleanly).
> But that got me worrying a bit about inter-operability: it is probably
> unsafe to use Linux to replay the journal written by Mac OS X, and
> vice versa. i.e. if you have a dual boot machine, or a portable disk
> that you use between two OSes, if it disconnects/unplugs/crashes under
> one OS, it is better to plug it right back and let the same OS
> replaying the journal then unmount cleanly before using it under the
> other OS.

The journal should be replayed during every mount in the case of
presence of valid transactions. A HFS+ volume shouldn't be mounted
without journal replaying. Otherwise, it is possible to achieve
corrupted partition. Just imagine, you have mounted HFS+ partition with
not empty journal then add some data on volume. It means that you modify
metadata. If you will mount such HFS+ volume under Mac OS X then journal
will be replayed and metadata will be corrupted.

With the best regards,
Vyacheslav Dubeyko.

> I'll be interested on hearing any tips on finding out kmap's limit at
> run time, if anybody has any idea...
> 
> Hin-Tak



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: hfsplus BUG(), kmap and journalling.
  2012-10-19 12:45 ` Vyacheslav Dubeyko
@ 2012-10-20  6:24   ` Hin-Tak Leung
  2012-10-22  9:02     ` Vyacheslav Dubeyko
  0 siblings, 1 reply; 7+ messages in thread
From: Hin-Tak Leung @ 2012-10-20  6:24 UTC (permalink / raw)
  To: Vyacheslav Dubeyko; +Cc: linux-fsdevel, Till Kamppeter, Naohiro Aota

--- On Fri, 19/10/12, Vyacheslav Dubeyko <slava@dubeyko.com> wrote:

> Hi Hin-Tak,
> 
> On Thu, 2012-10-18 at 17:55 +0100, Hin-Tak Leung wrote:
> > Hi,
> >
> > While looking at a few of the older BUG() traces I have
> consistently
> > running du on a somewhat large directory with lots of
> small files and
> > small directories, I noticed that it tends to have two
> sleeping "?
> > hfs_bnode_read()" towards the top. As it is a very
> small and simple
> > function which just reads a b-tree node record -
> sometimes only a few
> > bytes between a kmap/kunmap, I see that it might just
> be the number of
> > simultaneous kmap() being run. So I put a mutex around
> it just to make
> > sure only one copy of hfs_bnode_read() is run at a
> time.
> 
> Yeah, you touch very important problem. It needs to rework
> hfsplus
> driver from using kmap()/kunmap() because kmap() is slow,
> theoretically
> deadlocky and is deprecated. The alternative is
> kunmap_atomic() but it
> needs to dive more deeply in every case of kmap() using in
> hfsplus
> driver.
> 
> The mutex is useless. It simply hides the issue.

Yes, I am aware of that - putting mutex'es in just makes fewer kmap calls, but the limit of simultaneous kmap()'s can still be reached - and reasonably easily - just run 'du' a few more times, as I wrote below.

I tried swapping those by kmap_atomic()/kunmap_atomic() (beware the arguments are different for unmap) - but the kernel immediately warned that *_atom() code is used where code can sleep. 

> > This seems to make it much harder to get a BUG() - I
> needed to run du
> > a few times over and over to get it again. Of course it
> might just be
> > a mutex slowing the driver down to make it less likely
> to get
> > confused, but as I read that the number of simultaneous
> kmap() in the
> > kernel is limited, I think I might be on to something.
> > Also this shifts the problem onto multiple copies of
> "?
> > hfsplus_bmap()". (which also kmap()/kunmap()'s, but
> much more
> > complicated).
> 
> Namely, the mutex hides the issue.
> 
> > I thought of doing hfsplus_kmap()/etc(which seems to
> exist a long time
> > ago but removed!) , but this might cause dead locks
> since some of the
> > hfsplus code is kmapping/kunmapping all the time, and
> recursively. So
> > a better way might be just to make sure only one
> instance of some of
> > the routines are only run one at a time. i.e. multiple
> mutexes.
> > This is both ugly and sounds like voodoo though. Also I
> am not sure
> > why the existing mutex'es, which protects some of the
> internal
> > structures, doesn't protect against too many kmap's.
> (maybe they
> > protect "writes", but not against too many simultaneous
> reads).
> > So does anybody has an idea how many kmaps are allowed
> and how to tell
> > that I am close to my machine's limit?
> 
> As I can understand, the hfsplus_kmap() doesn't do something
> useful. It
> really needs to rework kmap()/kunmap() using instead of
> mutex using.
> 
> Could you try to fix this issue? :-)

Am *trying* :-). Hence this request for discussion & help. I do think that the hfsplpus driver is kmap'ping/unmapping too often - and doing so on very small pieces of data, which does not map. I think one possibility of improving is to organize the internal representation of the b-tree - translate to a more page-filling structure, if that make sense, rather than mapping/unmapping pages all the time to read very small pieces off each page.

I still cannot quite get around my head how, (1) essentially read-only operations can get worse and worse if you run it a few more times, (2) it seems that  it is just the kernel's internal representation of the filesystem get more and more confusing - there does not seem to be any write on unmount, and if you unmount and run fsck it is "no need to do anything", and you can re-mount and play with 'du' again. 

> > Also a side note on the Netgear journalling code: I see
> that it
> > jounrnals the volume header, some of the special files
> (the catalog,
> > allocation bitmap, etc), but (1) it has some code to
> journal the
> > attribute file, but it was actually non-functional,
> since without
> > Vyacheslav's recent patches, the linux kernel doesn't
> even read/write
> > that correctly, let alone doing *journalled* read/write
> correctly, (2)
> > there is a part which tries to do data-page
> journalling, but it seems
> > to be wrong - or at least, not quite working. (this I
> found while I
> > was looking at some curious warning messages and how
> they come about).
> > Luckily that codes just bails out when it gets confused
> - i.e. it does
> > non-journalled writes, rather than writing wrong
> journal to disk. So
> > it doesn't harm data under routine normal use. (i.e.
> mount/unmount
> > cleanly).
> > But that got me worrying a bit about inter-operability:
> it is probably
> > unsafe to use Linux to replay the journal written by
> Mac OS X, and
> > vice versa. i.e. if you have a dual boot machine, or a
> portable disk
> > that you use between two OSes, if it
> disconnects/unplugs/crashes under
> > one OS, it is better to plug it right back and let the
> same OS
> > replaying the journal then unmount cleanly before using
> it under the
> > other OS.
> 
> The journal should be replayed during every mount in the
> case of
> presence of valid transactions. A HFS+ volume shouldn't be
> mounted
> without journal replaying. Otherwise, it is possible to
> achieve
> corrupted partition. Just imagine, you have mounted HFS+
> partition with
> not empty journal then add some data on volume. It means
> that you modify
> metadata. If you will mount such HFS+ volume under Mac OS X
> then journal
> will be replayed and metadata will be corrupted.

Both OSes try to replay on first mount - but I doubt that they create/use journal the same way so inter-operability is not gaurannteed - i.e. it is not recommended to reboot to Mac OS X from an unclean shutdown of linux or vice versa. Although the netgear code seems to be consistent - i.e. it replays journals created by itself okay, I should hope.

What is clearly a problem is that the netgear code bails out too often and *not* write a journal (and therefore not clear the transaction after the data write) for some data writes, and basically just write data without an accompanying journal & its finishing transaction.

Hin-Tak

> With the best regards,
> Vyacheslav Dubeyko.
> 
> > I'll be interested on hearing any tips on finding out
> kmap's limit at
> > run time, if anybody has any idea...
> > 
> > Hin-Tak
> 
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: hfsplus BUG(), kmap and journalling.
  2012-10-20  6:24   ` Hin-Tak Leung
@ 2012-10-22  9:02     ` Vyacheslav Dubeyko
  2012-10-30  8:24       ` Hin-Tak Leung
  0 siblings, 1 reply; 7+ messages in thread
From: Vyacheslav Dubeyko @ 2012-10-22  9:02 UTC (permalink / raw)
  To: htl10; +Cc: linux-fsdevel, Till Kamppeter, Naohiro Aota

On Sat, 2012-10-20 at 07:24 +0100, Hin-Tak Leung wrote:
> --- On Fri, 19/10/12, Vyacheslav Dubeyko <slava@dubeyko.com> wrote:
> 
> > Hi Hin-Tak,
> > 
> > On Thu, 2012-10-18 at 17:55 +0100, Hin-Tak Leung wrote:
> > > Hi,
> > >
> > > While looking at a few of the older BUG() traces I have
> > consistently
> > > running du on a somewhat large directory with lots of
> > small files and
> > > small directories, I noticed that it tends to have two
> > sleeping "?
> > > hfs_bnode_read()" towards the top. As it is a very
> > small and simple
> > > function which just reads a b-tree node record -
> > sometimes only a few
> > > bytes between a kmap/kunmap, I see that it might just
> > be the number of
> > > simultaneous kmap() being run. So I put a mutex around
> > it just to make
> > > sure only one copy of hfs_bnode_read() is run at a
> > time.
> > 
> > Yeah, you touch very important problem. It needs to rework
> > hfsplus
> > driver from using kmap()/kunmap() because kmap() is slow,
> > theoretically
> > deadlocky and is deprecated. The alternative is
> > kunmap_atomic() but it
> > needs to dive more deeply in every case of kmap() using in
> > hfsplus
> > driver.
> > 
> > The mutex is useless. It simply hides the issue.
> 
> Yes, I am aware of that - putting mutex'es in just makes fewer kmap calls, but the limit of simultaneous kmap()'s can still be reached - and reasonably easily - just run 'du' a few more times, as I wrote below.
> 

Usually, hfs_bnode_read() is called after searching of some object in
b-tree. It needs to initialize struct hfsplus_find_data by means of
hfs_find_init() before any search and operations inside b-tree node. And
then, it needs to call hfs_find_exit(). The hfsplus_find_data structure
contains mutex that it locks of b-tree during hfs_find_init() call and
unlock during hfs_find_exit(). And, usually, hfs_bnode_read() is placed between hfs_find_init()/hfs_find_exit() calls. So, as I can understand, your mutex inside hfs_bnode_read() is useless. But, maybe, not all hfs_bnode_read() calls are placed inside hfs_find_init()/hfs_find_exit() calls. It needs to check.

I can't clear understand about what simultaneous kmap()'s you are talking. As I can see, usually, (especially, in the case of hfs_bnode_read()) pairs of kmap()/kunmap() localize in small parts of code and I expect that it executes fast. Do I misunderstand something?


> I tried swapping those by kmap_atomic()/kunmap_atomic() (beware the arguments are different for unmap) - but the kernel immediately warned that *_atom() code is used where code can sleep. 
> 

Could you describe in more details about warning?

I thought that in hfs_bnode_read() pairs of kmap()/kunmap() localize in
small parts of code and can be changed on kmap_atomic()/kunmap_atomic().

> > > This seems to make it much harder to get a BUG() - I
> > needed to run du
> > > a few times over and over to get it again. Of course it
> > might just be
> > > a mutex slowing the driver down to make it less likely
> > to get
> > > confused, but as I read that the number of simultaneous
> > kmap() in the
> > > kernel is limited, I think I might be on to something.
> > > Also this shifts the problem onto multiple copies of
> > "?
> > > hfsplus_bmap()". (which also kmap()/kunmap()'s, but
> > much more
> > > complicated).
> > 
> > Namely, the mutex hides the issue.
> > 
> > > I thought of doing hfsplus_kmap()/etc(which seems to
> > exist a long time
> > > ago but removed!) , but this might cause dead locks
> > since some of the
> > > hfsplus code is kmapping/kunmapping all the time, and
> > recursively. So
> > > a better way might be just to make sure only one
> > instance of some of
> > > the routines are only run one at a time. i.e. multiple
> > mutexes.
> > > This is both ugly and sounds like voodoo though. Also I
> > am not sure
> > > why the existing mutex'es, which protects some of the
> > internal
> > > structures, doesn't protect against too many kmap's.
> > (maybe they
> > > protect "writes", but not against too many simultaneous
> > reads).
> > > So does anybody has an idea how many kmaps are allowed
> > and how to tell
> > > that I am close to my machine's limit?
> > 
> > As I can understand, the hfsplus_kmap() doesn't do something
> > useful. It
> > really needs to rework kmap()/kunmap() using instead of
> > mutex using.
> > 
> > Could you try to fix this issue? :-)
> 
> Am *trying* :-). Hence this request for discussion & help. I do think that the hfsplpus driver is kmap'ping/unmapping too often - and doing so on very small pieces of data, which does not map. I think one possibility of improving is to organize the internal representation of the b-tree - translate to a more page-filling structure, if that make sense, rather than mapping/unmapping pages all the time to read very small pieces off each page.
> 

Sorry, I am not fully understand your description. Currently, I think that current implementation of b-tree functionality looks like good. And I can't see any real necessity to change this.

Could you describe your trying in more details?

> I still cannot quite get around my head how, (1) essentially read-only operations can get worse and worse if you run it a few more times, (2) it seems that  it is just the kernel's internal representation of the filesystem get more and more confusing - there does not seem to be any write on unmount, and if you unmount and run fsck it is "no need to do anything", and you can re-mount and play with 'du' again. 
> 
> > > Also a side note on the Netgear journalling code: I see
> > that it
> > > jounrnals the volume header, some of the special files
> > (the catalog,
> > > allocation bitmap, etc), but (1) it has some code to
> > journal the
> > > attribute file, but it was actually non-functional,
> > since without
> > > Vyacheslav's recent patches, the linux kernel doesn't
> > even read/write
> > > that correctly, let alone doing *journalled* read/write
> > correctly, (2)
> > > there is a part which tries to do data-page
> > journalling, but it seems
> > > to be wrong - or at least, not quite working. (this I
> > found while I
> > > was looking at some curious warning messages and how
> > they come about).
> > > Luckily that codes just bails out when it gets confused
> > - i.e. it does
> > > non-journalled writes, rather than writing wrong
> > journal to disk. So
> > > it doesn't harm data under routine normal use. (i.e.
> > mount/unmount
> > > cleanly).
> > > But that got me worrying a bit about inter-operability:
> > it is probably
> > > unsafe to use Linux to replay the journal written by
> > Mac OS X, and
> > > vice versa. i.e. if you have a dual boot machine, or a
> > portable disk
> > > that you use between two OSes, if it
> > disconnects/unplugs/crashes under
> > > one OS, it is better to plug it right back and let the
> > same OS
> > > replaying the journal then unmount cleanly before using
> > it under the
> > > other OS.
> > 
> > The journal should be replayed during every mount in the
> > case of
> > presence of valid transactions. A HFS+ volume shouldn't be
> > mounted
> > without journal replaying. Otherwise, it is possible to
> > achieve
> > corrupted partition. Just imagine, you have mounted HFS+
> > partition with
> > not empty journal then add some data on volume. It means
> > that you modify
> > metadata. If you will mount such HFS+ volume under Mac OS X
> > then journal
> > will be replayed and metadata will be corrupted.
> 
> Both OSes try to replay on first mount - but I doubt that they
create/use journal the same way so inter-operability is not gaurannteed
- i.e. it is not recommended to reboot to Mac OS X from an unclean
shutdown of linux or vice versa. Although the netgear code seems to be
consistent - i.e. it replays journals created by itself okay, I should
hope.
> 
> What is clearly a problem is that the netgear code bails out too often
and *not* write a journal (and therefore not clear the transaction after
the data write) for some data writes, and basically just write data
without an accompanying journal & its finishing transaction.
> 

I think that situation is simple. :-) It is possible to state that we have support of HFS+ journaling under Linux if the Linux code can correctly replay the journal (filled by transaction as under Mac OS X as under Linux) and operates under HFS+ journaled partition without any complains of fsck and possibility to mount successfully under Mac OS X.

As I can understand (maybe I am wrong), current existing under Linux implementation of HFS+ journaling support only tries to replay journal during HFS+ volume mount and then it doesn't work with journal. It is possible to work with journaled HFS+ volume in such way. So, the journal after umount under Linux can be empty. And Mac OS X can mount such volume without any troubles. But good support of HFS+ journaling should work with journal during mounted state under Linux.

With the best regards,
Vyacheslav Dubeyko.

> Hin-Tak
> 
> > With the best regards,
> > Vyacheslav Dubeyko.
> > 
> > > I'll be interested on hearing any tips on finding out
> > kmap's limit at
> > > run time, if anybody has any idea...
> > > 
> > > Hin-Tak
> > 
> > 
> > 




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: hfsplus BUG(), kmap and journalling.
  2012-10-22  9:02     ` Vyacheslav Dubeyko
@ 2012-10-30  8:24       ` Hin-Tak Leung
  2012-10-30 11:45         ` Vyacheslav Dubeyko
  0 siblings, 1 reply; 7+ messages in thread
From: Hin-Tak Leung @ 2012-10-30  8:24 UTC (permalink / raw)
  To: Vyacheslav Dubeyko; +Cc: linux-fsdevel, Till Kamppeter, Naohiro Aota

[-- Attachment #1: Type: text/plain, Size: 4664 bytes --]



Vyacheslav Dubeyko wrote:
> On Sat, 2012-10-20 at 07:24 +0100, Hin-Tak Leung wrote:
>> --- On Fri, 19/10/12, Vyacheslav Dubeyko <slava@dubeyko.com> wrote:
>>
>>> Hi Hin-Tak,
>>>
>>> On Thu, 2012-10-18 at 17:55 +0100, Hin-Tak Leung wrote:
>>>> Hi,
>>>>
>>>> While looking at a few of the older BUG() traces I have
>>> consistently
>>>> running du on a somewhat large directory with lots of
>>> small files and
>>>> small directories, I noticed that it tends to have two
>>> sleeping "?
>>>> hfs_bnode_read()" towards the top. As it is a very
>>> small and simple
>>>> function which just reads a b-tree node record -
>>> sometimes only a few
>>>> bytes between a kmap/kunmap, I see that it might just
>>> be the number of
>>>> simultaneous kmap() being run. So I put a mutex around
>>> it just to make
>>>> sure only one copy of hfs_bnode_read() is run at a
>>> time.
>>>
>>> Yeah, you touch very important problem. It needs to rework
>>> hfsplus
>>> driver from using kmap()/kunmap() because kmap() is slow,
>>> theoretically
>>> deadlocky and is deprecated. The alternative is
>>> kunmap_atomic() but it
>>> needs to dive more deeply in every case of kmap() using in
>>> hfsplus
>>> driver.
>>>
>>> The mutex is useless. It simply hides the issue.
>>
>> Yes, I am aware of that - putting mutex'es in just makes fewer kmap calls, but the limit of simultaneous kmap()'s can still be reached - and reasonably easily - just run 'du' a few more times, as I wrote below.
>>
>
> Usually, hfs_bnode_read() is called after searching of some object in b-tree.
> It needs to initialize struct hfsplus_find_data by means of hfs_find_init()
> before any search and operations inside b-tree node. And then, it needs to
> call hfs_find_exit(). The hfsplus_find_data structure contains mutex that it
> locks of b-tree during hfs_find_init() call and unlock during
> hfs_find_exit(). And, usually, hfs_bnode_read() is placed
> between hfs_find_init()/hfs_find_exit() calls. So, as I can understand, your
> mutex inside hfs_bnode_read() is useless. But, maybe, not all hfs_bnode_read()
> calls are placed inside hfs_find_init()/hfs_find_exit() calls. It needs to check.

> I can't clear understand about what simultaneous kmap()'s you are talking.
> As
> I can see, usually, (especially, in the case of hfs_bnode_read()) pairs of
> kmap()/kunmap() localize in small parts of code and I expect that it executes
> fast. Do I misunderstand something?

Perhaps it is easier to show some code for the discussion. The attached 
serializes kmap in hfs_bnode_read() . This makes the driver works more reliably, 
somehow. In real usage, multiple instances of hfs_bnode_read() do get invoked in 
parallel. I assume that is because of both SMP as well as file system access 
itself are parallelized at various level - e.g. recursion like running du 
probably invokes a few instances of readdir/getdents?

>> I tried swapping those by kmap_atomic()/kunmap_atomic() (beware the arguments are different for unmap) - but the kernel immediately warned that *_atom() code is used where code can sleep.
>>
>
> Could you describe in more details about warning?

In hfs_bnode_read(), I tried changing kmap/kunmap to 
kmap_atomic()/kunmap_atomic() . (Sorry I deleted the change since it does not 
work - but it shouldn't be too difficult to redo since it is just changing two 
lines in hfs_bnode_read()) - then I get many:

BUG: scheduling while atomic: ...

<a lot of stuff snipped>

Netgear's hfs+ journalling code do try to replay journals, etc. But (1) it does 
not *create* journal entry correctly for attributes files, since implementation 
of attributes file itself was incomplete until recently, (2) I suspect it does 
not create/update journals the *exact same* way as Mac OS X - this means it is 
unsafe to do cross-OS unclean mounts, even when unclean-mount works perfectly by 
itself under Linux.

BTW, I think I see another bug - unrelated to journalling - in the current hfs+ 
code: folder counts are not updated correctly. It seems that hfs+ folders have 
recursive folder counts (i.e. a parent dir knows how many sub-dir's it has, 
without needing to recurse down), and this is currently not updated correctly. 
One way of demo'ing this is to:

    (make sure fs is fsck-clean ; mount)

    cp -ipr somedir_with_some_files/ some_hfs+_place/someplace_inside/

    (umount; run fsck_hfs again, now it complains about folder counts)

The actual message looks a it like this:

** Checking catalog hierarchy.
    HasFolderCount flag needs to be set (id = 393055)
    (It should be 0x10 instead of 0)
    Incorrect folder count in a directory (id = 205)
    (It should be 18 instead of 17)

[-- Attachment #2: 0001-serialize-hfs_bnode_read_kmap.patch --]
[-- Type: text/x-patch, Size: 1520 bytes --]

>From 3b69f7f0b2827d00daf20aa75991e8d184be5ad0 Mon Sep 17 00:00:00 2001
From: Hin-Tak Leung <htl10@users.sourceforge.net>
Date: Thu, 11 Oct 2012 01:46:49 +0100
Subject: [PATCH] serialize hfs_bnode_read_kmap


Signed-off-by: Hin-Tak Leung <htl10@users.sourceforge.net>
---
 fs/hfsplus/bnode.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/fs/hfsplus/bnode.c b/fs/hfsplus/bnode.c
index 5c125ce..7c19ad9 100644
--- a/fs/hfsplus/bnode.c
+++ b/fs/hfsplus/bnode.c
@@ -17,6 +17,20 @@
 #include "hfsplus_fs.h"
 #include "hfsplus_raw.h"
 
+static DEFINE_MUTEX(hfs_bnode_read_kmap_mutex_mutex);
+
+static inline void
+hfs_bnode_read_kmap_mutex_lock(void)
+{
+	mutex_lock(&hfs_bnode_read_kmap_mutex_mutex);
+}
+
+static inline void
+hfs_bnode_read_kmap_mutex_unlock(void)
+{
+	mutex_unlock(&hfs_bnode_read_kmap_mutex_mutex);
+}
+
 /* Copy a specified range of bytes from the raw data of a node */
 void hfs_bnode_read(struct hfs_bnode *node, void *buf, int off, int len)
 {
@@ -28,14 +42,18 @@ void hfs_bnode_read(struct hfs_bnode *node, void *buf, int off, int len)
 	off &= ~PAGE_CACHE_MASK;
 
 	l = min(len, (int)PAGE_CACHE_SIZE - off);
+	hfs_bnode_read_kmap_mutex_lock();
 	memcpy(buf, kmap(*pagep) + off, l);
 	kunmap(*pagep);
+	hfs_bnode_read_kmap_mutex_unlock();
 
 	while ((len -= l) != 0) {
 		buf += l;
 		l = min(len, (int)PAGE_CACHE_SIZE);
+		hfs_bnode_read_kmap_mutex_lock();
 		memcpy(buf, kmap(*++pagep), l);
 		kunmap(*pagep);
+		hfs_bnode_read_kmap_mutex_unlock();
 	}
 }
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: hfsplus BUG(), kmap and journalling.
  2012-10-30  8:24       ` Hin-Tak Leung
@ 2012-10-30 11:45         ` Vyacheslav Dubeyko
  2012-11-02  5:43           ` hfsplus foldercount (Re: hfsplus BUG(), kmap and journalling.) Hin-Tak Leung
  0 siblings, 1 reply; 7+ messages in thread
From: Vyacheslav Dubeyko @ 2012-10-30 11:45 UTC (permalink / raw)
  To: Hin-Tak Leung; +Cc: linux-fsdevel, Till Kamppeter, Naohiro Aota

Hi Hin-Tak,

On Tue, 2012-10-30 at 08:24 +0000, Hin-Tak Leung wrote:
> 

[snip]
> >
> > Usually, hfs_bnode_read() is called after searching of some object in b-tree.
> > It needs to initialize struct hfsplus_find_data by means of hfs_find_init()
> > before any search and operations inside b-tree node. And then, it needs to
> > call hfs_find_exit(). The hfsplus_find_data structure contains mutex that it
> > locks of b-tree during hfs_find_init() call and unlock during
> > hfs_find_exit(). And, usually, hfs_bnode_read() is placed
> > between hfs_find_init()/hfs_find_exit() calls. So, as I can understand, your
> > mutex inside hfs_bnode_read() is useless. But, maybe, not all hfs_bnode_read()
> > calls are placed inside hfs_find_init()/hfs_find_exit() calls. It needs to check.
> 
> > I can't clear understand about what simultaneous kmap()'s you are talking.
> > As
> > I can see, usually, (especially, in the case of hfs_bnode_read()) pairs of
> > kmap()/kunmap() localize in small parts of code and I expect that it executes
> > fast. Do I misunderstand something?
> 
> Perhaps it is easier to show some code for the discussion. The attached 
> serializes kmap in hfs_bnode_read() . This makes the driver works more reliably, 
> somehow. In real usage, multiple instances of hfs_bnode_read() do get invoked in 
> parallel. I assume that is because of both SMP as well as file system access 
> itself are parallelized at various level - e.g. recursion like running du 
> probably invokes a few instances of readdir/getdents?
> 

As I understand, readdir/getdents syscalls call hfsplus_readdir()
method, as a result. It is called hfs_find_init() in the begin of
hfsplus_readdir() (as in hfsplus_lookup() also). The hfs_find_init()
method locks the mutex on the whole catalog tree. Then, it can be called
hfs_bnode_read() and other methods that operates by b-tree content in
the environment of locked b-tree. And, finally, it is called
hfs_find_exit() method at the end of hfsplus_readdir(). The
hfs_find_exit() method unlock the mutex. So, even if you have several
threads with readdir/getdents calls then only one thread will operate
with the b-tree inside hfsplus_readdir() operation.

What serialized kmap in hfs_bnode_read() do you mean? How is it possible
to have such situation for locked b-tree?


> >> I tried swapping those by kmap_atomic()/kunmap_atomic() (beware the arguments are different for unmap) - but the kernel immediately warned that *_atom() code is used where code can sleep.
> >>
> >
> > Could you describe in more details about warning?
> 
> In hfs_bnode_read(), I tried changing kmap/kunmap to 
> kmap_atomic()/kunmap_atomic() . (Sorry I deleted the change since it does not 
> work - but it shouldn't be too difficult to redo since it is just changing two 
> lines in hfs_bnode_read()) - then I get many:
> 
> BUG: scheduling while atomic: ...
> 

It is strange. I need to check it.

[snip]
> BTW, I think I see another bug - unrelated to journalling - in the current hfs+ 
> code: folder counts are not updated correctly. It seems that hfs+ folders have 
> recursive folder counts (i.e. a parent dir knows how many sub-dir's it has, 
> without needing to recurse down), and this is currently not updated correctly. 
> One way of demo'ing this is to:
> 
>     (make sure fs is fsck-clean ; mount)
> 
>     cp -ipr somedir_with_some_files/ some_hfs+_place/someplace_inside/
> 
>     (umount; run fsck_hfs again, now it complains about folder counts)
> 
> The actual message looks a it like this:
> 
> ** Checking catalog hierarchy.
>     HasFolderCount flag needs to be set (id = 393055)
>     (It should be 0x10 instead of 0)
>     Incorrect folder count in a directory (id = 205)
>     (It should be 18 instead of 17)

Sorry, I can't reproduce this issue. I tried to reproduce as you
described. Maybe, do you miss something?

With the best regards,
Vyacheslav Dubeyko.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* hfsplus foldercount (Re: hfsplus BUG(), kmap and journalling.)
  2012-10-30 11:45         ` Vyacheslav Dubeyko
@ 2012-11-02  5:43           ` Hin-Tak Leung
  0 siblings, 0 replies; 7+ messages in thread
From: Hin-Tak Leung @ 2012-11-02  5:43 UTC (permalink / raw)
  To: Vyacheslav Dubeyko; +Cc: linux-fsdevel, Till Kamppeter, Naohiro Aota

--- On Tue, 30/10/12, Vyacheslav Dubeyko <slava@dubeyko.com> wrote:

<snipped>
> [snip]
> > BTW, I think I see another bug - unrelated to
> journalling - in the current hfs+ 
> > code: folder counts are not updated correctly. It seems
> that hfs+ folders have 
> > recursive folder counts (i.e. a parent dir knows how
> many sub-dir's it has, 
> > without needing to recurse down), and this is currently
> not updated correctly. 
> > One way of demo'ing this is to:
> > 
> >     (make sure fs is fsck-clean ;
> mount)
> > 
> >     cp -ipr
> somedir_with_some_files/ some_hfs+_place/someplace_inside/
> > 
> >     (umount; run fsck_hfs again,
> now it complains about folder counts)
> > 
> > The actual message looks a it like this:
> > 
> > ** Checking catalog hierarchy.
> >     HasFolderCount flag needs to be
> set (id = 393055)
> >     (It should be 0x10 instead of
> 0)
> >     Incorrect folder count in a
> directory (id = 205)
> >     (It should be 18 instead of
> 17)
> 
> Sorry, I can't reproduce this issue. I tried to reproduce as
> you
> described. Maybe, do you miss something?

Since it is easily producible for me, I thought I'd have a look one day... got side-tracked to look at Apple's xnu kernel's file system driver on journal, and turned out to be interesting. Just grep -R -i foldercount shows, among other things, a very important comment:

bsd/hfs/hfs_vfsutils.c:	/* kHFSHasFolderCount is only supported/updated on HFSX volumes */

So the foldercount feature is "optional" and only switched on for the casesitive variant of HFS+ . Yes, while my primary interests is in journalling, for inter-operation with linux, I somewhat naively always chose to format my test volumes as HFSX, rather than the the plain case-preserving-but-not-sensitive variant, not knowing there are important differences.

I'll find some time to have a look at what else is hiding in Apple's xnu kernel code... there are probably more interesting "optional" but important differences.

Hin-Tak
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-11-02  5:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-18 16:55 hfsplus BUG(), kmap and journalling Hin-Tak Leung
2012-10-19 12:45 ` Vyacheslav Dubeyko
2012-10-20  6:24   ` Hin-Tak Leung
2012-10-22  9:02     ` Vyacheslav Dubeyko
2012-10-30  8:24       ` Hin-Tak Leung
2012-10-30 11:45         ` Vyacheslav Dubeyko
2012-11-02  5:43           ` hfsplus foldercount (Re: hfsplus BUG(), kmap and journalling.) Hin-Tak Leung

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).