* Java Stop-the-World GC stall induced by FS flush or many large file deletions @ 2013-09-12 4:17 Cuong Tran 2013-09-12 5:32 ` Sidorov, Andrei ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Cuong Tran @ 2013-09-12 4:17 UTC (permalink / raw) To: linux-ext4, linux-fsdevel We have seen GC stalls that are NOT due to memory usage of applications. GC log reports the CPU user and system time of GC threads, which are almost 0, and stop-the-world time, which can be multiple seconds. This indicates GC threads are waiting for IO but GC threads should be CPU-bound in user mode. We could reproduce the problems using a simple Java program that just appends to a log file via log4j. If the test just runs by itself, it does not incur any GC stalls. However, if we run a script that enters a loop to create multiple large file via falloc() and then deletes them, then GC stall of 1+ seconds can happen fairly predictably. We can also reproduce the problem by periodically switch the log and gzip the older log. IO device, a single disk drive, is overloaded by FS flush when this happens. Our guess is GC has to acquiesce its threads and if one of the threads is stuck in the kernel (say in non-interruptible mode). Then GC has to wait until this thread unblocks. In the mean time, it already stops the world. Another test that shows similar problem is doing deferred writes to append a file. Latency of deferred writes is very fast but once a while, it can last more than 1 second. We would really appreciate if you could shed some light on possible causes? (Threads blocked because of journal check point, delayed allocation can't proceed?). We could alleviate the problem by configuring expire_centisecs and writeback_centisecs to flush more frequently, and thus even-out the workload to the disk drive. But we would like to know if there is a methodology to model the rate of flush vs. rate of changes and IO throughput of the drive (SAS, 15K RPM). Many thanks. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions 2013-09-12 4:17 Java Stop-the-World GC stall induced by FS flush or many large file deletions Cuong Tran @ 2013-09-12 5:32 ` Sidorov, Andrei 2013-09-12 5:45 ` Cuong Tran 2013-09-12 7:46 ` Zheng Liu 2013-09-12 19:02 ` Theodore Ts'o 2 siblings, 1 reply; 14+ messages in thread From: Sidorov, Andrei @ 2013-09-12 5:32 UTC (permalink / raw) To: Cuong Tran; +Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org Hi, Large file deletions are likely to lock cpu for seconds if you're running non-preemptible kernel < 3.10. Make sure you have this change: http://patchwork.ozlabs.org/patch/232172/ (available in 3.10 if I remember it right). Turning on preemption may be a good idea as well. Regards, Andrei. On 12.09.2013 00:18, Cuong Tran wrote: > We have seen GC stalls that are NOT due to memory usage of applications. > > GC log reports the CPU user and system time of GC threads, which are > almost 0, and stop-the-world time, which can be multiple seconds. This > indicates GC threads are waiting for IO but GC threads should be > CPU-bound in user mode. > > We could reproduce the problems using a simple Java program that just > appends to a log file via log4j. If the test just runs by itself, it > does not incur any GC stalls. However, if we run a script that enters > a loop to create multiple large file via falloc() and then deletes > them, then GC stall of 1+ seconds can happen fairly predictably. > > We can also reproduce the problem by periodically switch the log and > gzip the older log. IO device, a single disk drive, is overloaded by > FS flush when this happens. > > Our guess is GC has to acquiesce its threads and if one of the threads > is stuck in the kernel (say in non-interruptible mode). Then GC has to > wait until this thread unblocks. In the mean time, it already stops > the world. > > Another test that shows similar problem is doing deferred writes to > append a file. Latency of deferred writes is very fast but once a > while, it can last more than 1 second. > > We would really appreciate if you could shed some light on possible > causes? (Threads blocked because of journal check point, delayed > allocation can't proceed?). We could alleviate the problem by > configuring expire_centisecs and writeback_centisecs to flush more > frequently, and thus even-out the workload to the disk drive. But we > would like to know if there is a methodology to model the rate of > flush vs. rate of changes and IO throughput of the drive (SAS, 15K > RPM). > > Many thanks. > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions 2013-09-12 5:32 ` Sidorov, Andrei @ 2013-09-12 5:45 ` Cuong Tran 2013-09-12 6:01 ` Sidorov, Andrei 2013-09-12 6:02 ` Sidorov, Andrei 0 siblings, 2 replies; 14+ messages in thread From: Cuong Tran @ 2013-09-12 5:45 UTC (permalink / raw) To: Sidorov, Andrei; +Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org Awesome fix and thanks for very speedy response. I have some questions. We delete files one at a time, and thus that would lock up one core or all cores? And in our test, we use falloc w/o writing to file. That would still cause freeing block-by-block, correct? --Cuong On Wed, Sep 11, 2013 at 10:32 PM, Sidorov, Andrei <Andrei.Sidorov@arrisi.com> wrote: > Hi, > > Large file deletions are likely to lock cpu for seconds if you're > running non-preemptible kernel < 3.10. > Make sure you have this change: > http://patchwork.ozlabs.org/patch/232172/ (available in 3.10 if I > remember it right). > Turning on preemption may be a good idea as well. > > Regards, > Andrei. > > On 12.09.2013 00:18, Cuong Tran wrote: >> We have seen GC stalls that are NOT due to memory usage of applications. >> >> GC log reports the CPU user and system time of GC threads, which are >> almost 0, and stop-the-world time, which can be multiple seconds. This >> indicates GC threads are waiting for IO but GC threads should be >> CPU-bound in user mode. >> >> We could reproduce the problems using a simple Java program that just >> appends to a log file via log4j. If the test just runs by itself, it >> does not incur any GC stalls. However, if we run a script that enters >> a loop to create multiple large file via falloc() and then deletes >> them, then GC stall of 1+ seconds can happen fairly predictably. >> >> We can also reproduce the problem by periodically switch the log and >> gzip the older log. IO device, a single disk drive, is overloaded by >> FS flush when this happens. >> >> Our guess is GC has to acquiesce its threads and if one of the threads >> is stuck in the kernel (say in non-interruptible mode). Then GC has to >> wait until this thread unblocks. In the mean time, it already stops >> the world. >> >> Another test that shows similar problem is doing deferred writes to >> append a file. Latency of deferred writes is very fast but once a >> while, it can last more than 1 second. >> >> We would really appreciate if you could shed some light on possible >> causes? (Threads blocked because of journal check point, delayed >> allocation can't proceed?). We could alleviate the problem by >> configuring expire_centisecs and writeback_centisecs to flush more >> frequently, and thus even-out the workload to the disk drive. But we >> would like to know if there is a methodology to model the rate of >> flush vs. rate of changes and IO throughput of the drive (SAS, 15K >> RPM). >> >> Many thanks. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions 2013-09-12 5:45 ` Cuong Tran @ 2013-09-12 6:01 ` Sidorov, Andrei 2013-09-12 6:02 ` Sidorov, Andrei 1 sibling, 0 replies; 14+ messages in thread From: Sidorov, Andrei @ 2013-09-12 6:01 UTC (permalink / raw) To: Cuong Tran; +Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org It would lock-up one core whichever jdb/sdaX runs on. This will usually happen upon commit that runs every x seconds, 5 by default (see “commit” mount option for ext4). I.e. deleting 5 files one by one with 1 second interval in between is basically the same as deleting all of them “at once”. Yes, fallocated files are the same wrt releasing blocks. Regards, Andrei. On 12.09.2013 01:45, Cuong Tran wrote: > Awesome fix and thanks for very speedy response. I have some > questions. We delete files one at a time, and thus that would lock up > one core or all cores? > > And in our test, we use falloc w/o writing to file. That would still > cause freeing block-by-block, correct? > --Cuong > > On Wed, Sep 11, 2013 at 10:32 PM, Sidorov, Andrei > <Andrei.Sidorov@arrisi.com> wrote: >> Hi, >> >> Large file deletions are likely to lock cpu for seconds if you're >> running non-preemptible kernel < 3.10. >> Make sure you have this change: >> http://patchwork.ozlabs.org/patch/232172/ (available in 3.10 if I >> remember it right). >> Turning on preemption may be a good idea as well. >> >> Regards, >> Andrei. >> >> On 12.09.2013 00:18, Cuong Tran wrote: >>> We have seen GC stalls that are NOT due to memory usage of applications. >>> >>> GC log reports the CPU user and system time of GC threads, which are >>> almost 0, and stop-the-world time, which can be multiple seconds. This >>> indicates GC threads are waiting for IO but GC threads should be >>> CPU-bound in user mode. >>> >>> We could reproduce the problems using a simple Java program that just >>> appends to a log file via log4j. If the test just runs by itself, it >>> does not incur any GC stalls. However, if we run a script that enters >>> a loop to create multiple large file via falloc() and then deletes >>> them, then GC stall of 1+ seconds can happen fairly predictably. >>> >>> We can also reproduce the problem by periodically switch the log and >>> gzip the older log. IO device, a single disk drive, is overloaded by >>> FS flush when this happens. >>> >>> Our guess is GC has to acquiesce its threads and if one of the threads >>> is stuck in the kernel (say in non-interruptible mode). Then GC has to >>> wait until this thread unblocks. In the mean time, it already stops >>> the world. >>> >>> Another test that shows similar problem is doing deferred writes to >>> append a file. Latency of deferred writes is very fast but once a >>> while, it can last more than 1 second. >>> >>> We would really appreciate if you could shed some light on possible >>> causes? (Threads blocked because of journal check point, delayed >>> allocation can't proceed?). We could alleviate the problem by >>> configuring expire_centisecs and writeback_centisecs to flush more >>> frequently, and thus even-out the workload to the disk drive. But we >>> would like to know if there is a methodology to model the rate of >>> flush vs. rate of changes and IO throughput of the drive (SAS, 15K >>> RPM). >>> >>> Many thanks. >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions 2013-09-12 5:45 ` Cuong Tran 2013-09-12 6:01 ` Sidorov, Andrei @ 2013-09-12 6:02 ` Sidorov, Andrei 2013-09-12 6:08 ` Cuong Tran 1 sibling, 1 reply; 14+ messages in thread From: Sidorov, Andrei @ 2013-09-12 6:02 UTC (permalink / raw) To: Cuong Tran; +Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org It would lock-up one core whichever jdb/sdaX runs on. This will usually happen upon commit that runs every x seconds, 5 by default (see “commit” mount option for ext4). I.e. deleting 5 files one by one with 1 second interval in between is basically the same as deleting all of them “at once”. Yes, fallocated files are the same wrt releasing blocks. Regards, Andrei. On 12.09.2013 01:45, Cuong Tran wrote: > Awesome fix and thanks for very speedy response. I have some > questions. We delete files one at a time, and thus that would lock up > one core or all cores? > > And in our test, we use falloc w/o writing to file. That would still > cause freeing block-by-block, correct? > --Cuong > > On Wed, Sep 11, 2013 at 10:32 PM, Sidorov, Andrei > <Andrei.Sidorov@arrisi.com> wrote: >> Hi, >> >> Large file deletions are likely to lock cpu for seconds if you're >> running non-preemptible kernel < 3.10. >> Make sure you have this change: >> http://patchwork.ozlabs.org/patch/232172/ (available in 3.10 if I >> remember it right). >> Turning on preemption may be a good idea as well. >> >> Regards, >> Andrei. >> >> On 12.09.2013 00:18, Cuong Tran wrote: >>> We have seen GC stalls that are NOT due to memory usage of applications. >>> >>> GC log reports the CPU user and system time of GC threads, which are >>> almost 0, and stop-the-world time, which can be multiple seconds. This >>> indicates GC threads are waiting for IO but GC threads should be >>> CPU-bound in user mode. >>> >>> We could reproduce the problems using a simple Java program that just >>> appends to a log file via log4j. If the test just runs by itself, it >>> does not incur any GC stalls. However, if we run a script that enters >>> a loop to create multiple large file via falloc() and then deletes >>> them, then GC stall of 1+ seconds can happen fairly predictably. >>> >>> We can also reproduce the problem by periodically switch the log and >>> gzip the older log. IO device, a single disk drive, is overloaded by >>> FS flush when this happens. >>> >>> Our guess is GC has to acquiesce its threads and if one of the threads >>> is stuck in the kernel (say in non-interruptible mode). Then GC has to >>> wait until this thread unblocks. In the mean time, it already stops >>> the world. >>> >>> Another test that shows similar problem is doing deferred writes to >>> append a file. Latency of deferred writes is very fast but once a >>> while, it can last more than 1 second. >>> >>> We would really appreciate if you could shed some light on possible >>> causes? (Threads blocked because of journal check point, delayed >>> allocation can't proceed?). We could alleviate the problem by >>> configuring expire_centisecs and writeback_centisecs to flush more >>> frequently, and thus even-out the workload to the disk drive. But we >>> would like to know if there is a methodology to model the rate of >>> flush vs. rate of changes and IO throughput of the drive (SAS, 15K >>> RPM). >>> >>> Many thanks. >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions 2013-09-12 6:02 ` Sidorov, Andrei @ 2013-09-12 6:08 ` Cuong Tran 2013-09-12 15:47 ` Sidorov, Andrei 0 siblings, 1 reply; 14+ messages in thread From: Cuong Tran @ 2013-09-12 6:08 UTC (permalink / raw) To: Sidorov, Andrei; +Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org My desk top has 8 cores, including hyperthreading. Thus deleting files would lock up one core but that should not affect GC threads if core lock-up is an issue? Would # journal records be proportional to # blocks deleted. And thus deleting N blocks, one block at a time would create N times more journal records than deleting all N blocks in "one shot"? --Cuong On Wed, Sep 11, 2013 at 11:02 PM, Sidorov, Andrei <Andrei.Sidorov@arrisi.com> wrote: > It would lock-up one core whichever jdb/sdaX runs on. This will usually > happen upon commit that runs every x seconds, 5 by default (see “commit” > mount option for ext4). I.e. deleting 5 files one by one with 1 second > interval in between is basically the same as deleting all of them “at once”. > > Yes, fallocated files are the same wrt releasing blocks. > > Regards, > Andrei. > > On 12.09.2013 01:45, Cuong Tran wrote: >> Awesome fix and thanks for very speedy response. I have some >> questions. We delete files one at a time, and thus that would lock up >> one core or all cores? >> >> And in our test, we use falloc w/o writing to file. That would still >> cause freeing block-by-block, correct? >> --Cuong >> >> On Wed, Sep 11, 2013 at 10:32 PM, Sidorov, Andrei >> <Andrei.Sidorov@arrisi.com> wrote: >>> Hi, >>> >>> Large file deletions are likely to lock cpu for seconds if you're >>> running non-preemptible kernel < 3.10. >>> Make sure you have this change: >>> http://patchwork.ozlabs.org/patch/232172/ (available in 3.10 if I >>> remember it right). >>> Turning on preemption may be a good idea as well. >>> >>> Regards, >>> Andrei. >>> >>> On 12.09.2013 00:18, Cuong Tran wrote: >>>> We have seen GC stalls that are NOT due to memory usage of applications. >>>> >>>> GC log reports the CPU user and system time of GC threads, which are >>>> almost 0, and stop-the-world time, which can be multiple seconds. This >>>> indicates GC threads are waiting for IO but GC threads should be >>>> CPU-bound in user mode. >>>> >>>> We could reproduce the problems using a simple Java program that just >>>> appends to a log file via log4j. If the test just runs by itself, it >>>> does not incur any GC stalls. However, if we run a script that enters >>>> a loop to create multiple large file via falloc() and then deletes >>>> them, then GC stall of 1+ seconds can happen fairly predictably. >>>> >>>> We can also reproduce the problem by periodically switch the log and >>>> gzip the older log. IO device, a single disk drive, is overloaded by >>>> FS flush when this happens. >>>> >>>> Our guess is GC has to acquiesce its threads and if one of the threads >>>> is stuck in the kernel (say in non-interruptible mode). Then GC has to >>>> wait until this thread unblocks. In the mean time, it already stops >>>> the world. >>>> >>>> Another test that shows similar problem is doing deferred writes to >>>> append a file. Latency of deferred writes is very fast but once a >>>> while, it can last more than 1 second. >>>> >>>> We would really appreciate if you could shed some light on possible >>>> causes? (Threads blocked because of journal check point, delayed >>>> allocation can't proceed?). We could alleviate the problem by >>>> configuring expire_centisecs and writeback_centisecs to flush more >>>> frequently, and thus even-out the workload to the disk drive. But we >>>> would like to know if there is a methodology to model the rate of >>>> flush vs. rate of changes and IO throughput of the drive (SAS, 15K >>>> RPM). >>>> >>>> Many thanks. >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >> > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions 2013-09-12 6:08 ` Cuong Tran @ 2013-09-12 15:47 ` Sidorov, Andrei 2013-09-12 16:58 ` Cuong Tran 0 siblings, 1 reply; 14+ messages in thread From: Sidorov, Andrei @ 2013-09-12 15:47 UTC (permalink / raw) To: Cuong Tran; +Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org If there are threads bound to this core, they probably won't be able to progress for the time of commit. I don't think scheduler would migrate them to a different core immediately. So, if GC wants to talk to these threads, it would block as well. Of course deleting many small files is much slower than deleting single file of the same size. Number of records is roughly the number of files deleted, split at allocation group boundaries. The group size is 128M unless you enabled bigalloc feature. That is in simplest case a release blocks entry for 256M file contains one record with 2 groups to free. But release blocks entry for 256 1M files will contain 256 records that are to be precessed separately. Before that change, commit time for removing one 256M is the same as removing 256 1M files (wrt to time taken to release blocks). After the change, releasing blocks of 256M file would take 2 “iterations” as opposed to 256. Basically, there is no thing like “deleting N blocks”, unless you delete a file by progressively truncating it towards zero. Regards, Andrei. On 12.09.2013 02:08, Cuong Tran wrote: > My desk top has 8 cores, including hyperthreading. Thus deleting files > would lock up one core but that should not affect GC threads if core > lock-up is an issue? Would # journal records be proportional to # > blocks deleted. And thus deleting N blocks, one block at a time would > create N times more journal records than deleting all N blocks in "one > shot"? > > --Cuong > > On Wed, Sep 11, 2013 at 11:02 PM, Sidorov, Andrei > <Andrei.Sidorov@arrisi.com> wrote: >> It would lock-up one core whichever jdb/sdaX runs on. This will usually >> happen upon commit that runs every x seconds, 5 by default (see “commit” >> mount option for ext4). I.e. deleting 5 files one by one with 1 second >> interval in between is basically the same as deleting all of them “at once”. >> >> Yes, fallocated files are the same wrt releasing blocks. >> >> Regards, >> Andrei. >> >> On 12.09.2013 01:45, Cuong Tran wrote: >>> Awesome fix and thanks for very speedy response. I have some >>> questions. We delete files one at a time, and thus that would lock up >>> one core or all cores? >>> >>> And in our test, we use falloc w/o writing to file. That would still >>> cause freeing block-by-block, correct? >>> --Cuong >>> >>> On Wed, Sep 11, 2013 at 10:32 PM, Sidorov, Andrei >>> <Andrei.Sidorov@arrisi.com> wrote: >>>> Hi, >>>> >>>> Large file deletions are likely to lock cpu for seconds if you're >>>> running non-preemptible kernel < 3.10. >>>> Make sure you have this change: >>>> http://patchwork.ozlabs.org/patch/232172/ (available in 3.10 if I >>>> remember it right). >>>> Turning on preemption may be a good idea as well. >>>> >>>> Regards, >>>> Andrei. >>>> >>>> On 12.09.2013 00:18, Cuong Tran wrote: >>>>> We have seen GC stalls that are NOT due to memory usage of applications. >>>>> >>>>> GC log reports the CPU user and system time of GC threads, which are >>>>> almost 0, and stop-the-world time, which can be multiple seconds. This >>>>> indicates GC threads are waiting for IO but GC threads should be >>>>> CPU-bound in user mode. >>>>> >>>>> We could reproduce the problems using a simple Java program that just >>>>> appends to a log file via log4j. If the test just runs by itself, it >>>>> does not incur any GC stalls. However, if we run a script that enters >>>>> a loop to create multiple large file via falloc() and then deletes >>>>> them, then GC stall of 1+ seconds can happen fairly predictably. >>>>> >>>>> We can also reproduce the problem by periodically switch the log and >>>>> gzip the older log. IO device, a single disk drive, is overloaded by >>>>> FS flush when this happens. >>>>> >>>>> Our guess is GC has to acquiesce its threads and if one of the threads >>>>> is stuck in the kernel (say in non-interruptible mode). Then GC has to >>>>> wait until this thread unblocks. In the mean time, it already stops >>>>> the world. >>>>> >>>>> Another test that shows similar problem is doing deferred writes to >>>>> append a file. Latency of deferred writes is very fast but once a >>>>> while, it can last more than 1 second. >>>>> >>>>> We would really appreciate if you could shed some light on possible >>>>> causes? (Threads blocked because of journal check point, delayed >>>>> allocation can't proceed?). We could alleviate the problem by >>>>> configuring expire_centisecs and writeback_centisecs to flush more >>>>> frequently, and thus even-out the workload to the disk drive. But we >>>>> would like to know if there is a methodology to model the rate of >>>>> flush vs. rate of changes and IO throughput of the drive (SAS, 15K >>>>> RPM). >>>>> >>>>> Many thanks. >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions 2013-09-12 15:47 ` Sidorov, Andrei @ 2013-09-12 16:58 ` Cuong Tran 0 siblings, 0 replies; 14+ messages in thread From: Cuong Tran @ 2013-09-12 16:58 UTC (permalink / raw) To: Sidorov, Andrei; +Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org Andrei, regarding core binding, our test program has only 1 thread which appends to the log. I did not explicitly bind this test to any core. Even without core binding, would a thread already scheduled for the core (say due to NUMA) get stuck until the core is available or does the kernel migrate this poor thread? Regarding journaling, your explanation is very clear. I read that a record takes up 1 to multiple buffers, 4 KB each. And a journal default size is 128 MB, regardless of partition size. Thus a journal can be filled up to 32,000 records before it has to check point. Is this correct? Thanks again for your help. --Cuong On Thu, Sep 12, 2013 at 8:47 AM, Sidorov, Andrei <Andrei.Sidorov@arrisi.com> wrote: > If there are threads bound to this core, they probably won't be able to > progress for the time of commit. I don't think scheduler would migrate > them to a different core immediately. So, if GC wants to talk to these > threads, it would block as well. > > Of course deleting many small files is much slower than deleting single > file of the same size. Number of records is roughly the number of files > deleted, split at allocation group boundaries. The group size is 128M > unless you enabled bigalloc feature. That is in simplest case a release > blocks entry for 256M file contains one record with 2 groups to free. > But release blocks entry for 256 1M files will contain 256 records that > are to be precessed separately. Before that change, commit time for > removing one 256M is the same as removing 256 1M files (wrt to time > taken to release blocks). After the change, releasing blocks of 256M > file would take 2 “iterations” as opposed to 256. > > Basically, there is no thing like “deleting N blocks”, unless you delete > a file by progressively truncating it towards zero. > > Regards, > Andrei. > > On 12.09.2013 02:08, Cuong Tran wrote: >> My desk top has 8 cores, including hyperthreading. Thus deleting files >> would lock up one core but that should not affect GC threads if core >> lock-up is an issue? Would # journal records be proportional to # >> blocks deleted. And thus deleting N blocks, one block at a time would >> create N times more journal records than deleting all N blocks in "one >> shot"? >> >> --Cuong >> >> On Wed, Sep 11, 2013 at 11:02 PM, Sidorov, Andrei >> <Andrei.Sidorov@arrisi.com> wrote: >>> It would lock-up one core whichever jdb/sdaX runs on. This will usually >>> happen upon commit that runs every x seconds, 5 by default (see “commit” >>> mount option for ext4). I.e. deleting 5 files one by one with 1 second >>> interval in between is basically the same as deleting all of them “at once”. >>> >>> Yes, fallocated files are the same wrt releasing blocks. >>> >>> Regards, >>> Andrei. >>> >>> On 12.09.2013 01:45, Cuong Tran wrote: >>>> Awesome fix and thanks for very speedy response. I have some >>>> questions. We delete files one at a time, and thus that would lock up >>>> one core or all cores? >>>> >>>> And in our test, we use falloc w/o writing to file. That would still >>>> cause freeing block-by-block, correct? >>>> --Cuong >>>> >>>> On Wed, Sep 11, 2013 at 10:32 PM, Sidorov, Andrei >>>> <Andrei.Sidorov@arrisi.com> wrote: >>>>> Hi, >>>>> >>>>> Large file deletions are likely to lock cpu for seconds if you're >>>>> running non-preemptible kernel < 3.10. >>>>> Make sure you have this change: >>>>> http://patchwork.ozlabs.org/patch/232172/ (available in 3.10 if I >>>>> remember it right). >>>>> Turning on preemption may be a good idea as well. >>>>> >>>>> Regards, >>>>> Andrei. >>>>> >>>>> On 12.09.2013 00:18, Cuong Tran wrote: >>>>>> We have seen GC stalls that are NOT due to memory usage of applications. >>>>>> >>>>>> GC log reports the CPU user and system time of GC threads, which are >>>>>> almost 0, and stop-the-world time, which can be multiple seconds. This >>>>>> indicates GC threads are waiting for IO but GC threads should be >>>>>> CPU-bound in user mode. >>>>>> >>>>>> We could reproduce the problems using a simple Java program that just >>>>>> appends to a log file via log4j. If the test just runs by itself, it >>>>>> does not incur any GC stalls. However, if we run a script that enters >>>>>> a loop to create multiple large file via falloc() and then deletes >>>>>> them, then GC stall of 1+ seconds can happen fairly predictably. >>>>>> >>>>>> We can also reproduce the problem by periodically switch the log and >>>>>> gzip the older log. IO device, a single disk drive, is overloaded by >>>>>> FS flush when this happens. >>>>>> >>>>>> Our guess is GC has to acquiesce its threads and if one of the threads >>>>>> is stuck in the kernel (say in non-interruptible mode). Then GC has to >>>>>> wait until this thread unblocks. In the mean time, it already stops >>>>>> the world. >>>>>> >>>>>> Another test that shows similar problem is doing deferred writes to >>>>>> append a file. Latency of deferred writes is very fast but once a >>>>>> while, it can last more than 1 second. >>>>>> >>>>>> We would really appreciate if you could shed some light on possible >>>>>> causes? (Threads blocked because of journal check point, delayed >>>>>> allocation can't proceed?). We could alleviate the problem by >>>>>> configuring expire_centisecs and writeback_centisecs to flush more >>>>>> frequently, and thus even-out the workload to the disk drive. But we >>>>>> would like to know if there is a methodology to model the rate of >>>>>> flush vs. rate of changes and IO throughput of the drive (SAS, 15K >>>>>> RPM). >>>>>> >>>>>> Many thanks. >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in >>>>>> the body of a message to majordomo@vger.kernel.org >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>>>> >> > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions 2013-09-12 4:17 Java Stop-the-World GC stall induced by FS flush or many large file deletions Cuong Tran 2013-09-12 5:32 ` Sidorov, Andrei @ 2013-09-12 7:46 ` Zheng Liu 2013-09-12 11:46 ` Cuong Tran 2013-09-12 19:02 ` Theodore Ts'o 2 siblings, 1 reply; 14+ messages in thread From: Zheng Liu @ 2013-09-12 7:46 UTC (permalink / raw) To: Cuong Tran; +Cc: linux-ext4, linux-fsdevel Hello Cuong, Could you please tell us the kernel version? Meanwhile it would be better if you could paste the result of the following commands: * sudo tune2fs -l ${DEV} * cat /proc/mounts | grep ${DEV} I want to know which feature is enabled in your file system. From your description, I guess delayed allocation is enabled. So I suggest that you can try to disable it. You can disable it using the following command: * sudo mount -t ext4 -o remount,nodelalloc ${DEV} ${MNT} Regards, - Zheng On Wed, Sep 11, 2013 at 09:17:26PM -0700, Cuong Tran wrote: > We have seen GC stalls that are NOT due to memory usage of applications. > > GC log reports the CPU user and system time of GC threads, which are > almost 0, and stop-the-world time, which can be multiple seconds. This > indicates GC threads are waiting for IO but GC threads should be > CPU-bound in user mode. > > We could reproduce the problems using a simple Java program that just > appends to a log file via log4j. If the test just runs by itself, it > does not incur any GC stalls. However, if we run a script that enters > a loop to create multiple large file via falloc() and then deletes > them, then GC stall of 1+ seconds can happen fairly predictably. > > We can also reproduce the problem by periodically switch the log and > gzip the older log. IO device, a single disk drive, is overloaded by > FS flush when this happens. > > Our guess is GC has to acquiesce its threads and if one of the threads > is stuck in the kernel (say in non-interruptible mode). Then GC has to > wait until this thread unblocks. In the mean time, it already stops > the world. > > Another test that shows similar problem is doing deferred writes to > append a file. Latency of deferred writes is very fast but once a > while, it can last more than 1 second. > > We would really appreciate if you could shed some light on possible > causes? (Threads blocked because of journal check point, delayed > allocation can't proceed?). We could alleviate the problem by > configuring expire_centisecs and writeback_centisecs to flush more > frequently, and thus even-out the workload to the disk drive. But we > would like to know if there is a methodology to model the rate of > flush vs. rate of changes and IO throughput of the drive (SAS, 15K > RPM). > > Many thanks. > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions 2013-09-12 7:46 ` Zheng Liu @ 2013-09-12 11:46 ` Cuong Tran 0 siblings, 0 replies; 14+ messages in thread From: Cuong Tran @ 2013-09-12 11:46 UTC (permalink / raw) To: Cuong Tran, linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org Hi Zheng, this is output of tune2fs: tune2fs 1.41.12 (17-May-2010) Filesystem volume name: <none> Last mounted on: / Filesystem UUID: 358707f1-8997-4e96-8110-d9f929073e68 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: journal_data_writeback user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 23281664 Block count: 93110272 Reserved block count: 4655513 Free blocks: 44878839 Free inodes: 22074662 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 1001 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 Flex block group size: 16 Filesystem created: Fri Oct 7 05:01:07 2011 Last mount time: Mon Sep 9 18:49:39 2013 Last write time: Wed Aug 21 10:48:03 2013 Mount count: 23 Maximum mount count: -1 Last checked: Fri Oct 7 05:01:07 2011 Check interval: 0 (<none>) Lifetime writes: 1795 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 First orphan inode: 4207998 Default directory hash: half_md4 Directory Hash Seed: 4de319c2-fc2d-4771-b4de-b63acd610ba0 Journal backup: inode blocks And this is mount options: UUID=358707f1-8997-4e96-8110-d9f929073e68 / ext4 defaults 1 1 Yes, I forgot to mention that in the deferred write test, turning off deferred allocation or using preallocation, the problem is largely gone. I mentioned largely gone because it still happens but much less frequently. Thanks. --Cuong On Thu, Sep 12, 2013 at 12:46 AM, Zheng Liu <gnehzuil.liu@gmail.com> wrote: > Hello Cuong, > > Could you please tell us the kernel version? Meanwhile it would be > better if you could paste the result of the following commands: > * sudo tune2fs -l ${DEV} > * cat /proc/mounts | grep ${DEV} > > I want to know which feature is enabled in your file system. From your > description, I guess delayed allocation is enabled. So I suggest that > you can try to disable it. You can disable it using the following > command: > * sudo mount -t ext4 -o remount,nodelalloc ${DEV} ${MNT} > > Regards, > - Zheng > > On Wed, Sep 11, 2013 at 09:17:26PM -0700, Cuong Tran wrote: >> We have seen GC stalls that are NOT due to memory usage of applications. >> >> GC log reports the CPU user and system time of GC threads, which are >> almost 0, and stop-the-world time, which can be multiple seconds. This >> indicates GC threads are waiting for IO but GC threads should be >> CPU-bound in user mode. >> >> We could reproduce the problems using a simple Java program that just >> appends to a log file via log4j. If the test just runs by itself, it >> does not incur any GC stalls. However, if we run a script that enters >> a loop to create multiple large file via falloc() and then deletes >> them, then GC stall of 1+ seconds can happen fairly predictably. >> >> We can also reproduce the problem by periodically switch the log and >> gzip the older log. IO device, a single disk drive, is overloaded by >> FS flush when this happens. >> >> Our guess is GC has to acquiesce its threads and if one of the threads >> is stuck in the kernel (say in non-interruptible mode). Then GC has to >> wait until this thread unblocks. In the mean time, it already stops >> the world. >> >> Another test that shows similar problem is doing deferred writes to >> append a file. Latency of deferred writes is very fast but once a >> while, it can last more than 1 second. >> >> We would really appreciate if you could shed some light on possible >> causes? (Threads blocked because of journal check point, delayed >> allocation can't proceed?). We could alleviate the problem by >> configuring expire_centisecs and writeback_centisecs to flush more >> frequently, and thus even-out the workload to the disk drive. But we >> would like to know if there is a methodology to model the rate of >> flush vs. rate of changes and IO throughput of the drive (SAS, 15K >> RPM). >> >> Many thanks. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions 2013-09-12 4:17 Java Stop-the-World GC stall induced by FS flush or many large file deletions Cuong Tran 2013-09-12 5:32 ` Sidorov, Andrei 2013-09-12 7:46 ` Zheng Liu @ 2013-09-12 19:02 ` Theodore Ts'o 2013-09-13 15:25 ` Cuong Tran 2 siblings, 1 reply; 14+ messages in thread From: Theodore Ts'o @ 2013-09-12 19:02 UTC (permalink / raw) To: Cuong Tran; +Cc: linux-ext4, linux-fsdevel Are you absolutely certain your JVM attempting to write to any files in its GC thread? Say, to do some kind of logging? It might be worth stracing the JVM and correlating the GC stall with any syscalls that might have been issued from the JVM GC thread. Especially in the case of the FS Flush, the writeback thread isn't CPU bound. It will wait for the writeback to complete, but while it's waiting, other processes or threads will be allowed to run on the CPU. Now, if the GC thread tries to do some kind of fs operation which requires writing to the file system, and the file sytstem is trying to start a jbd transaction commit, file system operations can block until all of the jbd handles associated with the previous commit can complete. If you are storage devices are slow, or you are using a block cgroup to control how much I/O bandwidth a particular cgroup could use, this can end up causing a priority inversion where a low priority cgroup takes a while to complete, this can stall the jbd commit completion, and this can cause new ext4 operations can stall waiting to start a new jbd handle. So you could have a stall happening, if it's taking a long time for commits to complete, but it might be completely unrelated to a GC stall. If you enable the jbd2_run_stats tracepoint, you can get some interesting numbers about how long the various phases of the jbd2 commit are taking. - Ted ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions 2013-09-12 19:02 ` Theodore Ts'o @ 2013-09-13 15:25 ` Cuong Tran 2013-09-13 18:36 ` Theodore Ts'o 0 siblings, 1 reply; 14+ messages in thread From: Cuong Tran @ 2013-09-13 15:25 UTC (permalink / raw) To: Theodore Ts'o Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org Thanks Ted. That's great info. I did strace and some stalls can be correlated to GC thread writing to the log but others are not. I turned on the jbd2_run_stats trace point, and this is sample trace of creating and deleting 300 100MB files in a loop. (My desktop is running low on disk space!). No Java test is running. I also turned on jbd2_checkpoint_stats. Not sure if the count "running" in 5000s or frequency of jbd2_checkpoint_stats has any relevance. <...>-534 [003] 64044.506552: jbd2_checkpoint_stats: dev sda1 tid 5433920 chp_time 0 forced_to_close 0 written 0 dropped 6 <...>-534 [003] 64044.506555: jbd2_run_stats: dev sda1 tid 5433921 wait 0 running 5003 locked 0 flushing 0 logging 13 handle_count 1587 blocks 6 blocks_logged 7 <...>-534 [003] 64049.505782: jbd2_checkpoint_stats: dev sda1 tid 5433921 chp_time 0 forced_to_close 0 written 0 dropped 6 <...>-534 [003] 64049.505784: jbd2_run_stats: dev sda1 tid 5433922 wait 0 running 5003 locked 0 flushing 0 logging 15 handle_count 1585 blocks 6 blocks_logged 7 <...>-534 [003] 64149.417563: jbd2_checkpoint_stats: dev sda1 tid 5433941 chp_time 0 forced_to_close 0 written 0 dropped 6 <...>-534 [003] 64149.417565: jbd2_run_stats: dev sda1 tid 5433942 wait 0 running 5003 locked 0 flushing 0 logging 14 handle_count 1600 blocks 6 blocks_logged 7 <...>-534 [003] 64234.335940: jbd2_checkpoint_stats: dev sda1 tid 5433958 chp_time 0 forced_to_close 0 written 0 dropped 6 <...>-534 [003] 64234.335942: jbd2_run_stats: dev sda1 tid 5433959 wait 0 running 5000 locked 0 flushing 0 logging 13 handle_count 1574 blocks 6 blocks_logged 7 <...>-534 [004] 64658.799432: jbd2_run_stats: dev sda1 tid 5434029 wait 0 running 5004 locked 0 flushing 0 logging 66 handle_count 442 blocks 1139 blocks_logged 1143 <...>-534 [000] 64707.948801: jbd2_run_stats: dev sda1 tid 5434034 wait 0 running 5002 locked 0 flushing 0 logging 73 handle_count 8416 blocks 1475 blocks_logged 1480 <...>-534 [000] 64722.940015: jbd2_run_stats: dev sda1 tid 5434037 wait 0 running 5008 locked 0 flushing 0 logging 72 handle_count 8442 blocks 1479 blocks_logged 1484 <...>-534 [001] 64727.933030: jbd2_run_stats: dev sda1 tid 5434038 wait 0 running 5001 locked 0 flushing 0 logging 73 handle_count 19208 blocks 1389 blocks_logged 1394 <...>-534 [001] 64737.922818: jbd2_run_stats: dev sda1 tid 5434040 wait 0 running 5002 locked 0 flushing 0 logging 77 handle_count 10147 blocks 1405 blocks_logged 1410 <...>-534 [001] 64742.908500: jbd2_run_stats: dev sda1 tid 5434041 wait 0 running 5003 locked 0 flushing 0 logging 69 handle_count 8474 blocks 1219 blocks_logged 1223 <...>-534 [001] 64747.901419: jbd2_run_stats: dev sda1 tid 5434042 wait 0 running 5003 locked 0 flushing 0 logging 66 handle_count 8452 blocks 1206 blocks_logged 1210 <...>-534 [001] 64752.898315: jbd2_checkpoint_stats: dev sda1 tid 5434041 chp_time 0 forced_to_close 0 written 0 dropped 1219 And this is sample of traces for creating and deleting one 20 GB file in a loop: <...>-534 [001] 64511.057009: jbd2_run_stats: dev sda1 tid 5434012 wait 0 running 5460 locked 0 flushing 0 logging 47 handle_count 2569 blocks 299 blocks_logged 300 <...>-534 [001] 64517.051769: jbd2_run_stats: dev sda1 tid 5434013 wait 0 running 5592 locked 0 flushing 0 logging 52 handle_count 3847 blocks 583 blocks_logged 585 <...>-534 [000] 64522.958310: jbd2_run_stats: dev sda1 tid 5434014 wait 0 running 5905 locked 0 flushing 0 logging 54 handle_count 2570 blocks 572 blocks_logged 574 <...>-534 [001] 64528.027403: jbd2_run_stats: dev sda1 tid 5434015 wait 0 running 5085 locked 0 flushing 0 logging 47 handle_count 3845 blocks 281 blocks_logged 282 <...>-534 [000] 64534.019586: jbd2_checkpoint_stats: dev sda1 tid 5434012 chp_time 0 forced_to_close 0 written 0 dropped 299 <...>-534 [000] 64534.019610: jbd2_run_stats: dev sda1 tid 5434016 wait 0 running 5490 locked 0 flushing 0 logging 48 handle_count 2568 blocks 298 blocks_logged 299 <...>-534 [001] 64540.255202: jbd2_checkpoint_stats: dev sda1 tid 5434013 chp_time 0 forced_to_close 0 written 0 dropped 583 <...>-534 [001] 64540.255206: jbd2_run_stats: dev sda1 tid 5434017 wait 0 running 5581 locked 283 flushing 0 logging 54 handle_count 3849 blocks 692 blocks_logged 695 <...>-534 [001] 64547.069532: jbd2_checkpoint_stats: dev sda1 tid 5434016 chp_time 0 forced_to_close 0 written 0 dropped 8 <...>-534 [001] 64547.114704: jbd2_checkpoint_stats: dev sda1 tid 5434015 chp_time 0 forced_to_close 0 written 0 dropped 260 <...>-534 [001] 64547.114724: jbd2_checkpoint_stats: dev sda1 tid 5434014 chp_time 0 forced_to_close 0 written 0 dropped 571 <...>-534 [001] 64547.114729: jbd2_checkpoint_stats: dev sda1 tid 5434017 chp_time 0 forced_to_close 0 written 0 dropped 8 <...>-534 [001] 64547.114731: jbd2_run_stats: dev sda1 tid 5434018 wait 0 running 5603 locked 276 flushing 0 logging 45 handle_count 3851 blocks 281 blocks_logged 282 <...>-534 [001] 64553.974915: jbd2_run_stats: dev sda1 tid 5434019 wait 0 running 5593 locked 274 flushing 0 logging 48 handle_count 3849 blocks 302 blocks_logged 303 <...>-534 [001] 64559.979010: jbd2_run_stats: dev sda1 tid 5434020 wait 0 running 5005 locked 0 flushing 0 logging 56 handle_count 3847 blocks 669 blocks_logged 671 <...>-534 [001] 64565.952850: jbd2_run_stats: dev sda1 tid 5434021 wait 0 running 5400 locked 0 flushing 0 logging 52 handle_count 12933 blocks 559 blocks_logged 561 It would be awesome if you could shed some light on the counters and the interactions. Thanks again. --Cuong On Thu, Sep 12, 2013 at 12:02 PM, Theodore Ts'o <tytso@mit.edu> wrote: > Are you absolutely certain your JVM attempting to write to any files > in its GC thread? Say, to do some kind of logging? It might be worth > stracing the JVM and correlating the GC stall with any syscalls that > might have been issued from the JVM GC thread. > > Especially in the case of the FS Flush, the writeback thread isn't CPU > bound. It will wait for the writeback to complete, but while it's > waiting, other processes or threads will be allowed to run on the CPU. > > Now, if the GC thread tries to do some kind of fs operation which > requires writing to the file system, and the file sytstem is trying to > start a jbd transaction commit, file system operations can block until > all of the jbd handles associated with the previous commit can > complete. If you are storage devices are slow, or you are using a > block cgroup to control how much I/O bandwidth a particular cgroup > could use, this can end up causing a priority inversion where a low > priority cgroup takes a while to complete, this can stall the jbd > commit completion, and this can cause new ext4 operations can stall > waiting to start a new jbd handle. > > So you could have a stall happening, if it's taking a long time for > commits to complete, but it might be completely unrelated to a GC > stall. > > If you enable the jbd2_run_stats tracepoint, you can get some > interesting numbers about how long the various phases of the jbd2 > commit are taking. > > - Ted ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions 2013-09-13 15:25 ` Cuong Tran @ 2013-09-13 18:36 ` Theodore Ts'o 2013-09-14 18:47 ` Cuong Tran 0 siblings, 1 reply; 14+ messages in thread From: Theodore Ts'o @ 2013-09-13 18:36 UTC (permalink / raw) To: Cuong Tran; +Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org On Fri, Sep 13, 2013 at 08:25:30AM -0700, Cuong Tran wrote: > I did strace and some stalls can be correlated to GC thread writing to > the log but others are not. Something else which just came to mind --- depending on the kernel version which you are using, you may be getting hit by stable page writebacks. This is where a process tries to modify a mmap'ed page while it is being written back to disk. With stable page writeback, the an attempt to modify a page which is being written back to disk will cause a page fault, and that process will block until the page is written back to disk. This is a feature which is required for certain block devices such as iSCSI, SAS drives with DIF and RAID 5, since if the page changes out from under the writeback, the checksum (or parity strip in the case of RAID 5) of the page will be broken. However, it can cause random process stalls. Some of us (Google and Tao Bao) have hacked kernels which forcibly disables stable page writeback, because it causes latency problems. (Since we weren't using any of the block devices that require this feature, it was OK.) In the most recent kernels, there are some changes which will automatically disable stable page writebacks for those block devices that don't need it. > I turned on the jbd2_run_stats trace point, and this is sample trace > of creating and deleting 300 100MB files in a loop. (My desktop is > running low on disk space!). No Java test is running. I also turned on > jbd2_checkpoint_stats. Not sure if the count "running" in 5000s or > frequency of jbd2_checkpoint_stats has any relevance. > > <...>-534 [003] 64044.506552: jbd2_checkpoint_stats: dev > sda1 tid 5433920 chp_time 0 forced_to_close 0 written 0 dropped 6 The check point time is zero, so we don't have any problems here. That's fine. > <...>-534 [003] 64044.506555: jbd2_run_stats: dev sda1 > tid 5433921 wait 0 running 5003 locked 0 flushing 0 logging 13 > handle_count 1587 blocks 6 blocks_logged 7 The "run time" is the time between the previous transaction and when the current transaction starts committing. Five seconds (all of the times in jbd2_run_stats are in milliseconds) is the default commit time. This implies that the transaction was committed because of the commit timeout, and not because either (a) an fsync() or fdatasync() or (b) the number of changed blocks was too big, such that we forced the commit to start. So for this test, all of the jbd2_run_stats don't show anything interesting. We are spending 13-15ms writing a half-dozen or so blocks to the journal. The handle count is the number of individual file system operations in the commit, but all of these handles are touching only 6 metadata blocks. I'm guessing this means you are writing those 100MB files in relatively small chunks, which explains why there are 1500+ handles started but so few metadata blocks (primarily bitmap allocation blocks and inode table blocks) being modified in those 5 seconds. > <...>-534 [000] 64707.948801: jbd2_run_stats: dev sda1 > tid 5434034 wait 0 running 5002 locked 0 flushing 0 logging 73 > handle_count 8416 blocks 1475 blocks_logged 1480 OK, this is a bit more typical; in this five second range, we modified 1475 metadata blocks across 8416 file system operations. Note that the commit operation only took a total of 73 milliseconds, though. (That's because the write to the journal is a sequential write, which tends to be fairly efficient for HDD's.) The bottom line is that it doesn't look like the journal commits is actually taking that long. I certainly don't see anything to explain GC stalls in the one second range. Regards, - Ted ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions 2013-09-13 18:36 ` Theodore Ts'o @ 2013-09-14 18:47 ` Cuong Tran 0 siblings, 0 replies; 14+ messages in thread From: Cuong Tran @ 2013-09-14 18:47 UTC (permalink / raw) To: Theodore Ts'o Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org Sorry for belated response as I am packing for vacation. It's very interesting about the stable page writeback. I will keep that in mind because other apps use mmap but not this one. Oh, I was not very clear in my earlier e-mails. The jbd2_run_stats are for the tests that falloc() (hence you did not see much writes) and then delete large files. Java app is not running in these traces. I will repeat the run but this time with Java app and let you know. I hope you don't mind the following questions: 1. We observed that ex4 if mounted with "writeback" seems to be less likely to stall than mounted with "ordered". Is this because "ordered" flushes the journal buffers after the data blocks of the same inode and hence it's more likely for the threads writing to the same inode to block (hence stall) during the journal commit? In our test, all threads write to the same inode. 2. We also observed that the threads are more likely to stall when write back kicks in (dirty_background_ratio exceeded). Is this due to block allocations if the FS has to flush buffers that have delayed allocation? 3. Once the # dirty buffers approach dirty_ratio threshold, the threads can stall more frequently than in case 2 above, probably they are drafted to flush the dirty buffers? 4. I am very puzzled as I can correlate some but not stalls with GC writes to gc.log file. But I can't correlate stalls with deferred writes by the app threads. These writes are append-only and can take up to 700 msec to return but the strace time-stamp of such writes do not line up with when GC stalls happen. And it's also unclear when GC "parks" or "acquiesces" the threads. AFAIK, it's a co-operating models that app threads check back (probably via futex) if they need to park. What if one thread is stuck in a kernel but in the mean time, Eden needs compaction! Other threads can not proceed then until the "stuck" thread checks back and parks? (I would think JVM would signal the threads to "recall" them but that does not seem to be the case). Thanks again and you have a good weekend. --Cuong On Fri, Sep 13, 2013 at 11:36 AM, Theodore Ts'o <tytso@mit.edu> wrote: > On Fri, Sep 13, 2013 at 08:25:30AM -0700, Cuong Tran wrote: >> I did strace and some stalls can be correlated to GC thread writing to >> the log but others are not. > > Something else which just came to mind --- depending on the kernel > version which you are using, you may be getting hit by stable page > writebacks. This is where a process tries to modify a mmap'ed page > while it is being written back to disk. With stable page writeback, > the an attempt to modify a page which is being written back to disk > will cause a page fault, and that process will block until the page is > written back to disk. > > This is a feature which is required for certain block devices such as > iSCSI, SAS drives with DIF and RAID 5, since if the page changes out > from under the writeback, the checksum (or parity strip in the case of > RAID 5) of the page will be broken. However, it can cause random > process stalls. > > Some of us (Google and Tao Bao) have hacked kernels which forcibly > disables stable page writeback, because it causes latency problems. > (Since we weren't using any of the block devices that require this > feature, it was OK.) In the most recent kernels, there are some > changes which will automatically disable stable page writebacks for > those block devices that don't need it. > >> I turned on the jbd2_run_stats trace point, and this is sample trace >> of creating and deleting 300 100MB files in a loop. (My desktop is >> running low on disk space!). No Java test is running. I also turned on >> jbd2_checkpoint_stats. Not sure if the count "running" in 5000s or >> frequency of jbd2_checkpoint_stats has any relevance. >> >> <...>-534 [003] 64044.506552: jbd2_checkpoint_stats: dev >> sda1 tid 5433920 chp_time 0 forced_to_close 0 written 0 dropped 6 > > The check point time is zero, so we don't have any problems here. > That's fine. > >> <...>-534 [003] 64044.506555: jbd2_run_stats: dev sda1 >> tid 5433921 wait 0 running 5003 locked 0 flushing 0 logging 13 >> handle_count 1587 blocks 6 blocks_logged 7 > > The "run time" is the time between the previous transaction and when > the current transaction starts committing. Five seconds (all of the > times in jbd2_run_stats are in milliseconds) is the default commit > time. This implies that the transaction was committed because of the > commit timeout, and not because either (a) an fsync() or fdatasync() > or (b) the number of changed blocks was too big, such that we forced > the commit to start. > > So for this test, all of the jbd2_run_stats don't show anything > interesting. We are spending 13-15ms writing a half-dozen or so > blocks to the journal. The handle count is the number of individual > file system operations in the commit, but all of these handles are > touching only 6 metadata blocks. I'm guessing this means you are > writing those 100MB files in relatively small chunks, which explains > why there are 1500+ handles started but so few metadata blocks > (primarily bitmap allocation blocks and inode table blocks) being > modified in those 5 seconds. > > >> <...>-534 [000] 64707.948801: jbd2_run_stats: dev sda1 >> tid 5434034 wait 0 running 5002 locked 0 flushing 0 logging 73 >> handle_count 8416 blocks 1475 blocks_logged 1480 > > OK, this is a bit more typical; in this five second range, we modified > 1475 metadata blocks across 8416 file system operations. Note that > the commit operation only took a total of 73 milliseconds, though. > (That's because the write to the journal is a sequential write, which > tends to be fairly efficient for HDD's.) > > The bottom line is that it doesn't look like the journal commits is > actually taking that long. I certainly don't see anything to explain > GC stalls in the one second range. > > Regards, > > - Ted ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2013-09-14 18:47 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-09-12 4:17 Java Stop-the-World GC stall induced by FS flush or many large file deletions Cuong Tran 2013-09-12 5:32 ` Sidorov, Andrei 2013-09-12 5:45 ` Cuong Tran 2013-09-12 6:01 ` Sidorov, Andrei 2013-09-12 6:02 ` Sidorov, Andrei 2013-09-12 6:08 ` Cuong Tran 2013-09-12 15:47 ` Sidorov, Andrei 2013-09-12 16:58 ` Cuong Tran 2013-09-12 7:46 ` Zheng Liu 2013-09-12 11:46 ` Cuong Tran 2013-09-12 19:02 ` Theodore Ts'o 2013-09-13 15:25 ` Cuong Tran 2013-09-13 18:36 ` Theodore Ts'o 2013-09-14 18:47 ` Cuong Tran
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).