domenica 15 giugno 2014

BTRFS and systemd-journal


The BTRFS filesystem and Systemd are two news projects in the Linux eco-system. Systemd is the new init system, which is becoming default for the most distributions; in fact even Debian and Ubuntu have program to switch to it.
The same is true for the BTRFS filesystem, which is considered the next gen linux filesystem.
In this post I will analyse a performance problem which happens when the Systemd log system (journald) is used on a Btrfs filesystem.


Recently I switched from Debian to Fedora, because I was interested in Systemd, and the one shipped by Debian was quite old (204 vs 208). From the beginning I noticed that the Fedora boot time was longer than the Debian one. Initially I thought the better Debian performances were due to the fact that Debian is more light than Fedora (less number of service enabled). But after some tailoring the boot time still was longer.
However I didn't care it enough, because the boot is performed only one or two times per day. So this didn't hurt me too much
But recently I looked at this bug Bug 1006386 - Journal flushing often slow, can prevent system booting correctly[1]. Here was reported a performance problem due to the flushing of the journal on the permanent storage. So I decided to investigate a bit more deeply this issue.

Systemd journal

Systemd introduced a new log daemon called journald. It has some nice properties, one of the most important is that it is started from the beginning (it is even present in the initramfs). Due to the fact that during the early stages of the boot, there is no availability of a persistent storage, all the log were stored in a tmpfs filesystem, and only when a permanent storage is available all the information are flushed to the disk.
It must be noticed that other option are available, but this is the Fedora default.
It seems that during this flushing, all daemons which are trying to log something are blocked by journald.
My tests revealed that this flushing may requires a long time. This is due to the BTRFS COW filesystem nature which doesn't behave well with the log file structure of Journald.
However defragmenting the log file of Journald reduces the boot time.
# btrfs filesystemd defrag /var/log/journal/189323cd4cc348159b9fd5b32b566b05/system.journal
leaded a boot time reduction by 20 second on three different machines. Is an huge value, which suggested me to perform further tests. (note 189323cd4cc348159b9fd5b32b566b05 is the machine-id and it is likely different on each machine).
Of course I have to point out that this results was due to the bad interaction between BTRFS and the Journald log file. Other setup might lead to different results.

My tests

I took an old machine (a P4 2.5GHz with 512MB of ram) where was present a fresh installation of a Fedora 20 and I measured the boot time during several reboots (up to 70). The results were very impressive. I tested the following scenarios
  1. standard (without defragmenting any file, without readahead)
  2. defragment the journal file at the end of the boot
  3. defragment the journal file before the flushing
  4. mark as NOCOW the journald log file
  5. enable the systemd-readahead
  6. remove the fsync(2) call from journald
  7. remove the posix_fallocate(3) call from journald
  8. do a defrag when posix_fallocate(3) is called
  9. do a defrag when the journal log file is opened
Each batch of tests started with an empty log file. The time measured was the boot time as reported by systemd-analyze (the userspace time). On each chart was also reported the number of extents as reported by the filefrag command

1) Standard
This test was performed without any strategies to mitigate the performance problem. I repeated the reboot about 70 times, doing two defrag in the middle (at test #37 and #53). At the beginning the boot time was less than 20 seconds, then it increases up to 55-60 seconds. Defragmenting the journal file helped, reducing the boot time of about 20 seconds. What was impressive is the number of extents: at the end of test these were near 8000. The journald log file size was 64MB.

2) Defragment the journal at the end of the boot
The test was performed doing a defrag at the end of each boot (after 30 seconds). The number of reboots were 56. The boots times were always between 15s and 20s. The number of extents were between 2000 and 3000 extents.

3) Defragment the journal before flushing the data to the disk
The test was performed doing a defrag before flushing the data to the disk. The number of reboots were 60. The boots times were between 15s and 20s. The number of extents were between 2000 and 3000.

4) Mark NOCOW the journald log file
This test was performed marking the journal file NOCOW. This disable the COW behaviour increasing the speed. Unfortunately this also removes the checksum of the file. The boots times were a bit less than the previous tests. The extents were two orders of magnitudes lower (30 vs 3000).

5) Enable the systemd-readahead
All test were performed with systemd-readahead disabled. Then Kai Krakow [2] pointed me that the systemd readahead service is able to defrag. So I tested also this configuration. The result was very bad like the first one. The reason is that systemd readahead doesn't take care of the Journald log file for some reason. Further investigation is needed.
Update 2014-06-17: systemd-readahead doesn't consider file bigger than 10M; this is the reason because the system.journal is never defragmented.

6) Remove the fsync(2) system call from journald
I tweaked the source of systemd-journald removing the fsync(2) call. But after 36 reboot I didn't notice any improvement, with the exception of a smaller number of extents.

7) Remove the posix_fallocate(3) system call from journald

I tweaked the source of systemd-journald removing the posix_fallocate(3) call. But after 34 reboots I didn't notice any improvement: the boot time was up to 40s and the number of extents was greater than the previous test.

8) Do a defrag before calling the posix_fallocate(3) system call

I tweaked the source of systemd-journald, so before the posix_fallocate(3) call the file is defragmented. The chart shows the decrement of the number of extents when the posix_fallocate(3) is called. However the boot time reached a (bad) value of 50 seconds.

9) Do a defrag when the journal file is opened

I tweaked the source of systemd-journald in order to defrag the journal file each time it is opened. This test reached the same (good) result of the test #3; the boots times is between 15 and 30 seconds.


My tests confirmed the bad interaction between the Systemd log daemon with the BTRFS filesystem. The log file fragments quickly and the performance decrease (see test #1).
Doing a periodic defragmentation, the boots times don't increase too much (See test #2, #3 and #9), both if the defragmentation is performed before the journal flushing and if it is performed at the end of the boot.
Another good strategy is to mark the file NOCOW [3] (test #4); however it must be pointed out that also the checksumming protection is lost. This could be a limit in a multi-disk (RAID) btrfs filesystem scenario, because the checksum is used to discard a corrupted sector.
Systemd implements, in its readahead-* helpers, a defrag strategy which would alleviate this kind of problem. But I was not able to get it working properly (test #5). Further investigation is needed. However I discovered systemd-readahead ignores files greater than 10M. To understand that I had look at the code.
I also tried to change the source of systemd-journald removing the fsync() or the posix_fallocate() call, to verify if these are a cause of the problem. But the tests #6 and #7 seem to suggests that the problem is elsewhere.
In the last tests (#8 and #9) I tweaked the source in order to do the defrag from Journald daemon. Doing a defrag during the open of the journal file seems to have the same (good) results got in the test #3.

I decided to adopt a strategy like the test #2: I added a new job which defrags all the files greather than 10MB under /var each day. So I covered other cases were the files are higly fragmented. Below my .service and .timer systemd unit. Be aware that I am not a systemd expert, so I am open to suggestion on how improve these units.

# file defrag-var.service
Description=Defrag the /var subdirs

ExecStart=/bin/bash -c 'find /var/ -xdev -size +5M | xargs -l btrfs fi defrag'

# file /etc/systemd/system/defrag-var.timer 
Description=Defrag the /var subdirs




Nessun commento:

Posta un commento

My little patches...

Below a list of my patches spread on different projects: Linux kernel [all] 2018-02-01 iversion: Rename make inode_cmp_iversion{+raw}...