So I’m working on a project that has a number of components to it. In particular the data is all stored in a MySQL database. For various reasons we wanted to convert this to a PostgreSQL database. So off I went working on a branch to make the changes and test the system. Meanwhile other parts of the system are changing as well, in particular the size of the input data. When it comes time to merge I get everything setup and then merge the changes in and all the tests pass, so I commit.
Then we notice that the nightly performance run on the continuous integration server is really slow, taking 2 hours instead of 15 minutes. We had noticed there being some slowness on loginsĀ before, but now the logins were slow and the software being tested was really slow. So we go about testing the CI server and find that openSUSE has been kind enough to keep track of the MAC address from the original system (we had installed on a different drive and chassis and then moved to this identical chassis). This caused the IPv6 link local address of this machine to match that of the previous system, which happened to be on the same network. This is bad, so we changed the settings back to the right MAC address and things were better, but still slow. So we decided to reinstall the OS, there are only a couple of directories of data to save, so no big deal.
After the reinstall, logins are faster, but the performance test is still slow. So we blame this on the changes to the input data. So over the next month or so we optimize the handling of the input data and find ways to reduce the input data some, but not back to the original size. The performance test is better, but still not back where it should be, we’re down to about 1.5 hours now.
Other tasks kept me busy, so I didn’t get back to this for another couple of weeks. At that point I’m running the performance test on another development machine and it’s running in 15 minutes! This is great! But it’s still running slow on the CI server. What’s different? So I start checking versions of software and all and everything matches up. What I do notice is differences in the drives. So I get another drive installed in the CI server so I can test different configurations of drives for the PostgreSQL data directory. Here are the results:
XFS (logbufs=8): ~4 hours to finish ext4: ~1 hour 50 minutes to finish ext3: 15 minutes to finish ext3 on LVM: 15 minutes to finish reiserfs: ~1 hour 50 minutes ext3 barrier=1: ~15 minutes ext4 nobarrier: ~15 minutes jfs: ~15 minutes
So as you can see, the filesystem really makes a difference. Turns out the development machine was using ext3 on LVM and the CI server was running ext4. After posting to the postgres-performance mailing list about this, it turns out that I either get speed or safety. With ext3, if the power goes out, I could have a corrupted database, with ext4 this isn’t likely to happen. Given that I’m doing research here and if the power goes out during a test we have a lot bigger problems, I switched to ext3 and left it at that.
Now if I had just changed one thing (the database), rather than the database, the input data and the CI server setup around the same time, I probably would have caught this much sooner. It also would help to have my development systems setup with not only the same software, but the same filesystems too.