Apple recently released High Sierra, the new version of Mac OS. While it is billed as one of those “stability releases” with few user-facing changes, it introduces a new file system, APFS. This system has already been rolled out for iOS devices. This is not something users interact with directly, but Apple lists a number of features (such as on-disk system snapshots) that are enabled by APFS.
Coincidentally, I have been working on a method to randomly sample records from files. High Sierra has become available in the middle of the project, while I was repeatedly benchmarking my code. I will report the details of the method soon (watch this space!). To test a base-line method, I wrote a program (in C++, compiled with Apple’s llvm
with -O3 optimization) that reads a line of the target file and then probabilistically decides whether it will save it to an output file or not. The files can be text or binary. Given that the sampling is reasonably sparse, the execution is dominated by file reading operations. Binary files are read with the read()
ifstream
method, while text files are processed with an overload of the getline()
function. I then use the clock()
function to time execution. I vary the number of records sampled, and perform 15 replicates to estimate execution time variability, which can be due to any number of factors. For example, since I execute the program on my laptop other processes running at the same time can interfere by commandeering file I/O facilities.
I ran my program on my MacBook Pro 15-inch (mid-2015) laptop with an SSD. Re-running it recently, after I updated to High Sierra and APFS, I noticed about two-fold speed-up on a binary input file. This is shown below in two plots: the one on the left was generated when I was still on Sierra and therefore HFS+, while the plot on the left was generated on the same computer after updating to High Sierra with APFS. The amount of free space on the drive was comparable, and the disk was encrypted with FileVault under both systems.
The x-axes indicate the number of samples taken from a file of fixed size. The y-axes are the time it took to perform each operation (in milliseconds). Note that the average time taken, as well as that for each sample size, was reduced by half after updating to APFS.
Execution timing of the same scheme on a text file did not decrease, however. The following pair of plots is organized the same as before, and the y-axis values are the same before and after the update.
The outlier observations that pop up on the new OS are not related to the system update. They occurred (fairly inconsistently) under HFS+, too.
Note that all operations were done on an SSD. APFS apparently does not support spinning drives yet. I saw another set of file system benchmarks that also showed some speed-ups, but my results may provide a useful extra data point for people running file I/O-intensive applications. I was unable to find any other comparisons that separate binary and text file operations. I would love to hear about other users’ experiences. Please contact me with questions or comments. I will release the source code once I complete the project that was actually the point of this exercise.