Scientific computing increasingly involves handling large amounts of raw data. This is particularly the case for neuroimaging. With increases in processing speed outpacing storage speed, disk i/o has become a limiting factor in many computing applications, especially in multi-user systems.

For my consulting work with GreenAnt Networks, I was asked to build a high-speed disk array for scientific computing and virtual machine storage. This is a brief build log and performance test of the outcome.

Parts

  • Intel Server System R2224WFTZS
    • S2600WFT motherboard
    • 24x 2.5” bays with NVMe support
    • 2x 8 port NVMe Adapter (AXXP3SWX08080) PCIx8
  • 2x Intel Scalable Silver 4114 (10core) CPUs
  • 12x 32GB (384GB) ECC DDR4 RAM - 2400MHz
  • 8x Micron 9100 2.5” (U.2) NVMe SSD 1.2TB (MTFDHAL1T2MCF)

Micron 9100

Configuration

To maximise utilisation of the drives, they were installed across the two NVMe controllers (4 drives per controller). The drive array needs to be reliable as well as fast, so the ZFS filesystem was used. ZFS is a modern filesystem originally derived from the Solaris operating system, but now largely rewritten for Linux and FreeBSD. It offers in-built checksumming of data for consistency and, being a copy-on-write filesystem, it makes snapshotting the filesystem very simple. If you’re interested to read more about ZFS, Aaron Toponce’s articles are a great starting point.

Software

ZFS configuration

The 8 SSDs were aggregated into a single pool and the pool was set with redundancy mode, RAIDZ2. This means that any two drives can fail in the pool and it will continue to work normally.

  pool: zpool1
 state: ONLINE
  scan: none requested
config:

	NAME                                           STATE     READ WRITE CKSUM
	zpool1                                         ONLINE       0     0     0
	  raidz2-0                                     ONLINE       0     0     0
	    nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
	    nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
	    nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
	    nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
	    nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
	    nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
	    nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
	    nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0

There are some negative effects of running in a redundant mode like this, because data needs to be written across all the drives. Therefore write speeds are not as good as mirroring. In our case, the individual drives are very fast so a bit of a loss in write speed is acceptable.

Speed Testing

Now the interesting part, I used iozone to test the array’s speed. The way in which ZFS works means that it relies heavily upon RAM and caching to speed up disk access. Therefore, to get an accurate estimate of speed we need to starve the server of memory and run tests beyond the available cache memory. There’s a few ways to do this but I chose a relatively crude method of creating a ramdisk and filling it with blank data. This left only 8GB of RAM and we ran tests utilising 16GB of test data.

When testing storage, it’s important to also test how the array performs with multiple processes hitting it at once. So we ran iozone with 20 workers. We also tried two block sizes, 16k and 256k. The commands used for testing were:

iozone -i 0 -i 1 -t 1 -s16g -r256k  -t 20
iozone -i 0 -i 1 -t 1 -s16g -r16k  -t 20

The tests can vary a bit from run to run, but here are some representative numbers

Memory Contrained

RAIDZ2 ashift=13 (1 / 20 thread, 16k blocks)

test 1 thread 20 threads
write 1.22G/sec 5.25G/sec
rewrite 1.27G/sec 3.73G/sec
read 2.82G/sec 17.6G/sec
reread 2.97G/sec 17.8G/sec

RAIDZ2 ashift=13 (1 / 20 thread, 256k blocks)

test 1 thread 20 threads
write 2.48G/sec 8.90G/sec
rewrite 2.84G/sec 6.46G/sec
read 4.42G/sec 16.6G/sec
reread 4.59G/sec 17.2G/sec

not RAM constrained

RAIDZ2 ashift=13 (1 / 20 thread, 16k blocks)

test 1 thread 20 threads
write 1.09G/sec 5.41G/sec
rewrite 1.25G/sec 5.45G/sec
read 2.76G/sec 22.1G/sec
reread 2.78G/sec 22.3G/sec

RAIDZ2 ashift=13 (1 / 20 thread, 256k blocks)

test 1 thread 20 threads
write 2.81G/sec 9.79G/sec
rewrite 2.91G/sec 9.44G/sec
read 3.86G/sec 39.4G/sec
reread 3.63G/sec 39.0G/sec

Conclusion

I was very impressed with the speed of the array, especially the high read speeds when memory constrained. At around 17GB/sec, these speeds are close to the theoretical maximum throughput of the drives in parallel, suggesting that we’re not seeing much reduction in speed due to PCI bandwidth or processor effects.

It’s also interesting to see just how much of an effect RAM caching can have, effectively doubling the read speeds for our larger, 256k, block test.

I’ll report back with some more results once we test the array under load.


Follow up

Well, it’s been 10 months that we’ve been running the array and there’s been a few questions about it, so here’s a bit of follow-up.

The array has been running wonderfully in a production server with 20 KVM VMs using it as a storage array. We’ve upgraded ZFS 2 times over that period without issue. It’s nice to have the server not constrained by I/O at all (in fact it’s probably overkill for anything we’ve thrown at it so far). We ended up using the 8 NVMe drives in a raidz2 array as it was fast enough and gave good redundancy across the 8 drives (can survive 2 drive failures). Most of the time, we’re using layered filesystems on top of ZFS (ext4 on top of zvols and 9pfs on top of zfs datasets) which taxes the drives even more. Each VM consistently gets over 1GB/sec on reads/writes even with the filesystem overheads. The really impressive thing about the array is how the performance scales by thread. It scales almost linearly up to 20 simultaneous threads hitting the array.

One thing to consider with this sort of array is that to get the most out of it, you need to preserve as much bandwidth as possible to the drives. This means you need lots of PCI ports. In our setup, we’re using an intel dual Xeon server with Intel NVMe switches to get a full 4x PCI to each drive. We went with the Intel server as it’s a reference server, has lots of 2.5” slots and they don’t have any silly vendor lock-in (beware of Dell/HP etc. as they have chips on their motherboards that will only accept Dell/HP labelled drives - for which they charge ridiculous amounts!). I was disappointed with the Meltdown/Spectre debacle which Intel clearly knew about while they were selling us the server, it does have an impact on max I/O when patches are applied.

What you definitely need if you’re going to use ZFS are the following:

  • You must have ECC RAM (don’t believe the uninformed people on the web that say you don’t need it, ECC RAM is essential for ZFS due to the checksums happening in RAM, uncaught memory bit flips will corrupt data, and they happen often enough to be an issue). This means the motherboard and CPU also need ECC support.
  • Lots of RAM (1GB per TB is considered the absolute minimum but 5GB/TB is better)
  • SSD drives that will survive heavy writes and also have onboard capacitors to write out buffers in case of power failure
    • it’s possible to mitigate some of the cost by using cheaper SSD drives and then having an enterprises level drive(s) for the ZIL
  • for sustained write performance, the enterprise drives are much better (but about 4x more expensive than consumer level drives)
  • an OS that supports ZFS well (we use ZFSonLinux in Debian9 - without systemd and Devuan ascii but FreeBSD would also be fine)

One interesting option for a lower price setup is the new AMD chips. They all support ECC RAM and actually have more PCI ports per CPU than Intel. I haven’t tested these setups myself but they should work. It allows the use of non-enterprise CPUs which will be a huge cost saving. ZFS is not particularly demanding on CPU unless you use deduplication (don’t!) or encryption (not in the stable release yet). In general, using ZFS compression is recommended because it has low overhead (we’re using zfs set compression=lz4 <zpool>).

Also, if you’re really pushing drives, thermal throttling becomes an issue, which is why it’s probably better to go with the 2.5” U.2 standard of drive rather than M.2. They are much easier to cool in a proper enclosure. ZFS performance drops when you fill a pool, so keeping the array less than 70% full helps.

We’ve also been running a secondary array of spinning-rust drives and have been really impressed with this array too, which is made up of 4x HGST C10K1800 HUC101818CS4200 1.8TB 10K SAS 2.5” with 2x TOSHIBA 512GB SSD M.2 2280 PCIe NVMe THNSN5512GPUK for the log/cache. The drives are connected to an LSI 9300-8i SAS controller in IT mode (not using the controller’s RAID features, as we want ZFS to have direct drive acccess).

This array gives sustained 600MB/sec for reads/writes and could be pushed a lot further with more drives (and they’re much cheaper). You’ll notice in the HDD array listed below that we’re using the same SSDs for the logs and cache. This works fine but make sure that the SSD is an enterprise one with onboard capacitors (it’s also recommended to have them in a mirrored pair for the logs in case one fails).

	NAME                              STATE     READ WRITE CKSUM
	zpool2                            ONLINE       0     0     0
	  raidz1-0                        ONLINE       0     0     0
	    scsi-C10K1800        ONLINE       0     0     0
	    scsi-C10K1800        ONLINE       0     0     0
	    scsi-C10K1800        ONLINE       0     0     0
	    scsi-C10K1800        ONLINE       0     0     0
	logs
	  mirror-1                        ONLINE       0     0     0
	    scsi-THNSN5512GPUK-part1  ONLINE       0     0     0
	    scsi-THNSN5512GPUK-part1  ONLINE       0     0     0
	cache
	  scsi-THNSN5512GPUK-part2                            ONLINE       0     0     0
	  scsi-THNSN5512GPUK-part2                            ONLINE       0     0     0

ZFS has so much to offer as a filesystem. It allows us to run frequent snapshots of data with almost no overhead and easily replicate across servers. We’re using sanoid and syncoid to manage those tasks.

As a side note, we’re using KVM with p9fs virtualised file system which allows sharing of filesystems between virtual machines. It took a while to find optimal parameters for mounting the filesystems.

It’s important to pick the correct filesystem caching for your purposes. I have tried a few of these and found that the best caching strategy is very dependent on the load on the virtual machine.

For instance, if you have a very read-heavy environment that is walking the file tree, then cache=fscache can speed up p9fs access by about 20x. I tend to use fscache for virtual machines which heavily use caching (PHP CMS setups are a good example). However, beware that the fscache can get confused if files are modified outside that VM as it will pull from the local cache without realising a file has changed.

filesystem        /mnt/filesystem/  9p  trans=virtio,msize=16384,cache=fscache  0  0

In some cases (especially when writes have happened from a non-local machine), it may be necessary to force a filesystem cache flush. This can be accomplished by the following command:

echo 3 > /proc/sys/vm/drop_caches

Caution: With further testing, I have also found that the fscache option can also lead to some instabilities under heavy load, so I tend to use it less now.

Another option which is safer than fscache is cache=mmap. This results in a significant speed increase and also helps with improving compatibility for memory maps required for applications like Redis and MySQL.

filesystem        /mnt/filesystem/  9p  trans=virtio,msize=16384,cache=mmap  0  0

Tuning the msize is also important, keep it close to the size of the files that are most frequently accessed (16k in the example below).

p9fs caching has some trade-offs, here’s an outline of the different modes on the client:

cache=mode specifies a caching policy. By default, no caches are used.

  • none = default: no cache policy, metadata and data alike are synchronous.
  • loose = no attempts are made at consistency, intended for exclusive, read-only mounts
  • fscache = use FS-Cache for a persistent, read-only cache backend.
  • mmap = minimal cache that is only used for read-write mmap. Nothing else is cached, like cache=none.