Scientific computing increasingly involves handling large amounts of raw data. This is particularly the case for neuroimaging. With increases in processing speed outpacing storage speed, disk i/o has become a limiting factor in many computing applications, especially in multi-user systems.

For my consulting work with GreenAnt Networks, I was asked to build a high-speed disk array for scientific computing and virtual machine storage. This is a brief build log and performance test of the outcome.

Parts

  • Intel Server System R2224WFTZS
    • S2600WFT motherboard
    • 24x 2.5” bays with NVMe support
    • 2x 8 port NVMe Adapter (AXXP3SWX08080) PCIx8
  • 2x Intel Scalable Silver 4114 (10core) CPUs
  • 12x 32GB (384GB) ECC DDR4 RAM - 2400MHz
  • 8x Micron 9100 2.5” (U.2) NVMe SSD 1.2TB (MTFDHAL1T2MCF)

Micron 9100

Configuration

To maximise utilisation of the drives, they were installed across the two NVMe controllers (4 drives per controller). The drive array needs to be reliable as well as fast, so the ZFS filesystem was used. ZFS is a modern filesystem originally derived from the Solaris operating system, but now largely rewritten for Linux and FreeBSD. It offers in-built checksumming of data for consistency and, being a copy-on-write filesystem, it makes snapshotting the filesystem very simple. If you’re interested to read more about ZFS, Aaron Toponce’s articles are a great starting point.

Software

ZFS configuration

The 8 SSDs were aggregated into a single pool and the pool was set with redundancy mode, RAIDZ2. This means that any two drives can fail in the pool and it will continue to work normally.

  pool: zpool1
 state: ONLINE
  scan: none requested
config:

	NAME                                           STATE     READ WRITE CKSUM
	zpool1                                         ONLINE       0     0     0
	  raidz2-0                                     ONLINE       0     0     0
	    nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
	    nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
	    nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
	    nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
	    nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
	    nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
	    nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
	    nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0

There are some negative effects of running in a redundant mode like this, because data needs to be written across all the drives. Therefore write speeds are not as good as mirroring. In our case, the individual drives are very fast so a bit of a loss in write speed is acceptable.

Speed Testing

Now the interesting part, I used iozone to test the array’s speed. The way in which ZFS works means that it relies heavily upon RAM and caching to speed up disk access. Therefore, to get an accurate estimate of speed we need to starve the server of memory and run tests beyond the available cache memory. There’s a few ways to do this but I chose a relatively crude method of creating a ramdisk and filling it with blank data. This left only 8GB of RAM and we ran tests utilising 16GB of test data.

When testing storage, it’s important to also test how the array performs with multiple processes hitting it at once. So we ran iozone with 20 workers. We also tried two block sizes, 16k and 256k. The commands used for testing were:

iozone -i 0 -i 1 -t 1 -s16g -r256k  -t 20
iozone -i 0 -i 1 -t 1 -s16g -r16k  -t 20

The tests can vary a bit from run to run, but here are some representative numbers

Memory Contrained

RAIDZ2 ashift=13 (1 / 20 thread, 16k blocks)

test 1 thread 20 threads
write 1.22G/sec 5.25G/sec
rewrite 1.27G/sec 3.73G/sec
read 2.82G/sec 17.6G/sec
reread 2.97G/sec 17.8G/sec

RAIDZ2 ashift=13 (1 / 20 thread, 256k blocks)

test 1 thread 20 threads
write 2.48G/sec 8.90G/sec
rewrite 2.84G/sec 6.46G/sec
read 4.42G/sec 16.6G/sec
reread 4.59G/sec 17.2G/sec

not RAM constrained

RAIDZ2 ashift=13 (1 / 20 thread, 16k blocks)

test 1 thread 20 threads
write 1.09G/sec 5.41G/sec
rewrite 1.25G/sec 5.45G/sec
read 2.76G/sec 22.1G/sec
reread 2.78G/sec 22.3G/sec

RAIDZ2 ashift=13 (1 / 20 thread, 256k blocks)

test 1 thread 20 threads
write 2.81G/sec 9.79G/sec
rewrite 2.91G/sec 9.44G/sec
read 3.86G/sec 39.4G/sec
reread 3.63G/sec 39.0G/sec

Conclusion

I was very impressed with the speed of the array, especially the high read speeds when memory constrained. At around 17GB/sec, these speeds are close to the theoretical maximum throughput of the drives in parallel, suggesting that we’re not seeing much reduction in speed due to PCI bandwidth or processor effects.

It’s also interesting to see just how much of an effect RAM caching can have, effectively doubling the read speeds for our larger, 256k, block test.

I’ll report back with some more results once we test the array under load.