Scientific computing increasingly involves handling large amounts of raw data. This is particularly the case for neuroimaging. With increases in processing speed outpacing storage speed, disk i/o has become a limiting factor in many computing applications, especially in multi-user systems.

For my consulting work with GreenAnt Networks, I was asked to build a high-speed disk array for scientific computing and virtual machine storage. This is a brief build log and performance test of the outcome.

# Parts

• Intel Server System R2224WFTZS
• S2600WFT motherboard
• 24x 2.5” bays with NVMe support
• 2x 8 port NVMe Adapter (AXXP3SWX08080) PCIx8
• 2x Intel Scalable Silver 4114 (10core) CPUs
• 12x 32GB (384GB) ECC DDR4 RAM - 2400MHz
• 8x Micron 9100 2.5” (U.2) NVMe SSD 1.2TB (MTFDHAL1T2MCF)

# Configuration

To maximise utilisation of the drives, they were installed across the two NVMe controllers (4 drives per controller). The drive array needs to be reliable as well as fast, so the ZFS filesystem was used. ZFS is a modern filesystem originally derived from the Solaris operating system, but now largely rewritten for Linux and FreeBSD. It offers in-built checksumming of data for consistency and, being a copy-on-write filesystem, it makes snapshotting the filesystem very simple. If you’re interested to read more about ZFS, Aaron Toponce’s articles are a great starting point.

## Software

• Debian Stretch 9.1
• ZFS 0.7.6
• compiled from source

## ZFS configuration

The 8 SSDs were aggregated into a single pool and the pool was set with redundancy mode, RAIDZ2. This means that any two drives can fail in the pool and it will continue to work normally.

  pool: zpool1
state: ONLINE
scan: none requested
config:

zpool1                                         ONLINE       0     0     0
raidz2-0                                     ONLINE       0     0     0
nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0
nvme-MTFDHAL1T2MCF-1AN1ZABYY_P606060xxxxx  ONLINE       0     0     0


There are some negative effects of running in a redundant mode like this, because data needs to be written across all the drives. Therefore write speeds are not as good as mirroring. In our case, the individual drives are very fast so a bit of a loss in write speed is acceptable.

# Speed Testing

Now the interesting part, I used iozone to test the array’s speed. The way in which ZFS works means that it relies heavily upon RAM and caching to speed up disk access. Therefore, to get an accurate estimate of speed we need to starve the server of memory and run tests beyond the available cache memory. There’s a few ways to do this but I chose a relatively crude method of creating a ramdisk and filling it with blank data. This left only 8GB of RAM and we ran tests utilising 16GB of test data.

When testing storage, it’s important to also test how the array performs with multiple processes hitting it at once. So we ran iozone with 20 workers. We also tried two block sizes, 16k and 256k. The commands used for testing were:

iozone -i 0 -i 1 -t 1 -s16g -r256k  -t 20
iozone -i 0 -i 1 -t 1 -s16g -r16k  -t 20

The tests can vary a bit from run to run, but here are some representative numbers

### Memory Contrained

RAIDZ2 ashift=13 (1 / 20 thread, 16k blocks)

write 1.22G/sec 5.25G/sec
rewrite 1.27G/sec 3.73G/sec

RAIDZ2 ashift=13 (1 / 20 thread, 256k blocks)

write 2.48G/sec 8.90G/sec
rewrite 2.84G/sec 6.46G/sec

### not RAM constrained

RAIDZ2 ashift=13 (1 / 20 thread, 16k blocks)

write 1.09G/sec 5.41G/sec
rewrite 1.25G/sec 5.45G/sec

RAIDZ2 ashift=13 (1 / 20 thread, 256k blocks)