Playing with Parallel I/O

Playing with Parallel I/O - Part 2

Published Feb 13, 2020

I'm back for Part 2 in my series on parallel I/O. Just a quick recap, in Part 1 we focused on copying a file in a single and multithreaded program. Even with our worker thread model and bigger block sizes the single thread one each round. Our 2 remaining bottlenecks appear to be context switching and disk seeking. Disk seeking is certainly the lowest hanging fruit. So let’s see what happens if we attempt to eliminate that.

The simplest solution to reducing disk latency is to use different disks/devices. So I wrote two test programs. The first will create a single thread and iterate over each device and write a file to each device. The second test will create a thread for each individual device and write a file to each device. To save time I'm starting with the worker thread model from the beginning. As far as file size and block size, I'm using the same rules as Part 1. 250MB file with 4K, 64K and 256K blocks.

Now that the tests are ready to go, I started setting up my test environment. I decided four was a reasonable number of devices for testing. On my VM, I added 3 additional virtual disks and mounted each to unique directories. Before starting my 100 iteration test I did a quick 1 iteration test to verify the test. As soon as I hit enter I realized this wasn't going to work, and sure enough the single thread finished first. Even though my virtual machine is writing to individual virtual disks, those disks are still all writing to the same hard disk on my host machine. There is a positive to this mishap. It allowed me to satisfy an additional test case. After I finished Part 1, I was curious if writing multiple files to the same disk would produce a multithreaded benefit even though the single file test didn't. Here are the results.

4KB Blocks Threads Block Size Blocks Time(ms)

4096

65536

1109

4096

65536

3938

65KB Blocks Threads Block Size Blocks Time(ms)

65536

4096

813

65536

4096

4064

250KB Blocks Threads Block Size Blocks Time(ms)

262144

1024

805

262144

1024

4921

So even having each thread write its own file is still slower than 1 thread writing all the files. What if we up the number of files and or threads? For this test, I want to keep the work even among the devices, so the thread and file count needs to be a factor of 4, since that is my device count. So if I used 8 threads and 12 files, each device will have 2 threads each writing 6 files to the device. With 12 threads and 12 files, it will be 3 threads per device and each will write 4 files to the device. Here we go.

4KB Blocks Threads Block Size Blocks Time(ms)

4096

65536

1436

4096

65536

24411

4096

65536

76599

4096

65536

82198

65KB Blocks Threads Block Size Blocks Time(ms)

65536

4096

9432

65536

4096

25164

65536

4096

72946

65536

4096

83906

250KB Blocks Threads Block Size Blocks Time(ms)

262144

1024

9415

262144

1024

2114

262144

1024

77281

262144

1024

86382

I think it's now safe to conclude that when using a single disk, it's going to be very difficult to improve performance with a multithreaded I/O approach.

I now have a test environment that uses 3 physical machines. I'm still using 4 total devices. On the actual test machine, I'm using the local disk and a mounted instance of TmpFS. TmpFS is a FS that operates only in RAM, this allows me to simulate 2 devices on the same machine with only 1 physical disk. The 2 remaining devices will be an SMB and NFS share with each using one of the remaining physical machines. I think we are now ready to run some tests. Just a heads up, since we are introducing network latency using SMB and NFS, our times are going to increase significantly. Luckily our focus is on the comparison between thread usages. For our first test we will copy 1 file to each device using 1 thread and 4 threads.

4KB Blocks Threads Block Size Blocks Time(ms)

4096

65536

52812

4096

65536

46763

65KB Blocks Threads Block Size Blocks Time(ms)

65536

4096

53082

65536

4096

47035

250KB Blocks Threads Block Size Blocks Time(ms)

262144

1024

53389

262144

1024

47351

Finally! The multithreaded test was about 12% faster. Let’s see how it scales. I’ll increase the number of files to 12 and try again with 1, 4, 8, and 12 threads?

4KB Blocks Threads Block Size Blocks Time(ms)

4096

65536

631845

4096

65536

552663

4096

65536

551202

4096

65536

551391

65KB Blocks Threads Block Size Blocks Time(ms)

65536

4096

636366

65536

4096

559615

65536

4096

551014

65536

4096

550098

250KB Blocks Threads Block Size Blocks Time(ms)

262144

1024

632776

262144

1024

552279

262144

1024

550813

262144

1024

551547

So again, the multithreaded test was about 12% faster in each test case. That came out to approximately 1 min on each test. It’s worth mentioning, the total data set of the test was 3GB. What if it was 30GB or 300GB? If that 12% scales linearly, you have yourself a massive time increase. Also note the time did go down adding additional threads, certainly not as significant as the jump from single to multiple threads, but enough to where it’s worth contemplating the additional thread management. I wish I had quick access to a physical machine with four local disks. Because I think the reason we see gains with more than 4 threads is because our network latency covers for the disk seek and context switching time. Think Big O.

So what have we learned with these tests? It really comes down to what you are trying to accomplish. If you are handling I/O to multiple disks and or machines, it would be worth spending the additional time to implement a multithread approach. If you are writing to a single local drive, stick to the KISS method.

Want to Learn More?