Playing with Parallel I/O - Part 2

I'm back for Part 2 in my series on parallel I/O. Just a quick recap, in Part 1 we focused on copying a file in a single and multithreaded program. Even with our worker thread model and bigger block sizes the single thread one each round. Our 2 remaining bottlenecks appear to be context switching and disk seeking. Disk seeking is certainly the lowest hanging fruit. So let’s see what happens if we attempt to eliminate that.

The simplest solution to reducing disk latency is to use different disks/devices. So I wrote two test programs. The first will create a single thread and iterate over each device and write a file to each device. The second test will create a thread for each individual device and write a file to each device. To save time I'm starting with the worker thread model from the beginning. As far as file size and block size, I'm using the same rules as Part 1. 250MB file with 4K, 64K and 256K blocks.

Now that the tests are ready to go, I started setting up my test environment. I decided four was a reasonable number of devices for testing. On my VM, I added 3 additional virtual disks and mounted each to unique directories. Before starting my 100 iteration test I did a quick 1 iteration test to verify the test. As soon as I hit enter I realized this wasn't going to work, and sure enough the single thread finished first. Even though my virtual machine is writing to individual virtual disks, those disks are still all writing to the same hard disk on my host machine. There is a positive to this mishap. It allowed me to satisfy an additional test case. After I finished Part 1, I was curious if writing multiple files to the same disk would produce a multithreaded benefit even though the single file test didn't. Here are the results.

4KB Blocks Threads Block Size Blocks Time(ms) 

1

 

4096

 

65536

 

1109

 

4

 

4096

 

65536

 

3938

 

65KB Blocks Threads Block Size Blocks Time(ms) 

1

 

65536

 

4096

 

813

 

4

 

65536

 

4096

 

4064

 

250KB Blocks Threads Block Size Blocks Time(ms) 

1

 

262144

 

1024

 

805

 

4

 

262144

 

1024

 

4921

 

So even having each thread write its own file is still slower than 1 thread writing all the files. What if we up the number of files and or threads? For this test, I want to keep the work even among the devices, so the thread and file count needs to be a factor of 4, since that is my device count. So if I used 8 threads and 12 files, each device will have 2 threads each writing 6 files to the device. With 12 threads and 12 files, it will be 3 threads per device and each will write 4 files to the device. Here we go.

4KB Blocks Threads Block Size Blocks Time(ms) 

1

 

4096

 

65536

 

1436

 

4

 

4096

 

65536

 

24411

 

8

 

4096

 

65536

 

76599

 

12

 

4096

 

65536

 

82198

 

65KB Blocks Threads Block Size Blocks Time(ms) 

1

 

65536

 

4096

 

9432

 

4

 

65536

 

4096

 

25164

 

8

 

65536

 

4096

 

72946

 

12

 

65536

 

4096

 

83906

 

250KB Blocks Threads Block Size Blocks Time(ms) 

1

 

262144

 

1024

 

9415

 

4

 

262144

 

1024

 

2114

 

8

 

262144

 

1024

 

77281

 

12

 

262144

 

1024

 

86382

 

I think it's now safe to conclude that when using a single disk, it's going to be very difficult to improve performance with a multithreaded I/O approach.

I now have a test environment that uses 3 physical machines. I'm still using 4 total devices. On the actual test machine, I'm using the local disk and a mounted instance of TmpFS. TmpFS is a FS that operates only in RAM, this allows me to simulate 2 devices on the same machine with only 1 physical disk. The 2 remaining devices will be an SMB and NFS share with each using one of the remaining physical machines. I think we are now ready to run some tests. Just a heads up, since we are introducing network latency using SMB and NFS, our times are going to increase significantly. Luckily our focus is on the comparison between thread usages. For our first test we will copy 1 file to each device using 1 thread and 4 threads.

4KB Blocks Threads Block Size Blocks Time(ms) 

1

 

4096

 

65536

 

52812

 

4

 

4096

 

65536

 

46763

 

65KB Blocks Threads Block Size Blocks Time(ms) 

1

 

65536

 

4096

 

53082

 

4

 

65536

 

4096

 

47035

 

250KB Blocks Threads Block Size Blocks Time(ms) 

1

 

262144

 

1024

 

53389

 

4

 

262144

 

1024

 

47351

 

Finally! The multithreaded test was about 12% faster. Let’s see how it scales. I’ll increase the number of files to 12 and try again with 1, 4, 8, and 12 threads?

4KB Blocks Threads Block Size Blocks Time(ms) 

1

 

4096

 

65536

 

631845

 

4

 

4096

 

65536

 

552663

 

8

 

4096

 

65536

 

551202

 

12

 

4096

 

65536

 

551391

 

65KB Blocks Threads Block Size Blocks Time(ms) 

1

 

65536

 

4096

 

636366

 

4

 

65536

 

4096

 

559615

 

8

 

65536

 

4096

 

551014

 

12

 

65536

 

4096

 

550098

 

250KB Blocks Threads Block Size Blocks Time(ms) 

1

 

262144

 

1024

 

632776

 

4

 

262144

 

1024

 

552279

 

8

 

262144

 

1024

 

550813

 

12

 

262144

 

1024

 

551547

 

So again, the multithreaded test was about 12% faster in each test case. That came out to approximately 1 min on each test. It’s worth mentioning, the total data set of the test was 3GB. What if it was 30GB or 300GB? If that 12% scales linearly, you have yourself a massive time increase. Also note the time did go down adding additional threads, certainly not as significant as the jump from single to multiple threads, but enough to where it’s worth contemplating the additional thread management. I wish I had quick access to a physical machine with four local disks. Because I think the reason we see gains with more than 4 threads is because our network latency covers for the disk seek and context switching time. Think Big O.

So what have we learned with these tests? It really comes down to what you are trying to accomplish. If you are handling I/O to multiple disks and or machines, it would be worth spending the additional time to implement a multithread approach. If you are writing to a single local drive, stick to the KISS method.

Want to Learn More?

This is just a sample of what we can do. We have 15 years of experience working in nearly every technology and industry. Whatever you are doing, we've done it and are prepared to tackle your project. Reach out and we will discuss it with you.