Playing with Parallel I/O

Playing with Parallel I/O - Part 1

Published Feb 13, 2020

In 2002, I purchased a top of the line computer with an Intel Pentium 4 3.06 GHz processor. I marveled at its speed and power. I remember thinking how incredible it will be in a few years once clock speeds hit 5 or 6 GHz. That day still hasn't arrived. Instead we saw the industry shift to adding cores and processors. I had always been so focused on clock cycles and the number of instructions that could be processed. I hadn't yet conceptualized the potential of parallel processing. It wasn't until I started writing system applications that I quickly realized the advantages and necessities of parallel computing.

So I thought it would be fun to do a blog series on parallel computing. I have no agenda or claims that I want to validate. My goal is to write some test programs that will do parallel and non-parallel tasks and measure the differences. The first topic I'd like to spend some time on is file I/O, and, in this article specifically, writing to file.

First off, I think it's important to mention that various factors can affect parallel computing performance. So the numbers that I show on these tests could be completely different on a different setup. Nevertheless, I think these tests will still be plenty informative. I'll be running these tests on a CentOS 6.6 VM, with 4 cores and 4 GB of RAM.

Our first test will be a simple write test. We'll focus on writing 1, 250MB file. The parameters for the test will be the number of threads, block size, and total number of blocks. The block size multiplied by the number of blocks will always equal 250MB. The reason I used a fix block size for all tests is to eliminate the latency that could be introduced by having more system calls between thread tests. A system call is where a program issues a request to the OS that triggers an interrupt, which means the OS has to go to kernel space to satisfy the request, such as a file write. By using a fixed block size, no matter how many threads we allocate the total number of system calls will always be the same. Alright, so here is how the test will work. Once the test starts, it will spawn the requested number of threads. Each thread will open a handle and write a range of the file. The range is calculated by total blocks divided by number of threads. The ranges are handed out sequentially upon thread generation. With a 250MB file and 4 threads, each thread will write 62.5MB with thread 1 writing 0 to 62.5MB and thread 2 62.5MB to 125MB and etc.

I'm going to break this test up into 3 parts. The first part will be using 4KB blocks with 1, 2, 4, 8, and 16 threads. I will then increment the block size to 65KB and run it with the same 5 different thread values, and finally we'll up the block to 250KB. I'm running each part of the test 100 times and calculating the average time in milliseconds. Here we go!

4KB Blocks Threads Block Size Blocks Time(ms)

4096

65536

238

4096

65536

445

4096

65536

1031

4096

65536

1004

4096

65536

1027

65KB Blocks Threads Block Size Blocks Time(ms)

65536

4096

153

65536

4096

283

65536

4096

871

65536

4096

858

65536

4096

842

250KB Blocks Threads Block Size Blocks Time(ms)

262144

1024

153

262144

1024

276

262144

1024

843

262144

1024

854

262144

1024

832

Alright, so the results are in. Any surprises? I guess it depends how much parallel programming you've done before. The single thread won easily on all three block sizes. Before I get into why, notice how the times did go down across the board as the block size was increased. This goes back to what I was talking about previously with system calls being so expensive. If you are relegated to a fixed block size there is help, vector I/O allows you to bunch similar read/write requests into 1 system call, but that is a whole other topic that maybe I can get to some other time. OK, so is that it? Does single I/O trump parallel I/O when dealing with writing to a single file? Now that we are at this point, we can focus on a design pattern that better utilizes parallel programing. To an OS a thread and a process are very similar, for example, to the scheduler they are treated identically and the virtual memory manager supplies them each with their own address space, the only difference being the thread inherits a copy of its parent address space which of course is shared with all threads. The point being, creating threads is an expensive operation. This is where a technique used extensively by both the Linux and Windows kernels comes in handy. They are referred to as worker threads. Implementing your own worker threads definitely adds complexity to your code, but as you'll hopefully see in our next test, will vastly improve performance. In order to implement them properly you'll need to use some atomic operators, the specifics of those depend on your platform. Since this is just a simulation, I will use a somewhat makeshift approach. Since I'm on Linux I will use a barrier operator. What this allows me to do is have each thread spawn and wait on a barrier until I release it in my main function. That way we can have all our threads started up and ready to go before we begin the actual test. But why stop at thread creation? Since we now see how expensive system calls are, we will also open our file handle and seek to our starting range before we wait on our barrier.

OK, those changes are in. We'll run this test with the same exact parameters as the first test and see what we get.

4KB Blocks Threads Block Size Blocks Time(ms)

4096

65536

231

4096

65536

442

4096

65536

411

4096

65536

333

4096

65536

326

65KB Blocks Threads Block Size Blocks Time(ms)

65536

4096

153

65536

4096

285

65536

4096

262

65536

4096

214

65536

4096

212

250KB Blocks Threads Block Size Blocks Time(ms)

262144

1024

149

262144

1024

274

262144

1024

233

262144

1024

195

262144

1024

202

That looks a lot better than the first test. The single thread still won out, but as you can clearly see using the worker thread model made a significant performance improvement. Now in our simple program the actual start to finish execution time is the same. But in a service or OS, you can see where having a pool of worker threads would be very beneficial.

So why is our multithread solution still slower? I think there are 2 remaining significant bottlenecks. The first is context switching, this occurs every time a new thread or process is signaled by the scheduler, for example the address space for a thread could've been paged out to make room for newly needed memory, when that thread gets scheduled again that memory needs to be paged back in before it can run. The second potential bottleneck is disk access. If a disk has been used for a while it will become fragmented, so I/O can get scattered all over the disk, in our case of having multiple threads each writing at different positions of a single file, the disk could be introducing a significant amount of latency jumping around. As of right now, there isn't much we can do about these 2 issues, the context switching will hopefully get better as operating systems continue to evolve to run better in multi-processor environments. As far as our disk access bottleneck, for my next article I'll setup and run a test using separate files with each file being written to a network share or an independent disk. Then maybe we can see if we can live with only the cost of context switching when trying to process parallel write requests.

Want to Learn More?