Parallel Programming: MultiProcessing in Python

The hunger for computation power among programmers, gamers, scientists, software developers and most humans that know how to use a computer in general is immense. We are always looking for applications that are less compute intensive and more efficient. This allows us to make use of our computer setups more efficiently.

However, many of us do not fully utilize the computation power already available to us on our computers. Utilizing this power when needed can lead to exponentially better performances and usually you can run the processes 2-3 times faster with some changes in code. How do we do this you ask? Well let's dive in.

This blog focuses on parallel programming. i.e, running a program on multiple processors simultaneously. When you run your program it usually uses one of the cores in your computer. However, most computers have multiple cores. Depending on your processor it may be dual core, quad core, octa core or may contain more cores. If (lets say) you have a quad core processor running a program only on one core, you are essentially letting go of other three cores and hence three times the computation power you are using. Using all these cores can theoretically speed up your tasks by four.

However, it is not so simple otherwise software companies would all be using all the cores all the time for better performance. If you want to increase the performance of your program you need to make sure that it can be parallelized. i.e, it can be run on different cores at once simultaneously.

Lets take a simple example to understand this point. Let us say you have to build a product and you divide its manufacturing into four stages. If you can go from stage 1 to stage 2 only when stage 1 is already completed, and then from stage 2 to stage 3 only when stage 2 is completed and so on. Then this process is a sequential process, since you have to follow a sequence to execute your instructions. However, If you can break the manufacturing into four parts and assign it to different workers, then the process can be parallelized. e.g, building four components for a toy that can be joined once they are finished.

Once you have established that your program can be parallelized, the next step is to write the code for it. I will not write the code in this blog, However the code for this blog can be found at this github repo. I choose python to write the code and I used multiprocessing module to run the program on multiple processors.

In this program we will see two applications of parallel programming. First is Matrix Multiplication which can be easily parallelized and next we shall see prefix sum scan, which on the first look seems to be a sequential problem but can be parallelized to run on multiple processors.

Matrix Multiplication

Multiplying two matrices is fairly simple and is part of most introductory programming courses- You select a row from the first matrix and a column from the second matrix and multiply corresponding elements and add them to get the first element, then move onto next column do the same to get the next element and so on.

Here, one might notice that given two matrices A(mXn) and B(nXr) and their resultant sum matrix C(mXr), to get the element C(i,j) one only needs to consider the ith row of matrix A and jth column of matrix B to get the required element of matrix C. Hence, we can do it in isolation from the rest of the elements. And hence it does not matter whether we have calculated element C(1,1) in order to calculate element C(1,2).

Hence, one can divide the program into different processes and run them on different processors. This can lead to significant increases in speeds of running the program and since matrix multiplication is such an important operation in many applications such as machine learning, scientific simulations, Games etc. the rewards are immense.

Now, Even though it sounds simple enough we still need to be careful about some things while implementing this in code. One of the first things is to decide which calculations will run on which core. You can decide to divide the matrices row-wise or column-wise. We should do it row-wise. i.e, first we take the first matrix and take a definite number of its rows and multiply it with all the columns in the second matrix and do this simultaneously for different rows on different cores.

For example, lets say we have A(4X3) and B(3X5) be two matrices, we will distribute the load of calculation of first row of A on one processor, second row on second processor and so on. However, what if we have more rows than the number of processors. e.g, A(7X4) and B(4X5), Here, we make a function that does the division of matrix for us. i.e, it decides how the matrix will be divided to run on different processors.

One way to implement this function is to start the first division from 0 to r, where r can be calculated by integer division(total rows / number of processors) then, next division from r to r + integer division(total rows / number of processors) where total rows now are the total rows before minus the number of rows already to be executed, and we have one less processor and so on.

Lets take an example to understand this

number of processors, N = 4
total rows, R = 7

Then, first division will be from 0 to (7/4) i.e, 0 to 1
now R = R - 1 = 6 and N = N - 1 = 3

Hence, second division will be from 1 to 1+(6/3) i.e, 1 to 3
now R = R - 2 = 4, N = N - 1 = 2

Hence, third division will be from 3 to 3+ (4/2) i.e, 3 to 5
now R = R - 2 = 2, N = N - 1 = 1

Hence, third division will be from 5 to 5+(2/1) i.e, 5 to 7

Hence, the division will be from: (0 to 1), (1 to 3), (3 to 5) and (5 to 7)

Note: Here the first term is non inclusive and matrix are starting with index 1. Hence, first row executes on first processor and not on second processor.

Now, we can divide our matrix into required rows and execute on different cores. I ran the above program on a randomly generated 100X100 matrix and found out that the speedup was almost 2.75 times. Well what is speedup anyway?

It is a way to measure how fast your algorithm is as compared to a similar algorithm running sequentially. How do we do it? We first calculate the time taken to execute the sequential program and the time taken to execute the parallel program. We then divide the first time obtained by the second and that is our speedup i.e, how faster our program is as compared to the sequential program.

Theoretically, given four cores should increase the performance of algorithm by four, however there are additional overheads when we are trying to run the processes in parallel. For example, we have to spawn multiple processes and to do additional processing as in this case we divided the matrix rows to run on different processors. Also since we will be writing to a common memory array, there may be conflicts while writing the results so that also may, lead to slowdown. Another reason can be that the processes take different times to execute. For example in the above example, not all rows have same size hence take different time to execute.

I would suggest you to go to my github repo and download the folder and run the program by following the instructions. While you are at it open a software by the name of "system monitor" in ubuntu or maybe "task manager" in windows. And head over to tab that shows the usage of different processors. When the sequential program is running this is how it should look like.

Only one core is executing at 100% while others are not being utilized by the program

This shows that only one of the cores is being used when the sequential multiplication is running. While, running the parallel computation results in the usage of all four cores. You can see it in the screenshot below.

All the cores are being utilized by program

Now, we have executed the program to multiply matrices and also made sure that it actually executes on multiple processors. We will now focus on a more counter-intuitive problem that is prefix sum.

Prefix Sum

Scanning is perhaps one of the most important topics to understand in parallel programming. It is simple to understand what a scan is however, it is very difficult to come up with a method to parallelize it since it looks inherently sequential.

Given a list of n elements, the scan of the ith element is defined as the sum of all i-1 elements before it, including ith element or not depending on the type of scan. In this blog we will be looking at inclusive scan in which the ith element is included. For example, the scan of the list [1,2,3,4,5,6,7,8] will be [1,3,6,10,15,21,28,36]. This operation may look very simple but it is fairly common to do this and it is not limited to addition, you can actually apply a large number of operations, Hence what you are doing is finding a way to apply some operation to a list of numbers but instead of doing it sequentially you are doing it in parallel.

Calculating scan sequentially is easy enough, we have to maintain a variable for running sum and add every element to it one by one and replace it. However, there is no immediate solution that comes to mind when trying to parallelize this problem. We are going to look at an algorithm that is going to be helpful in order to think about this problem. This algorithm is called Hillis-Steeles algorithm after the people who first proposed it.

To start with this algorithm, first add every element at position i to the element at position i + 1, and update the result of (i+1)th element. If no element replaces the element then copy it as it is. In the next stage add element at i to element at (i+2)th position and update element at (i+2)th position. If element is not updated copy it again. Do this again with i and (i + 4)th element and so on.

Here, we are taking 2^r th element at each stage where r is the stage number starting from r = 0. The stopping condition will be when we can no longer add elements together since 2^r is greater than or equal to the number of elements. Hence, there are in total log(N) stages where N is the total number of elements. r varies from 0 to log(N)-1 both inclusive.

Why does this algorithm work? Well lets take a look at the following example taken from this amazing video from Udacity's course on parallel programming. Let us take for example 5th(1-indexed) element i.e, 5. The eventual value for this in inclusive sum scan will be 15(i.e, 1 + 2 + 3 + 4 + 5).

In first step, we add 5 with 4, we get 5+4 = 9. In the second step since we are already done with adding previous element to our element, we now need to add element before it. i.e, 3,2,1 etc. But in the previous step we have already added 3 to 2 i.e, 3+2 = 5. This element is now two places before our original element. Hence, if we add it to 9, we get 9 + 5 = 14 or 5 + 4 + 3 + 2 = 14. Now, we have already added 5,4,3,2. Hence, we need to go one step further i.e, add element 4 step before i.e, 1 hence we get the sum 15.

But what if we had more elements before 1, In that case the element at the place where 1 is, will contain the sum of the 3 elements before it. Hence, it is easy to see how this will lead to addition of those elements that were not previously added while at each stage any element contains the sum of 2^r elements before it(counting itself).

How does this algorithm help with parallelization? Well at each stage we are only concerned with adding one element to the current element and we are not concerned about all the elements before it since at each stage the element we are adding to current element contains information about some elements before it and in the next stage the element we add will contain information about more elements and eventually we will have information about all the elements preceding the current element. Hence, at each stage we can work on only some elements while ignoring other elements. This can be done on different cores.

To parallelize this algorithm we will follow a similar approach as we did in matrix multiplication, first we make a function to give us the division of list we will need in order to distribute it on different cores. We can use pretty much the same algorithm we used in the case of matrix multiplication. And then we can add ith element to the (i - 2^j)th element where j is the iterating variable we get iterating from 0 to lg(N). This algorithm however, will work only on even number of elements, as you can verify by taking odd number of elements. The way to handle this will be to pad one more element at the end if the number of elements is odd and once the calculation is done remove the last element.

To see the actual implementation details please visit the github repo containing the code and follow the README.md file to run the program

For a more visual understanding of this algorithm please take a look at the following video, the algorithm can be found here.

Search This Blog

Khan Saad Bin Hasan's Technical Blog

Parallel Programming: MultiProcessing in Python

Comments

Post a Comment