Parallel processing in PHP

Since PHP does not offer native threads, we have to get creative to do parallel processing. I will introduce 3 fundamentally different concepts to emulate multithreading as good as possible.

Using systemcalls

If you have some basic linux knowledge, you will know that a background process can be started by adding ampersand to the systemcall (in Windows, it’s the start-command)

dav@david:/var/www$ php index.php & [1] 3229 

The PHP script is running silently in the background. What is being printed to the shell (3229) is the process id, so that we are able to kill the process using

kill 3229 

A problem with this approach is, that any output of the script is lost, so we have to redirect the output stream to a file, just like this:

php index.php > output.txt 2>&1 & 

The purpose of the scary 2>&1 is to redirect stderr to stdout, so when your script produces any kind of php error, it will also get caught by the output-file. Putting everything together, we get

$cmd = "php script.php";  $outputfile = "/var/www/files/out."; $pidfile = "/var/www/files/pid.";  for ($i = 0; $i < $process_count; $i++)     exec(sprintf("%s > %s 2>&1 & echo $! >> %s", $cmd, $outputfile.$i, $pidfile.$i)); 

Looks confusing, right? We’ve added echo $! >> %s to the command, so that the process id of the background script gets written to a file. This proves to be useful to keep track of all running processes.

If you want to kill all php-processes, the following command will do:

killall php 

Needless to say that when you add the php shebang #!/usr/bin/php to the top of your script and make it executable using chmod +x script.php, the system command needs to be modified to ./script.php instead of php script.php.

To check if a process is still running, you might use some variation of the ps command as done here (stolen from Steffen):

function is_running($pid) { 	$c = "ps -A -o pid,s | grep " . escapeshellarg($pid); 	exec($c, $output);  	if (count($output) && preg_match("~(\d+)\s+(\w+)$~", trim($output[0]), $m)) 	{ 		$status = trim($m[2]); 		if (in_array($status, array("D","R","S"))) 		{ 			return true; 		} 	} 	 	return false; } 

Using fork()

Using the pnctl-functions of php, you get the ability to fork a process (pcntl_fork, not availible on Windows). Before you get too excited, read to following quote from a comment written on php.net that exactly reflects my experience with forking in php:

You should be _very_ careful with using fork in scripts beyond academic examples, or rather just avoid it alltogether, unless you are very aware of it’s limitations.
The problem is that it just forks the whole php process, including not only the state of the script, but also the internal state of any extensions loaded.
This means that all memory is copied, but all file descriptors are shared among the parent and child processes.
And that can cause major havoc if some extension internally maintains file descriptors.
The primary example is ofcourse mysql, but this could be any extensions that maintains open files or network sockets.

You have been warned! Look at the following example:

for ($i = 0; $i < 4; $i++) {     pcntl_fork(); }  echo "hi there! pid: " . getmypid() . "\n"; 

Output:

dav@david:/var/www$ php script.php hi there! pid: 3534 hi there! pid: 3536 hi there! pid: 3538 hi there! pid: 3539 hi there! pid: 3540 hi there! pid: 3541 hi there! pid: 3542 hi there! pid: 3537 hi there! pid: 3543 dav@david:/var/www$  hi there! pid: 3544 hi there! pid: 3545 hi there! pid: 3546 hi there! pid: 3548 hi there! pid: 3547 hi there! pid: 3549 hi there! pid: 3550 

As you can see, we get 2 ^ fork count processes. Somewhere in the middle of the output, the original script is finished but some forks are still running. It’s even possible to communicate with processes that you forked. Forking is a very interesting area of computer science, nevertheless i don’t recommend using fork in real-world php applications.

Using curl

The last way to process multiple scripts in parallel is to abuse the webserver and curl. With curl, we are able to execute multiple requests in parallel (inspired by Gonzalo Ayuso).

$url = "http://localhost/calc.php"; $mh = curl_multi_init(); $handles = array(); $process_count = 15;  while ($process_count--) {     $ch = curl_init();     curl_setopt($ch, CURLOPT_URL, $url);     curl_setopt($ch, CURLOPT_HEADER, 0);     curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);     curl_setopt($ch, CURLOPT_TIMEOUT, 30);     curl_multi_add_handle($mh, $ch);     $handles[] = $ch; }  $running=null;  do  {     curl_multi_exec($mh, $running); }  while ($running > 0);  for($i = 0; $i < count($handles); $i++)  {     $out = curl_multi_getcontent($handles[$i]);     print $out . "\r\n";     curl_multi_remove_handle($mh, $handles[$i]); }  curl_multi_close($mh); 

Here, we call the script calc.php 15 times. The content of calc.php is:

<?php echo "my pid: " . getmypid(); ?> 

The output is as follows:

dav@david:/var/www$ php script.php my pid: 1401 my pid: 1399 my pid: 1399 my pid: 1403 my pid: 1403 my pid: 1398 my pid: 1398 my pid: 1402 my pid: 3767 my pid: 3768 my pid: 3769 my pid: 3772 my pid: 3771 my pid: 3773 my pid: 3770 

Interesting to see, that we see the same process id a few times. Keep in mind, that you trigger an http-request, so you are losing performance because a webserver has to do some work. Furthermore, the called script will be working with the ordinary php.ini, and not php-cli.ini.

What about the speed? Benchmarks!

What would you take away from this post, when you didn’t know which parallel processing method is the fastest? I’ve written a little benchmark script using the 3 methods described above, did 3 runs and calculated the average. Basically, this is my benchmark scipt calc.php:

$starttime = time(); $duration = 10;  $filename = "/var/www/results/" . getmypid() . ".out";  $loops = 0;  while (true) {     for ($i = 0; $i < 10000; $i++)     {         sqrt($i);     }          $loops++;          if ($starttime + $duration <= time())         break; }  file_put_contents($filename, $loops); 

My system:

Ubuntu 10.10 (Kernel 2.6.35-28) 4 gig Ram Intel Core 2 Duo T7500 (2 * 2.2GHz) 

I’m fully aware that this benchmark is in no way representative, because writing the result files to harddisk might influence other processes, that are still running and my time comparison may also be slightly inaccurate. Ah, before you ask: I haven’t used set_time_limit because it sucks. So bring on the results!

Method Proc.  Iterations  exec     1    2183  exec     2    3953  exec     4    4283  exec     8    4378 exec    16    4586 exec    32    4868  curl     1    2203 curl     2    2843 curl     4    3029 curl     8    3556 curl    16    3986 curl    32    4373  fork     1    2274 fork     2    4299 fork     4    4245 fork     8    4309 fork    16    4177 fork    32    4577 

As you can see, the more parallel processes, the more iterations in total. I haven’t tested 64 processes and more because my system almost froze (memory usage and cpu utilization). Feel free to interpret the results in any way you want but in the end, it boils down to the exec – method because fork is evil and curl is not a serious alternative.

Finally, if you want to do some testing on your own, here is my benchmark file. Place it in the same folder with the calc.php from above, give the file execute rights and create a folder results. The file is invoked by using ./bench.php method processcount, so possible calls are

./bench.php exec 16 ./bench.php curl 8 ./bench.php fork 32 ./bench.php -> no parameter to display results 

The file itself:

#!/usr/bin/php <?php $mode = isset($argv[1]) ? $argv[1] : "results"; $process_count = isset($argv[2]) ? $argv[2] : 1;  //cleanup if ($mode != "results" && count(glob("/var/www/results/*"))) {     exec("rm /var/www/results/*"); }  if ($mode == "exec") {     $cmd = "php calc.php";      $outputfile = "/var/www/results/out.";     $pidfile = "/var/www/results/pid.";      for ($i = 0; $i < $process_count; $i++)         exec(sprintf("%s > %s 2>&1 & echo $! >> %s", $cmd, $outputfile.$i, $pidfile.$i)); } elseif ($mode == "curl") {     $url = "http://localhost/calc.php";     $mh = curl_multi_init();          while ($process_count--)     {         $ch = curl_init();         curl_setopt($ch, CURLOPT_URL, $url);         curl_setopt($ch, CURLOPT_HEADER, 0);         curl_setopt($ch, CURLOPT_NOBODY, true);         curl_setopt($ch, CURLOPT_RETURNTRANSFER, false);         curl_setopt($ch, CURLOPT_TIMEOUT, 30);         curl_multi_add_handle($mh, $ch);     }          $running=null;          do      {         curl_multi_exec($mh, $running);     }      while ($running > 0); } elseif ($mode == "fork") {     for ($i = 0; $i < log($process_count, 2); $i++)     {         pcntl_fork();     }          include "calc.php"; } else {     $total = 0;      foreach (glob("/var/www/results/*.out") as $f)     {         $runtime = file_get_contents($f);         $total += $runtime;         echo $runtime . "\r\n";     }      echo "Total: " . $total . "\r\n"; }