Tuesday, August 20, 2013

Divide and Conquer.. with bash and friends!

Another day, another script in need of a huge performance boost. The scenerio is somewhat common to me: datasets in the form of files are being transferred using bash scripts (glorified rsync wrappers with some additional error checking) and after the transfer, the same bash process spawns a php script for the actual proccessesing of the records (in this case, performs some transformations followed by DB inserts).

The problem was that the (single threaded) php step was unable to keep up with the high rates of massive files (>3 Gb) being sent its way.

Instead of trying to optimize the php processor, I decided to wrap it with some job control logic and divide the hug files into smaller chunks. Finally, a use-case for using `split` (man 1 split).

The idea was to cut the big file into lots of smaller pieces and then spawn X php processes to consume the files in parallel. For this problem I decided to split the file based on the number of lines because each line contains a full record and then spawn off 1 php process per chunk. It worked like a charm, dividing the work into smaller and easier to digest pieces fed into a pool of php workers:

#
# divide_and_conquer( file_name, max_number_of_workers )
#
function divide_and_conquer {
 local _big_file="${1}"; shift;
 local _parallel_count="${1}"; shift;

 # where to place the file chunks:
 local _prefix="/var/tmp/$(date +%s)_";

 split --lines=10000 ${_big_file} ${_prefix};

 local _file_chunks="$(ls -1 ${_prefix}*)";

 for f in ${_file_chunks}; do
  # spawn off a php worker for this file chunk and if the php script returns a non error code,
  # then delete the processed chunk:
  ( php /var/script.php ${f} && rm ${f} ) &

  # limit the total number of worker processes:
  while [[ $(jobs -p | wc -l) -ge ${_parallel_count} ]]; do
   sleep 0.1;
  done
 done

 # wait for the last of the children:
 while [[ $(jobs -p | wc -l) -ne 0 ]]; do
  sleep 0.1;
 done
}

# and let's use it:
divide_and_conquer "/var/lib/huge_file.csv" "8"


Cheers -