Tuesday, August 20, 2013

Divide and Conquer.. with bash and friends!

Another day, another script in need of a huge performance boost. The scenerio is somewhat common to me: datasets in the form of files are being transferred using bash scripts (glorified rsync wrappers with some additional error checking) and after the transfer, the same bash process spawns a php script for the actual proccessesing of the records (in this case, performs some transformations followed by DB inserts).

The problem was that the (single threaded) php step was unable to keep up with the high rates of massive files (>3 Gb) being sent its way.

Instead of trying to optimize the php processor, I decided to wrap it with some job control logic and divide the hug files into smaller chunks. Finally, a use-case for using `split` (man 1 split).

The idea was to cut the big file into lots of smaller pieces and then spawn X php processes to consume the files in parallel. For this problem I decided to split the file based on the number of lines because each line contains a full record and then spawn off 1 php process per chunk. It worked like a charm, dividing the work into smaller and easier to digest pieces fed into a pool of php workers:

#
# divide_and_conquer( file_name, max_number_of_workers )
#
function divide_and_conquer {
 local _big_file="${1}"; shift;
 local _parallel_count="${1}"; shift;

 # where to place the file chunks:
 local _prefix="/var/tmp/$(date +%s)_";

 split --lines=10000 ${_big_file} ${_prefix};

 local _file_chunks="$(ls -1 ${_prefix}*)";

 for f in ${_file_chunks}; do
  # spawn off a php worker for this file chunk and if the php script returns a non error code,
  # then delete the processed chunk:
  ( php /var/script.php ${f} && rm ${f} ) &

  # limit the total number of worker processes:
  while [[ $(jobs -p | wc -l) -ge ${_parallel_count} ]]; do
   sleep 0.1;
  done
 done

 # wait for the last of the children:
 while [[ $(jobs -p | wc -l) -ne 0 ]]; do
  sleep 0.1;
 done
}

# and let's use it:
divide_and_conquer "/var/lib/huge_file.csv" "8"


Cheers -

Tuesday, July 9, 2013

php syntax checking with vim

Showing off your cowboy skills by modifying PHP code on a production server with vim? Here's a neat trick to at least check the syntax of the modified file before saving it:
:w ! php -l
Or even save the above as a binding (ctrl-b) in your vimrc:
map <C-B> :w ! php -l<CR>
And a big thanks to http://vim.wikia.com/wiki/Runtime_syntax_check_for_php for making this so clear.

Thursday, June 6, 2013

ZeroMQ, HWM, and INPROC

I have been banging my head pretty hard for the past 2 days using ZeroMQ with a combination of inproc transports and HWM. In my scenario I have an inproc ZMQ_PUSH socket pushing and an inproc ZMQ_PULL reading from the pipe. The client (pusher) blocked somewhere between 1 and 2k messages and no matter what I set it's ZMQ_HWM to it just kept blocking. As the project I'm working on requires something like a dynamic HWM I wrote a custom implementation of HWM and deal with them in the application and just wanted to disable the built-in version. Before filing a bug report I decided to take a glance into the github repo.. and that's where it all made sense, here's a snippet from this source:
        // The total HWM for an inproc connection should be the sum of
        // the binder's HWM and the connector's HWM.
        int sndhwm = 0;
        if (options.sndhwm != 0 && peer.options.rcvhwm != 0)
            sndhwm = options.sndhwm + peer.options.rcvhwm;
        int rcvhwm = 0;
        if (options.rcvhwm != 0 && peer.options.sndhwm != 0)
            rcvhwm = options.rcvhwm + peer.options.sndhwm;
And it's even clearly stated in the comments just above, I need to set the HWM to 0 on both the sender AND the receiver :) Moral of the story is: if you need to set HWM limits on inproc sockets, you have to set the limits on both sides!

Wednesday, April 24, 2013

Locking processes with flock

Got a cronjob that might overlap if it runs slower than usual and you need to avoid multiple instances running?
Here's how to lock them using flock in bash:
function delicate_process() {
  # the 'locked down' code
  return 0;
}

function main() {
  (
    if ! flock -x --nonblock 200; then
      return 1;
    fi

    delicate_process;

  ) 200>/var/lock/.my.lock
}

main "${@}";
This is essentially opening the file /var/lock/.my.lock and assigning it the FD 200, then inside flock attempts a non blocking exclusive lock on the FD 200, returning `1` on failure.

Monday, April 22, 2013

Dynamic /etc/hosts with a simple template engine

This is a pretty niche script that probably won't do much good to anybody else.. but just in case I'm pushing it down the intertubes. I find myself editing blocks of domains in my hosts file regularly so I finally decided to save a few seconds everyday and create a template engine for my hosts file where I can now use a script to change the IP for blocks of virtual hosts all in a single command. The syntax is short and simple:
# <pool_a>
127.0.0.1 vhost-1.service_a.com service_a
127.0.0.1 vhost-2.service_a.com
# </pool_a>

# <pool_b>
192.168.0.1 vhost-1.service_b.com service_b
192.168.0.1 vhost-2.service_b.com
# </pool_b>
It uses an html-like tag inside of a bash comment for opening and closing the "blocks" of virtual hosts. Then comes the script which now reads and modifies my hosts files using these "templates":
#!/bin/bash

HOSTS_FILE='/etc/hosts';

function update_dyn_block() {
  local block_name="${1}";
  local ip_address="${2}";

  # sanity checks, only one named template block allowed:
  if [ $(cat ${HOSTS_FILE} | grep "# <${block_name}>" 2>/dev/null | wc -l) -ne 1 ]; then
    echo -n "missing or duplicate named template opening-tag found in ";
    echo "${HOSTS_FILE}: <${block_name}>";
    return 1;
  fi

  if [ $(cat ${HOSTS_FILE} | grep "# </${block_name}>" 2>/dev/null | wc -l) -ne 1 ]; then
    echo -n "missing or duplicate named template closing-tag found in ";
    echo "${HOSTS_FILE}: </${block_name}>";
    return 1;
  fi

  # get the line numbers of the line numbers of the template opening and closing tags:
  local opening_tag_at=$(cat ${HOSTS_FILE} | grep -n "# <${block_name}>" | cut -d: -f1);
  local closing_tag_at=$(cat ${HOSTS_FILE} | grep -n "# </${block_name}>" | cut -d: -f1);

  # echo "template block found between lines: ${opening_tag_at} and ${closing_tag_at}";

  local temp_file=$(mktemp);

  # ...
  cat ${HOSTS_FILE} | awk "
  {
    if ( NR > ${opening_tag_at} && NR < ${closing_tag_at} && NF >= 2 ) {
      printf \"%s\", \"${ip_address}\";
      for ( i = 2; i <= NF; i++ ) {
        printf \" %s\", \$i;
      }
      printf \"\n\";
    } else {
      print;
    }
  }" > ${temp_file};

  diff ${HOSTS_FILE} ${temp_file};

  mv ${temp_file} ${HOSTS_FILE};
}

if [ "${1}" = "" -o "${2}" = "" ]; then
  echo "usage: $0 <block_name> <new_ip_address>";
  echo;
  echo "inside of your hosts file, use 'tags' to define a block of entries with this syntax:";
  echo;
  echo "# <tag-name>";
  echo "127.0.0.1 host-1.hostname.com";
  echo "127.0.0.1 host-2.hostname.com";
  echo "# </tag-name>";
  echo;
  echo -n "note: the script is picky about the space between the hashtag and the '<' ";
  echo "character of the tag -- dont forget the space";
  exit 1;
else
  update_dyn_block "${1}" "${2}";
fi

And voila, now I can just call the script with the parameters "block name" and "new ip address" and my hosts file will be updated using the same template and print a diff showing the changes that were made to the hosts file. Here's a complete demo with the HOSTS_FILE=/tmp/hosts set in the change_hosts script:
[root@localhost requester]# cat /tmp/hosts
# <pool_a>
127.0.0.1 vhost-1.service_a.com service_a
127.0.0.1 vhost-2.service_a.com
# </pool_a>

# <pool_b>
192.168.0.1 vhost-1.service_b.com service_b
192.168.0.1 vhost-2.service_b.com
# </pool_b>
[root@localhost requester]# ~/bin/change_hosts pool_a 127.0.0.2
2,3c2,3
< 127.0.0.1 vhost-1.service_a.com service_a
< 127.0.0.1 vhost-2.service_a.com
---
> 127.0.0.2 vhost-1.service_a.com service_a
> 127.0.0.2 vhost-2.service_a.com
[root@localhost requester]# ~/bin/change_hosts pool_b 198.168.0.2
7,8c7,8
< 192.168.0.1 vhost-1.service_b.com service_b
< 192.168.0.1 vhost-2.service_b.com
---
> 198.168.0.2 vhost-1.service_b.com service_b
> 198.168.0.2 vhost-2.service_b.com
[root@localhost requester]# cat /tmp/hosts
# <pool_a>
127.0.0.2 vhost-1.service_a.com service_a
127.0.0.2 vhost-2.service_a.com
# </pool_a>

# <pool_b>
198.168.0.2 vhost-1.service_b.com service_b
198.168.0.2 vhost-2.service_b.com
# </pool_b>
Cheers