Wednesday, April 24, 2013

Locking processes with flock

Got a cronjob that might overlap if it runs slower than usual and you need to avoid multiple instances running?
Here's how to lock them using flock in bash:
function delicate_process() {
  # the 'locked down' code
  return 0;
}

function main() {
  (
    if ! flock -x --nonblock 200; then
      return 1;
    fi

    delicate_process;

  ) 200>/var/lock/.my.lock
}

main "${@}";
This is essentially opening the file /var/lock/.my.lock and assigning it the FD 200, then inside flock attempts a non blocking exclusive lock on the FD 200, returning `1` on failure.

Monday, April 22, 2013

Dynamic /etc/hosts with a simple template engine

This is a pretty niche script that probably won't do much good to anybody else.. but just in case I'm pushing it down the intertubes. I find myself editing blocks of domains in my hosts file regularly so I finally decided to save a few seconds everyday and create a template engine for my hosts file where I can now use a script to change the IP for blocks of virtual hosts all in a single command. The syntax is short and simple:
# <pool_a>
127.0.0.1 vhost-1.service_a.com service_a
127.0.0.1 vhost-2.service_a.com
# </pool_a>

# <pool_b>
192.168.0.1 vhost-1.service_b.com service_b
192.168.0.1 vhost-2.service_b.com
# </pool_b>
It uses an html-like tag inside of a bash comment for opening and closing the "blocks" of virtual hosts. Then comes the script which now reads and modifies my hosts files using these "templates":
#!/bin/bash

HOSTS_FILE='/etc/hosts';

function update_dyn_block() {
  local block_name="${1}";
  local ip_address="${2}";

  # sanity checks, only one named template block allowed:
  if [ $(cat ${HOSTS_FILE} | grep "# <${block_name}>" 2>/dev/null | wc -l) -ne 1 ]; then
    echo -n "missing or duplicate named template opening-tag found in ";
    echo "${HOSTS_FILE}: <${block_name}>";
    return 1;
  fi

  if [ $(cat ${HOSTS_FILE} | grep "# </${block_name}>" 2>/dev/null | wc -l) -ne 1 ]; then
    echo -n "missing or duplicate named template closing-tag found in ";
    echo "${HOSTS_FILE}: </${block_name}>";
    return 1;
  fi

  # get the line numbers of the line numbers of the template opening and closing tags:
  local opening_tag_at=$(cat ${HOSTS_FILE} | grep -n "# <${block_name}>" | cut -d: -f1);
  local closing_tag_at=$(cat ${HOSTS_FILE} | grep -n "# </${block_name}>" | cut -d: -f1);

  # echo "template block found between lines: ${opening_tag_at} and ${closing_tag_at}";

  local temp_file=$(mktemp);

  # ...
  cat ${HOSTS_FILE} | awk "
  {
    if ( NR > ${opening_tag_at} && NR < ${closing_tag_at} && NF >= 2 ) {
      printf \"%s\", \"${ip_address}\";
      for ( i = 2; i <= NF; i++ ) {
        printf \" %s\", \$i;
      }
      printf \"\n\";
    } else {
      print;
    }
  }" > ${temp_file};

  diff ${HOSTS_FILE} ${temp_file};

  mv ${temp_file} ${HOSTS_FILE};
}

if [ "${1}" = "" -o "${2}" = "" ]; then
  echo "usage: $0 <block_name> <new_ip_address>";
  echo;
  echo "inside of your hosts file, use 'tags' to define a block of entries with this syntax:";
  echo;
  echo "# <tag-name>";
  echo "127.0.0.1 host-1.hostname.com";
  echo "127.0.0.1 host-2.hostname.com";
  echo "# </tag-name>";
  echo;
  echo -n "note: the script is picky about the space between the hashtag and the '<' ";
  echo "character of the tag -- dont forget the space";
  exit 1;
else
  update_dyn_block "${1}" "${2}";
fi

And voila, now I can just call the script with the parameters "block name" and "new ip address" and my hosts file will be updated using the same template and print a diff showing the changes that were made to the hosts file. Here's a complete demo with the HOSTS_FILE=/tmp/hosts set in the change_hosts script:
[root@localhost requester]# cat /tmp/hosts
# <pool_a>
127.0.0.1 vhost-1.service_a.com service_a
127.0.0.1 vhost-2.service_a.com
# </pool_a>

# <pool_b>
192.168.0.1 vhost-1.service_b.com service_b
192.168.0.1 vhost-2.service_b.com
# </pool_b>
[root@localhost requester]# ~/bin/change_hosts pool_a 127.0.0.2
2,3c2,3
< 127.0.0.1 vhost-1.service_a.com service_a
< 127.0.0.1 vhost-2.service_a.com
---
> 127.0.0.2 vhost-1.service_a.com service_a
> 127.0.0.2 vhost-2.service_a.com
[root@localhost requester]# ~/bin/change_hosts pool_b 198.168.0.2
7,8c7,8
< 192.168.0.1 vhost-1.service_b.com service_b
< 192.168.0.1 vhost-2.service_b.com
---
> 198.168.0.2 vhost-1.service_b.com service_b
> 198.168.0.2 vhost-2.service_b.com
[root@localhost requester]# cat /tmp/hosts
# <pool_a>
127.0.0.2 vhost-1.service_a.com service_a
127.0.0.2 vhost-2.service_a.com
# </pool_a>

# <pool_b>
198.168.0.2 vhost-1.service_b.com service_b
198.168.0.2 vhost-2.service_b.com
# </pool_b>
Cheers

Thursday, December 6, 2012

Simple process management in bash

Too many times I've found myself writing bash scripts responsible for spawning children processes which in turn work on a sets of files or other. They usually look cryptic.. Then I discovered 2 much nicer solutions: the built-in bash command `jobs` and the ever present `xargs`.

Running `jobs -p` in a bash shell will give you a NL separated list of the PIDs of all the children of the current shell.

Even better: let's avoid overloading the CPU of our dev machines with too many processes by limiting the number of children processes which may run in parallel dynamically based on the actual number of cores available.

#!/bin/bash

function get_cpu_count {
  echo -n $(cat /proc/cpuinfo | grep "^processor" | wc -l);
}

#!/bin/bash

function process_file {
  local FILE="${1}";
  # ...
}

function process_files_in_batches {
  local FILES="${1}";
  local CPU_COUNT=$(get_cpu_count);
  for FILE in $FILES; do
    process_file $FILE &
    while [[ $(jobs -p | wc -l) -ge $CPU_COUNT ]]; do
      sleep 0.5;
    done
  done
}

# using it becomes as easy as:
process_files_in_batches $(find /var/lib/data/ -type f);

And then there is xargs, another simple solution for making your scripts run in parallel, using the -P parameter:
function process_files_in_batches_2 {
  find /var/lib/data/ -type f -print0 | xargs --null -P $(get_cpu_count) -I{} cmd {};
}

The -print0 of `find` will make the list of matching files be separated by null bytes instead of the default NL characters. The --null of `xargs` tells `xargs` that the data is null byte separated. The -I{} defines {} as the replacement token inside of the `xargs` command to run (noted as cmd here). Finally, the -P $(get_cpu_count) defines how many processes xargs will allow to run in parallel at any given time.

That'll all there is to it, very simple parallel bash scripting. When I first discovered this I immediately re-used it for refactoring out some dirty multiprocessing attempts in various php workers. In then I feel like it somehow approached the code to that unix saying "do one thing and do it well" by allowing the php scripts to only handle their processing and not the system processes.. and all that in just a few generic shell functions ^^

Tuesday, November 20, 2012

How to throttle rsync

I was facing an issue where I needed to rsync a large set of small files (about half a million) from backend servers towards web fronts. That seemed simple but after a few test runs I noticed just how much load the rsync put onto the front-end, bringing the load up to 40 (on quad core VMs) so this was obviously bad.. The apache instances were suffering as a cause, connections piled up, things quickly got out of hand.

Solution? `ionice`!

Where `nice` is used for CPU scheduling `ionice` is used for io scheduling and together they can tame a massive rsync.

Here's what the final command line looked like (initiated from a backend server):
rsync --timeout=480 -z --compress-level=9 --rsync-path="nice -n19 ionice -c3 rsync" --recursive --delete-during --delete-excluded /local/path $REMOTE_SERVER:/remote/path
The magic here is in the --rsync-path parameter where we're defining the path on the remote server for rsync. Instead of using just rsync we're setting a nice'd and ionice'd rsync. Finally the -c3 parameter for ionice is stating that the io scheduling should only occur when the disk is considered idle as to avoid any blocking (especially important for the apache processes which are serving from disk!).

See more about ionice and nice.

Friday, October 19, 2012

Adventures of a backend developer

Recently I was given a rather interesting task which consisted of loading a CSV formatted file into a memcached bucket. Our original loader, which was a rough cli script written in (my beloved) PHP, just wasn't cutting it any longer. Error rates were rather high and the loading speed *suboptimal* to say the least, and lets not even talk about CPU/Mem usage. Back to the drawing board, I decided to get away from PHP and look at what the unix toolbox proposed as solutions. If you're not aware of memcached please check it out, it's a blazing fast and very high performant key-value store with a very simple ASCII protocol. I won't go into the details here but you can learn more about memcached's merits here and the protocol specification here. Out of that spec, the only piece which I was interested in was the "set" command, the command used for inserting/updating a key/value pair.
set <key> <flags> <exptime> <bytes> [noreply]\r\n
<value>\r\n
So what I have to work with is a linux server, the default toolbox, and a set of CSV files where the 3 columns represent: key, value, expiration time. Parsing CVS files with linux? That's a job for awk! Awk is amazing for parsing CSV files, it reads line by line and allows you to manipulate/conditionally compute on the values of the columns. So I decided to let awk tranform these CSV file into the memcached ASCII protocol. Half the problem is solved, I can transform a CSV file into a suite of memcached commands.. Now for the second half: how to pipe this into memcached? As soon as I hear network pipe, I think of another awesome linux tool: netcat. Netcat is a very simple tool which allows you to connect a socket to a remote machine and pipe whatever you like over the socket. All that said, here is a shortened demo of the final working solution, looking at the different steps I had taken: A snippet of an exmaple flat data file:
key_00001###value_00001###60
key_00002###value_00002###64
key_00003###value_00003###69
key_00004###value_00004_different_length_value###30
And the bash script using awk/netcat magic:
#!/bin/bash

function warmup_memcached_from_csv {
 local FILE="${1}";
 local HOST="${2}";
 local PORT="${3}";

 awk 'BEGIN { FS="###" } {
  printf "set %s 0 %i %i\r\n%s\r\n", $1, $3, length($2), $2
 } END { printf "quit\r\n" }' $FILE | netcat $HOST $PORT
}
So for each line in the csv file, 2 lines of output are generated, the memcached set commands. The first line of the example csv file would become:
set key_00001 0 60 11\r\n
value_00001\r\n
To execute the function on a set of csv files to warmup a bucket running on localhost:11211 would look something like this
for F in $(find /var/lib/csv_files/ -name '*.csv' -type -f); do
 warmup_memcached_from_csv $F 127.0.0.1 11211
done
That worked like a charm.. but there was more to the requirements. I needed to know the number of errors, if any, that occured during the sets. The memcached host will respond to the command with "ERROR\r\n" in the case of an error on a set command so all I needed was to count them. I know the tools for the job, grep and wc. Grep is a regex pattern matching filter which can work on streams and wc is the "word counter" which can also count lines.. So putting the 2 together, I'll filter out everything except error messages and then count them. Because netcat is a bidirection network pipe, this was very simple to tack on to the last implementation:
#!/bin/bash

function warmup_memcached_from_csv {
 local FILE="${1}";
 local HOST="${2}";
 local PORT="${3}";

 awk 'BEGIN { FS="###" } {
  printf "set %s 0 %i %i\r\n%s\r\n", $1, $3, length($2), $2
 } END { printf "quit\r\n" }' $FILE | netcat $HOST $PORT | grep 'ERROR' | wc -l
}
Now this function no longer spews a long list of memcached commands, it only returns a number, the error count. Using the function changes slightly to actually make use of the error counter:
for F in $(find /var/lib/csv_files/ -name '*.csv' -type -f); do
 ERROR_COUNT=$(warmup_memcached_from_csv $F 127.0.0.1 11211);
 if [ "$ERROR_COUNT" -gt 0 ]; then
  # .. error reporting! retry loops! everything is possible ..
 fi
done
Up till now everything is nice and simple.. The last requirement was bit of a mystery to me about how I would achieve at first: the values columns of these files were't actually plain text, but serialized php objects. The webservers using them were using the php-memcache extension. This is where the second parameter of the memcached set command comes into play, the records "flag". The memcached server has an extra interger flag field which is stored with the set command and retrieved with the get command but isn't actually used by the memcached server, instead it's only purpose it to have "metadata" for the memcached clients. Using the php-memcache extension it's possible to do something like so:
$m = new memcache();
$m->connect('127.0.0.1', 11211);
$m->set('key_00001', array('this', 'is', 'a', 'php', 'array'));

echo var_export( $m->get('key_00001'), true );

# which will display:
#
# array (
#   0 => 'this',
#   1 => 'is',
#   2 => 'a',
#   3 => 'php',
#   4 => 'array',
# )
So if the CSV file contained:
key_00001###a:5:{i:0;s:4:"this";i:1;s:2:"is";i:2;s:1:"a";i:3;s:3:"php";i:4;s:5:"array";}###60
After loading the file the frontend servers should be able to:
echo var_export( $m->get('key_00001'), true );
# and get the same result:
# array (
#   0 => 'this',
#   1 => 'is',
#   2 => 'a',
#   3 => 'php',
#   4 => 'array',
# )

So somehow, the php-memcache extension must be using this extra flag field internally to know what's to be unserialized after a get() and whats to be considered non-serialized data. A quick look into the inside of the extension told me exactly what I expected:
$ php --re memcache # this is an EXTREMELY usefull parameter ( --re ) of php for viewing which ini options/methods/functions/classes/interfaces/constants any extension installed exposes
And sure enough, in the constant definitions I found the one I was looking for: MEMCACHE_HAVE_SESSION. Admittingly the constants name wasn't very clear to me at first but then I realized that the method php-memcache uses for serializing/unserializing is the same that's defined for session storage .. so maybe not un/serialize(), it could just as well be defined as json encoding or maybe even some exotic XML format.. joy.
$ php --re memcache
... snippet ...
    Constant [ integer MEMCACHE_COMPRESSED ] { 2 }
    Constant [ integer MEMCACHE_HAVE_SESSION ] { 1 }
... snippet ...
Now I was ready to make the last and final change to the loader, simply setting the keys flag to 1 so that the php-memcache extension would automatically unserialize the value after issuing a get():
#!/bin/bash

function warmup_memcached_from_csv {
 local FILE="${1}";
 local HOST="${2}";
 local PORT="${3}";

 awk 'BEGIN { FS="###" } {
  printf "set %s 1 %i %i\r\n%s\r\n", $1, $3, length($2), $2
 } END { printf "quit\r\n" } ' $FILE | netcat $HOST $PORT | grep 'ERROR' | wc -l
}
Problem solved. We can now load the CSV files using only the linux toolbox and still get the values from memcached from php without having to make any change to the frontend webservers. Comments? Bug reports? Critics? They're all welcome!