Data Science Tools/Tips

I was recently talking data science with a friend who wanted some good links and ideas on how to train non programmers on how to do data analysis.
First thing is to determine what type of analysis is needed, visualization, correlation, causation, classification. Then you just start exploring. A few links came to mind,

Posted in Uncategorized | Leave a comment

Piwik and Load Balancers

Ran into an issue with Piwik not seeming to deal with load balancers and the X-Forwarded-For header. The documentation for setting up this relativly common thing is a little lacking in the official documentation, and I found it’s a bit prickly. I finally found the instructions here but it neglects to mention that it needs to be exactly this way. To summarize though, the steps are as follows;
1. add the line proxy_client_headers[] = "HTTP_X_FORWARDED_FOR" (note the HTTP, all caps, and underscores
2. On your LB (or proxy) make sure that the ip header is X-Forwarded-For. (note the proper caps, and the dashes)

Posted in Uncategorized | Tagged , , | Leave a comment

Logstash and WSO2 Carbon Logs, dealing with Java Stack Traces

So, now that I’ve got my WSO2 Cluster setup, I get to diagnose issues. The biggest problem is that when trying to work through a cause I’ve got to look at half a dozen log files spread across half a dozen machines. Centralized logging is the solution of course.

I prefer logstash and kibana because they’re free and very configurable. They are suprisingly easy to setup, and once you find a few tools (grok debugger) easy to configure. The biggest problem is that most logs are only a single line. However, WSO2 has the nasty habit of dumping java stack traces in it’s log all the time. Luckily Logstash has the Multiline filter to help with that. Configuring Multiline is a bit of a pain, so here is the config I’m using.

input {
 syslog {
  port => 514
  type => "syslog"

To make life easier, I just use rsyslog for everything, one thing I didn’t realize is that syslog automatically applys a syslog grok, and truncates the message.

filter {
    if "_grokparsefailure" in [tags] {
        grok {
            type => "syslog"
            match => ["message", "%{SYSLOG5424PRI}%{TIMESTAMP_ISO8601} +(?:%{HOSTNAME:syslog5424_host}|-) %{SYSLOGPROG}%{GREEDYDATA:messagebodysyslog}"]
            match => ["message", "%{SYSLOG5424PRI}%{SYSLOGTIMESTAMP} +(?:%{HOSTNAME:syslog5424_host}|-) %{SYSLOGPROG}%{GREEDYDATA:messagebodysyslog}"]
            remove_tag => ["_grokparsefailure"]
        if "_grokparsefailure" not in [tags] {
            mutate {
                replace => ["message","%{messagebodysyslog}"]
                remove_field => ["messagebodysyslog"]

This section just parses the syslog portion out of anything not caught by the input syslog filter. Notice the 2 matches, for some reason I have some rsyslog messages coming in with one time format, and others with a different one. Somtimes even from the same machine

filter {
    if "wso" in [program] and "multiline" not in [tags] {
                grok    {
            match => [ "message", "TID\: \[%{INT}\] \[%{WORD:product}\] \[%{TIMESTAMP_ISO8601:logdate}\] +%{LOGLEVEL:level} \{%{DATA:classname}\} - %{GREEDYDATA:messagebody22}"]
                        remove_tag => ["_grokparsefailure"]
        if "_grokparsefailure" not in [tags] {
                    mutate {
                            replace => ["message","%{messagebody22}"]
                remove_field => ["messagebody22"]


Here is the wso2 parser, it gets almost all versions of wso2 messages, except stack traces of course

filter {
    if "wso2" in [program] {
        multiline {
            pattern => "(Uncaught exception.+)|(([^\s]+)Exception.+)|(([\s]+)at.+\))|(.+\.Exception)"
            stream_identity => "%{logsource}.%{@type}"
            what    =>"previous"

The stack trace, it only catches if it’s a wso2 message, and any line that matches the pattern regex gets stored and merged

filter {
    if "wso2" in [program] {
        mutate {
            replace => ["type","wso2_carbon"]
            add_field => ["logsource","%{syslog5424_host}"]
            remove_field => ["@originalmessage"]
            remove_tag   => ["_grokparsefailure"]

Clean up the messages

output {
 elasticsearch_http {
  host => ""

Output to elasticsearch

Posted in Technical | Leave a comment

Important NFS setting when using OSX Clients

I’ve been getting permission denied errors when connecting to nfs shares from my OSX system. According to logs on the server, it was authenticating fine, and no errors were showing up. Turns out, OSX uses a nonstandard port. So you need to add insecure to the export options.

Posted in Technical | Tagged | Leave a comment

Friday’s Cool Tools May 23rd

In an attempt to post more regularly I’m going to start a new entry going over some of the neat tools I’ve found over the week.

I’m a huge fan of Notational Velocity style apps, when I can use a Mac one of the first things I setup is NVAlt which is a fantastic program in and of itself. However, I’m also a huge fan of vi/vim, as well as stuck on a Windows system at work. I was thrilled to discover nvim which brings a similar environment to vi and the command line.

nvim screenshot

I also have been using AsciiFlow quite a bit recently to do quick diagrams for emails. This is an awesome web tool that lets you create ASCII Diagrams.

asciiflow screenshot

Posted in Uncategorized | Tagged , , , | Leave a comment

WSO2 Init Script (with status)

I’ve been stuck working with wso2 products lately, can’t say I’m terribly fond of them, but what can you do.

It turns out that there are no good init scripts for the wso2 servers, and all the ones I found didn’t have a status, so I whipped one up. You can use the same script for all of them, just change the product home to whatever product you’re using.

# wso2AS WSO2 App Server
# chkconfig: 345 70 30
# description: WSO2 Application Server
# processname: wso2as
# Source function library.
. /etc/init.d/functions
killproc() { # kill the named process(es)
 pid=`/bin/ps -ef | /bin/grep $1 |
 /bin/grep -v grep | /bin/grep -v /bin/sh |
 /bin/awk '{print $2}'`
 if [ "${pid}" != "" ] ; then
 kill -TERM ${pid} > /dev/null 2>&1
start() {
 echo -n "Starting WSO2 App Server: "
 su - wso2usr -c "$ASHOME/bin/ > /tmp/wso2appsvr.bootlog 2>&1 &"
 touch /var/lock/subsys/wso2as
 return $RCODE
stop() {
 echo -n "Shutting down WSO2 App Server: "
 killproc "$ASHOME"
 rm -f /var/lock/subsys/wso2as
 return 0
status() {
 pid=`/bin/ps -ef | /bin/grep "$ASHOME" |
 /bin/grep -v grep | /bin/grep -v /bin/sh |
 /bin/awk '{print $2}'`
 if [ "${pid}" != "" ] ; then
 echo "AS Is running"
 return 0
 return 1
case "$1" in
 echo "Usage: <servicename> {start|stop|restart}"
 exit 1
exit $?
Posted in Technical | Tagged , | Leave a comment

Solaris NFS Troubleshooting Checklist

Ran into a recent problem at work today, which shouldn’t have been an issue. However, something that normally took a few minutes ended up taking all morning to resolve, and shed light on a few other oddities. I got a ticket early in the morning that a user no longer had access to a directory on a server that has had numerous restores done to it over the last few weeks while the developers try to get it working again. So I logged on and checked the permissions and discovered the folders were owned by root, and quickly realized they were NFS mounts. No matter what I did though, I couldn’t get access to those mounts from the remote system. Other systems mounted that server fine, and that system mounted other nfs shares fine too. It turned out that the server had decided to start sending traffic to the nfs server in question over a vip instead of it’s primary link. This vip of course wasn’t in DNS, or in the hosts file, or in the vsftab file. A quick addition to /etc/hosts and it started working again. However, here is a list of steps for troubleshooting NFS in a solaris environment.

The Basics

  1. Can the client ping the NFS server?
  2. Can the server ping the client?
  3. Can the server resolv the ip of the client to a name?
  4. Is the NFS service and associated (rpcbind, portmap, etc) running?
  5. Is it running on the client also?

A little more indepth

  1. Does the share show up as an export via share?
  2. Is the client an allowed client?
  3. What happens when you run showmount -efrom the client?


  1. Are the permissions valid?
  2. If the permissions are for NIS users/groups, are both systems seeing the same NIS server?
  3. Do the ACL’s make sense?
  • Don’t forget, getfacl on Solaris < =9, ls -v on Solaris >= 10


  1. Run snoop and watch the nfs traffic
Posted in Technical | Tagged , , , , | Leave a comment

Compiling Ledger 3.0

Ledger   is a pretty sweet finance system that fits my style of all text all the time. Only problem is that installing ledger 3.0 is a bit of a pain on RedHat 6. It doesn’t have any native packages, and the 2.x branch is pretty much dead. To install it you need to compile it yourself, and of course it uses a newer version of some tools so it’s a multi step process. Here is how I did it.

Install Yum prereqs

Some stuff is provided by yum, but in case run the following

yum groupinstall "Development Tools" 
yum install cmake cmake28 
yum install mpfr*

Install Boost

Download boost 1.46.1 – I had issues with 1.52, and while they may have been caused by other things, this works. extract it

tar -zxf boot_1_46_1.tar.gz

Now it’s time to compile and install

cd boost_1_46_1

./ –prefix=/usr/include/boost_1_46_1


./bjam install

Install Ledger

Now it’s time to install ledger, the instructions are almost the same as on the github page

git clone git:// 
cd ledger 
git checkout -b master origin/master 
export BOOST_ROOT=/usr/include/boost_1_46_1/ 
export BOOST_INCLUDEDIR=/usr/include/boost_1_46_1/include/ 
./acprep dependencies 
./acprep update make install

You should now be able to use ledger 3.0 If you have any trouble feel free to leave a comment or contact me on irc.

Posted in Uncategorized | Tagged , , | Leave a comment

Graphite Tips

Recent came up with a few things that make using graphite even easier. Namely the Graphlot view. It’s a much nicer view for single graphs then the composer as it lets you zoom in easily, and unlike rrd, you can zoom in both horizontally and vertically. Problem is it sucks for building graphs in it, and I’ve got lots of complicated graphs I want to look at. So I took a hint from the Graphite Composer bookmarklet at obfuscurity, and created a bookmarklet for graphlot. Just drag this link into your bookmarks bar. It will prompt for a graphite url, and then will take you to the graphlot page for it. Open In Graphlot

Additionally, there is a problem with graphlot in that it only displays times in UTC. This is a problem if you live anywhere else, and gets confusing when trying to correlate with events outside of graphite. So after digging through the graphlot code, which is mostly javascript/jquery stuff that is greek to me. I tracked down the time display piece and added a variable to set to the UTC offset in milliseconds, and subtract it from the dates before any of the processing is done.

Heres the diff

File: /opt/graphite/webapp/content/js/jquery.flot.js

< //Added UTC Offset to correct date, set to millisecond difference between local timezone and utc
< UTCOFFSET=25200000;
<             //Added UTCOFFSET to correct date
<             var d = new Date(v – UTCOFFSET);

>             var d = new Date(v);

Hopefully these make it a little easier and nicer to use.


Posted in Uncategorized | Leave a comment

SNMP and Monitoring Rant

This is a rant, it won’t be to long but…

So I just started playing with graphite which is just all sorts of awesome.  However most of the data I care about needs to come from SNMP, because it’s A, a SAN and I can’t just install some agents, or B) It’s a locked down Solaris box that’s got ancient versions of everything and won’t work with any of the new hotness. So I looked at collection agents that worked with graphite, and …. yeah nothing except collectd. Collectd sucks, I had nothing but issues getting it working, and it doesn’t play nice with graphite. I like graphite because it’s simple, a name, a value and a timestamp over a tcp port. Thats the sort of stuff that is awesome. collectD tries to do all this other stuff…and while I’m sure it’s great once it’s setup I don’t like it. I don’t use agents, because my servers have more important things to do than hang on a crappy plugin that doesn’t handle edge cases well.

Enough bitching though, I want to go back to the unix philosophy, do one thing, do it well. Graphite does one thing well, graphing, carbon does one thing well, getting data. SNMP does one thing…poorly…very poorly. Unfortunatley I have yet to find a decent system that can poll snmp data and present it in a useful fashion without having to go through a million steps configuring it. I don’t want my snmp poller to autodiscover my network and try to be smarter than me (OpenNMS I’m looking at you). I just want to say, these mibs, these hosts, this interval, go.

As a pet project I’m going to start working on one in python, until it’s ready for primetime I’ll deal with collectd and it screwing up my nice graphite naming scheme.

Posted in Uncategorized | 3 Comments