Monitoring your Software for more than just Uptime and Bugs

Posted Feb 15, 2015 by Scott Ringwelski

I’m not really sure what this post will be about yet, other than how effective and interesting I’m finding monitoring services outside what one might think the core tools are. I’ll be mostly talking about SaaS, but without a doubt it applies to all. You may be thinking — that’s obvious, of course you should be monitoring your software, but I’m talking more about unique-to-your-app sort of monitoring.

To start, there are what I’d consider ‘core tools’ and what we currently use: raw logging (papertrail), performance monitoring (new relic, skylight), downtime alerts (new relic synthetics + pager duty), and bug tracking (we love Bugsnag).


But there’s much more than just bugs, uptime and logs. Your application is unique and it has unique measurements that are important to be keeping a close eye on.

For this sort of stuff we’ve started using Librato. It’s awesome. I’m sure there are other tools out there like it, but we’ve been very happy with what Librato has to offer so far.

Maybe not-so-unique measurements

Here’s a few of the not-so-unique to our app measurements we’ve been keeping track of that other startups should probably be measuring too:

Login Rates Login rates for various types of systems

Notifications Count Notifications sent to users by type (email vs. in app)

Feed Items Feed items generated for user news feeds

Looks like we have a few activities with very large fan-outs and the rest tend to be low fan-out.

Enqueued Jobs Count Enqueued Jobs by Queue (we use Sidekiq for background workers)

Our enqueued jobs spike during some heavy data syncs, but we’re making progress.

Critical Jobs How long are critical jobs taking to run?

It’s important that our critical jobs run quickly.

Unique to You

The above type data I’d say is important to most startups — but where Librato and similar services get interesting are when you start measuring anything and everything you want to unique to your business. We’re just getting started on this at Handshake, but here’s a few examples we have so far:

Pending Employers

A simple one is number of pending duplicate employers in the system. We want to always make sure we are on top of these and merging duplicate employer accounts (looks like we’re a little behind).

User Reindex

We use elasticsearch for search. We provide very “wide” searching options to our users when searching over users. This means that reindexing users in elasticsearch can be expensive.

Resume Parsing

Handshake let’s users parse their resumes to quickly build their profile. Keeping track of success rates is important (the spike is from a large hackathon we helped run in which we parsed users resumes for them before they logged in).

Alerts

Last but not least — alerting. Librato let’s you set up alerts based on absolute thresholds or relative change.

We have alerts for:

A simple heartbeat that is sent every 30 seconds. If we stop receiving the heartbeat, background jobs aren’t being run. Also alerts for large number of enqueued jobs, large number of failed resume parsings, high login failure rate and slow user reindexing. We certainly plan to add more.

Originally posted on Medium

engineering