How to collect aggregated statistics in a Ruby on Rails application?
It is important to collect aggregated statistics so that management can analyze the data and make wellinformed decisions. Sphere was retained by a client in the recruiting industry who, among other things, needed to collect the following data:
 Total shifts posted
 Total hours posted
 Total shifts worked
 Total hours worked
 Average length of shifts
 Average shifts per job
In addition, Sphere had to provide the possibility of “spoofing” the statistics to a certain point while the production database was being tested. Up to that point, the statistics should have been based not on the actual values from the database but on some customarilyentered data.
Since our client was using a Ruby on Rails application, we decided to write a statistics module in Ruby as well in order to leverage existing code and to simplify maintenance. We considered three implementation options:
Option  Advantages  Disadvantages 









After presenting these three options to our client, we agreed to proceed with the third option.
Calculating & Storing Statistics
The Statistics::Employer model is used for calculating and storing statistical data. In its table, we store the date, employer’s foreign key, and all other values needed to calculate the statistics (total hours posted, total hours worked, number of applications, and average number of applications).
class CreateStatisticsEmployers < ActiveRecord::Migration def change create_table :statistics_employers do t t.date :date, index: true, null: false t.references :employer_profile, index: true, foreign_key: true, null: false t.integer :jobs_count, default: 0, null: false t.integer :shifts_posted_count, default: 0, null: false t.decimal :hours_posted_count, default: 0, null: false t.integer :shifts_worked_count, default: 0, null: false t.decimal :hours_worked_count, default: 0, null: false end end end
All formulas are contained in the model code:
module Statistics class Employer < ActiveRecord::Base belongs_to :employer_profile class << self def total_jobs sum(:jobs_count) end def total_shifts_posted sum(:shifts_posted_count) end def total_hours_posted sum(:hours_posted_count) end def average_length_of_shift_posted total_shifts_posted.zero? ? 0 : total_hours_posted / total_shifts_posted end def average_shifts_per_job total_jobs.zero? ? 0 : total_shifts_posted.to_f / total_jobs end def total_shifts_worked sum(:shifts_worked_count) end def total_hours_worked sum(:hours_worked_count) end def average_length_of_shift_worked total_shifts_worked.zero? ? 0 : total_hours_worked / total_shifts_worked end end end end
Methods are using ActiveRecord::Calculations, so they can be called up on any scope, which is useful for filtering by date/employer.
Collecting Statistics
The collection of statistics can be divided into three subtasks:
 What time to start daily statistics collection.
 What dates to collect statistics.
 How to collect statistics.
We have already answered the first question by choosing an embodiment (implementation variation). After analyzing the operation in the application, we found that the majority of shifts end before 2 a.m., so the statistics will be collected by schedule at 3 a.m.
Cron can be used to perform this task, but we decided to use clockwork gem:
# clock.rb require 'clockwork' require './config/boot' require './config/environment' module Clockwork every(1.day, 'statistics.collect', at: '3:00') { Statistics::CollectJob.perform_later } end Statistics::CollectJob is background job, consistently resolving the remaining two subtasks: # app/jobs/statistics/collect_job.rb module Statistics class CollectJob < ::BaseJob def perform Statistics::UntrackedDatesService.new.execute Statistics::UpdateUntrackedService.new.execute end end end
Statistics::UntrackedDatesService – detects which dates are untracked and creates UntrackedDate for them. It always counts yesterday as untracked, as well as dates on models with updated_at after midnight the previous day.
UntrackedDate is a very simple active record model that contains only date attribute with unique index.
As we collect statistics for jobs and shifts, we need to track Job and Shift model updates. Also, as we count jobs and posted shifts on each job’s creation date, and worked shifts at the end time of each shift, we assume Job#created_at‘s and JobShift#end_time‘s dates are untracked if those jobs/shifts changed from the time of the last statistics update.
So the full code of UntrackedDatesService is:
# app/services/statistics/untracked_dates_service.rb module Statistics class UntrackedDatesService attr_reader :working_date def initialize(current_date = Date.current) @working_date = current_date  1.day end def execute untracked_dates.each do date Rails.logger.info "Marked #{date} as untracked" Statistics::UntrackedDate.find_or_create_by date: date end end private def untracked_dates [ working_date, *untracked_past_jobs_dates, *untracked_past_shifts_posted_dates, *untracked_past_shifts_worked_dates ].uniq end def untracked_past_jobs_dates Job.where('updated_at >= ?', working_date.beginning_of_day) .where('created_at < ?', working_date.beginning_of_day) .pluck(:created_at).map(&:to_date) end def untracked_past_shifts_posted_dates # similar logic end def untracked_past_shifts_worked_dates # similar logic end end end
The last subtask is performed by Statistics::UpdateUntrackedService. It takes each untracked date, deletes all statistics for that day, and calculates new statistics. (Calculation is incapsulated in yet another service, UpdateService.) We need to delete all previous statistics to keep the process simple. UpdateService does not know why we mark this date as untracked. It just does what it is supposed to do.
In UpdateService, we create groupings by employer and calculate aggregated stats. Then we bulk insert all stats into the Statistics::Employer model:
module Statistics class UpdateService attr_reader :date def initialize(date) @date = date end def execute return if date < KEEP_LIVE_STATISTICS_FROM Rails.logger.info "Updating statistics for #{date}" Statistics::Employer.where(date: date).delete_all Statistics::Employer.create employers_statistics end private def employers_statistics # Here we have a lot of ruby/rails/sql magic # and return array of hashes for each statistics entry # (i.e. grouped by date/employer_profile_id) end end end
This is all we need to collect and calculate statistics, but we have one more step to cover.
Callbacks
Sometimes a model’s time attributes can be changed. In that case, we can only track that statistics were changed in the new date, but not in the old one (because we can’t know what the previous time was). So we have to use callbacks to track previous dates of previous timestamps.
Here is a Tracking module that could be required by any tracked model:
module Statistics module Tracking extend ActiveSupport::Concern included do cattr_accessor(:statistics_tracked_attributes) { Set.new } after_update :check_statistics_tracked_attributes_have_changed end class_methods do def track_attributes_for_statistics(*attributes) statistics_tracked_attributes.merge attributes.map(&:to_s) end end def check_statistics_tracked_attributes_have_changed (statistics_tracked_attributes & changed).each do attr before, after = changes[attr] next unless date_changed?(before, after) Statistics::UntrackedDate.mark before.to_date end end def date_changed?(before, after) before && (!after  before.to_date != after.to_date) end end end and it is included into Job include Statistics::Tracking track_attributes_for_statistics :created_at and JobShift include Statistics::Tracking track_attributes_for_statistics :clocked_out_at
Now we have implemented full, easily expandable business logic to collect and output application statistics!
Output
Finally, we need all collected data to output. Since we use ActiveAdmin, I will show ARB code snippets and the screenshots it outputs.
First, we need filter form:
form_for search, url: admin_statistics_employer_path, method: 'post' do f f.text_field :employer_profile_id f.text_field :from , class: 'datepicker', 'datadatepickeroptions' => '{"maxDate": "1d"}' span '' f.text_field :to, class: 'datepicker' f.submit 'Filter' end search here is a form object that takes params[:search] and returns scoped Statistics::Employer.where(date: from..to). We can output total statistics by the period: table do thead do tr do th :stat th :value, style: 'textalign: right' end end tbody do %w(total_jobs total_shifts_posted total_hours_posted average_length_of_shift_posted average_shifts_per_job total_shifts_worked total_hours_worked average_length_of_shift_worked).each do stat_name tr do td stat_name.titleize td number_with_delimiter(stats.public_send(stat_name).round(1)), style: 'textalign: right' end end end end
Here is a form in table view:
We can output monthly breakdown of all these stats, using chartkick gem:
h3 'Shifts' div line_chart( [ { name: 'posted', data: stats.group_by_month(:date).sum(:shifts_posted_count) }, { name: 'worked', data: stats.group_by_month(:date).sum(:shifts_worked_count) } ] ) h3 'Hours' div line_chart( [ { name: 'posted', data: stats.group_by_month(:date).sum(:hours_posted_count) }, { name: 'worked', data: stats.group_by_month(:date).sum(:hours_worked_count) } ] )
Here is a chart view:
Summary
We would like to emphasize the following:
 The “spoofing” requirement is implemented using a constant Statistics::KEEP_LIVE_STATISTICS_FROM. (Did you noticed it in the code above?) The process of forming and loading madeup statistics prior to this date is beyond the scope of this article.
 Prepopulating the statistics with the existing data is performed with a straightforward rake task – just take each date application worked and pass it to UpdateService.
 In the real statistics, there are some more complex metrics, like breakdown of job roles. We used Postgresql hstore columns for storing it, but this topic is also beyond the scope of this article.