⟵ Archive

Logging for Data Scientists

16 Sep 2022

Logging is important– it’s how your application talks to you, shares its feelings and tells you how it’s doing. Getting it right isn’t trivial and getting it wrong is expensive.

Here are 5 tips on writing better logs as a Data Scientist.

Tip No. 1: Don’t log in a tight loop

Avoid logging in a tight loop. A hundred lines that read “processing record” have little value, but come at the cost of noise, and bigger infrastructure bills.

Tip No. 2: Use Structured Logging

Structure your log entries as json. On most platforms, the entry will magically turn into metadata you can query, filter, and derive metrics from.

If you’re using a terminal to read logs, jq provides some of that magic locally.

Tip No. 3: Context is Crucial

Knowing your app is doing “something” isn’t enough. The more details the entry has, the more helpful it is in tracing and debugging. Print the name of action preformed, upon which object, and whenever possible, object’s position in the queue.

{
   "Summary":"Calculating distance",
   "customer_id":"32",
   "order_id":"5184",
   "current_record":"351",
   "total_records":"1730"
}

Tip No. 4: Use Correlation IDs.

It’s not uncommon to have multiple copies of your app running at the same time each working with a different subset of the data. Matching interweaving log entries from different copies of your app is nigh impossible. Before it becomes a problem, use job ids.

It’s as simple as generating a “short uuid” on start up and printing it with every subsequent log entry.

Tip No. 5: Log on start-up and shutdown

For applications that get kicked off by a timer, process records for a bit, and then exit quietly, i.e, jobs – write an entry start up and include the arguments used, and on shutdown, include the running time in the entry – also include arguments.

It’s much easier to track running time this way, instead of using start and stop events to derive the value. Tracking duration comes in handy when jobs overlapping or reporting on stale data is a concern.

{
   "job id": "a4739f8d58e148ba6503c",
   "Description":"Extracted Out of Market Orders",
   "Duration":"306139 ms",
   "args":[
      {
         "radius":"57 km"
      }
   ],
   "summary":[
      {
         "customer_processed":"41"
      },
      {
         "orders_processed":"1730"
      }
   ]
}

Bonus tip No. 6: Read your logs

Read your logs– build a mental model for what normal is so you can identify abnormal more easily, and generally speaking, look like a genius.

👍🏽

Also see:

  1. Parsing GitLab logs with jq
  2. jc - CLI tool and python library that converts the output of popular command-line tools, file-types, and common strings to JSON.
  3. shortuuid - A generator library for concise, unambiguous and URL-safe UUIDs.

ETL pipelines clogged? Reach out, say hi 👋🏽! I’m available for hire.