Every SaaS service needs reliability and scale, but log management presents unique challenges:

  • A massive stream of incoming events with bursts reaching over 100,000 events per second
  • The need for a "no log left behind" policy -- any log can be critical
  • Operational troubleshooting use cases that demand near real-time indexing and time-series index management

These needs dictate that we be able to collect logs in real time regardless of what happens in downstream processes (parsing, indexing and so on). Enter Apache Kafka.

Here's what attracted us to Kafka:

  • Reliability. Every day, we move terabytes of data through our Kafka cluster without losing a single event.
  • Low latency. 99.99999 percent of the time our data is coming from disk cache and RAM.
  • Performance. It's crazy good!
  • Scalability. By increasing the partition count per topic and downstream consumer threads, you can increase throughput when desired.

In Loggly's architecture, Kafka supports distributed log collection, efficient devops and the ability to control resource utilization throughout our processing pipeline. Because Kafka topics are amazingly cheap from a performance and overhead standpoint, we can create as many queues as we want, scaled to desired performance and optimizing resource utilization across the system. As we apply Kafka to new use cases, our love grows stronger. Read more about our Kafka use and results.

Do you have homegrown log management?
Believe us, it's hard to do well and a big challenge at scale. Take a test drive to see how we stack up: https://www.loggly.com/lp-loggly-general/.

--Michael Goldsby, Engineer; Vinh Nguyen, senior developer; and Suyog Rao, developer, Loggly