Anthony Johnson has a long and distinguished career spent building event data clouds using Elasticsearch. He’s built these systems at three different companies and learned a lot along the way. He was most recently cloud services architect and principal engineer at Ellie Mae, a fintech company that helps people finance homes. He joined Scalyr as its field CTO a few months ago to help other companies achieve the vision of an integrated event data cloud. While at Ellie Mae, Anthony built its event data cloud to power log analytics use cases for incident management, problem isolation, alerting, monitoring and dashboards — and he combined the log data with other event and business data for product and business insights. This is the story of what he learned and would do differently.
Anthony is deservedly proud of what he accomplished, but also introspective about it’s long term value. “I built a system that was extremely scalable, and something I thought would be best for the business. And then I looked at what I built, and I wonder if I really did the business a great service.”
When I asked him why he questions the value in hindsight, Anthony explains the journey and challenges. He actually built two systems. The first used Elasticsearch on AWS for logging.
The system worked great, until it didn’t work great.
“We had scaling challenges with AWS’s Elasticsearch service. We could only keep three days of data. So we pivoted and built our own Elasticsearch cluster on prem. It was months and months and months of work. The business kept asking ‘When is going to be delivered? When is it going to be delivered?’ It was pretty tedious. But we built it, and it was successful.” Reality set in as they grew.
And it ran great until it didn’t run great.
“We were at around two terabytes a day, and probably about 50 to 60 data nodes, and we started running into a lot of problems with cluster management and data latency.” This is when he began to question the value of his team’s time on this work. “Ultimately, Ellie Mae’s business was not running Elasticsearch. They just want to help people refinance.” The work to maintain the analytics system became tedious instead of fulfilling, and getting everyone to use it to its full potential became a source of frustration and learning.
Anthony is an interesting combination of idealistic and pragmatic. He envisions a perfect event data cloud — and he knows that in any environment you have to make trade offs. “Getting teams to adhere to a consistent schema was impossible. But flexible schema isn’t one of the strengths of an index-based system like Elasticsearch. We started to get data loss and queries not working, because the shards were incompatible between indexes.”
In the real world, data is messy, and you need a system to deal with messy data. Users and data are fluid, and you can’t expect them to be perfect — or even consistent — over time.
“With some tender loving care the system would support 4TB/day, but I don’t consider it a success. And the reason I don’t consider it a success is continuity. Every company has clever engineers — or as they are affectionately called at Ellie Mae, clever troublemakers. But the really great engineers want to work on projects that are central to the mission of the business. And when they do build a core infrastructure product, they don’t want to hang around to care and feed it — it’s not challenging. The reality is, I built a system that works, but it’s a distraction for the business. It’s not Ellie Mae’s core business case.”
When asked why it matters, why not just let clever trouble makers build stuff. After all, you get full control, it’s built to your specifications, and it’s not THAT much resources. So what’s the real problem?
The biggest challenge is continuity.
“I fell in love with the Scalyr platform and I left. If it wasn’t that, it would have been something else eventually, and now who is going to run this thing?” But continuity isn’t the only challenge Anthony identified. “Kibana can be a bit daunting. It’s a great product, but there is a lot to take in. And training was definitely an issue. The pivot from Splunk to Kibana was hard because there were a lot of shortcomings with how you query.”
Scalability became a challenge too. “Indexes can grow exponentially. You need to influence the indexer to reduce the index size. That’s how you scale. I don’t want to waste my time doing that, I have better things to do. And now that I know that we could have purchased a full-featured SaaS offering from Scalyr for less than my AWS infrastructure bill, I feel like I did the company a disservice despite my best intentions.”
One of the things that attracted Anthony to Scalyr, despite having been an Elasticsearch and open source advocate and evangelist for years, is that Scalyr doesn’t use indexes. “The strategy here is not one index to rule them all.” Scalyr uses a streamlined columnar store and allows every team to have access to their own data, or to logically group data by application. “When you have a field that is changing very rapidly, or is always unique, and then you throw in a billion records, an index doesn’t really add that much value. In fact, it’s detrimental.”
“I started out in the schema-on-write camp.” But this ideal fell short in the real and imperfect world of data and users. Schema-on-write became a problem and limitation. “I’m much more in the schema-on-read camp now, because the world is not perfect. Nobody understands the questions you may need to ask up front, or the value of data in advance, nor so they want to invest the time to figure it out until they need it. They just want to ask the questions and get answers.”
You’ve got to be able to execute queries as fast as possible. Users are impatient.
There are signs that tell you when you start to outgrow Indexing. ”Every successful logging service, observability, or telemetry has buffers throughout the system. If those buffers start backing up, and your users start complaining about latency, you know you’re in bad shape and dealing with scalability challenges. The biggest issue that I ran into was index and shard growth. I had to write a whole bunch of scripting around this thing to be able to deal with the limitations of shards limited to 50 gigs in size, network challenges, ingest challenges, and Elasticsearch’s scalability limits. It becomes painful.”
Anthony now helps to build products that are delivered as a service and supports Scalyr customers who are considering a migration to full service SaaS. “Smart engineers need to be careful. Because the smart engineer always wants an engineering challenge and always wants to show off.” At Ellie Mae the challenge was to replace our expensive vendor with a more affordable system. “And I said, ‘well, gee, I’ll go and build a system.’ I think this happens with a lot of engineers. But at the end of the day, it probably shouldn’t have happened. There’s no continuity in the business now that I’m gone. I anchored myself because it was the engineering challenge at the time. Having it done for you is probably the right choice.”
Even if it costs more, it’s probably the right choice. The advantage is that Scalyr typically costs less, which is awesome.
About the Author
Christine Heckart is CEO of Scalyr, which provides the industry’s most scaled log analytics SaaS offering and a unique Event Data Cloud, a DPaaS that delivers analytics as a service for event data and can be integrated with existing dashboards, user interfaces, and custom applications. Scalyr also created and curates a peer-network of VPEs, CTOs, and top technical executives at leading SaaS companies called ENG (Engage, Network, Grow). To learn more about Scalyr or to join ENG, visit www.scalyr.com.
Top Blog Posts
SUBSCRIBE TO OUR BLOG