Code Runs the World — But No One Understands It
An Interview with Scalyr CEO Christine Heckart
“The services that we use, they seem so simple, because everything is one touch, one click, one swipe” says Scalyr CEO Christine Heckart.
“But what is happening underneath to give you that very simple, elegant experience is an extremely complex labyrinth of interconnected microservices.”
“And it has become it has become so intertwined, so interdependent, so complex, that the average human brain — any human brain, even an extraordinary one — can’t keep the whole system in their head in real time because it’s always changing.”
I interviewed Christine recently for my podcast, Kotecki On Tech. Here’s the interview (recorded June 10th, 2019), followed by a lightly-edited transcript.
Scalyr does log management. What’s the context for that and why the average person should care about why that’s so important?
We live in a world of always-on connected services. We all expect instant gratification: everything, everyone at our fingertips. And when something goes wrong with the delivery of that experience, it’s an engineer’s job to figure out what happened or what is happening in real time. Scalyr gives them real time data that lets them troubleshoot what is happening and resolve it very rapidly.
We hear all these stories about software glitches and things crashing. In fact, I think in the news just today or yesterday, there was some kind of GPS software glitch that I’m reading about that grounded some planes. How do you think about the scope of what you do in the context of just kind of our societal reliance on these apps?
The services that we use, they seem so simple because everything is one touch, one click, one swipe. But what is happening underneath to give you that very simple, elegant experience is an extremely complex labyrinth of interconnected microservices. And it has become so intertwined, so interdependent, so complex that the average human brain, any human brain, even an extraordinary one, can’t keep the whole system in their head in real time because it’s always changing.
These microservices come and go. You might call dozens or even hundreds of individual microservices, which can be running on any system, anywhere in the world, all together in real time to deliver that one-click experience. So when something goes wrong, it is very difficult to know what just happened, where did it happen in this ephemeral environment where things come and go in real time — they don’t live in one permanent place — and how to resolve the thing that went wrong.
And so tools like what Scalyr offers that help engineers get that data in real time and let them resolve that issue are increasingly important because the world is so increasingly complex and interdependent.
How does the tool work then, if a human brain’s not able to comprehend it? Are you using things like artificial intelligence to pinpoint what went wrong?
Sometimes, but the most important piece is to collect the data and collect it in real time, which is a hard thing to do, because of scale typically and because of the way some more traditional systems are built. Organizing it can be quite difficult and sometimes very costly and time consuming, and then making it available in real time or after the fact so that you can search through it all at rapid speeds — the speed you’re used to searching the Internet at — each of those three things is a difficult task in and of itself. And when you put all three things together, it’s a very difficult task.
Scalyr was born about seven years ago — our founder is a Google engineer who had that task at Google, running Google Drive for Google Docs, and to do this kind of thing at scale in real time quickly was a hard enough problem that he left Google and founded Scalyr to solve it.
Because of your closeness to this issue, do you look at the world with a little bit more concern than somebody else because of the amount of code that’s controlling everything in our lives, because of the glitches that we sometimes see, or at least the ones that we become aware of in the news? I mean, it seems like it’s such a hodgepodge of code built on top of older code connected with outside code — as you’re mentioning, these microservices could be all over the place. Are we kind of dancing closer to the edge as a society or are tools like this actually able to help us keep up?
It is the quintessential question right now. So Google had a big outage last weekend, Sunday kind of bleeding into Monday. In the middle of the week. I popped out an op-ed about that, and the interconnected systems like Google as an example was one.
It just so happened that the airlines was another example that I pulled out. And sure enough, as you say, there was a problem over the weekend with airline systems going down because of GPS system failure: highly interconnected systems, third party system, but still impacted a lot of users.
The financial crisis is frankly another one — it was a decade ago. But it really was about interdependencies of systems — in that case financial systems and financial tools — that were not understood. They were so complex. the systems were so interdependent, people didn’t understand the cascading failure that could happen. And that is the world that we live in today.
The entire physical world is based on this underlying logical, interconnected, always on cloud environment, to degrees that none of us fully appreciate. The way goods and services are shipped around the world and across the country so that we have food in our store shelves and we have gasoline and we have all these physical goods, not to mention the information flow that we’re used to always having at our fingertips.
All of that is based on this underlying, highly interconnected system of data, bits and bytes and services and microservices. And to keep that world up and running is an increasingly complex, difficult and realtime task.
Transportation, food, all of these different aspects of our lives - we as a society used to do all those things without any of this. And now of course, we’re completely reliant on all of it to the extent that if one of those systems goes down, we’re not able to meet those basic needs or have that element of our economy. So is there any sense we kind of need to build a parallel system or at least some kind of redundancies in our society, because it just seems like all this stuff has grown organically and now we can’t not depend on it.
If you go back to the problem the weekend before where Google’s system went down, it was a pretty simple configuration error. But what happened is the very tools that would normally have helped them find and diagnose and fix that problem quickly, were also impacted by the problem.
And what that tells us — and again, we can look at it with the airline systems, we can look at it with the financial systems, you can look in any complex interdependent system — you need to design redundancies into the system itself. And that’s where there’s a lot of good, best practice and engineering that has already happened over the past decade and it will only get better. And you need to make sure that the tools that you use to find, isolate and resolve the problems in the system are themselves separate from the system. So a problem in one does not impair your ability to then resolve that problem.
One of the interesting things about Scalyr’s story specifically is not just what you’re doing, but the people who are actually doing it. So if you think about the average tech worker, if you just say those words to somebody in their mind, probably you’re thinking of somebody who is a young, white male. And it seems to me that it’s not just an issue of kind of equal opportunity for individuals, although it certainly is that. But also if we’re talking about, as we are code that controls so much of society, the individuals who are making the specific decisions about that, if they’re not as reflective of society, that can cause problems. But it seems like with Scalyr you’ve been able to do something to counter that narrative, at least in your own company. So can you talk a little bit about what you’ve done there?
I’m very fortunate to have joined a company who had a very diverse workforce to begin with, including in engineering. 40% of our employees are diverse. 30% of our engineers are female. And the challenge for us and all companies is to maintain those ratios as you scale. The reason that’s important is because cognitive diversity, when you’re developing code, when you’re trying to problem-solve, is one of the single most important elements that you can bring into the problem solving environment, into the code development environment to make sure you’re delivering and creating the best possible product and the best possible experience. And I say cognitive diversity because I really don’t think it has anything to do with gender or race or socioeconomic or any one thing. It’s all of those things combined and it’s our natural coding as humans. People think very differently. Two white females can themselves think very differently about problems, despite the fact that they’re both two white females or two African American males or — you really need to bring in a wide variety of cognitive thinking, which means a wide variety of backgrounds, so that you have the best possible minds thinking about the problem that you’re trying to solve or the experience you’re trying to deliver through every potential facet and you can, together, come up with what will be the best experience.
Do you have a specific example of a time when that kind of cognitive diversity helped you out?
There are so many examples. It’s hard to pick out a single one. I see it every single day. And if you are one person sitting in a room writing code and you don’t have the benefit of conversation, if you don’t have the benefit of understanding how that experience will be viewed or how that algorithm will change as different kinds of people react with it, then you aren’t going to be able to produce the best possible experience. Or the flip side, what you’re saying, the rest of us all end up in a world that’s been programmed by a single type of person, a single type of personality. And that may not be the most optimum world for all of us to live in.
You’re a female leader and I just am curious how you’ve seen the situation for female leaders in tech improve and how far you think we still have to go?
I would say most men in our industry and most others — and the men are still largely in power — the men are mostly gender blind. They’re not intentionally trying to keep experiences, promotions, influence away from anybody, female or otherwise. But we’ve had a system that has developed over time that has unconscious bias built into it. And we all — female, male, young, old, black, white — we all have these unconscious biases.
I think what’s happened over the last, especially five years, is we’ve started to expose and challenge many of the unconscious biases that are built into — unawares, unintentionally — built into all of the systems that operate our daily environment, whether they’re cultural and decision-making, or they’re physical, like the way we interact with software algorithms. And the more we do that, the more we have conversation, the more we challenge the status quo, the more we bring diversity, cognitive diversity into all elements of business at all levels, the more we will see improvement. And that’s a self-reinforcing cycle.