Gordon Wong has been working with data since the early 90’s. His accomplishments include building Fitbit’s data analytics platform from the ground up and serving as HubSpot’s VP of Business Intelligence. He has also worked or consulted for leading brands such as ezCater, edX, and ZipCar. In all his engagements, Gordon focuses on repeatable processes, continuous insights, and team building. We spoke to him about data science, the surprising staying power of SQL, and data quality.
Data Culpa: How did you get into this field? What is it that drew you to data engineering and data science?
Gordon Wong: Well, that’s really interesting. When you’re younger, you talk about finding your passion. But most research shows that you don’t actually find your passion; you develop your passion over time. And that was true for me.
I got into databases about almost 30 years ago now. I graduated during a recession. I had degrees in psychology and philosophy, so I was not extremely employable at a bachelor’s level. I ended up at the GIS lab. And they were working on doing something that not a lot of people were doing at that time. We were building 911 databases, and we were tying them to a GPS point, so the ambulance could find your house. This was pretty cool in the early 90s, and this was really valuable in rural communities where people’s address was listed as Rural Route 1, Box 15. The ambulance can’t find you that way. So, if the ambulance driver didn’t know you, you were out of luck.
That’s how I got my start. At GIS, we were using an Oracle database. And I taught myself SQL. And in those days, documentation was good. The Oracle publication manual set was $3000, and the concept manual was amazing.
And that’s a good way to stay employed. Someone told me you should dive into SQL more. It’s sort of fading a little bit, they thought, but it’ll probably be useful for a few years. Well, it turns out a few decades, right?
Then I ended up getting into warehousing, because I realized I didn’t like being woken up in the middle of the night as a transactional DBA. And I liked dealing with performance. We were using DEC Alphas, and we’re doing these EMC disk arrays and so on, and it was really super fun. Then I eventually got into consulting.
“Consulting really taught me to put the mission in focus.”
And consulting—being part of a professional services team—really taught me to put the mission in focus, and really put the client in focus, because you need to get paid. Then, over time, I evolved my own personal methodology of always starting at the end.
Don’t start with the data. Start at the end. What is the question? What is the action the user is trying to drive? What question did he want to answer? And in order to have a safe answer, what do you need? What are the principles you need? Build your solutions that way.
“Don’t start with the data. Start at the end. What is the action? What is the action the user is trying to drive?”
And then I got into my—to use the Japanese term, Ikigai—my professional mission in life. Mine is to drive better outcomes through access to data.
Data Culpa: That’s a great mission.
Taking a Customer-Centric Approach to Data Science
Gordon Wong: It’s an inherently customer service-focused position, because I don’t take that many end actions myself. I don’t sign up customers, I don’t sell cars, I don’t do that kind of stuff. What I do is I try to provide information that generates insight for people to take better actions. And as a kid coming from a restaurant family, that kind of works.
Data Culpa: Right. In restaurants, you have to have a real customer focus.
Gordon Wong: Absolutely, and then I bring that focus to all my teams that way when I have teams. And I’m very comfortable training my teams that way, because I’ve seen my former employees who have really grasped something, that customer focus, that have thrived in their careers. And the ones who lack that focus on the ones who struggle more. And you’re like, you’re not really certain you would hire them again.
Data Culpa: That orientation is really important. And in IT, sometimes people can get distracted by the technology.
Gordon Wong: Oh, yeah. I’m a technologist, right? But the technology doesn’t matter if I’m not solving a problem.
Data Culpa: Let’s talk about solving problems and data quality. You’ve done things like set up the data analytics for Fitbit and worked on data quality for HubSpot and other companies. So I understand your customer focus now.
Can you talk about data quality? It sounds like you begin with a question. I imagine part of getting the right answer is making sure that data is good along the way. Can you tell us about how you approach that?
Gordon Wong: My approach to data quality is shaped by my belief about what the product is; how we work for the end user.
Continuous Insights from Data Science
Gordon Wong: There’s this unfortunate leaning in analytics and data science world where we think the only goal is building models. Let me go develop that perfect model that will help base my entire company on, or let me produce that one report that the CFO needs, and then I’m done.
What I’ve learned is that there’s a lot of value from the continuous delivery of insights. We all know data warehousing and analytics and data science projects typically fail, right? They either fail outright or they underdeliver. Where I’ve seen them be successful is when they get on a continuous process, with continuous incremental insight delivery every week. Every month, every week—every day if possible.
“We all know data warehousing and analytics and data science projects typically fail, right?”
When you do this right, two things happen. You have this continuous delivery of insights about questions you know about, but you also evolve better questions over time. And the only way to evolve those better questions is if you deliver insight to people who can use it and who can drive better questions. People or algorithms, right?
“When you do this right, two things happen. You have this continuous delivery of insights about questions you know about, but you also evolve better questions.”
So, you say to the CFO, here’s our revenue from 2019. Great. Well, of course, they’re going to have a new question. Their first question was, “How much revenue do I have in 2019?” But then later on they might ask, “Can you break it down by month?” Okay, that’s a little bit more useful, but it’s still not very useful. “Oh, but can you bring it down by month and customer cohort for 2019?” Oh, that gets more interesting. Now let’s include 2020 and get year-over-year analysis. And let’s tie in our marketing spend.
I think one thing I’ve learned is in the beginning, users come to you with a question. They think that’s their last question. It’s really their first question, and it’s usually a shitty question. It doesn’t really tell them anything, but they’ve been flying blind, so they’re just desperate to get this one thing. So, if we accepted this as a continuous product, then we’ve got to think about solutions that will deliver continuously.
“In the beginning, users come to you with a question. They think that’s their last question. It’s really their first question.”
That’s where I get into data quality. It’s based on test-driven development; at least I go from a test-driven development side. One of my rules is, test how you’re going to use the data. If you come from the data side, we’d be loading an entire huge data site. Here, if I’m Fitbit, but I have a petabyte of data—literally a petabyte.
And if I load this data and I start testing, well, there’s any number of different combinations you can test. But you’re also coming at it from the perspective as an engineer where you don’t really actually know the data. You only can describe its technical aspects. Is it unique, is it not null, does it have referential integrity? You can ask questions like these, but are they ultimately useful?
Now, how about the business rules? If I tell you that the average person took 14,000 steps per day in August, is that a good number or a bad number? You don’t know the data, so you’re not going to realize that number is really just kind of a little too high. And so there’s so many different combinations to test, and you still have to test to determine how much value the data has.
But if I come at it from the other side as an end-user saying, “Hey, I am trying to study the difference in step counts between New England and Florida in the month of May, and I have a hypothesis.” Well, I can build my test cases off that, too.
In fact, you should build your test data ahead of time by taking a known case. For example, you could ask, do we have a Fitbit employee in Florida? Okay, let’s get that user’s data. Then we know what that data is supposed to be. Let’s make that our test case. Do we have a Fitbit employee in New England? That’s another test case. And have all your tests fail up front.
“You should build your test data ahead of time by taking a new use case.”
In theory, if you do this correctly, when all your tests pass, maybe you’re done. At a minimum, at least you haven’t done that terrible thing where someone says, “Oh, I’ve delivered this whole project; now let’s do some testing.” That’s awful.
A Continuous Process to Deliver Continuous Insights
For me, data quality is a continuous process. You test all along the way, like any other manufacturing business. We talked about this being a continuous process. Well, if I’m producing a widget, or if I am selling baby arugula to a Sue’s Market, you don’t test one time. You test all the time, continuously. Because just like bad food, bad data can poison the business.
[ End of Part I ]