Russ Taylor, Part One: Continuous Integration
Published 6/30/2015
During today's first half of a two-part interview, I have the pleasure of talking with Russ Taylor. Russ works on the continuous integration team at Etsy.
We had a chance to talk about test driven development, all of the processes that Russ is involved with at Etsy, and how big applications like Etsy come together.
If you don't want to miss out on the second half of my interview with Russ, you can head on over to Developer Tea.com and hit Subscribe through our RSS feed, Stitcher, on iTunes or another podcasting platform that you prefer.
This episode is sponsored by OneMonth.com. Head over to OneMonth.com/developertea to get started learning Ruby on Rails in just one month, and receive a limited-time 25% discount!
I hope you enjoy the show, and until next time,
Enjoy your tea.
Transcript (Generated by OpenAI Whisper)
Hey, everyone, and welcome to Developer Tea. My name is Jonathan Cottrell, and today I have the pleasure of interviewing Russ Taylor. Russ works at Etsy on the Continuous Integration team. We had a fantastic conversation. It actually spanned over two episodes, just like normal. So I hope you really enjoy the conversation that I have with Russ. He is very interesting to talk to because what he works on is used by so many people. It's such a big platform. We had a chance to talk about test-driven development and all of the processes that Russ is a part of using at Etsy. I hope you enjoy this episode of Developer Tea, my interview with Russ Taylor. Russ, welcome to the show. Thank you. It's really a pleasure to be here. I'm glad you have a chance to talk with me today because you do work for a company that probably everyone listening to this podcast has heard of. You work at Etsy. Can you tell everyone what your role is at Etsy? So I am a software engineer on the Continuous Deployment team at Etsy, which basically means that I rarely touch anything on the front end. I'm more focused on making other developments. I'm more focused on making developers' lives easier and making the process of updating the site and adding new code, what have you, as easy as possible. Sure. That's incredible. I've always been interested in how big applications like Etsy come together, how the team works on something that big. And so I'm excited to talk to you about that kind of stuff. But before we get into it... I want to just add to that, for listeners who may not know what continuous integration is, can you explain just a basic overview of what continuous integration actually is? Yeah. Yeah. So continuous integration, in a nutshell, is running tests and builds on your software for perhaps every commit or every few commits. So, if you're running your own software, you're going to need to run your own tests, or even just on a daily basis, actually execute, running those tests to make sure that everything is still in good shape, and, depending on how you have it set up, either automatically deploying those changes when the tests pass, or making it very easy to deploy those changes. So, for example, in Etsy's case, we deploy code to the site about 30 times a day. times a day. Oh, wow. Yeah, like new code. And it's it's pretty amazing. It only takes about 15 minutes to go from, I have this this commit on my local machine. And I want, you know, to put this live on the site. Sure. And so we have a whole suite of integrated tests that we run on those, both I mean, unit tests, integration tests, things like that, to make sure that the code is actually working. And we we ask each developer to kind of during the deployment process, actually verify that their code is working. And they verify that through through you said, unit tests, feature tests or integration tests. And for anyone who's who's not familiar with the way testing actually works, basically, tests are are written code that tests your code. It's like a program that you write alongside your other code to make sure that your code is actually working. And so we have a whole that the original code is correct. Exactly. It's kind of hard to explain until you actually see it, right. And once you've actually had a chance to experience it, and especially experience the benefits of it. I, I'm kind of biased, I like TDD, or whatever you want to call it. And I know that carries some baggage with it. But I appreciate tests more now than I ever have. And I think that's because I've run into so many issues. And I've had a lot of problems with testing, where I didn't have tests, for whatever reason. And, you know, I, I kick myself, because I go back and I think, man, if I had tested this particular thing, especially just like unit tests, I mean, that's, that's the simplest level of testing, I feel like, if I had just had unit tests, I would have saved myself a lot of pain from breaking stuff in the future. Absolutely. They're, they're incredibly powerful. And you don't really have to be a pro at testing. You don't have to be a pro at testing. You don't have to be a pro at testing. You don't have to be a pro at testing. You don't have to be a pro at testing. You don't have to be a pro at really recognize that power until, you know, you've, you've been on on the other side where you realize how you could have easily resolved a problem by just writing a test. I'm interested to know, workflow wise, is Etsy actually fully test driven development? Or is there another kind of approach that you guys take? There is another approach. We don't, we aren't fully test driven. We try to have very good test coverage. And, you know, obviously, we, we'd like to have 100% test coverage, which means, you know, every bit of code is tested in one way or another. But at the same time, we don't want to impose necessarily, we want to trust developers to do what is right. And we don't want to say, you know, well, before you write, before you put any code on the site, you need to have all of it tested. So we try not to kind of impose those strict rules. We don't want to impose those strict rules. We don't want to impose those rules. But at the same time, we like to have very good test coverage. We like to have as much as we can covered by tests, because it's it's by having those tests and having that fairly complete coverage that you can actually be confident in your code. Yeah, especially with with reintroducing old bugs, or creating new bugs, but in old code as you're changing it. Yeah, absolutely. And what is Etsy? Is it Ruby driven? Or what is it? It's mostly PHP. Oh, okay. Yeah, yeah. On my team, in particular, we use whatever we feel is best suited, but we're not necessarily actually writing the production code most of the time, the code that goes on the site, but the site itself is is basically entirely PHP. Okay, and JavaScript for the front end. Sure. And yeah, you can't really get away from JavaScript these days. No, no, you can't. Do you guys write JavaScript? JavaScript tests as well? Um, yes, we do. Okay. I'm not sure that we have as much coverage on the JavaScript side. But yes, we definitely still do. Yeah. And so you mentioned that you ship code 30 ish times a day, right? What kind of things are you guys shipping every day? Well, anything from the smallest change to big changes. So one of the things that we do, that was new to me, that I really like, at at Etsy, is that basically, everything on the site is controlled by some configuration flag, we do a huge amount of like, A, B testing to kind of verify that a change is number one, not going to create huge issues for users, but number two, that it's actually going to be something that the users will like. So everything down to the entire site itself is behind a configuration flag. Wow. Yeah. So, all we have to do to disable the entire site is change one configuration setting. It does mean that we have some pretty bulky configuration files, because basically every feature is in there. But at the same time, we can deploy a change that only affects 1% of users, or that only, say, Etsy employees actually see, so that we can test things out, even in production. Without actually having them affect users. But the changes we deploy are, I mean, that's how all of the changes happen to our site, through these little changes, these little incremental changes every day. The other thing that's nice about the configuration flags is that developers can work on a new feature, and before it's complete, they can still be committing that code, and actually having that code in production, just not having any users actually see it. But. It makes it so that we're more focused on actually getting your code merged with kind of our master branch. We're having all of that code live in one branch as much as possible, versus, you know, having to deal with huge merge conflicts when you've been working on a feature for two weeks, and, you know, suddenly everything has changed, and you have to spend three hours just resolving conflicts. Yeah. And so I'm assuming you guys are using Git. You are using Git? Yes. Yes, we are. You know, I don't know many people who don't, actually, at least in some capacity, use Git for all their version control at this point. Yeah. It's a very powerful tool. Sure. And you're saying you try to keep things in one branch as much as possible. Absolutely. And that makes for faster deployment, I would assume, and a little less overhead of trying to do, you know, resolving merge conflicts as you, you kind of alluded to a second ago. Yeah, definitely. It's something that it's been a little more difficult to apply it to, say, our mobile applications, but it's something we're even trying to do as much as possible there. All of this is so interesting to me because, I mean, getting a huge application like Etsy running is just, I mean, there's so many moving parts and pieces, and making it simple, I feel like would be nearly impossible. But it doesn't mean that... I mean, complicated doesn't necessarily mean hard to manage. But I wanted to ask you about those, the flags that you mentioned. Those are not, I would assume that because you do A-B testing, those flags are probably like a cascade of different points that they can be set at, right? So you have like a default file that has all the flags that are set like in the normal use case, but then a given user for a given session could, you know, be able to set up a different Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. um a whole range of settings there are a few different options so like i mentioned we can just enable it for um employees of etsy so that any any etsy employee who's logged into the site will see that we can also enable it we can enable it for any percentage of users so we can say for example okay i want to test this change out with one percent of users and the the configuration flag will pick a random one percent to apply that setting to and then you can kind of compare the behavior on that one percent yeah which which is super powerful the other thing that it enables us to do is um kind of in our our forums on etsy we we can like users can join groups and so for some actually enable people to opt in yeah so we say you can join this um this test group and then you will see this new setting so for example if we're making changes to the way that sellers can list their new items we can say okay any sellers that are interested in this you can join the group and see what we're working on and then we can interact with those um those sellers in particular and get their feedback on the feature as we're developing it oh okay yeah very interesting it makes for a really strong business case right because it definitely it helps you guys keep things under control on on the development side it makes it flexible so that you know introducing a change to the site uh in large doesn't really make a huge does there's not a huge problem in fact it's it's enabling right you can test it more readily because those configuration flags are really the uh kind of the master switches exactly it's really interesting well let's take a quick sponsor break and then i want to come back and talk to you about kind of how workflow actually happens uh when a given change is going to be made all the way from that change request coming in maybe or however you guys do it i just want to talk to you about that because i think that's going to be enlightening for me and for the listeners of the show what if you could learn to build anything in one month well with onemouth.com you can just ask any one of the 20 000 students who have learned to code on onemouth.com by building real websites and applications complete with payment systems security solutions and full stack deployment you can start without any prior experience in just 15 minutes a day for 30 days all online that's because one month hyper focuses on applied techniques that you use immediately in the apps you are building as part of the courses one month's courses are the easiest way to learn new tech skills including ruby on rails content marketing, growth hacking, and more. And the best part is, if you get stuck, there's always someone there to help you out while you learn. Yes, that's a real person, not an automated computer. So enroll now at onemonth.com frontslash developer T and get 25% off your first month. Now normally, access to all courses costs $99, and access to one course usually costs $49, but with the special URL, you get full access for just $74, or one course for $37. That's less than $3 a day, or if you do a single course, it's just over $1 a day. Enroll now for 25% off your first month at onemonth.com frontslash developer T. So Russ, we've been talking about continuous integration and programmatic tests and test coverage and all of these things that sound great in theory and seem like the ultimate, ultimate way of doing things, but it really comes down to actually doing the work, actually sitting down and implementing these things in your day-to-day work. And so I want to talk to you a little bit about how the developers at Etsy actually do this stuff. Like what is the workflow from start to finish for a given feature, for example? So for any change that is deployed to the site, well, well, I guess I need to back up a little bit. The communication tool we use just company-wide is IRC. So just normal old internet relay chat. You know, as old as it is, it's still a really powerful tool. So we have a channel on our IRC server that's just called push, and we call it the push queue. So the push queue is where a developer goes anytime they want to actually make a change to the site, anytime they want to push a change to the site. And so we have a little IRC bot that kind of manages that push queue. And we have an internal application on the side called Deployinator that kind of goes hand-in-hand with this push queue. So when a developer has a change that's ready to go, they will generally do that work kind of on just the master branch, and they'll keep that change local on their computer, and they'll go into this, this channel in IRC, and they'll tell the bot that they want to join. So they'll just say join, and the bot adds them to this queue. So for each change, there can be, you know, I think, I don't know, there's some arbitrary number of people, the maximum number of people that can be in the queue at any one point. And those people are the ones who are going to push their code together. So when their turn comes, when it's their turn to actually make the changes, the bot will say, okay, you're ready to go. So go ahead and put your code in. And so at that point, everyone actually pushes their code to kind of the master branch. And once they've pushed that code into the master branch, they say, okay, I'm in. And then once everyone has actually put their code in, the bot says, okay, we're ready to go. And then we have kind of, we call them the driver who actually handles, it's on the companion site, this deployinator, there's a button that they push that deploys all of that new code to a staging version of the site. And so they deploy that code to the staging version of the site, which overall takes about a minute to actually deploy the code and make the changes. And at that point, the bot also starts our whole test suite on those changes. So the tests begin and the code is made on the staging site. Now, are those tests done on a separate server or are they done on the staging server or where do those happen? Yeah, they're done on a separate server. So we have an instance of Jenkins, which is kind of a testing, how do you say? Like a test runner? Yeah, yeah, a test runner. So it's really flexible. You can give it basically any type of task you want. And so it has a bunch of jobs that it will run to actually run the tests. And if any one of those fails, it'll return and say, hey, this failed. So what we do there, there are some key things. We, number one, try to make the test run in parallel as much as possible. So we have literally a couple hundred servers that those tests can run on. And so for each push, we probably have, I would guess about 20 different servers that we'll actually run. Wow. So definitely not just one staging server running the test then. No, no. We have the staging server just so that they can test their changes on the front end. But while that's going on, the 20 servers are actually running the test. So we split them up as much as we can. And we try to have all of the tests complete in under five minutes, which is really difficult when you're testing a large application. Do you know, offhand, how many tests you have? I know. I don't. I know it's a lot. A lot. A lot. Yeah. And quick question for you, just speaking of tests, what frameworks are you using for your PHP tests? We are using PHP unit. Okay. We also have a tool called Pake that I think is just kind of a wrapper around PHP unit to simplify some things. Okay. So let's see. Okay. So once the changes are on the staging server, the bot in the IRC channel notifies everyone, hey, your changes are on the staging server. Please test them. And everyone's expected to actually test their changes. And at that point, we hope that we catch any issues that might arise. For example, if the staging site is broken, that's a bad sign. Sure. And so we give developers the chance to revert their changes if they need to. Mm-hmm. And do they do that through the IRC bot as well? Yes, they do. So they would do a git revert and then push the new changes and then deploy it to the staging site once again. Okay. To make sure that things were good. But yeah, it's still in conjunction with the IRC bot. Sure. So once each developer tests their changes, they tell the IRC bot, we call it .good because it's just a period good and the bot recognizes that as okay. They're saying that their changes are good. And once everyone has said they're good and the tests have come back, the tests have successfully passed. Mm-hmm. That's when we actually deploy to the production server. So once again, the driver of the train is the one that actually does that. We literally have a button that says deploy to production. So you click that button and in the course of about a minute and a half to two minutes, those changes are, uploaded to all of our production servers and your code is live. Once that happens, developers are expected to test their change one more time on the production server. So they actually go to etsy.com and test their changes. Mm-hmm. And once again, say good. Once everyone has said that they're good, the driver says done. And at that point, the IRC bot says, okay, all of you are done and moves on to the next group of people that are ready to come in. Gotcha. Okay. That's very interesting. And how long has the IRC bot been a part of that workflow? You know, I'm not sure. I think. Since before you got there? Definitely since before I got there. I would say at least for the last three or four years, but maybe even longer than that. So. You know, what's so interesting about that to me is that everybody has a different way of doing this. Mm-hmm. Right? Some people have IRC bots or, you know, of course like the campfire bots, which are basically the same thing. We use just direct to server because we have a small team. So we're, you know, we're managing, we're yelling across the room, hey, are you good? I'm good. Let's do this, you know, and we're pushing. And we don't have to, typically we don't have many people working remotely, so we don't have to, you know, share servers. And it's not that big of a production to actually push to production. Mm-hmm. But that's not true for a larger application. Okay. How many developers does Etsy have working on Etsy.com? That would be hard to say. I think our engineering team in total is 250 to 300. Okay. How many of those are actually on Etsy.com? I'm not sure. We've obviously, our mobile teams have grown a lot recently. Sure. Mm-hmm. So. I'm sure a lot of people touch Etsy.com that aren't necessarily working all the time on Etsy.com sometimes too. Absolutely. Absolutely. And almost everyone will touch it at one point or another. Generally, for all of our engineers on their first day at the company, we try to have them push code that adds their little picture to our kind of about page, our team. Sure. Yeah. Which is neat. Yeah. It's like a rite of passage. Exactly. And you have to go through the IRC bot to make that happen. Yes, you do. Which is a little stressful the first time. Oh, I bet. Yeah. Well, because there's all these little domain specific things that you're doing with the IRC bot that if you were to go work at, I don't know, Facebook or something, it would be totally different there because they do things differently. Absolutely. And there's nothing wrong with that. In fact, obviously it's working well for Etsy because you guys are deploying 30 times a day. Like that process happens 30 times a day. Yeah. And it's obviously of necessity. Different companies have different ways of doing it because different things work for their specific situations. The other side of this is that there's so many checks along the way to make sure that things are not breaking. And that really is what, that's kind of the spirit of continuous integration is two things. One, don't break things. It's never smart to break things. It doesn't matter what anyone said in the past. I'm not going to point any fingers. It doesn't matter. It doesn't make sense to break things. But also ship a lot, right? Like constantly be shipping. Exactly. I'm just saying memes at this point. But just the continuous integration idea is that you don't need to do all of your pushes on Thursday. Like that doesn't make any sense because if something is done, then how quickly can you get value out of it? How quickly can Etsy customers? How can they get value out of that feature? Well, immediately, usually, right? Yeah. And that's, that's one of the big things. And some of the most important parts of continuous integration are making testing easy, making it so that for a developer it's, it's super easy and maybe even fun to write these tests that they can watch pass. Well, it spins up 20 servers when you press a button. It's true. Who doesn't like spinning up 20 servers just to run your tests? It's very true. It's enjoyable. I've watched, we had a project that we had 700, I think like 700 tests on and we watch, every time we run the test, we watch those little green dots come in and there is something kind of mesmerizing about it to see that many things working, right? Yeah. All working kind of in harmony with each other. There's, there's some, something strangely interesting. Yeah. It's just hypnotizing, I guess, about a 100% test passing moment. Exactly. It's so important to have that too, because when you start to allow failures to come through the cracks, you know, if someone says, oh, you know, this, this test maybe isn't so important. So we're just going to ignore when it fails. Suddenly you, you get in the habit of, of just seeing failures all the time. And at that point it's. There's a lot less signal and a lot more noise. Thanks so much for listening to this episode of Developer Tea. I hope you enjoyed the first part of the interview with Russ. If you want to make sure that you don't miss the second part, you can always subscribe in just about any podcasting app, iTunes, Stitcher, Developer Tea is pretty much everywhere and you should be able to find it pretty easily. If you would like to subscribe to the RSS feed, you can go to developertea.com and there's an RSS feed. And you can also find it in the header. Developertea.com also has the show notes for this episode and it has every other episode that I've ever done. So go check it out. Developertea.com. Thank you so much for listening and until next time, enjoy your tea.