« All Episodes

Russ Taylor, Part One: Continuous Integration

Published 6/30/2015

During today's first half of a two-part interview, I have the pleasure of talking with Russ Taylor. Russ works on the continuous integration team at Etsy.

We had a chance to talk about test driven development, all of the processes that Russ is involved with at Etsy, and how big applications like Etsy come together.

If you don't want to miss out on the second half of my interview with Russ, you can head on over to Developer Tea.com and hit Subscribe through our RSS feed, Stitcher, on iTunes or another podcasting platform that you prefer.

This episode is sponsored by OneMonth.com. Head over to OneMonth.com/developertea to get started learning Ruby on Rails in just one month, and receive a limited-time 25% discount!

I hope you enjoy the show, and until next time,

Enjoy your tea.

Transcript (Generated by OpenAI Whisper)
Hey everyone and welcome to Developer Tea. My name is Jonathan Cutrell and today I have the pleasure of interviewing Russ Taylor. Russ works at Etsy on the continuous integration team. We had a fantastic conversation and actually spanned over two episodes just like Norm also. I hope you really enjoy the conversation that I have with Russ. He is very interesting to talk to because what he works on is used by so many people. It's such a big platform. We had a chance to talk about test driven development and all of the processes that Russ is a part of using at Etsy. I hope you enjoy this episode of Developer Tea interview with Russ Taylor. Russ welcome to the show. Thank you. It's really a pleasure to be here. I'm glad you have a chance to talk with me today because you do work for a company that probably everyone listening to this podcast has heard of. You work at Etsy. Can you tell everyone what your role is at Etsy? I am a software engineer on the continuous deployment team at Etsy which basically means that I rarely touch anything on the front end. I'm more focused on making other developers lives easier and making the process of updating the site and adding new code. What have you as easy as possible? Sure. That's incredible. I've always been interested in how big applications like Etsy come together. How the team works on something that big. I'm excited to talk to you about that kind of stuff. Before we get into that, for listeners who may not know what continuous integration is, can you explain just a basic overview of what continuous integration actually is? Yeah. Continuous integration in a nutshell is running tests and builds on your software for perhaps every commit or every few commits or even just on a daily basis. Actually, running those tests to make sure that everything is still in good shape and depending on how you have it set up, either automatically deploying those changes when the test pass or making it very easy to deploy those changes. For example, in Etsy's case, we deploy code to the site about 30 times a day. Wow. New code. It's pretty amazing. It only takes about 15 minutes to go from, I have this commit on my local machine and I want to put this live on the site. Sure. We have a whole suite of integrated tests that we run on those, both unit tests, integration tests, things like that, to make sure that the code is actually working. We ask each developer to during the deployment process actually verify that their code is working. They verify that through you said unit tests, feature tests, or integration tests. For anyone who's not familiar with the way testing actually works, basically tests are written code that tests your code. It's like a program that you write alongside your other code to make sure that the original code is correct. Exactly. It's hard to explain until you actually see it. Once you've actually had a chance to experience it, especially experience the benefits of it, I'm biased. I like TDD or whatever you want to call it. I know that carries some baggage with it, but I appreciate tests more now than I ever have. I think that's because I've run into so many issues where I didn't have tests for whatever reason. I kick myself because I go back and I think, man, if I had tested this particular thing, especially just unit tests, that's the simplest level of testing. I feel like if I had just had unit tests, I would have saved myself a lot of pain from breaking stuff in the future. Absolutely. They're incredibly powerful and you don't really recognize that power until you've been on the other side where you realize how you could have easily resolved a problem by just writing a test. I'm interested to know workflow wise. Is Etsy actually fully test-driven development or is there another kind of approach that you guys take? There is another approach. We aren't fully test-driven. We try to have very good test coverage, and obviously we'd like to have 100% test coverage, which means every bit of code is tested in one way or another, but at the same time, we don't want to impose necessarily, we want to trust Developer To do what is right. We don't want to say, well, before you write, before you put any code on the site, you need to have all of it tested. We try not to impose those strict rules, but at the same time, we like to have very good test coverage. We like to have, as much as we can, covered by tests, because it's by having those tests and having that fairly complete coverage that you can actually be confident in your code, especially with reintroducing old bugs or creating new bugs in old code as you're changing it. Yeah, absolutely. What is Etsy written in? Are you guys doing, is it Ruby-driven or what is it? It's mostly PHP. On my team in particular, we use whatever we feel is best suited, but we're not necessarily actually writing the production code most of the time, the code that goes on the site, but the site itself is basically entirely PHP and JavaScript for the front-end. Sure. Yeah, you can't really get away from JavaScript these days. No, no, you can't. Do you guys write JavaScript tests as well? Yes, we do. I'm not sure that we have as much coverage on the JavaScript site, but yes, we definitely still do. You mentioned that you'd ship code 30-ish times a day, right? What kind of things are you guys shipping every day? Well, anything from the smallest change to big changes. One of the things that was new to me that I really like at Etsy is that basically everything on the site is controlled by some configuration flag. We do a huge amount of like A, B testing to verify that a change is number one, not going to create huge issues for users, but number two, that it's actually going to be something that the users will like. Everything down to the entire site itself is behind a configuration flag. Wow. Yeah. All we have to do to disable the entire site is change one configuration setting. It does mean that we have some pretty bulky configuration files because basically every feature is in there, but at the same time, we can deploy a change that only affects one percent of users or that only, say, Etsy employees actually see so that we can test things out even in production without actually having them affect users. But the changes we deploy are, I mean, that's how all of the changes happen to our site through these little changes, these little incremental changes every day. The other thing that's nice about the configuration flags is that developers can work on a new feature and before it's complete, they can still be committing that code and actually having that code in production, just not having any users actually see it. But it makes it so that we're more focused on actually getting your code merged with our master branch. We're having all of that code live in one branch as much as possible versus having to deal with huge merge conflicts when you've been working on a feature for two weeks. And suddenly everything has changed and you have to spend three hours just resolving conflicts. Yeah. And so I'm assuming you guys are using Git. Are you using Git? Yes, yes, we are. I don't know many people who don't actually, at least in some capacity use Git for all their version control at this point. Yeah, it's a very powerful tool. Sure. And you're saying you keep everything. You try to keep things in one branch as much as possible. Absolutely. And that makes for faster deployment, I would assume. And a little less overhead of trying to do resolving merge conflicts as you kind of alluded to a second ago. Yeah, definitely. It's something that it's been a little more difficult to apply it to say our mobile applications. But it's something we're even trying to do as much as possible there. All of this is so interesting to me because I mean getting a huge application like Etsy running is just, I mean, there's so many moving parts and pieces and making it simple, I feel like would be nearly impossible. But it doesn't mean that, I mean, complicated doesn't necessarily mean hard to manage. But I wanted to ask you about those flags that you mentioned. Those are not, I would assume that because you do A, B testing, those flags are probably like a cascade of different points that they can be set at. So you have like a default file that has all the flags that are set like in the normal use case. But then a given user for a given session could have one of those flags overridden. Is that how that works or how does that whole flow actually happen? I believe that's how it is. I'm actually not entirely sure how the configuration files are applied. But we definitely have a whole range of settings. There are a few different options. So like I mentioned, we can just enable it for employees of Etsy. So any Etsy employee who's logged into the site will see that. We can also enable it. We can enable it for any percentage of users. So we can say, for example, okay, I want to test this change out with 1% of users. And the configuration flag will pick a random 1% to apply that setting to. And then you can kind of compare the behavior on that 1%. Which is super powerful. The other thing that it enables us to do is kind of in our forums on Etsy, we can like users can join groups. And so for some settings, we actually enable people to opt in. So we say, you can join this test group. And then you will see this new setting. So for example, if we're making changes to the way that sellers can list their new items, we can say, okay, any sellers that are interested in this, you can join the group and see what we're working on. And then we can interact with those sellers in particular and get their feedback on the feature as we're developing it. Okay. Yeah. Very interesting. It makes for a really strong business case, right? Because it helps you guys keep things under control on the development side. It makes it flexible so that introducing a change to the site in large doesn't really make a huge, there's not a huge problem. In fact, it's enabling, right? You can test it more readily because those configuration flags are really the kind of the master switches. Exactly. It's really interesting. Well, let's take a quick sponsor break and then I want to come back and talk to you about kind of how workflow actually happens when a given change is going to be made all the way from that change request coming in. Maybe or however you guys do it. I just want to talk to you about that because I think that's going to be enlightening for me and for the listeners of the show. What if you could learn to build anything in one month? Well, with one month.com, you can just ask any one of the 20,000 students who have learned to code on one month.com by building real websites and applications complete with payment systems, security solutions, and full stack deployment. You can start without any prior experience in just 15 minutes a day for 30 days all online. That's because one month hyperfocuses on applied techniques that you use immediately in the apps you are building as part of the courses. One month's courses are the easiest way to learn new text skills, including Ruby on Rails, Python, content marketing, growth hacking, and more. And the best part is if you get stuck, there's always someone there to help you out while you learn. Yes, that's a real person, not an automated computer. So enroll now at one month.com, front slash Developer Tea and get 25% off your first month. Now normally access to all courses cost $99 and access to one course usually costs $49, but with the special URL you get full access for just $74 or one course for $37. That's less than $3 a day or if you do a single course it's just over $1 a day. enroll now for 25% off your first month at one month.com, front slash Developer Tea. So Russ we've been talking about continuous integration and programmatic tests and test coverage and all of these things that sound great in theory and seem like the ultimate way of doing things, but it really comes down to actually doing the work, actually sitting down and implementing these things in your day to day work. And so I want to talk to you a little bit about how the developers at Etsy actually do this stuff. Like what is the workflow from start to finish for a given feature for example? So for any change that is deployed to the site, well I guess I need to back up a little bit. The communication tool we use just company-wide is IRC, so just normal old internet relay chat. As old as it is it's still a really powerful tool. So we have a channel on our IRC server that's just called push and we call it the push queue. So the push queue is where a developer goes anytime they want to actually make a change to the site. Anytime they want to push a change to the site. And so we have a little IRC bot that kind of manages that push queue and we have an internal application on the side called deployinator that kind of goes hand in hand with this push queue. So when a developer has a change that's ready to go, they will generally do that work kind of on just the master branch and they'll keep that change local on their computer and they'll go into this channel in IRC and they'll tell the bot that they want to join. So they'll just say join and the bot adds them to this queue. So for each change there can be, you know, I think I don't know, there's some arbitrary number of people, the maximum number of people that can be in the queue at any one point. And those people are the ones who are going to push their code together. So when, when their turn comes, when it's their turn to actually make the changes, the bot will say, okay, you're ready to go. So go ahead and put your code in. And so at that point everyone actually pushes their code to kind of the master branch. And once they've pushed that code into the master branch, they say, okay, I'm in. And then once everyone has actually put their code in, the bot says, okay, we're ready to go. And then we have kind of, we call the driver who actually handles, it's on the, it's on the companion site, this deployinator. There's a button that they push that deploys all of that new code to a staging version of the site. And so they deploy that code to the staging version of the site, which overall takes about a minute to actually deploy the code and make the changes. And at that point, the bot also starts our whole test suite on those changes. Yeah. So the test begin and the code is made on the staging site. Now are those tests done on a separate server or are they done on the on the staging server or where did those happen? That yeah, they're done on a separate server. So we we have an instance of Jenkins, which is kind of a testing. How do you say a test runner? Yeah, yeah, a test runner. So it's really flexible. You can give it basically any type of task you want. And so it has a bunch of jobs that it will run to actually run the test. And if any one of those fails, it'll return and say, hey, this failed. So we, what we do there, there are some key things. We, number one, try to make the test run in parallel as much as possible. So we have literally a couple hundred servers that those tests can run on. And so for each push, we probably have, I would guess about 20 different servers that will actually run the tests for that. So definitely not just one staging server running the test then. No, no, we have the staging server just so that they can test their changes on the front end. But while that's going on, the 20 servers are actually running the test. So we split them up as much as we can. And we try to have all of the tests complete in under five minutes, which is really difficult when you're testing a large application. Do you know offhand how many tests you have? I know, I don't. I know it's a lot. A lot. Yeah. Yeah. And, and quick question for you, just speaking of tests, what frameworks are you using for your, for your PHP tests? We are using PHP unit. Okay. We also have a tool called PAK that I think is just kind of a wrapper around PHP unit to simplify some things. Okay. So let's see. Okay. So once the changes are on the staging server, um, for bot in the IRC channel, notifies everyone, hey, your changes are on the staging server. Please test them. And everyone's expected to actually test their changes. And at that point, we hope that we catch any issues that might might arise. For example, if the staging side is broken, that's a bad sign. Sure. And so we give Developer The chance to revert their changes if they need to. Mm-hmm. And do they do that through the IRC bot as well? Yes, they do. So they, they would, um, do a, a get revert and then push the new changes and then deploy to the staging site once again. Okay. To make sure that things were good. But yeah, it's still in conjunction with IRC bot. Sure. Um, so once each developer tests their changes, they tell the IRC bot, it's, we call it dot good, because it's just a period good and that the bot recognizes that as, okay, they're saying that their changes are good. Um, and once everyone has said they're good and the tests have come back, the tests have successfully passed. Mm-hmm. That's when we actually deploy to the production server. Um, so once again, the driver of the train is the one that actually does that. We just, we literally have a button that says deployed a production. So you click that button and in the course of about a minute and a half to two minutes, those changes are uploaded to all of our production servers and your code is live. But once that happens, developers are expected to test their change one more time on the production server. So they actually go to Etsy.com and test their changes. Mm-hmm. And once again, say good. And once everyone has said that they're good, the driver says done. And at that point, the IRC bot says, okay, all of you are done and moves on to the next group of people that are ready to commit. Gotcha. Okay. That's very interesting. And how long has the IRC bot been a part of that workflow? You know, I'm not sure. I think since before you got, you got there. Definitely since before I got there. I would say at least for the last three or four years, but maybe even longer than that. So. You know what's so interesting about that to me is that everybody has a different way of doing this. Some people have IRC bots or, you know, of course, like the campfire bots, which are basically the same thing. We use just direct to server because we have a small team. So we're managing, we're yelling across the room, hey, are you good? I'm good. Let's do this. You know, and we're pushing and we don't have to, typically we don't have many people working remotely. So we don't have to, you know, share servers. And it's not that big of a production to actually push to production. But that's not true for a larger application. How many developers does Etsy have working on Etsy.com? Um, that would be hard to say. I think our engineering team in total is 250 to 300. Okay. How many of those are actually on Etsy.com? I'm not sure. We've obviously our mobile teams have grown a lot recently. Sure. Mm hmm. So I'm sure a lot of people touch Etsy.com that aren't necessarily working all the time on Etsy.com sometimes too. Absolutely. Absolutely. And almost everyone will touch it at one point or another. Generally for all of our engineers on their first day at the company, we try to have them push code that adds their little picture to our kind of about page, our team. Sure. Yeah. Which is neat. Yeah. It's like a ride of passage. Exactly. And you have to go through the IRC bot to make it happen. Which is a little stressful the first time. Oh, I bet. Yeah. Well, because there's all these these little domain specific things that you're doing with the IRC bot that if you were to go, you know, work it, I don't know, Facebook or something, it would be totally different there because they do things differently. Absolutely. And there's nothing wrong with that. In fact, no. Obviously it's it's working well for Etsy because you guys are deploying 30 times a day like that. That process happens 30 times a day. Yeah. But obviously of necessity, different companies have different ways of doing it because different things work for their specific situations. The other side of this is that there's so many checks along the way to make sure that things are not breaking. And that really is what that's kind of the spirit of continuous integration. There's two things. One, don't break things. It's never smart to break things. It doesn't matter what anyone said in the past. I'm not going to point any fingers. It doesn't make sense to break things. But also ship a lot, right? Like constantly be shipping. Exactly. I know. I'm just saying memes at this point. But just the continuous integration idea is that you don't need to do all of your pushes on Thursday. Like that doesn't make any sense because if something is done, then how quickly can you get value out of how quickly can Etsy customers get value out of that feature? Well, immediately, usually, right? Yeah. And that's one of the big things. And some of the most important parts of continuous integration are making testing easy, making it so that for a developer, it's super easy and maybe even fun to write these tests that they can watch pass. Mm-hmm. Well, it spins up 20 servers when you press a button. It's true. Who doesn't like spinning up 20 servers just to run your tests? It's very true. I've watched a project that we had 700, I think, like 700 tests on. And we watch every time we run the test, we watch those little green dots come in. And there is something kind of mesmerizing about it to see that many things working, right? Yeah. All working in harmony with each other. There's something strangely hypnotizing, I guess, about a 100% test passing moment. Exactly. It's so important to have that too because when you start to allow failures to come through the cracks, if someone says, oh, this test maybe isn't so important, so we're just going to ignore when it fails. Suddenly, you get in the habit of just seeing failures all the time. And at that point, there's a lot less signal and a lot more noise. Thanks so much for listening to this episode of Developer Tea. I hope you enjoyed the first part of the interview with Russ. If you want to make sure that you don't miss the second part, you can always subscribe in just about any podcasting app. iTunes, Stitcher, Developer Tea is pretty much everywhere. And you should be able to find it pretty easily. If you would like to subscribe to the RSS feed, you can go to developertea.com. And there's an RSS feed in the header developertea.com also has the show notes for this episode. And it has every other episode that I've ever done. So go check it out developertea.com. Thank you so much for listening. And until next time, enjoy your tea.