I had the pleasure of talking with Grant Ingersoll. Grant is a search and NLP engineer, author of Taming Text and CTO at Lucidworks.
Code School has recently launched a second course on SQL, The Sequel to SQL. In this course you'll learn the most important parts of the SQL language so you can create tables with constraints, use relationships, and write powerful join queries. Learn more at codeschool.com/developertea.
Enjoy your tea.
Transcript (Generated by OpenAI Whisper)
Hey everyone and welcome to Developer Tea. My name is Jonathan Cutrell and today I am speaking with Grant Ingersoll. Grant, thanks so much for coming on the show today. Thanks for having me, Jonathan. So I reached out to you because you wrote a very interesting, on a very interesting topic, text and that sounds like a, that's probably huge and so I want to get into the details of what your book, Taming Text is about, is released back in 2013, but everything that's relevant in a book about that kind of topic two years ago is probably relevant today as well. Wouldn't you agree? Yeah, I certainly hope so and I think, you know, based off of sales that's proving to be true, which makes me, makes me happy. I mean, Taming Text is really geared towards, in many ways, who I was as an engineer back when I started on this problem, which is, I knew I had a program, you know, at a computer science degree and you, you know, algorithms and data structures and all that fun stuff and I knew databases. And I had no clue about how to work with text. Well, I shouldn't say no clue. I mean, I knew kind of the basics of regular expressions and I could build those things out. And so in a lot of ways, what I wanted to do is take and say to engineers who have never done this text stuff before they haven't done search, they haven't done classification, they haven't done clustering, they haven't done named entity extraction, take these common problems in natural language processing or text and make them accessible to engineers without all of the math. So the, you know, the book is really geared towards each chapter kind of starts off with, here's the concepts, here's the examples, here's some real world things that you see. And then it goes and shows open source examples of how to do that. Very little math, but all a lot of code. And then we ultimately end up with a working fairly simple question answering system that can do, you know, fact-based questions like, who is the president of the United States or, you know, what is developer T? Right. Mm-hmm. What's so interesting about text to me is that it's, it is only a representation of kind of an idea, right? So, you know, with numbers, you can, you can see that numbers are working out in a particular way based on a, like a set of constructs, a couple of rules that you learn, maybe early in life, the base 10 system, for example. And you can see that kind of played out. There's, you know, one item is just one item and one means one in every language. And so if you approach the idea of text, it's significantly more complicated. So you have to, you know, surround text in those rules. And that's kind of that idea of taming it, right? Yeah, that's exactly it. I mean, I think one of my favorite lines and not to pat myself too much at the back here. One of my favorite kind of lines in the book is this notion that, you know, there's a reason why the saying is the numbers don't lie. Right? And not the text doesn't lie, right? Text lies all the time. I mean, not necessarily on purpose, right? But, you know, just take the word bank, right? What does that mean? If I just say the word bank, well, you probably, you know, conjure up, especially given what's happening on Wall Street and the New York Stock Exchange yesterday, you probably conjure up that world, you know, that view of it. But if you and I were going fishing and we wanted to stand on the bank, then it would be a whole different context. And so, you know, like, text is this really complicated thing that's evolved over all of these years and, you know, and by text, of course, we mean voice and essentially human language, right? Yeah. And, you know, not everybody speaks the same language. We pretty much all agree on the same numeric systems these days and all of those kinds. We don't agree on language at all. And so it's this really hard thing to grasp. And then in fact, as hard as it is for us as humans, think about then how do we translate that into a computer system and help a computer better understand that? And that really is what we're trying to get out here is what are some of the tips and tricks and techniques for engineers for programmers to go in and better handle this stuff? Because let's face it, like almost all of our content actually is text based or, or at least has undertones if you will of text. Sure. Yeah. Well, it's the fundamental way in which we communicate, right? I speak to you on this podcast and you could, you know, you could write out everything that we're saying and it's a conversation between two people. And, you know, if you break it down, the computer is just representing this with numbers. And so how, and it's not just about representing text, but also understanding and communicating that meaning, the meaning of the text itself to the computer. Yeah, that's exactly it. I mean, it's, it's been really amazing to me as somebody who's been in this field for a long time that, you know, how much better things have gotten in and say like the last five years. I mean, we're just, there's so much capability out there now to weave together, you know, open source with, you know, your own ideas and proprietary systems, etc. I went quite maybe goes far as saying a golden age, but of handling texting computers, but it is gotten a little significantly better from when I started. I used to kind of, I always look at it as like, yeah, this is really fun as an engineer to work on. But, you know, wink wink nod, nod, this stuff doesn't work. Right. Whereas nowadays, yeah, it can work and it can work pretty well in a lot of situations. And we're getting better every day at it. There's still some spots where, you know, of course, it's going to screw up, but hey, guess what? You know, just in this one conversation here alone, we've screwed up already as well. So, you know, just with ums and a's and, you know, not picking our words right or whatever. So, you know, you kind of have to keep that in mind that we shouldn't be too hard on the machines. Right. Yeah. Yeah. And to, to say like that our language has the possibility of being perfect. Is to misunderstand that we have different understandings of the language itself. You know, when, when you say the word bank, like you're saying, even between you and I, there is a little bit of loss of resolution there. And so how much more resolution is lost when we're trying to communicate to something that doesn't understand inflection yet. Yeah, exactly. I mean, and I think, you know, what you're starting to see is more and more tools that are kind of working up the NLP. The NLP stack, if you will, you know, the kind of the low level of the stack is things like, you know, character and coatings and and syntax and and punctuation and higher up the stack is, you know, things like world knowledge or, you know, at least tribal knowledge anyways, the fact that you and I both speak English like we didn't even have to assert to each other that we both speak spoke English. We just took it for granted, right or that, you know, whatever our cultural references are right now that are implied. But yet if you're somebody else from a different country coming in, you may not understand those. And so all of those things create friction in, you know, essentially the communication transaction, right. And so, yeah. What can, what's been really amazing, I think in recent years is the tools that are getting better and better at moving up that stack, you know, you just look at, for instance, the machine translation space, right of being able to take an automatically translate between English and French or whatever else, right. I mean, those tools have gotten significantly better and in no small part due to the big data movement and the internet and the fact that we have access to so much more content across so many different places. Understanding the differences between meanings of a given text, NLP is such a huge part of, you know, people are doing a lot of research on NLP. They have for a long time, but even more now, because you know, things like multi core and higher computing power is going to make NLP much more accessible, I think to, you know, lower level devices or, you know, everyday devices, handheld devices. And if it's sending up to a server like, you know, voice recognition on an iPhone does that kind of availability is here. It's, you know, it's a real thing. And there's so many interesting facets and tons of research. I'd be interested to know, you know, beyond your book, if somebody's really particularly interested in NLP specifically, where you would point them to start doing research. Of course, starting at taming text, especially if you're a programmer or a developer to get some really practical useful knowledge and start working with this stuff. But what other resources, would you say for NLP? Yeah, that's a great question. I mean, well, you know, first off, the book is some of the examples, I think, are already outdated. Although we have some branch on our GitHub, where we've been updating the code slowly to newer versions of, you know, libraries like solar and all that kind of stuff. But, you know, I think the core principles of the book are still right. I did just actually have published an article just recently on open source.com on five tools, five libraries for getting started with natural language processing. And in there, you know, there's some really good libraries out there, Stanford, the Stanford NLP group puts out a really good library. That does a lot of different and interesting things around NLP Apache has an open NLP project, which is, you know, I cover some in the book and recover some in the book. It's a good starting place. I wouldn't say it's necessarily best in class these days, but it's still pretty good. There is, you know, there's a whole bunch of open source libraries out there. Other books, you know, if you're, if you like the more academic version, you know, if you if you want more of that rigor around the math and all that. Manning and shoots have a great book called foundations of statistical natural language processing. That's a mouthful, but it covers kind of all of the gory details. You know, of course, like obviously academia has a ton of work going on here. There's conferences and, and papers and all that constantly being published on that. I, you know, though, you know, from a developer standpoint, I would just go check out some open source libraries, like some of the ones I mentioned, NLTK, the natural language toolkit for Python developers. A fantastic library for getting started as well and has a lot of capabilities. There's, it's called gate to G A T E. That's another great framework that has a lot of pieces that you can kind of plug together to build out NLP stuff. You know, lots and lots of good stuff out and available on the web. That's probably where I would start. Just with some simple Google searches and go look at some open source projects and, and then really just like get your hands dirty, trying out some of these libraries. I mean, you know, go download your favorite book or whatever or, you know, go to Gutenberg, press and get some public domain novel and just chunk it through with with the code and see what happens. Yeah, you know, that's that's kind of how I started. You just start a little bit at a time and then pretty soon start having these flashbacks to high school where you're like, oh, yeah, I remember my grammar teacher teaching me the not. And then next thing, you know, like you're immersed. Sure. Yeah, and because it's, you know, this comes down to grammar a lot of the time to do things like sentiment analysis. Well, sentiment analysis, I guess, is something slightly different. To be able to classify a given word in its context and a given sentence, you know, to make good decisions about what does that sentence mean you have to understand fundamentally what that word actually is in terms of grammar word is it fit next to another word doesn't modify another thing in that sentence is it is it saying, you know, if I say one cup versus two cups, well that two is modifying cups right. There's so many different things like that that you will run into as you actually try to do things like, for example, if you just open Google.com and start, you know, typing things into Google right like that will give you an idea of what types of problems you will run into doing text analysis in general. But there's nothing like actually doing a project for yourself. That's right. Do something that analyzes text in some way. Yep, that's exactly right. And I think that's, yeah, I mean, especially that's kind of the engineer mindset is dig in and figure it out and then go learn what you don't know. Sure. Yeah. I've seen what happens and looking at some examples and then and then going from there. Yeah, absolutely. Grant, I'm going to take a quick sponsor break and then I'm going to come back. I have a couple more questions before we wrap up. Great. Thanks so much to today's sponsor code school. If you're building a web application, you're very likely going to need to know how to use SQL. SQL is the language that communicates with your database code school recently launched a second course on SQL. It's titled the sequel to SQL in this course. You'll learn the most important parts of the SQL language. So you can create tables with constraints, use relationships and write powerful join queries with levels like aggregate functions, sub queries and normalization relationships. This brand new course will help you become fluent in the most widely used database language on the web. You can learn more today at code school.com slash Developer Tea. Of course, as always, that link will be in the show notes go to code school.com slash Developer Tea. Grant, we've been talking about taming text your book as well as the subject of taming text talking about natural language processing, which by the way, I don't know if we actually mentioned what NLP means. For those of you who are not who haven't had an introduction to NLP, it's the idea that we've been talking about basically processing language that comes from a human in some way to be able to understand the meaning of that language. I have a few questions for you now that are kind of hard and I told I told Grant already off the record what these are and he said, yeah, those those are hard. So the first one has to do once again with the domain of analyzing text. What would you say right now is kind of the big the next big problem to be solved in text analysis and language analysis. Wow, you know, that's a great question. I mean, I think there's there's a lot. A lot of places you could go with that, you know, we talked about machine translation earlier that's that's got to be one of them. I think, you know, when I think about what is still hard, I think it's really the understanding and leveraging NLP to figure out what's important in text and not an importance means a lot of different things to somebody right. And so, like, you know, if you imagine being able to take in just lots and lots of this data, like how do you sift through all of it, right. To me, it's kind of the end user application of like, how do we make the machines do a lot more of this grunt work. The other things with NLP is, you know, there's obviously some just practical things of, you know, this stuff is usually really CPU and computer intensive. Are there better algorithms, better data structures that we can do more faster, better, you know, stronger all of those kinds of things at larger scales. You know, the deep learning stuff that's going on, I think is pretty interesting around NLP and applications there and having, you know, building up much more complex statistical models. At the end of the day, though, I think it still ties back into the human level of how do we make these interfaces complimentary and natural for us. Right. You know, the phone extension is great. Like I can just talk to my phone and I can get a lot of that. Right. But how can we start to bridge the gaps between the ways people communicate and fill in, you know, your example earlier, we, we, we used this acronym NLP and we didn't actually define it. Well, what if somebody in listening to this automatically, the system said, oh, here's this acronym that was used. Let me fill that in for you. Right. I often call this, you know, you think about how much miscommunication happens in this world that causes, you know, everything ranging from war all the way down to, you know, you overpaying or you getting mad at, you know, somebody or whatever. Right. Yeah. Like, is there ways we can leverage this stuff to be, you know, better at those kinds of things. You know, now, you know, that's really pie in the sky and, you know, has nice ring to it and all of that. But, you know, the practical side of it too, I think is just, you know, better, faster, stronger at this stuff. And dealing with more of the ambiguity in language, me personally, like I spend a lot of time thinking about question answering and search. And so how do I get the right information, the right amount of information to the right people at the right time. Yeah. So, you know, my company Lucidworks focuses a lot of our time on building off of open source search and NLP and all those kinds of things. You know, so I mean, it really, and this is the beauty of it, right? It's just wide open. Right? Yeah. You know, that's the kind of problem you want to work on as a developer, right? Is the ones that are wide open and there's, you know, there's, they're limitless in the size and scope. Right? And so you can, you can have, you're never going to get bored. I think the best problems are the ones that have the pie and the sky answers. You know, I think when you have something, you mentioned that on it, you know, I don't think it's totally out of bounds for us to say, you know, that should be our goal as developers. We can actually solve human problems. We can actually, you know, make people's lives better when somebody says, oh, did I miss my turn? It'd be really interesting to hear my car say, no, no, you haven't missed your turn yet. Keep on going. You know, little things like that that, you know, it seems insignificant and kind of, you know, hyper fewtrist or whatever you want to call that. But that's, that is a reality that we are not far away from. Yeah. Well, you know, I mean, I think, you know, especially lately with some of the talk around AI and, you know, people getting all freaked out about it. You know, I'm personally a glass half full kind of guy, you know, I mean, I think I think there's just limitless possibilities around what we can do with this stuff to enhance our lives and and make ourselves better as people. And so, you know, sure, you know, terminators fun to watch as a movie and, and, you know, and we all, I think sometimes get worried about that kind of view of AI, but it doesn't have to be that way, right? We can use this stuff to make our lives better. And, you know, it's our choice, right? Yeah, nothing is nothing is particularly inevitable, necessarily, right? Wow, we've gone a long ways from having we at one point we were talking about algorithms and now we're talking about, you know, self-driving car. But I think that's the power of language, right? Is it, you know, it inspires us to dream and communicate those dreams with people and, and, you know, why not have the computers help us with those things? They help us with so many other things. They certainly help us with numbers. Yeah, exactly. Yeah, bringing it back home. Yeah, there we are, right? Right on track. Well, let me ask you one more question, Grant, and then we'll wrap up for today. And, you know, I'd really, I think I'd love to have you back on the show sometime in the future. And, you know, maybe talk a little bit more about this, the subject, because it is wide open. Like you said, I think I think, you know, even in half a year from now, everything is going to change pretty significantly. You know, I got a, actually recently got the Amazon Echo. And, I mean, it's really, really cool. Like, I'm walking around my house. I've set it up to be connected to my, to my Hue lights. And I can tell Alexa to turn on my lights. And it turns on my lights, you know, it's, it's very different from opening my phone and touching buttons to do that. Yeah. There's something really interesting. So I think, you know, with people like Amazon investing in that kind of technology, we're going to see it shift faster and faster, you know, all the time. My last question for you today is, you know, if you were to sit down and I ask all of my guests this question. So this one's not pointed at you directly. But if you were to sit down and talk with any kind of developer, you know, whether they are, you know, well into their career or if they're younger and they're just now starting out, what kind of advice would you give that developer? Yeah, that's a, that's a great question with a lot of responsibility. You know, when I look back at my own career, I think there was a couple of key inflection points in that. And I think if anybody did that, I actually give a talk on this and I have a couple different conferences. But one of the key inflection points was very early on in my career. And interview that this company that wanted me to do, you know, some really basic programming kinds of things. I won't get into it too much at the risk of putting down anybody who works in a particular area. It wasn't right for me. And the reason it wasn't right for me is I didn't feel like I would be challenged by it. And now in hindsight, actually, that whole industry has taken off quite a bit and is evolved very significantly to be really challenging. But at the time, it was not. And so the thing for me that I kind of took away from that was this notion of that you always have to work on hard problems and a hard problem. You know, there's something that's hard for you. It doesn't necessarily have to be, you know, you're solving, you know, whatever hard, you know, the hardest problem in the world, the grand challenges. But it has to be something that is going to captivate you and want you to continue to learn and invest in. And when it doesn't, then, you know, it's time to move on and find the next one. And so I often, you know, my advice almost always to especially people who just getting started in their careers, go find a hard problem to work on. Dig into it, you know, the the corollary that goes with this is a programmer, I think, is go find an open source project that tackles that hard problem and start contributing to it, you know, spend your nights and weekends doing that. And you know, your whole career path, assuming it's an area that people care about or reasonably care about, you're not off just doing some, you know, some thing that nobody cares about. Your whole career path just gets so much easier in some regards because you now have a body of work that is public that you can point to, so never mind all the other benefits of open source, which are fantastic. This is essentially this body of work that you can point back to much like an artist would have, you know, a body working say, look, I painted this, I painted this, I painted this, and you know, this is why you should pay me to paint yours, right, you know, or photographer, right, this is the same kind of thing is you've got to develop and curate your portfolio of work and there's no better way to do that than open source, especially if you can find a good project right now. You know, put my own hat on there and I'd say, well, you know, Lucine and solar are great places to do that or my house or some of these projects, but you know, it really can be anything go and find one of those and get over the fear of contributing to them, I think a lot of people end up being blocked because they're afraid, oh, what if somebody doesn't like my code or what if I don't do it right and well. Good now you have somebody telling you how to fix things well, you'll learn from that right and by doing that, you know, I can remember multiple times in my career where you would walk into the room and people would be like, oh, well, I've already read all your code, you know, here's a few questions and it's like they're trying to convince you as supposed to you trying to convince them and that's that's a really nice position to be on as a, you know, as somebody looking and seeking employment right. And so, you know, go spend the time doing that kind of thing get you know, get up on a hard problem and show that you're capable of making incremental improvement on that problem. Yeah, that that goes for people who are interested in anything it's not just text it's not just searching text or you know categorizing or doing in LP it's anything interface work even you know front end work. There are so many problems that have yet to be solved and you know, you might hear somebody say that's not true, but I guarantee you it is there are so many problems yet to be solved. Just go look at my issue backlog and I could show you. Yeah, I mean all over the place and and you know, if you are interested in working on on text problems follow grant that's that's not true. Grant that's another great way to find out you know what is going on in that particular niche of of the programming world of the engineering world. But there are so many things out there still left to be solved. So don't feel like you have to make up your own problems to solve. You can go and create value in the world today without even having a job. Now of course a job is nice to have right. Exactly, but but you can create value in the world and make yourself even more employable today by working on something that's open source absolutely. Exactly. Very cool. Grant thank you so much for being on the show today. I appreciate your time. That was my pleasure Jonathan. Thank you.