Introduction to Text as Data
What information can we get from text? How do we get it? What are the different options for analyzing words as data? How do we know whether a method is doing a good job of capturing what’s in a document?
This class introduces computational approaches to collecting, organizing, and analyzing text as data. Computational methods can assist the process of forming impressions and formulating hypotheses when a lot of data is involved, and in applying classification schemes at scale. Like most methods, computational methods are never a substitute for careful theorizing, and using them will not rescue a poorly conceived project. For this reason, we begin by learning about the practice of content analysis more generally.
Python is the go to language for large scale text processing. R also includes a number of valuable packages, but has important limits where processing is concerned. We will learn the basics of the Python programming language and apply some important text analytic methods. We will be working with code, so you either need to have some minimal experience using a coding language (e.g. you have written scripts in STATA or R), or you need to recognize that you will probably have some catching up to do. If you are already a Python user but have not done much with text, we would be happy to have you participate.
Readings and Resources
General comments. A central virtue of text methods is that we have access to new data (text!). So you should be striving for a nonobvious ‘transformative’ opportunity in your area of interest. I would much rather see ambitious, albeit halfcompleted research projects than projects that merely demonstrate your ability to apply what we covered in class.
What particular part of your code is causing the problem? Test parts of your code in sequence. Test that your code works with a simple dataset Consult common error messages. The Python Tutorial offers useful examples and Stack Overflow has a lot of answers
Books (additional articles and papers are listed on the weekly schedule)
Saldana, The Coding Manual for Qualitative Researchers o Covers potential objectives of many content analysis projects, how to develop a coding scheme, and how to assess validity and reliability. We will not be using the software Saldana references.
Bird, Klein, Loper, Natural Language Processing with Python o There is a free downloadable version of this book, but it can also be purchased
Lutz, Learning Python (a very readable, big, introduction to the Python programming language)
Pythex https://pythex.org/ (Testing regular expressions)
Python Tutor http://www.pythontutor.com/visualize.html (see how the programming language actually works)
Participation (20%) o Come to class prepared and participate in class activities (including assisting classmates). Please refrain from spending class time answering emails etc.
Homework (40%) o Coding homeworks shouldn’t take hours. If you are stuck, you are welcome to seek advice (from classmates or Andreu).
Research Project (40%) - Must be pre-approved by the instructor. A central purpose of this class is to open up a whole new world of data opportunities – text as data. Be ambitious! Your project will pose an interesting question, gather a substantial amount of text, convert it to data, and analyze it using the tools of this course (or beyond).
Papers should be about 15 pages in length. Substance is more important than the length. Papers can be light in terms of theory and literature review, but should be heavy in terms of having an ambitious and well supported research design. Be sure to include your carefully annotated iPython Notebook as a separate attachment.