Transforming Text into Data Structures

Text data offers a very unique proposition by not providing any direct representation available for it in terms of numbers. Computers only understand numbers. Representing text using numbers is a challenge. At the same time, it is an opportunity to invent and try out approaches to represent text so that the maximum information can be captured in the process. In this chapter, we will look at how text and math interface. Let's take baby steps toward transforming text data into mathematical data structures that will provide insights on how to actually represent text using numbers and, consequently, build Natural Language Processing (NLP) models.

Pause for a moment here and dwell on how would you try to solve it.

As we progress toward the end of this chapter, we will be better equipped to handle text data as we understand techniques including count vectorization and term frequency-inverse document frequency (TF-IDF) vectorization, among others.

Before we proceed and discuss various possible approaches such as count vectors and TF-IDF vectors in this chapter and more approaches such as Word2vec in future chapters, we need to understand two supremely important concepts that validate every language. These are syntax and semantics. Syntax defines the grammatical structures or the set of rules defining a language. It can be thought of as a set of guiding principles that define how words can be put in each other's vicinity to form sentences or phrases. However, syntactically correct sentences may not be meaningful. Semantics is the part that takes care of the meanings and defines how to put words together so that they actually make sense when organized based on the available syntactical rules.

In this chapter, we will primarily focus on the syntactical aspects, where we use information such as how many times a word occurred in a document or in a set of documents as potential features to represent documents. Let's see how these approaches pan out in solving the representation problem we have.

The following topics will be covered in this chapter:

  • Understanding vectors and matrices
  • Exploring the Bag-of-Words (BoW) architecture
  • TF-IDF vectors
  • Distance/similarity calculation between document vectors
  • One-hot vectorization
  • Building a basic chatbot