书名：Python Natural Language Processing
作者名：Jalaj Thanaki
本章字数：661字
更新时间：2025-02-28 13:05:45

Exploring different file formats for corpora

Corpora can be in many different formats. In practice, we can use the following file formats. All these file formats are generally used to store features, which we will feed into our machine learning algorithms later. Practical stuff regarding dealing with the following file formats will be incorporated from Chapter 4, Preprocessing onward. Following are the aforementioned file formats:

.txt: This format is basically given to us as a raw dataset. The gutenberg corpus is one of the example corpora. Some of the real-life applications have parallel corpora. Suppose you want to make Grammarly a kind of grammar correction software, then you will need a parallel corpus.
.csv: This kind of file format is generally given to us if we are participating in some hackathons or on Kaggle. We use this file format to save our features, which we will derive from raw text, and the feature .csv file will be used to train our machines for NLP applications.
.tsv: To understand this kind of file format usage, we will take an example. Suppose you want to make an NLP system that suggests where we should put a comma in a sentence. In this case, we cannot use the .csv file format to store our features because some of our feature attributes contain commas, and this will affect the performance when we start processing our feature file. You can also use any customized delimiter as well. You can put \t, ||, and so on for ease of further processing.
.xml: Some well-known NLP parsers and tools provide results in the .xml format. For example, the Stanford CoreNLP toolkit provides parser results in the .xml format. This kind of file format is mainly used to store the results of NLP applications.
.json: The Stanford CoreNLP toolkit provides its results in the .json format. This kind of file format is mainly used to store results of NLP applications, and it is easy to display and integrate with web applications.
LibSVM: This is one of the special file formats. Refer to the following Figure 2.4:

Figure 2.4: LibSVM file format example

LibSVM allows for sparse training data. The non-zero values are the only ones that are included in the training dataset. Hence, the index specifies the column of the instance data (feature index). To convert from a conventional dataset, just iterate over the data, and if the value of X(i,j) is non-zero, print j + 1: X(i,j).
X(i,j): This is a sparse matrix:
- If the value of X(i,j) is equal to non-zero, include it in the LibSVM format
  - j+1: This is the value of X(i,j), where j is the column index of the matrix starting with 0, so we add 1
- Otherwise, do not include it in the LibSVM format
Let's take the following example:
- Example: 1 5:1 7:1 14:1 19:1
  - Here, 1 is the class or label
  - In the preceding example, let's focus on 5:1, where 5 is the key, and 1 is the value; 5:1 is the key : value pair
  - 5 is the column number or data attribute number, which is the key and is in the LibSVM format; we are considering only those data columns that contain non-zero values, so here, 1 is the value
  - The values of parameters with indexes 1, 2, 3, 4, 6, and others unmentioned are 0s, so we are not including these in our example
This kind of data format is used in Apache Spark to train your data, and you will learn how to convert text data to the LibSVM format from Chapter 5, Feature Engineering and NLP Algorithms onwards.
Customized format: You can make your feature file using the customized file format. (Refer to the CoNLL dataset.) It is kind of a customized file format.
There are many different CoNLL formats since CoNLL is a different shared task each year. Figure 2.5 shows a data sample in the CoNLL format:

Figure 2.5: Data sample in CoNLL format