Data Vectorization

Vectorization is the process of enhancing the efficiency of data operations. It is particularly beneficial for processing large data sets by operating simultaneously on multiple data points, and it is a fundamental aspect of array programming.

Traditionally, data management tasks have relied on loops to execute operations sequentially on each piece of data, a method known as scalar computing. This scalar approach tends to be sluggish when processing large data volumes.

On the contrary, vectorization facilitates the parallel execution of computations. It allows operations to be carried out on entire data arrays simultaneously rather than on individual data points one at a time. By doing so, vectorization capitalizes on the design of contemporary GPUs adept at handling parallel operations efficiently. Libraries like NumPy in Python exemplify this, highly optimized for vectorized operations.

In the NLP world, it is known as Word embeddings.

Therefore, vectorization or word embedding converts text data into numerical vectors. Later, those vectors are used to build various machine-learning models. In this manner, we say this as extracting features with the help of text to construct multiple natural languages, processing models, etc. We have different ways to convert the text data to numerical vectors, which we will discuss later.

This process needs the computation power, namely, the GPU.

Different types of Word Embeddings

Broadly, the word embeddings can be classified into the following two categories:

  • Frequency-based or Statistical based Word Embedding

  • Prediction-based Word Embedding

There are five mainly used word embedding technologies: One-Hot Encoding, Count Vectorizer, Matrix Formulation, Bag-of-Words(BoW), N-grams Vectorization, and TF-IDF Vectorization.

1. One-Hot Encoding (OHE)

In this technique, each unique word is represented in the vocabulary by setting a unique token with the value 1 and rest 0 at other positions in the vector. In simple words, a vector representation of a one-hot encoded vector represents 1 and 0, where 1 stands for the position where the word exists and 0 everywhere else.

Let’s consider the following sentence:

Sentence: I am teaching NLP in Python

A word in this sentence may be “NLP”, “Python”, “teaching”, etc.

A dictionary is a list of all unique words in a sentence. So, a dictionary may look like –

Dictionary: [‘I’,’am’, ’teaching’,’ NLP’,’ in’, ’Python’] 

Therefore, the vector representation in this format, according to the above dictionary, is

Vector for NLP: [0,0,0,1,0,0] 
Vector for Python:  [0,0,0,0,0,1]

This is just a straightforward method to represent a word in vector form.

Disadvantages of One-hot Encoding

1. One disadvantage of One-hot encoding is that the vector size equals the number of unique words in the vocabulary.

2. One-hot encoding does not capture the relationships between different words. Therefore, it does not convey information about the context.

2. Count Vectorizer

Count vectorizer is one of the simplest ways of doing text vectorization. It creates a document term matrix, a set of dummy variables that indicates if a particular word appears in the document.

The count vectorizer will fit and learn the word vocabulary and try to create a document term matrix in which the individual cells denote the frequency of that word in a particular document, also known as term frequency. The columns are dedicated to each word in the corpus.

3. Matrix Formulation

Consider a Corpus C containing D documents {d1,d2…..dD} from which we extract N unique tokens. Now, the dictionary consists of these N tokens, and the size of the Count Vector matrix M formed is given by D X N. Each row in the matrix M describes the frequency of tokens in document D(i).

4.Bag-of-Words(BoW)

This vectorization technique converts the text content to numerical feature vectors. Bag of Words takes a document from a corpus and converts it into a numeric vector by mapping each document word to a feature vector for the machine learning model.

5. N-grams Vectorization

Like the count vectorization technique, a document term matrix is generated in the N-Gram method, and each cell represents the count. The columns represent all columns of adjacent words of length n. Count vectorization is a particular case of N-Gram where n=1. N-grams consider the sequence of n words in the text, where n is (1,2,3.. ) like 1-gram, 2-gram. For token pair. Unlike BOW, it maintains word order.

6. TF-IDF Vectorization

As discussed in the above techniques, the BOW method is simple and works well, but it treats all words equally. As a result, it cannot distinguish prevalent words from rare words. So, to solve this problem, TF-IDF comes into the picture!

Term frequency-inverse document frequency ( TF-IDF) is a measure that considers the importance of a word depending on how frequently it occurs in a document and a corpus.

The Computation Node in the Network

In the depicted network (see Fig.1), the computation nodes are highlighted as blue rectangles. These nodes accept the validated dataset from the validation node and transform it into a vector format, preparing it for training.

The functionality of the computation node will do the following to produce the vector. Take the processing for Tweet for example:

Feature Extraction: In this step, the Bag of Words (BOW) model is used to extract features from Twitter feeds. In this model, each document sample is usually represented by associating the word with its observed frequency. The order of words in text content is also considered to be irrelevant. After obtaining the BOW features, we applied the removal of stopwords, stemming, and text normalization steps using Zemberek and Lucene's approach.

Representation and Term Weighting: Each Tweet is represented in the TTF dataset as a feature vector using VSM (Vector Space Model). In VSM, each feature is represented as its term frequency or weight value using numbers. Therefore, term weighting is an important step that assigns a weight to indicate the importance of each related feature in the sample vector. In this step, the TF∗IDF (Term Frequency-Inverse Document Frequency) method is used, the most commonly used weighting scheme in text classification. The TF∗IDF applies weighting to a feature based on its inverse document frequency and term frequency factors. This means that if more tweets appear, the term will be less critical, and the weighting will be less. The TF∗IDF can be formulated as Eq.:

Where tf_ij, N, and n_j represent the raw term frequency of term 𝑗 in a tweet 𝑖, the total number of tweets in the dataset, and the number of tweets that term 𝑖 appears, respectively.

GPU Parallel Computing Architecture and CUDA Programming Model:

CUDA is a parallel computing architecture that allows significant increases in computing performance using NVIDIA's GPU power. As a result of the development of TESLA GPU architecture, NVIDIA has shown that GPUs can be programmed by considering a processor. Previously, GPUs were mainly used for graphics applications. Research conducted on CUDA architecture revealed that the CPU's performance is higher than the CPU's. This difference is because the GPU architecture has been developed for operations requiring parallel calculations at high degrees and operation intensity, such as graphics processing. Contrary to the operations requiring a flow control as in the CPU, the GPU targets parallel calculation applications such as data processing, image processing, 3D rendering, and signal processing, consisting of repeating and arithmetically intense operations. As shown in Fig. 2, GPU parallel calculation enables an architecture with intense transistors devoted to data processing instead of cache memory and flow control mechanisms in the CPU. The excessive performance difference between multi-core GPUs and general-purpose multi-core CPUs arises from the primary design difference between the two processors. Compared to the CPU, the GPU has more ALU (see Fig. 2) and includes fewer components (e.g., cache memory and flow control). This is essential in obtaining the high arithmetic operation power and capacity required for performing parallel arithmetical operations. GPU also allows the same operations to be carried out on different data elements in compliance with SIMD (Single Instruction Multiple Data) programming, and the small cache memory on GPU enables the control of the bandwidth required by the application. Therefore, there is no need to return to DRAM (Dynamic Random Access Memory) for threads trying to access data in the same memory area. Thus, results are obtained much faster for any application with parallelizable functions running on GPU. CUDA is quite helpful and practical thanks to its easy thread management structure and being software with shared memory that can be used on GPU. In addition, it enables the applications written in C programming language for CPU to be run by using multithread on graphics processor as it is a C-based parallel programming language. In this aspect, CUDA programming with NVIDIA GPUs also provides adequate API for non-graphical applications.

Parallel Implementation: The k-NN algorithm is implemented in a parallel fashion. The main reasons behind our selection are as follows:

• k-NN is generally the slowest algorithm among machine learning algorithms since it often outperforms other well-known classifiers (e.g., Decision Tree, Naïve Bayes, etc.) in terms of computation complexity or accuracy, especially in text categorization,

• Compared to the other classification algorithms, this method is more suitable for running in parallel mode, as searching for nearest neighbors can be calculated independently. This task is the most time-consuming part of the k-NN.

Implementing other classifiers is also possible, but each algorithm has its calculation strategy, and not all algorithms have the same parallelizable potential. For instance, compared to the kNN, running the Naïve Bayes algorithm in parallel fashion may not provide a considerable performance difference compared to its serial execution. Therefore, in this study, we selected the k-NN algorithm, a lazy learner, which requires intense calculation, especially in text categorization. As distance calculation is the most time-consuming part of k-NN classification, we performed this calculation in parallel using our distance calculation kernel.

K-Nearest Neighbor Classifier: k-NN is a supervised learning algorithm based on the samples' distance (or similarity). This method's classification process is time-consuming, and estimating the most suitable k value is hard. K-NN classifier finds the k closest neighbors by distance function and uses category weights of these neighbors to assign a category to test sample. The distance (e.g., Euclidean, Manhattan, Minkowski, etc.) or similarity function may differ. However, Euclidean distance is generally used to determine the closest neighbors in the K-NN classifier. Let d0, 𝑑j ∈ 𝑘𝑘 − 𝑁𝑁(𝑑0), and 𝐶 represent the test sample, k closest neighbors, and category set, respectively. Then, assigning a category to a test sample using the category weights of its k closest neighbors is performed as in Eq. (2) and (3):

In the vectorization process, the previously validated dataset transforms into vector form via the computational methods outlined. This phase demands substantial computational resources, with Graphics Processing Units (GPUs) as the foundational hardware for sustaining the unbroken transmutation of data into vectors.

As delineated in the antecedent chapter, computation nodes are designated to execute these vectorization tasks, thereby ensuring the integrity and fidelity of the dataset conversion process. It is imperative to note that computation nodes operate on a permissioned basis, signifying that only nodes meeting the prescribed hardware criteria are authorized to function as computation nodes. This prerequisite guarantees the computational demands are met with the requisite performance efficiency.

最后更新于