Clean text with nimbletext

3/21/2023

We are only interested in working with the text field. Most of the columns won’t be used in this post.

We can easily see the fields of the dataset, printing the information: import numpy as npĭf = pd.read_csv('./input/covid19-tweets/covid19_tweets.csv') A filter was applied: only the tweets with #covid19 were extracted. Let’s import the libraries and the dataset from Kaggle, which contains the tweets collected using the Twitter API from July 2020. Only after much research on the web and comparisons with classmates, I was able to solve the problem. I still didn’t have any previous knowledge about Machine Learning and Deep Learning. In this post, I want to focus on this task, because I remember the efforts I did in my Data Science project for my master’s degree. We need to remove non-informative features, like punctuations, common words like “the” and “a”, numbers, and so on. To pre-process the text, there are some operations to apply. Text Pre-processing is the most critical and important phase to clean and prepare the text data for applications, like topic modeling, text classification, and sentiment analysis.The goal is to obtain only the most significant words from the dataset of text documents. The initial text contains many noisy and redundant words. When you start to deal with huge amounts of text, like tweets, there is still work to do before applying any model. This guide will let you understand step by step how to work with text data, clean it, create new features using state-of-art methods and then make predictions or other types of analysis.

This is the first post of the NLP tutorial series. This article was published as a part of the Data Science Blogathon Introduction

0 Comments

Clean text with nimbletext

Leave a Reply.

Author

Archives

Categories