Edit Template

Understanding Text Data

Text Mining

Text Mining with R

Data Analytics becomes more and more important for any organisation over all industries worldwide. This trend has to do with two developments: Firstly, more and more high-quality data is available to describe any kind of business-related activities, consumer behaviour and workforce matters. Secondly, potent hard- and software is at hand that can handle, analyse and save huge amounts of data.

One of the popular analysis packages is R. Data analytics with R and RStudio is an easy way to overcome high costs for powerful analysis tools. In the following, we offer some examples of data analytics with R.

Understanding Text Data for Text Mining

Text mining, also known as text analytics, is the process of extracting meaningful insights and patterns from unstructured textual data. By applying techniques such as natural language processing (NLP), sentiment analysis, and topic modelling, text mining transforms raw text into structured data for analysis.

It is widely used in applications like customer feedback analysis, spam detection, social media monitoring, and document classification. This powerful approach helps uncover hidden trends, relationships, and actionable insights, enabling businesses and researchers to make data-driven decisions.

This case shows a comprehensive example of Text Analytics with R.

Get Data and Add ID

Purpose: To ensure all data is ready for analysis without missing or incompatible entries.

Explanation: This phase involves gathering and loading the data into R for processing. The dataset, presumably containing text data like customer comments or feedback, was imported into the R environment using read.csv(). Any encoding issues were handled to ensure compatibility. Data is retrieved from URL. Original ID number is replaced by new ID number in Date sequence. Amended data file is saved.

Data Analytics with R: Code and Results

Clean and Preprocess Data

Purpose: To prepare the text data for analysis by removing noise and ensuring that only meaningful content is retained.

Explanation: Non-ASCII characters, punctuation, and numbers were removed.

Text was converted to lowercase to ensure consistency. Encoding of text data in UTF-8 ensures compatibility with all tools. All text data is turned into lower case. Date columns are turned into R date format. Month column is created.

Punctuation, numbers, stopwords (common words like ‘the’, ‘and’) and names are removed. All comments are split into initial words, i.e., tokens. Any comments marked as “no information” or with NA values were excluded.

Data Analytics with R: Code and Results

Tokenise Data for Text Mining

Purpose: To enable word-level analysis, which is crucial for frequency analysis, sentiment analysis, and topic modelling.

Explanation: Text data were split into individual words (tokens) using unnest_tokens().

This step transforms unstructured text into structured data, making it easier to analyse.

This dataset includes:

Data Analytics with R: Code and Results

Sentiment Analysis by Business, Month and Type Using Bing Lexicon

Purpose: To assess the emotional tone of the text data, identifying areas with strong positive or negative sentiments.

Explanation: Sentiment lexicons like Bing or NRC are used to classify words as positive, negative, or associated with specific emotions.

Sentiment scores are calculated and visualised over time and by business type.

The result of positive and negative sentiments by business and month is printed in bar charts.

Data Analytics with R: Code and Results

Pareto Chart of Comment Types

The number of comments is stratified by comment types and shown in a Pareto chart.

Data Analytics with R: Code and Results

Top Words by Business and Type

Purpose: To understand the key terms or themes in the dataset and their relevance across different categories.

Explanation: The most frequently occurring words were identified and visualised.

The most common words in comments are extracted by business and comment type and displayed in bar charts.

Data Analytics with R: Code and Results

Topic Modelling Using LDA for Text Mining

Purpose: To uncover the underlying themes or categories within the dataset, providing a high-level view of the content.

Explanation: Latent Dirichlet Allocation (LDA) was applied to discover hidden topics within the text.

The optimal number of topics was determined using metrics like Griffiths2004 and CaoJuan2009.

The most frequent terms within each topic were extracted to assign meaningful labels to topics.

Latent Dirichlet Allocation (LDA) is used to discover hidden topics. Firstly, the number of topics needs to be decided based on indicators by (2010), CaoJuan (2009), Deveaud (2014) and Griffiths (2004). Arjun and CaoJuan should be minimised, whereas indicators by Deveaud and Griffiths should be maximised.

Shown is the abbreviated version using only CaoJuan and Deveaud.

Data Analytics with R: Code and Results

Topic Modelling by Business, Month, and Type Using LDA

Purpose: To uncover the underlying themes or categories within the dataset, providing a high-level view of the content.

Latent Dirichlet Allocation (LDA) is conducted with the number of topics derived in the previous step.

For practicality reasons, a topic number of four is chosen due to CaoJuan and Deveaud for further analysis.

Result of Topic Modelling Using LDA

Purpose: To make the results accessible and actionable for stakeholders.

Explanation: Various visualisations were created, such as bar charts, word clouds, and topic distributions, to present findings in an easily interpretable format.

Each visualisation was tailored to highlight specific insights, such as topic importance or sentiment trends.

Shown here is the Word Cloud for Topic 4.

Topic 4 is about message cards that might need change.

Data Analytics with R: Code and Results

Visualise Topic Distribution within Documents

Purpose: Identify the distribution of topics in documents, i.e., comments.

Explanation: Each comment contributes to the Topic Modelling results. Therefore, it could be of interest to know which comments at which Business and Type, during what Month contributed to the topics generated.

Shown here is the analysis of topics represented in randomly selected comments 55, 62, 1111 and 2224.

Data Analytics with R: Code and Results

Emotion Analysis by Business, Month, and Type

Purpose: Emotion analysis in text mining serves to uncover the emotional tone and sentiment expressed in customer comments or complaints. Its primary purpose is to gain deeper insights into how customers feel about products, services, or experiences, going beyond simple positive or negative sentiment.

Explanation: The NRC lexicon assigns words to specific emotions such as anger, joy, sadness, and more, along with general positive or negative sentiment.

This step ensures that the emotions are defined for words in the tokenised text data.

Each word in the dataset is matched with its corresponding emotion. Words are aggregated by Business, Month, and Type. This allows the data to be reshaped, showing how often each emotion appears within different categories.

Assigning distinct colors for each emotion enhances the readability of the visualisations. Grouping similar emotions (e.g., red shades for negative emotions, green shades for positive ones) could improve clarity.

Data Analytics with R: Code and Results

Conclusion

Data analytics with R and RStudio offer an easy way to overcome high costs for powerful analysis tools. We have performed rather complex analysis, i.e., text mining procedures with R as shown above. There are practically no limitations to your research when you choose this environment.

And, whenever you have questions or look for a workshop to implement Data Analytics with R, talk to us (+65 6100 0263). Our partners, SUTD and SMU will help to design a package that suits your needs.

All Posts
Data Science

15 May 2023

Understanding Text Data

Text Mining

Text Mining with R

Understanding Text Data for Text Mining

Get Data and Add ID

Clean and Preprocess Data

Tokenise Data for Text Mining

Sentiment Analysis by Business, Month and Type Using Bing Lexicon

Pareto Chart of Comment Types

Top Words by Business and Type

Topic Modelling Using LDA for Text Mining

Topic Modelling by Business, Month, and Type Using LDA

Result of Topic Modelling Using LDA

Visualise Topic Distribution within Documents

Emotion Analysis by Business, Month, and Type

Conclusion

Internet of Things for Starters

Managers Need Data Analytics

Automating a Mess Yields an Automated Mess

Data Analysis – Plot the Data, Plot the Data, Plot the Data

Can We Predict When Our Staff Is Leaving?

Leading Digital-Ready Workforce

Analytical Storytelling

Great, We Have Improved … or Not?

Contact

Legal

Resources