Data Mining and Text Mining

This course covers principles, concepts, and methods in the fields of data mining and knowledge discovery. Algorithm development, current tools, and real-world applications are explored. Topics include: data visualization, exploration, dimensionality reduction, clustering, association rule mining, anomaly detection, sentiment analysis, topic modeling, among others.

Final Projects

Sample final project topics from previous years include:

  • Text mining for analysis of topics discussed in social media platforms

  • Finding patterns in performance of recent winning sports teams

  • Clustering and recommendation algorithms for video streaming services

  • Analysis of purchasing patterns for retail customers

  • Sentiment analysis of lyrics from top songs in recent years

  • Characterization of street network spatial features

  • Clustering of traffic crashes and their relationship with inclement weather

Sample Problems

Some sample problems are listed below.

Data Exploration

Click here for a sample problem

Using tools from the tidyverse, answer the following questions:

(a) From the list of flights that left NYC in 2013 (using the flights data frame), find those with a departure delay larger than 60 minutes

(b) How would you find the flights with a scheduled departure time later than 10:30 PM ?

(c) Find all flights that were operated by United, American, or Delta.

Data Preparation

Click here for a sample problem

In this problem you will be using tools from the caret package (short for Classification And REgression Training) (a) Consider the data from the mtcars data frame (a built-in dataset in R). Find summary statistics for each attribute.

(b) Using tools from the caret package, perform the following transformations to the mtcars data frame: normalization, standardization, and centering.

Comments on your results. Verify that in the standardization process, the transformed variables have unit standard deviation (Hint: the sd() function in R computes the standard deviation). Choose 3 variables to confirm this.

Matrix Factorization

Click here for a sample problem The chunk of code below, defines a function in R to create a Hilbert matrix.
# create Hilbert matrix of size n 
hilbert <- function(n) { 
  i <- 1:n
  1 / outer(i - 1, i, "+") 

(a) Build a Hilbert matrix of size 7, and call it hil_seven.

(b) Select columns 1 through 4, and call it X

(c) Compute the singular value decomposition of X using the svd() function.

(d) Print the singular values of X

(e) Verify that the product $U^T U$ where $U$ is the matrix of left singular values, returns the identity matrix (a square matrix with ones in the main diagonal, and zeros everywhere else). Recall that the t() function in R, creates the transpose of a matrix.

Dimensionality Reduction


Click here for a sample problem

(a) Read the Toyota Corollas dataset on sales during mid 2004. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications.

toyota <- read_csv("")

(b) Create a new data frame ignoring the categorical variables. Perform Principal Component Analysis (PCA), and comment on your results.

(c) Produce at least one data visualization to explain the results on PCA.


Click here for a sample problem Consider the `USArrests` dataset, which contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.

(a) Perform PCA on the four numerical variables in this dataset.

(b) Create a bi-plot (in the PC1-PC2 coordinate system) and explain your results. According to your findings: Do Florida, Nevada, California, and Michigan have anything in common? How about New Hampshire, Maine, and North Dakota?


Click here for a sample problem In this problem you will look at the statistics for 30 NHL teams compiled from and .

(a) Read the data

# read hockey data
tr_link <- ""
hockey <- read_csv(tr_link)

Create summary statistics for the numerical variables, including the mean, median, and standard deviation.

(b) Create a correlation matrix among the different numerical variables. Comment on your results.

(c) Perform PCA on the numerical variables. How many principal components are needed to capture about 70% of the variance? Comment on your results.

(d) Create a biplot showing the different data points and the loadings for each original feature in the PC1-PC2 space.

Association Rules

Apriori 1

Click here for a sample problem

The Institute for Statistics Education offers online courses in statistics and analytics, and would like information that will help in sequencing and packaging of courses.

courses_df <- read_csv("")

Each row represents the courses attended by a single customer. The firm wishes to evaluate alternative course offerings. Use association rules to analyze these data, and interpret several of the resulting rules.

(a) Create an item frequency plot.

(b) Set support to 0.01 and confidence to 0.1. Use the apriori() function to perform association rules mining with the apriori algorithm. Present the top 5 rules sorted by lift, and comment on your results.

(c) Create a plot of confidence vs support with the new set of rules.

Apriori 2

Click here for a sample problem
lastfm <- read_csv("")

(a) Find the top 10 most popular artists based on their appearance in the data set (regardless of the country)

(b) How many different users and artists are in this dataset?

(c) Think of users as shoppers and artists as items bought. Create an item frequency plot using itemFrequencyPlot() with the option support = 0.08

(d) Use the apriori algorithm with support = 0.01 and confidence = 0.5. Inspect the rules, and list the top 10 rules by lift.

(e) Use apriori() again to generate rules where RHS is "death cab for cutie". Use support = 0.01 and confidence = 0.2. Inspect the top 6 rules (by lift) and comment on your results.

(f) Create a network visualization of the set of rules created for Death Cab For Cutie


Click here for a sample problem In this problem we use corporate data on twenty-two public utilities in the US. We are interested in forming groups of similar utilities. The clustering will be based on eight measurements on each utility including cost per kilowatt capacity in place, annual load factor, and others.
utilities_df <- read_csv("")

(a) Provide an example (field, business problem, policy, etc.) of how the clustering of utilities may be useful.

(b) Create a data visualization that shows a relevant characteristic of this dataset (e.g. relationship between cost and sales, correlation between variables, etc.)

(c) Use the k-means algorithm to find $k = 6$ groups in this dataset. List the companies that end up in each group. Can you identify any features that characterize some of the clusters?