Data Cleaning & Transformation with Dplyr in R
Last Updated :
24 Apr, 2025
In R, data formatting typically involves preparing and structuring your data in a way that is suitable for analysis or visualization. The exact steps for data formatting may vary depending on your specific dataset and the analysis you want to perform. Here are some common data formatting tasks in the R Programming Language.
Formatting data in R is an essential part of data preprocessing and analysis. Depending on our specific needs, we want to manipulate data types, change the structure of our data frame, or format variables. Here are some common tasks related to data formatting in R:
Import Data
Use functions like read.csv(), read.table(), read.xlsx() from packages like readr, readxl, or others to import our data into R. Ensure that our data is in a supported file format such as CSV, Excel, or a database.
Checking Data Structure
Use str() to check the structure of your data frame. This function displays the data type and structure of each column in your data frame.
str(your_data_frame)
Changing Data Types
Convert columns to appropriate data types using functions like as.numeric(), as.character(), as.Date(), etc. For example, to convert a column to a numeric type.
# Convert a numeric variable to character
df$numeric_var <- as.character(df$numeric_var)
# Convert a character variable to numeric
df$character_var <- as.numeric(df$character_var)
# Convert a variable to a factor (categorical)
df$factor_var <- as.factor(df$factor_var)
Handling Missing Values
Use functions like is.na(), complete.cases(), or na.omit() to identify and handle missing values (NA) appropriately. You can choose to remove rows with missing values or impute missing values with mean, median, or other strategies.
# Remove rows with missing values
your_data_frame <- your_data_frame[complete.cases(your_data_frame), ]
# Impute missing values with mean
your_data_frame$numeric_column[is.na(your_data_frame$numeric_column)] <- mean
(your_data_frame$numeric_column, na.rm = TRUE)
Renaming Columns
Rename columns using the colnames() function or by directly assigning new column names to the names() attribute of your data frame.
# Rename a column
colnames(your_data_frame)[colnames(your_data_frame) == "old_column_name"] <- "new_column_name"
Merging and Joining Data
Use functions like merge(), join(), or dplyr functions like left_join(), inner_join(), etc., to combine data from multiple sources based on common keys.
# Merge two data frames by a common key
merged_data <- merge(data_frame1, data_frame2, by = "common_key")
Formatting Dates and Times
Use functions like as.Date(), as.POSIXct(), or format() to format date and time columns as needed.
your_data_frame$date_column <- as.Date(your_data_frame$date_column, format = "%Y-%m-%d")
Aggregating Data
Use functions like aggregate() or dplyr functions like group_by() and summarize() to aggregate data based on specific variables.
library(dplyr)
aggregated_data <- your_data_frame %>%
group_by(category_column) %>%
summarize(mean_value = mean(numeric_column))
Remember that data formatting is often a crucial step in data analysis and can significantly impact the accuracy and success of your analyses and visualizations. The specific formatting steps you need to take depend on your dataset and analysis objectives.
Let's go through some common data formatting tasks
Selecting Columns
We can select specific columns of the dataset using the $ operator or the subset() function.
R
# load the required package
library(dplyr)
# Select specific columns using the $ operator
selected_data <- mtcars$mpg
# Select specific columns using subset()
selected_data <- subset(mtcars, select = c(mpg, hp))
head(selected_data)
Output:
mpg hp
Mazda RX4 21.0 110
Mazda RX4 Wag 21.0 110
Datsun 710 22.8 93
Hornet 4 Drive 21.4 110
Hornet Sportabout 18.7 175
Valiant 18.1 105
Filtering Rows
We can filter rows based on certain conditions using the subset() or [ ] indexing.
R
# Filter rows where mpg is greater than 20
filtered_data <- subset(mtcars, mpg > 20)
# Alternatively, you can use [ ] indexing
filtered_data <- mtcars[mtcars$mpg > 20, ]
filtered_data
Output:
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Renaming Columns
We can rename columns using the colnames() function.
R
# Rename the "mpg" column to "MilesPerGallon"
colnames(mtcars)[colnames(mtcars) == "mpg"] <- "MilesPerGallon"
names(mtcars)
Output:
[1] "MilesPerGallon" "cyl" "disp" "hp" "drat"
[6] "wt" "qsec" "vs" "am" "gear"
[11] "carb"
Sorting Data
You can sort the data by one or more columns using the order() function.
R
# Sort the data by mpg in descending order
sorted_data <- mtcars[order(-mtcars$MilesPerGallon), ]
head(sorted_data)
Output:
MilesPerGallon cyl disp hp drat wt qsec vs am gear carb
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Creating New Variables
We can create new variables based on existing ones.
R
# Create a new variable "MilesPerGallonCategory" based on "mpg"
mtcars$MilesPerGallonCategory <- ifelse(mtcars$mpg > 20, "High", "Low")
mtcars$MilesPerGallonCategory
Output:
[1] "High" "High" "High" "High" "Low" "Low" "Low" "High" "High" "Low" "Low" "Low" "Low"
[14] "Low" "Low" "Low" "Low" "High" "High" "High" "High" "Low" "Low" "Low" "Low" "High"
[27] "High" "High" "Low" "Low" "Low" "High"
Aggregating Data
We can summarize data using aggregation functions like mean(), sum(), etc.
R
# Calculate the mean mpg for each number of cylinders
mean_mpg_by_cyl <- aggregate(mtcars$MilesPerGallon, by = list(mtcars$cyl),
FUN = mean)
colnames(mean_mpg_by_cyl) <- c("Cylinders", "MeanMPG")
mean_mpg_by_cyl
Output:
Cylinders MeanMPG
1 4 26.66364
2 6 19.74286
3 8 15.10000
Merging Data
We can merge datasets using functions like merge().
R
# Create a second dataset for merging
df2 <- data.frame(Car = rownames(mtcars), Price = runif(nrow(mtcars),
min = 10000, max = 50000))
# Merge the two datasets by the "Car" column
merged_data <- merge(mtcars, df2, by.x = "row.names", by.y = "Car")
head(merged_data)
Output:
Row.names mpg cyl disp hp drat wt qsec vs am gear carb
1 AMC Javelin 15.2 8 304 150 3.15 3.435 17.30 0 0 3 2
2 Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
3 Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
4 Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
5 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
6 Dodge Challenger 15.5 8 318 150 2.76 3.520 16.87 0 0 3 2
MilesPerGallonCategory Price
1 Low 49797.46
2 Low 35230.94
3 Low 14315.48
4 Low 36384.85
5 High 27144.52
6 Low 10147.37
These are some basic data formatting and manipulation tasks we can perform in R using built-in functions and datasets like mtcars. Depending on your specific data analysis needs, you may also want to explore packages like dplyr and tidyr for more advanced data manipulation tasks.
Similar Reads
Loading and Cleaning Data with R and the tidyverse
The tidyverse is a collection of packages that work well together due to shared data representations and API design. The tidyverse package is intended to make it simple to install and load core tidyverse packages with a single command. To install tidyverse, put the following code in RStudio: [GFGTAB
9 min read
Time Series Data Transformation Using R
Time series data transformation is important to prepare the data for modeling and making informed and correct predictions. It involves several steps that prepare the data such that it doesn't give wrong predictions. R is a statistical programming language that is widely used by researchers and analy
7 min read
Sorting DataFrame in R using Dplyr
In this article, we will discuss about how to sort a dataframe in R programming language using Dplyr package. The package Dplyr in R programming language provides a function called arrange() function which is useful for sorting the dataframe. Syntax :Â arrange(.data, ...) The methods given below sho
3 min read
Data Manipulation in R with Dplyr Package
In this article let's discuss manipulating data in the R programming language. In order to manipulate the data, R provides a library called dplyr which consists of many built-in methods to manipulate the data. So to use the data manipulation function, first need to import the dplyr package using lib
5 min read
Filtering row which contains a certain string using Dplyr in R
In this article, we will learn how to filter rows that contain a certain string using dplyr package in R programming language. Functions Used Two main functions which will be used to carry out this task are: filter(): dplyr package's filter function will be used for filtering rows based on condition
4 min read
Create a ranking variable with Dplyr package in R
In this article, we will discuss how to create a ranking variable with the Dplyr package in R. Installation To install this package type the below command in the terminal. install.packages("dplyr") The mutate method can be used to rearrange data into a different orientation by performing various agg
3 min read
How to Reorder data.table Columns (Without Copying) in R
In data analysis with R, the data.table package is a powerful tool for the handling large datasets efficiently. One common task when working with the data.table is reordering columns. This can be necessary for the better organization improved readability or preparing data for the specific analyses.
4 min read
How to Perform Arcsine Transformation in R?
In this article, we will discuss how to perform arcsine transformation in R programming language. Arcsine transformation is used to extend the  data points in the range of 0 -1 Syntax: asin(sqrt(data)) where, data belongs to range of values from 0 to 1 Arcsine Transformation of Values in Range 0 to
2 min read
Data visualization with R and ggplot2
The ggplot2 ( Grammar of Graphics ) is a free, open-source visualization package widely used in R Programming Language. It includes several layers on which it is governed. The layers are as follows: Layers with the grammar of graphicsData: The element is the data set itself.Aesthetics: The data is t
7 min read
How to Transform Data in R?
Data transformation in R can be performed using the tidyverse and dplyr packages, which offer various methods for data manipulation. These packages can be easily installed and provide a range of techniques for data transformation. Installing Required PackagesThe tidyverse and dplyr package can be in
8 min read