Master R Programming

Learn the powerful statistical programming language with modern tools, beautiful visualisations, and practical examples.

Start Learning R

Getting Started with R Programming

What is R?

R is an open-source programming language specifically designed for statistical computing, data analysis, and visualisation. It's widely used in academia, research, and industry for data science applications.

# Basic R syntax example
a <- 10
print(a)
# Output: [1] 10

Code Explanation:

  • a <- 10 - Assigns the value 10 to variable 'a'
  • print(a) - Outputs the value of 'a'
  • The <- operator is R's assignment operator
  • The output shows the value and [1] indicating it's the first element

Understanding the RStudio Interface

RStudio Panes Explained:

  • Source Editor - Write and edit your R scripts and notebooks
  • Console - Execute commands and see immediate results
  • Environment/History - Track variables and command history
  • Files/Plots/Packages/Help - Manage your workspace and documentation
# Example of working with data in R
# Load built-in dataset
data(cars)

# View first few rows
head(cars)

Understanding the Panes:

RStudio organises your workspace into these panes to streamline your data analysis workflow.

Essential Data Structures in R

Vectors

The most basic data structure in R - a sequence of data elements of the same type.

Data Frames

Tabular data structure similar to Excel spreadsheets.

Lists

Flexible structures that can hold different data types.

# Creating different data structures
# Vector
numbers <- c(1, 2, 3, 4, 5)
print(numbers)
# Output: [1] 1 2 3 4 5

Data Structure Types:

  • Vector - Basic one-dimensional data structure
  • Data Frame - Tabular data structure (like Excel)
  • Matrix - Two-dimensional data structure
  • List - Flexible structure that can hold different data types
  • Factor - Used for categorical data
  • Array - Multi-dimensional extension of matrices

Creating Beautiful Visualisations with ggplot2

Scatter Plots

Perfect for showing relationships between two continuous variables.

Bar Charts

Useful for comparing categorical data.

Histograms

Great for visualising data distributions.

# Load ggplot2 package
library(ggplot2)

# Create a scatter plot
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(aes(color = Species)) +
  labs(title = "Iris Sepal Dimensions",
       x = "Sepal Length (cm)",
       y = "Sepal Width (cm)")

ggplot2 Code Breakdown:

  • ggplot(data = iris) - Sets iris as the default dataset
  • aes(x = Sepal.Length, y = Sepal.Width) - Defines aesthetic mappings
  • geom_point() - Adds points to create scatter plot
  • aes(color = Species) - Colours points by species category
  • labs() - Adds labels and title to the plot

Data Wrangling with dplyr

Using `dplyr` functions:

  • select(): Choose specific columns
  • filter(): Filter rows based on conditions
  • arrange(): Sort data
  • mutate(): Create new columns
  • group_by(): Group data for summaries
# Example: select columns
library(dplyr)
starwars %>% select(name, height, mass)

# Filter example
starwars %>% filter(species == "Droid", height <= 100)

# Arrange example
starwars %>% arrange(desc(height))

# Mutate example
starwars %>% filter(species == "Human") %>% 
  mutate(bmi = mass / (height/100)^2)

# Group and summarize
starwars %>% group_by(species) %>% 
  summarise(avg_height = mean(height, na.rm=TRUE), 
            avg_mass=mean(mass, na.rm=TRUE))

Hypothesis Testing with T-tests

Example: Sleep Study

Suppose we compare sleep hours between two drugs using a t-test.

# Example data
sleep
# Boxplot
library(ggplot2)
ggplot(sleep, aes(x=group, y=extra)) + geom_boxplot()

# T-test (independent)
t.test(extra ~ group, data=sleep)

# Paired t-test (paired data)
t.test(extra ~ group, data=sleep, paired=TRUE)

Understanding p-values:

A small p-value (typically < 0.05) indicates strong evidence against the null hypothesis, suggesting a significant difference between groups.

Permutation Test (Optional Advanced)

What is a permutation test?

It's a non-parametric method to test if two groups differ significantly by randomly shuffling data labels.

# Pool data
pooled <- mosquito_preference$num_mosquitoes

# Shuffle data
shuffled <- sample(pooled)

# Split into fake groups
fake_group1 <- head(shuffled, 25)
fake_group2 <- tail(shuffled, 18)

# Calculate difference
fake_diff <- mean(fake_group1) - mean(fake_group2)

# Repeat many times
fake_diffs <- numeric(10000)
for(i in 1:10000){
  shuffled <- sample(pooled)
  fake_group1 <- head(shuffled, 25)
  fake_group2 <- tail(shuffled, 18)
  fake_diffs[i] <- mean(fake_group1) - mean(fake_group2)
}
# Plot histogram
library(ggplot2)
ggplot(as.data.frame(fake_diffs), aes(x=fake_diffs)) + geom_histogram() +
labs(title="Permutation Distribution of Difference", x="Difference", y="Frequency")

Interpreting results:

Compare your observed difference to this distribution to assess significance.