[R語言圖表]用ggplot畫散布圖 scatter plot

2023-06-21
R語言教學, 資料視覺化

Last Updated on 2023-10-31

Home » R語言教學 » 資料視覺化 » [R語言圖表]用ggplot畫散布圖 scatter plot

你想畫散布圖（scatter plot）嗎？用 ggplot2 帶你畫！

想用R語言畫散布圖（scatter plot），但是不知道怎麼使用相關函數嗎？什麼時候該用散布圖？要怎麼加上文字標籤？如何強調特定資料點？

我會在這篇文章介紹如何活用R語言的套件ggplot2，配上實際程式碼，帶你畫散布圖。

文章目錄

散布圖 scatter plot 是什麼？

散布圖（scatter plot）用於呈現數值資料之間的關係（relationship），散布在圖表中的點點，位置由兩個變數的數值共同決定。

使用散布圖，有什麼好處呢？

第二季P-league總冠軍戰平均每場三分球出手數與三分球命中率分布

從這張第二季P-league總冠軍戰的散布圖中，我們可以掌握幾個重點。

左下角有周桂羽，右上跟布魯斯巫師，左上蔡文誠非常突出，中間有一個林明毅、張宗憲在內的小集團。

它能夠讓我們看出資料點之間的分布（誰表現好、誰表現差），也能迅速變異出所謂的離群值（法獅打鐵、蔡文誠出手少但神準）。

這就是散布圖的價值所在。

此外，在其他的散布圖中，我們可以看出兩個變數之間的趨勢，例如「出手次數和平均得分呈現正相關」這樣的關係。

底下，我們就來展示如何在R語言中，利用ggplot2套件繪製散布圖。

實作篇

資料長相

我們先來匯入一份整理好的第二季P-league總冠軍戰球員表現數據。

從底下的表格可以看到，這份資料包含了許多欄位，它的內容是以球員為單位，計算他們在總冠軍戰包含出賽場次、命中率、上場時間等的表現。

library(tidyverse)
df_clean <- read_csv("data/df_clean.csv")
df_clean %>% glimpse()
#> Rows: 30
#> Columns: 28
#> $ name       <chr> "辛特力", "高國豪", "法獅", "辛巴", "布魯斯巫獅", "瓊斯", "…
#> $ number     <dbl> 17, 4, 7, 35, 31, 11, 3, 8, 32, 12, 76, 9, 21, 1, 3, 14, 8,…
#> $ team       <chr> "臺北富邦勇士", "新竹街口攻城獅", "新竹街口攻城獅", "新竹街…
#> $ n_game     <dbl> 5, 5, 3, 5, 2, 3, 5, 5, 2, 5, 4, 2, 5, 5, 3, 5, 5, 5, 2, 5,…
#> $ shot_two   <dbl> 40, 17, 6, 53, 5, 14, 14, 8, 6, 8, 6, 3, 10, 10, 1, 11, 9, …
#> $ try_two    <dbl> 67, 40, 21, 67, 18, 32, 32, 19, 13, 13, 18, 6, 20, 22, 8, 2…
#> $ shot_three <dbl> 18, 12, 8, 0, 2, 3, 5, 5, 3, 15, 3, 2, 4, 3, 3, 8, 1, 3, 3,…
#> $ try_three  <dbl> 42, 30, 27, 0, 14, 11, 20, 21, 8, 32, 8, 4, 8, 11, 11, 11, …
#> $ shot_ft    <dbl> 34, 9, 12, 32, 12, 5, 13, 3, 2, 15, 1, 0, 5, 0, 0, 4, 2, 3,…
#> $ try_ft     <dbl> 42, 10, 12, 56, 13, 5, 18, 7, 2, 20, 4, 0, 8, 0, 0, 7, 5, 6…
#> $ pt         <dbl> 168, 79, 48, 138, 28, 42, 56, 34, 23, 76, 22, 12, 37, 29, 1…
#> $ or         <dbl> 14, 6, 3, 33, 1, 2, 0, 1, 4, 8, 5, 2, 5, 5, 7, 6, 3, 2, 0, …
#> $ dr         <dbl> 56, 21, 16, 47, 8, 15, 9, 12, 2, 21, 12, 3, 8, 10, 8, 22, 4…
#> $ tr         <dbl> 70, 27, 19, 80, 9, 17, 9, 13, 6, 29, 17, 5, 13, 15, 15, 28,…
#> $ as         <dbl> 28, 24, 13, 7, 7, 11, 11, 15, 3, 16, 9, 8, 20, 14, 8, 4, 7,…
#> $ st         <dbl> 10, 9, 4, 0, 8, 3, 10, 4, 0, 1, 6, 2, 4, 4, 2, 8, 4, 1, 1, …
#> $ bl         <dbl> 3, 0, 1, 13, 1, 1, 1, 0, 3, 0, 7, 0, 2, 0, 0, 0, 0, 2, 0, 0…
#> $ to         <dbl> 23, 11, 12, 8, 4, 4, 9, 13, 2, 11, 9, 2, 3, 9, 3, 3, 2, 11,…
#> $ fl         <dbl> 13, 16, 4, 6, 8, 12, 10, 17, 7, 5, 14, 6, 19, 5, 2, 12, 13,…
#> $ min        <dbl> 206, 202, 110, 178, 70, 97, 138, 134, 52, 115, 92, 43, 105,…
#> $ sec        <dbl> 49, 10, 52, 59, 51, 37, 46, 58, 3, 40, 7, 33, 14, 55, 2, 59…
#> $ per_two    <dbl> 0.5970149, 0.4250000, 0.2857143, 0.7910448, 0.2777778, 0.43…
#> $ per_three  <dbl> 0.4285714, 0.4000000, 0.2962963, NA, 0.1428571, 0.2727273, …
#> $ per_ft     <dbl> 0.8095238, 0.9000000, 1.0000000, 0.5714286, 0.9230769, 1.00…
#> $ sec_total  <dbl> 12409, 12130, 6652, 10739, 4251, 5857, 8326, 8098, 3123, 69…
#> $ min_total  <dbl> 206.81667, 202.16667, 110.86667, 178.98333, 70.85000, 97.61…
#> $ sec_per    <dbl> 2481.8000, 2426.0000, 2217.3333, 2147.8000, 2125.5000, 1952…
#> $ min_per    <dbl> 41.363333, 40.433333, 36.955556, 35.796667, 35.425000, 32.5…

清理資料

因為我們要繪製的是跟三分球有關的圖表，所以要先整理出資料。

我們先篩選出那些「總三分球出手至少2次」的人，接著再篩選出那些「平均每場三分球出手至少1次」的人。

這個標準其實主觀而恣意（arbitrary），是我自己覺得這樣可以踢掉那些出手太少、不應列入比較的球員。

你當然也可以決定篩選資料的標準，只要夠有說服力就好。

整理好的資料長相如下：

df_three <- df_clean %>%
  filter(!is.na(per_three)) %>%
  filter(try_three > 1) %>%
  mutate(try_avg_three = try_three/n_game, shot_avg_three = shot_three/n_game) %>%
  filter(try_avg_three >= 1) %>%
  select(name, team, n_game, matches("three")) %>% arrange(desc(per_three))
df_three %>% head(5) %>% knitr::kable(booktabs = TRUE)

name	team	n	shot	try	per	try_avg	shot_avg
蔡文誠	臺北富邦勇士	5	8	11	0.73	2.2	1.6
田浩	新竹街口攻城獅	2	2	4	0.50	2.0	1.0
曾祥鈞	臺北富邦勇士	5	4	8	0.50	1.6	0.8
石博恩	臺北富邦勇士	3	3	6	0.50	2.0	1.0
林志傑	臺北富邦勇士	5	15	32	0.47	6.4	3.0

球員數據舉例

註：n 全名為 n_game，shot, try, per, try_avg, shot_avg 等變數都有 _three 的後綴，因表格呈現問題先刪掉後綴。

利用geom_point()，很快能畫出一張散布圖。

繪製基本的散布圖：利用`geom_point()`

我們將「平均每場三分球出手數」映射到x，「三分球命中率」映射到y，並且把隊伍映射到顏色。

利用geom_point()，很快能畫出一張散布圖。

df_three %>% 
  ggplot(aes(x = try_avg_three, y = per_three, color = team)) + 
  geom_point(alpha = 0.5, size = 2)

從圖表中你可能會發現，分明只有兩支隊伍，為什麼顏色卻感覺有3種？

這不是因為資料有問題，問題出在有些資料點的位置太過相近。

為什麼呢資料點位置那麼近，因為球員表現太接近，投射在座標軸上，自然靠得很近。

當資料點重疊：利用`geom_jitter()`加上隨機偏移

因為遇到重疊問題，我們可以在點點的位置上加上隨機的微小偏移，避免因為重疊導致顏色發生變化，利用geom_jitter()就能達到這個效果。

除了geom_point()既有的參數以外，它還有額外的width和height參數，藉此控制偏移的大小。

df_three %>% 
  ggplot(aes(x = try_avg_three, y = per_three, color = team)) + 
  geom_jitter(alpha = 0.5, size = 2, width = 0.1, height = 0.1)

從圖中你可以發現，已經沒有重疊情形了！

其實，這張圖表中的資料點並不多，因為資料先天上的關係，導致資料點遇上重疊。

一般來說，最常使用geom_jitter()的情境是有大量資料點的時候，使用這個函數的數據呈現效果較佳。

加上文字標籤：`geom_text_repel()`

我們想看的是球員的表現，但點點無法指認，所以加上文字標籤，幫助辨識球員。

利用ggrepel套件的函數geom_text_repel()，就能加上標籤。

你可能看過，一般最常用的其實是geom_text()，它有沒有加上repel後綴，差異在哪裡？

其實跟geom_jitter()有點類似，repel的後綴能夠讓標籤避免重疊，這對於有3個字甚至5個字（洋將）的名字來說，非常有幫助。

library(ggrepel)
df_three %>% 
  ggplot(aes(x = try_avg_three, y = per_three, color = team, label = name)) + 
  geom_jitter(alpha = 0.5, size = 2, width = 0.1, height = 0.1) + 
  geom_text_repel(family = "Noto Sans CJK TC Regular", size = 3,
                  max.overlaps = getOption("ggrepel.max.overlaps", default = 15))

美化：調整顏色、字體、座標軸與背景

因為隊伍的字體還沒挑整，又想將隊伍的顏色對應到球隊，底下就來調整。

value_color = c("新竹街口攻城獅"="#531078","臺北富邦勇士"="#007CB5")

我們利用scale_color_manual()指定顏色，對應到兩支球隊的代表色。

接著在theme()裡面加上字體。

df_three %>% 
  ggplot(aes(x = try_avg_three, y = per_three, color = team, label = name)) + 
  geom_point(alpha = 0.5, size = 2) + 
  geom_text_repel(family = "Noto Sans CJK TC Regular", size = 3,
                  max.overlaps = getOption("ggrepel.max.overlaps", default = 15)) +
  scale_color_manual(values = value_color) +
  theme(text = element_text(family = "Noto Sans CJK TC Regular"))

下一步是座標軸。

因為x和y軸的變數尺度都是連續，所以我們可以分別用scale_x_continuous()和scale_y_continuous()的參數中調整。

其中，limits決定座標軸要多寬多長，例如y軸我設定最大到0.8；labels定義了標籤的長相，像是小數點的位置，數字要不要有逗點等；breaks則是要在哪些地方放上標籤。

df_three %>% 
  ggplot(aes(x = try_avg_three, y = per_three, color = team, label = name)) + 
  geom_point(alpha = 0.5, size = 2) + 
  geom_text_repel(family = "Noto Sans CJK TC Regular", size = 3,
                  max.overlaps = getOption("ggrepel.max.overlaps", default = 15)) +
  scale_color_manual(values = value_color) +
  scale_x_continuous(limits = c(0, 10), breaks = seq(0, 10, 2)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 0.8), breaks = seq(0,0.8,0.1)) +
  theme(text = element_text(family = "Noto Sans CJK TC Regular"))

接著，我們來加上標題和說明文字吧！

利用labs()，在裡面放進title、subtitle等參數，也在theme()裏面提供字體大小等作為補充。

延伸閱讀：[R語言資源]在R語言的圖表中顯示中文

df_three %>% 
  ggplot(aes(x = try_avg_three, y = per_three, color = team, label = name)) + 
  geom_point(alpha = 0.5, size = 2) + 
  geom_text_repel(family = "Noto Sans CJK TC Regular", size = 3,
                  max.overlaps = getOption("ggrepel.max.overlaps", default = 15)) +
  scale_color_manual(values = value_color) +
  scale_x_continuous(limits = c(0, 10), breaks = seq(0, 10, 2)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 0.8), breaks = seq(0,0.8,0.1)) +
  labs(x= "平均每場三分球出手數\n",y= "三\n分\n球\n命\n中\n率", 
       title = "文誠破七成、法獅巫師雙雙打鐵",
       subtitle = "PLG 球員平均每場三分球出手數與命中率, 2022 冠軍賽",
       caption = "資料來源：P-league 官方網站；僅計算平均單場三分球出手至少 1 球者") +
  theme(plot.background = element_rect(fill = "white"), 
        plot.title = element_text(size = 18, family = "Noto Sans CJK TC Medium", hjust = 0, margin = margin(0,0,12,0)),
        plot.subtitle = element_text(size = 14, family = "Noto Sans CJK TC Medium", hjust = 0, margin = margin(0,0,15,0)),
        plot.caption = element_text(size = 10, family = "Noto Sans CJK TC Regular"),
        plot.title.position = "plot", 
        plot.caption.position =  "plot",
        text = element_text(family = "Noto Sans CJK TC Regular"))

輔助線：加上平均

不過，這張圖表還是有一個小問題。

我們可以知道表現特別好和特別差的人，但是我們分不清楚其他人，到底算是表現好呢，還是表現不好呢。

但凡比較，一定要有基準點。這是數據分析的第一鐵律。

所以，我們來計算平均三分命中率，當成衡量標準。

value_three_avg <- df_clean %>% summarise(shot_three = sum(shot_three, na.rm = T), try_three = sum(try_three, na.rm = T)) %>%
  mutate(per_three = shot_three/try_three) %>% pull(per_three)
value_three_avg2 <- round(value_three_avg,2) %>% scales::percent()

有了三分命中率以後，我們在圖上繪製輔助線，就可以知道球員表現的好與差了！

結合加上水平線的geom_hline()，還有以geom_text()加上文字標籤註明平均命中率，圖表上就有了參照的標準。

df_three %>% 
  ggplot(aes(x = try_avg_three, y = per_three, color = team, label = name)) + 
  geom_point(alpha = 0.5, size = 2) + 
  geom_text_repel(family = "Noto Sans CJK TC Regular", size = 3,
                  max.overlaps = getOption("ggrepel.max.overlaps", default = 15)) +
  geom_hline(yintercept=value_three_avg, linetype="dashed",color = "#361509", size=0.4) +
  geom_text(aes(9.2,value_three_avg,label = str_c("平均:", value_three_avg2), vjust = -1),
            family = "Noto Sans CJK TC Regular", color = "black", size = 3) +
  scale_color_manual(values = value_color) +
  scale_x_continuous(limits = c(0, 10), breaks = seq(0, 10, 2)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 0.8), breaks = seq(0,0.8,0.1)) +
  labs(x= "平均每場三分球出手數\n",y= "三分球命中率", 
       title = "文誠破七成、法獅巫師雙雙打鐵",
       subtitle = "球員平均三分球出手數與命中率, 2022冠軍賽",
       caption = "資料來源：P-league 官方網站") +
  theme(plot.background = element_rect(fill = "white"), 
        strip.background = element_rect(fill = "white"), 
        plot.title = element_text(size = 18, family = "Noto Sans CJK TC Medium", hjust = 0, margin = margin(0,0,12,0)),
        plot.subtitle = element_text(size = 14, family = "Noto Sans CJK TC Medium", hjust = 0, margin = margin(0,0,15,0)),
        plot.caption = element_text(size = 10, family = "Noto Sans CJK TC Regular"),
        plot.title.position = "plot", 
        plot.caption.position =  "plot",
        text = element_text(family = "Noto Sans CJK TC Regular"))

以上，就是完整的散布圖繪製步驟。

小結

在這篇文章中，我們嘗試畫了散布圖（scatter plot），它可以幫助我們捕捉變數之間的關係、查看資料的分布、找出離群值。

實際繪製時使用geom_point()，可以用geom_jitter()創造偏離，並透過geom_text()和geom_text_repel()加上文字標籤。

希望你會喜歡這篇文章，也能夠增添對於ggplot2的認識，並且學到怎麼繪製散布圖！

Post Views: 1,049

[R語言圖表]用ggplot畫散布圖 scatter plot

散布圖 scatter plot 是什麼？

實作篇

資料長相

清理資料

繪製基本的散布圖：利用`geom_point()`

當資料點重疊：利用`geom_jitter()`加上隨機偏移

加上文字標籤：`geom_text_repel()`

美化：調整顏色、字體、座標軸與背景

輔助線：加上平均

小結

相關

Dennis Tseng

No Comments

Leave a Reply Cancel reply

[R語言圖表]用ggplot畫散布圖 scatter plot

散布圖 scatter plot 是什麼？

實作篇

資料長相

清理資料

繪製基本的散布圖：利用geom_point()

當資料點重疊：利用geom_jitter()加上隨機偏移

加上文字標籤：geom_text_repel()

美化：調整顏色、字體、座標軸與背景

輔助線：加上平均

小結

相關

Dennis Tseng

No Comments

Leave a Reply Cancel reply

繪製基本的散布圖：利用`geom_point()`

當資料點重疊：利用`geom_jitter()`加上隨機偏移

加上文字標籤：`geom_text_repel()`