Data Preprocessing(1)

邱之宇 Cosmo Chiou
2 min readOct 31, 2022

--

Data Preprocessing Steps

  1. Data Cleaning (資料清洗)
  2. Impute missing value (資料補植)
  3. Data Labeling (資料標註)

Data Cleaning

If some data of a company contains boy, girl, male, female, Male, Female, M, F, …etc., many columns are duplicate.

Pie charts are divided into various blocks! Many blocks are duplicate.
  1. First, decide the format of data.
    - “Boy, Girl”
    - “Male, Female”
    - “M, F”
  2. If the format is set as “M,F”, we needs to start convergence process. Change all different format of male and female to “M,F”.
  3. Finally, if there are some value does not represent male and female, such as cell phone number, address. We should change the value to “null”.

Data Cleaning: Outlier (異常值)

There is no uniform definition. The outliers are judged by data analysts or decision maker.
(
沒有統一的定義。異常值由數據分析師或決策者判斷。)

data analysts = 資料分析師
decision maker = 決策者

Methods to judge outlier:

  • Draw boxplots. Treat values above a certain percentage as outliers. (將高於某個百分比的值視為異常值)
  • Using normal distribution (使用常態分佈)
  • Map data into a specific space and observe the distance among data. (將數據映射到特定空間,觀察數據之間的距離。)

Impute missing value

For missing value, should it be ignored, or the closest value fill in?

Different methods will produce great deviations, which may lead to wrong decisions and cause heavy losses. Thus, missing value should be handled with caution.

(不同的方法會產生很大的偏差,可能導致錯誤的決策,造成重大損失。因此,應謹慎處理缺失值。)

Method

List-wise Deletion (刪掉row)

df = df.dropna()

Pairwise deletion (刪掉column)

df = df[[0,1]].dropna()

Fillna (fill zero)

df = df.fillna(0)

Fillna (fill previous row)

df = df.fillna(method = 'ffill')

Fillna (fill mean value)

df = df.fillna(df.mean())

More fillna

Use “?df.fillna” to search other methods.

Data Labeling

  • Label the features in data. (多個Label)
  • Label the correct answer in data. (True or False)

資料標註工具

  • Labelbox
  • Labellmg
    - 支援Yolo

--

--

No responses yet