Data Preprocessing(1)
· Data Preprocessing Steps
· Data Cleaning
---------- Data Cleaning: Outlier (異常值)
---------- Methods to judge outlier:
· Impute missing value
---------- Method
· Data Labeling
---------- 資料標註工具
Data Preprocessing Steps
- Data Cleaning (資料清洗)
- Impute missing value (資料補植)
- Data Labeling (資料標註)
Data Cleaning
If some data of a company contains boy, girl, male, female, Male, Female, M, F, …etc., many columns are duplicate.
- First, decide the format of data.
- “Boy, Girl”
- “Male, Female”
- “M, F” - If the format is set as “M,F”, we needs to start convergence process. Change all different format of male and female to “M,F”.
- Finally, if there are some value does not represent male and female, such as cell phone number, address. We should change the value to “null”.
Data Cleaning: Outlier (異常值)
There is no uniform definition. The outliers are judged by data analysts or decision maker.
(沒有統一的定義。異常值由數據分析師或決策者判斷。)data analysts = 資料分析師
decision maker = 決策者
Methods to judge outlier:
- Draw boxplots. Treat values above a certain percentage as outliers. (將高於某個百分比的值視為異常值)
- Using normal distribution (使用常態分佈)
- Map data into a specific space and observe the distance among data. (將數據映射到特定空間,觀察數據之間的距離。)
Impute missing value
For missing value, should it be ignored, or the closest value fill in?
Different methods will produce great deviations, which may lead to wrong decisions and cause heavy losses. Thus, missing value should be handled with caution.
(不同的方法會產生很大的偏差,可能導致錯誤的決策,造成重大損失。因此,應謹慎處理缺失值。)
Method
List-wise Deletion (刪掉row)
df = df.dropna()
Pairwise deletion (刪掉column)
df = df[[0,1]].dropna()
Fillna (fill zero)
df = df.fillna(0)
Fillna (fill previous row)
df = df.fillna(method = 'ffill')
Fillna (fill mean value)
df = df.fillna(df.mean())
More fillna
Use “?df.fillna” to search other methods.
Data Labeling
- Label the features in data. (多個Label)
- Label the correct answer in data. (True or False)
資料標註工具
- Labelbox
- Labellmg
- 支援Yolo