2  Preprocess

using Rtemis, DataFrames, CategoricalArrays

The WebIO Jupyter extension was not detected. See the WebIO Jupyter integration documentation for more information.

2.1 Synthetic Data

x = DataFrame(ID=collect(1:5),
    a=[missing, randn(4)...],
    b=["a", "a", "b", "b", "c"],
    c=[missing, 3, 5, missing, 7],
    d=categorical(["f", "f", "s", "s", "f"]),
    e=[true, true, false, false, true],
    f=BitArray([0, 1, 1, 1, 0]),
    g=fill(42.0, 5));
push!(x, x[4, :]);
x
6×8 DataFrame
Row ID a b c d e f g
Int64 Float64? String Int64? Cat… Bool Bool Float64
1 1 missing a missing f true false 42.0
2 2 0.301659 a 3 f true true 42.0
3 3 0.892823 b 5 s false true 42.0
4 4 -0.843242 b missing s false true 42.0
5 5 -0.00549055 c 7 f true false 42.0
6 4 -0.843242 b missing s false true 42.0

2.2 Check Data

check_data(x)
  DataFrame with 6 rows and 8 columns:

  Data types
  ├ 2 continuous features
  │ ├ 3 integers
  │ └ 1 float
  ├ 2 boolean features
  └ 1 categorical feature
    └ of which 0 are ordered

  Issues
  ├ 1 constant feature
  ├ 1 duplicate row
  └ 2 features contain missing values
    ├ Max percent missing in a feature is 0.50% ('c')
    └ Max percent missing in a row is 0.25% (row #1)

  Recommendations
    ➤ Remove the constant feature 
    ➤ Remove the duplicated case 
    ➤ Consider imputing missing values or using complete cases only

2.3 Preprocess Data

xp = preprocess(
    x,
    removeConstants=true,
    removeDuplicates=true,
    string2categorical=true,
    boolean2categorical=true)
07-21-23 08:56:07 Removed 1 constant column. [preprocess]
07-21-23 08:56:07 Removed 1 duplicate row. [preprocess]
07-21-23 08:56:07 Converted 1 string variable to categorical. [preprocess]
07-21-23 08:56:08 Converted 2 boolean variables to categorical. [preprocess]
5×7 DataFrame
Row ID a b c d e f
Int64 Float64? Cat… Int64? Cat… Cat… Cat…
1 1 missing a missing f true false
2 2 0.301659 a 3 f true true
3 3 0.892823 b 5 s false true
4 4 -0.843242 b missing s false true
5 5 -0.00549055 c 7 f true false
xp2 = preprocess(
    x,
    removeConstants=true,
    removeDuplicates=true,
    string2categorical=true,
    boolean2categorical=true,
    oneHot=true)
07-21-23 08:56:08 Removed 1 constant column. [preprocess]
07-21-23 08:56:08 Removed 1 duplicate row. [preprocess]
07-21-23 08:56:08 Converted 1 string variable to categorical. [preprocess]
07-21-23 08:56:08 Converted 2 boolean variables to categorical. [preprocess]
07-21-23 08:56:09 One-hot encoded 4 categorical variables. [preprocess]
5×12 DataFrame
Row ID a b_a b_b b_c c d_f d_s e_false e_true f_false f_true
Int64 Float64? Int64 Int64 Int64 Int64? Int64 Int64 Int64 Int64 Int64 Int64
1 1 missing 1 0 0 missing 1 0 0 1 1 0
2 2 0.301659 1 0 0 3 1 0 0 1 0 1
3 3 0.892823 0 1 0 5 0 1 1 0 0 1
4 4 -0.843242 0 1 0 missing 0 1 1 0 0 1
5 5 -0.00549055 0 0 1 7 1 0 0 1 1 0