2 Preprocess

using Rtemis, DataFrames, CategoricalArrays

The WebIO Jupyter extension was not detected. See the WebIO Jupyter integration documentation for more information.

2.1 Synthetic Data

x = DataFrame(ID=collect(1:5),
    a=[missing, randn(4)...],
    b=["a", "a", "b", "b", "c"],
    c=[missing, 3, 5, missing, 7],
    d=categorical(["f", "f", "s", "s", "f"]),
    e=[true, true, false, false, true],
    f=BitArray([0, 1, 1, 1, 0]),
    g=fill(42.0, 5));
push!(x, x[4, :]);

6×8 DataFrame

Row	ID	a	b	c	d	e	f	g
	Int64	Float64?	String	Int64?	Cat…	Bool	Bool	Float64
1	1	missing	a	missing	f	true	false	42.0
2	2	0.301659	a	3	f	true	true	42.0
3	3	0.892823	b	5	s	false	true	42.0
4	4	-0.843242	b	missing	s	false	true	42.0
5	5	-0.00549055	c	7	f	true	false	42.0
6	4	-0.843242	b	missing	s	false	true	42.0

2.2 Check Data

check_data(x)

  DataFrame with 6 rows and 8 columns:

  Data types
  ├ 2 continuous features
  │ ├ 3 integers
  │ └ 1 float
  ├ 2 boolean features
  └ 1 categorical feature
    └ of which 0 are ordered

  Issues
  ├ 1 constant feature
  ├ 1 duplicate row
  └ 2 features contain missing values
    ├ Max percent missing in a feature is 0.50% ('c')
    └ Max percent missing in a row is 0.25% (row #1)

  Recommendations
    ➤ Remove the constant feature 
    ➤ Remove the duplicated case 
    ➤ Consider imputing missing values or using complete cases only

2.3 Preprocess Data

xp = preprocess(
    x,
    removeConstants=true,
    removeDuplicates=true,
    string2categorical=true,
    boolean2categorical=true)

07-21-23 08:56:07 Removed 1 constant column. [preprocess]
07-21-23 08:56:07 Removed 1 duplicate row. [preprocess]
07-21-23 08:56:07 Converted 1 string variable to categorical. [preprocess]
07-21-23 08:56:08 Converted 2 boolean variables to categorical. [preprocess]

5×7 DataFrame

Row	ID	a	b	c	d	e	f
	Int64	Float64?	Cat…	Int64?	Cat…	Cat…	Cat…
1	1	missing	a	missing	f	true	false
2	2	0.301659	a	3	f	true	true
3	3	0.892823	b	5	s	false	true
4	4	-0.843242	b	missing	s	false	true
5	5	-0.00549055	c	7	f	true	false

xp2 = preprocess(
    x,
    removeConstants=true,
    removeDuplicates=true,
    string2categorical=true,
    boolean2categorical=true,
    oneHot=true)

07-21-23 08:56:08 Removed 1 constant column. [preprocess]
07-21-23 08:56:08 Removed 1 duplicate row. [preprocess]
07-21-23 08:56:08 Converted 1 string variable to categorical. [preprocess]
07-21-23 08:56:08 Converted 2 boolean variables to categorical. [preprocess]
07-21-23 08:56:09 One-hot encoded 4 categorical variables. [preprocess]

5×12 DataFrame

Row	ID	a	b_a	b_b	b_c	c	d_f	d_s	e_false	e_true	f_false	f_true
	Int64	Float64?	Int64	Int64	Int64	Int64?	Int64	Int64	Int64	Int64	Int64	Int64
1	1	missing	1	0	0	missing	1	0	0	1	1	0
2	2	0.301659	1	0	0	3	1	0	0	1	0	1
3	3	0.892823	0	1	0	5	0	1	1	0	0	1
4	4	-0.843242	0	1	0	missing	0	1	1	0	0	1
5	5	-0.00549055	0	0	1	7	1	0	0	1	1	0