using Rtemis, DataFrames, CategoricalArrays
The WebIO Jupyter extension was not detected. See the WebIO Jupyter integration documentation for more information.
using Rtemis, DataFrames, CategoricalArrays
The WebIO Jupyter extension was not detected. See the WebIO Jupyter integration documentation for more information.
= DataFrame(ID=collect(1:5),
x =[missing, randn(4)...],
a=["a", "a", "b", "b", "c"],
b=[missing, 3, 5, missing, 7],
c=categorical(["f", "f", "s", "s", "f"]),
de=[true, true, false, false, true],
=BitArray([0, 1, 1, 1, 0]),
f=fill(42.0, 5));
gpush!(x, x[4, :]);
x
Row | ID | a | b | c | d | e | f | g |
---|---|---|---|---|---|---|---|---|
Int64 | Float64? | String | Int64? | Cat… | Bool | Bool | Float64 | |
1 | 1 | missing | a | missing | f | true | false | 42.0 |
2 | 2 | 0.301659 | a | 3 | f | true | true | 42.0 |
3 | 3 | 0.892823 | b | 5 | s | false | true | 42.0 |
4 | 4 | -0.843242 | b | missing | s | false | true | 42.0 |
5 | 5 | -0.00549055 | c | 7 | f | true | false | 42.0 |
6 | 4 | -0.843242 | b | missing | s | false | true | 42.0 |
check_data(x)
DataFrame with 6 rows and 8 columns:
Data types
├ 2 continuous features
│ ├ 3 integers
│ └ 1 float
├ 2 boolean features
└ 1 categorical feature
└ of which 0 are ordered
Issues
├ 1 constant feature
├ 1 duplicate row
└ 2 features contain missing values
├ Max percent missing in a feature is 0.50% ('c')
└ Max percent missing in a row is 0.25% (row #1)
Recommendations
➤ Remove the constant feature
➤ Remove the duplicated case
➤ Consider imputing missing values or using complete cases only
= preprocess(
xp
x,=true,
removeConstants=true,
removeDuplicates=true,
string2categorical=true) boolean2categorical
07-21-23 08:56:07 Removed 1 constant column. [preprocess]
07-21-23 08:56:07 Removed 1 duplicate row. [preprocess]
07-21-23 08:56:07 Converted 1 string variable to categorical. [preprocess]
07-21-23 08:56:08 Converted 2 boolean variables to categorical. [preprocess]
Row | ID | a | b | c | d | e | f |
---|---|---|---|---|---|---|---|
Int64 | Float64? | Cat… | Int64? | Cat… | Cat… | Cat… | |
1 | 1 | missing | a | missing | f | true | false |
2 | 2 | 0.301659 | a | 3 | f | true | true |
3 | 3 | 0.892823 | b | 5 | s | false | true |
4 | 4 | -0.843242 | b | missing | s | false | true |
5 | 5 | -0.00549055 | c | 7 | f | true | false |
= preprocess(
xp2
x,=true,
removeConstants=true,
removeDuplicates=true,
string2categorical=true,
boolean2categorical=true) oneHot
07-21-23 08:56:08 Removed 1 constant column. [preprocess]
07-21-23 08:56:08 Removed 1 duplicate row. [preprocess]
07-21-23 08:56:08 Converted 1 string variable to categorical. [preprocess]
07-21-23 08:56:08 Converted 2 boolean variables to categorical. [preprocess]
07-21-23 08:56:09 One-hot encoded 4 categorical variables. [preprocess]
Row | ID | a | b_a | b_b | b_c | c | d_f | d_s | e_false | e_true | f_false | f_true |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Float64? | Int64 | Int64 | Int64 | Int64? | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 1 | missing | 1 | 0 | 0 | missing | 1 | 0 | 0 | 1 | 1 | 0 |
2 | 2 | 0.301659 | 1 | 0 | 0 | 3 | 1 | 0 | 0 | 1 | 0 | 1 |
3 | 3 | 0.892823 | 0 | 1 | 0 | 5 | 0 | 1 | 1 | 0 | 0 | 1 |
4 | 4 | -0.843242 | 0 | 1 | 0 | missing | 0 | 1 | 1 | 0 | 0 | 1 |
5 | 5 | -0.00549055 | 0 | 0 | 1 | 7 | 1 | 0 | 0 | 1 | 1 | 0 |