Conversation
There was a problem hiding this comment.
This is an update to the pages of the documentation (i.e., the manual). So you should go through and check that you are doing all the things required for manual pages updates. Also, PRs should be more focused.
src/dataset.jl
Outdated
| using CSV | ||
| using DataFrames | ||
| using Downloads | ||
|
|
||
| export Dataset, csv_dataset |
There was a problem hiding this comment.
In Julia, we put using/import statements in the main source file. We do the same for export statements.
There was a problem hiding this comment.
All dependencies should appear in the Project.toml file. You should activate the package environment and then "add ..." your dependencies to ensure compatibility and correct environment for the package.
src/dataset.jl
Outdated
| # Arguments | ||
| - `path_or_url::String` | ||
| Local file path or web URL that has CSV data. | ||
|
|
||
| - `target_col` | ||
| Column index OR column name containing the response variable. | ||
|
|
||
| - `name::String` | ||
| Dataset name. | ||
|
|
||
| # Returns | ||
| `Dataset` |
There was a problem hiding this comment.
Need to abide by the style guide as you have done above.
src/dataset.jl
Outdated
| lv = unique(scol) | ||
| ind = scol .== permutedims(lv) | ||
|
|
||
| println("Variable: $name") |
There was a problem hiding this comment.
We should not have print statements inside of code.
There was a problem hiding this comment.
Review unit testing documentation in Julia to see how to do this correctly.
src/dataset.jl
Outdated
| # Throws | ||
| - `ArgumentError`: If rows in `X` does not equal length of `y`. | ||
|
|
||
| # Notes |
There was a problem hiding this comment.
Notes should be admonitions. See documenter.jl's documentation on admonitions.
| """ | ||
| Dataset(name, X, y) | ||
|
|
||
| Contains datasets for ridge regression experiments. | ||
|
|
||
| # Fields | ||
| - `name::String`: Name of dataset | ||
| - `X::Matrix{Float64}`: Matrix of variables/features | ||
| - `y::Vector{Float64}`: Target vector | ||
|
|
||
| # Throws | ||
| - `ArgumentError`: If rows in `X` does not equal length of `y`. |
There was a problem hiding this comment.
There should be documentation for the struct being created and then there should be documentation for the constructor in the same docstring.
| size(X, 1) == length(y) || | ||
| throw(ArgumentError("X and y must have same number of rows")) | ||
|
|
||
| new(name, Matrix{Float64}(X), Vector{Float64}(y)) |
There was a problem hiding this comment.
If you are interested in looking at sparse design matrices, this functionality precludes that as any matrix would be converted to Matrix{Float64} type which is dense. You can fix this by considering parametric types or Union types for the fields.
| # Returns | ||
| - `Dataset`: A dataset containing the encoded feature matrix `X`, response vector `y`, and dataset name. | ||
| """ | ||
| function csv_dataset(path_or_url::String; |
| # Returns | ||
| - `Matrix{Float64}`: A numeric matrix containing the encoded feature. | ||
| """ | ||
| function one_hot_encode(Xdf::DataFrame; drop_first::Bool = true)::Matrix{Float64} |
There was a problem hiding this comment.
Maybe this function should focus on one-hot encoding a specific column provided to the function rather than an entire data frame as we do not always know which columns should be one-hot encoded just from their type. Think of categorical data that is saved in the data set as integers rather than as words.
|
|
||
| end | ||
| """ | ||
| csv_dataset(path_or_url; target_col, name="csv_dataset") |
src/RidgeRegression.jl
Outdated
| export Dataset, csv_dataset, one_hot_encode | ||
|
|
||
| include("dataset.jl") |
There was a problem hiding this comment.
You should include before you export generally, but if it works this is fine too.
Add Dataset struct, CSV loader, one-hot encoding, and corresponding tests.
Description
This pull request introduces dataset.jl, which defines the Dataset struct and supports loading and preprocessing datasets used in Ridge Regression Experiments. The Dataset type provides a representation consisting of a matrix, response vector, and the dataset name. The file also implements a function (csv_dataset) for loading datasets from CSV files or URLs and removing rows with missing values. Additionally, categorical variables are converted into numeric form using the one_hot_encoding function. Tests were added to verify the correctness of the dataset constructor, the one-hot encoding process, and the CSV dataset loading functionality.
Issues resolved include improving code documentation by adding comments that describe code flow and functionality. Removed any print statements. The unit tests were updated and expanded to better improve code coverage. Any dependencies were added to the .toml file. Doc strings were also revised to follow standard formatting and documentation style, including the use of admonitions.
Motivation and Context
Many datasets contain categorical or missing variables, which must be handled prior to applying ridge regression methods. This change standardizes datasets and preprocessing workflows to ensure that we have numeric features, no missing data, and converting categorical variables to one-hot encoded features. This in turn provides consistent experimental units for the ridge regression framework.
Types of changes
Checklists:
Code and Comments
If this PR includes modifications to the code base, please select all that apply.
API Documentation
Manual Documentation
Testing
@code_lowered and
@code_typed)