Skip to content

Dataset.jl and Corresponding Tests#3

Open
etontackett wants to merge 10 commits intomainfrom
feature/datasets
Open

Dataset.jl and Corresponding Tests#3
etontackett wants to merge 10 commits intomainfrom
feature/datasets

Conversation

@etontackett
Copy link
Collaborator

@etontackett etontackett commented Mar 5, 2026

Add Dataset struct, CSV loader, one-hot encoding, and corresponding tests.

Description

This pull request introduces dataset.jl, which defines the Dataset struct and supports loading and preprocessing datasets used in Ridge Regression Experiments. The Dataset type provides a representation consisting of a matrix, response vector, and the dataset name. The file also implements a function (csv_dataset) for loading datasets from CSV files or URLs and removing rows with missing values. Additionally, categorical variables are converted into numeric form using the one_hot_encoding function. Tests were added to verify the correctness of the dataset constructor, the one-hot encoding process, and the CSV dataset loading functionality.

Issues resolved include improving code documentation by adding comments that describe code flow and functionality. Removed any print statements. The unit tests were updated and expanded to better improve code coverage. Any dependencies were added to the .toml file. Doc strings were also revised to follow standard formatting and documentation style, including the use of admonitions.

Motivation and Context

Many datasets contain categorical or missing variables, which must be handled prior to applying ridge regression methods. This change standardizes datasets and preprocessing workflows to ensure that we have numeric features, no missing data, and converting categorical variables to one-hot encoded features. This in turn provides consistent experimental units for the ridge regression framework.

Types of changes

  • CI
  • Docs
  • Feature
  • Fix
  • Performance
  • Refactor
  • Style
  • Test
  • Other (use sparingly):

Checklists:

Code and Comments
If this PR includes modifications to the code base, please select all that apply.

  • My code follows the code style of this project.
  • I have updated all package dependencies (if any).
  • I have included all relevant files to realize the functionality of the PR.
  • I have exported relevant functionality (if any).

API Documentation

  • For every exported function (if any), I have included a detailed docstring.
  • I have checked the spelling and grammar of all docstring updates through an external tool.
  • I have checked that the docstring's function signature is correctly formatted and has all arguments.
  • I have checked that the docstring's list of arguments, fields, or return values match the function.
  • I have compiled the docs locally and read through all docstring updates to check for errors.

Manual Documentation

  • I have checked the spelling and grammar of all manual updates through an external tool.
  • Any code included in the docstring is tested using doc tests to ensure consistency.
  • I have compiled the docs locally and read through all manual updates to check for errors.

Testing

  • I have added unit tests to cover my changes. (For Macros, be sure to check
    @code_lowered and
    @code_typed)
  • All new and existing tests passed.
  • I have achieved sufficient code coverage.

@vp314 vp314 self-requested a review March 10, 2026 16:49
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an update to the pages of the documentation (i.e., the manual). So you should go through and check that you are doing all the things required for manual pages updates. Also, PRs should be more focused.

src/dataset.jl Outdated
Comment on lines +1 to +5
using CSV
using DataFrames
using Downloads

export Dataset, csv_dataset
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Julia, we put using/import statements in the main source file. We do the same for export statements.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All dependencies should appear in the Project.toml file. You should activate the package environment and then "add ..." your dependencies to ensure compatibility and correct environment for the package.

src/dataset.jl Outdated
Comment on lines +95 to +106
# Arguments
- `path_or_url::String`
Local file path or web URL that has CSV data.

- `target_col`
Column index OR column name containing the response variable.

- `name::String`
Dataset name.

# Returns
`Dataset`
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to abide by the style guide as you have done above.

src/dataset.jl Outdated
lv = unique(scol)
ind = scol .== permutedims(lv)

println("Variable: $name")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not have print statements inside of code.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review unit testing documentation in Julia to see how to do this correctly.

src/dataset.jl Outdated
# Throws
- `ArgumentError`: If rows in `X` does not equal length of `y`.

# Notes
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes should be admonitions. See documenter.jl's documentation on admonitions.

@etontackett etontackett changed the title Feature/datasets Dataset.jl and Corresponding Tests Mar 17, 2026
Comment on lines +1 to +12
"""
Dataset(name, X, y)

Contains datasets for ridge regression experiments.

# Fields
- `name::String`: Name of dataset
- `X::Matrix{Float64}`: Matrix of variables/features
- `y::Vector{Float64}`: Target vector

# Throws
- `ArgumentError`: If rows in `X` does not equal length of `y`.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be documentation for the struct being created and then there should be documentation for the constructor in the same docstring.

size(X, 1) == length(y) ||
throw(ArgumentError("X and y must have same number of rows"))

new(name, Matrix{Float64}(X), Vector{Float64}(y))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are interested in looking at sparse design matrices, this functionality precludes that as any matrix would be converted to Matrix{Float64} type which is dense. You can fix this by considering parametric types or Union types for the fields.

# Returns
- `Dataset`: A dataset containing the encoded feature matrix `X`, response vector `y`, and dataset name.
"""
function csv_dataset(path_or_url::String;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not follow BlueStyle

# Returns
- `Matrix{Float64}`: A numeric matrix containing the encoded feature.
"""
function one_hot_encode(Xdf::DataFrame; drop_first::Bool = true)::Matrix{Float64}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this function should focus on one-hot encoding a specific column provided to the function rather than an entire data frame as we do not always know which columns should be one-hot encoded just from their type. Think of categorical data that is saved in the data set as integers rather than as words.


end
"""
csv_dataset(path_or_url; target_col, name="csv_dataset")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad function name.

Comment on lines +6 to +8
export Dataset, csv_dataset, one_hot_encode

include("dataset.jl")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should include before you export generally, but if it works this is fine too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants