Dataset.jl and Corresponding Tests by etontackett · Pull Request #3 · vp314/experiments-RidgeRegression

etontackett · 2026-03-05T14:29:45Z

Add Dataset struct, CSV loader, one-hot encoding, and corresponding tests.

Description

This pull request introduces dataset.jl, which defines the Dataset struct and supports loading and preprocessing datasets used in Ridge Regression Experiments. The Dataset type provides a representation consisting of a matrix, response vector, and the dataset name. The file also implements a function (csv_dataset) for loading datasets from CSV files or URLs and removing rows with missing values. Additionally, categorical variables are converted into numeric form using the one_hot_encoding function. Tests were added to verify the correctness of the dataset constructor, the one-hot encoding process, and the CSV dataset loading functionality.

Issues resolved include improving code documentation by adding comments that describe code flow and functionality. Removed any print statements. The unit tests were updated and expanded to better improve code coverage. Any dependencies were added to the .toml file. Doc strings were also revised to follow standard formatting and documentation style, including the use of admonitions.

Motivation and Context

Many datasets contain categorical or missing variables, which must be handled prior to applying ridge regression methods. This change standardizes datasets and preprocessing workflows to ensure that we have numeric features, no missing data, and converting categorical variables to one-hot encoded features. This in turn provides consistent experimental units for the ridge regression framework.

Types of changes

Checklists:

Code and Comments
If this PR includes modifications to the code base, please select all that apply.

My code follows the code style of this project.
I have updated all package dependencies (if any).
I have included all relevant files to realize the functionality of the PR.
I have exported relevant functionality (if any).

API Documentation

For every exported function (if any), I have included a detailed docstring.
I have checked the spelling and grammar of all docstring updates through an external tool.
I have checked that the docstring's function signature is correctly formatted and has all arguments.
I have checked that the docstring's list of arguments, fields, or return values match the function.
I have compiled the docs locally and read through all docstring updates to check for errors.

Manual Documentation

I have checked the spelling and grammar of all manual updates through an external tool.
Any code included in the docstring is tested using doc tests to ensure consistency.
I have compiled the docs locally and read through all manual updates to check for errors.

Testing

I have added unit tests to cover my changes. (For Macros, be sure to check
@code_lowered and
@code_typed)
All new and existing tests passed.
I have achieved sufficient code coverage.

vp314 · 2026-03-10T16:52:51Z

docs/src/design.md

This is an update to the pages of the documentation (i.e., the manual). So you should go through and check that you are doing all the things required for manual pages updates. Also, PRs should be more focused.

vp314 · 2026-03-10T16:53:45Z

src/dataset.jl

+using CSV
+using DataFrames
+using Downloads
+
+export Dataset, csv_dataset


In Julia, we put using/import statements in the main source file. We do the same for export statements.

vp314 · 2026-03-10T16:55:05Z

src/dataset.jl

All dependencies should appear in the Project.toml file. You should activate the package environment and then "add ..." your dependencies to ensure compatibility and correct environment for the package.

vp314 · 2026-03-10T16:55:58Z

src/dataset.jl

+# Arguments
+- `path_or_url::String`
+    Local file path or web URL that has CSV data.
+
+- `target_col`
+    Column index OR column name containing the response variable.
+
+- `name::String`
+    Dataset name.
+
+# Returns
+`Dataset`


Need to abide by the style guide as you have done above.

vp314 · 2026-03-10T16:56:48Z

src/dataset.jl

+        lv = unique(scol)
+        ind = scol .== permutedims(lv)
+
+        println("Variable: $name")


We should not have print statements inside of code.

src/dataset.jl

vp314 · 2026-03-10T17:00:50Z

test/dataset_tests.jl

Review unit testing documentation in Julia to see how to do this correctly.

vp314 · 2026-03-10T17:01:10Z

src/dataset.jl

+# Throws
+- `ArgumentError`: If rows in `X` does not equal length of `y`.
+
+# Notes


Notes should be admonitions. See documenter.jl's documentation on admonitions.

vp314 · 2026-03-17T16:55:07Z

src/dataset.jl

+"""
+    Dataset(name, X, y)
+
+Contains datasets for ridge regression experiments.
+
+# Fields
+- `name::String`: Name of dataset
+- `X::Matrix{Float64}`: Matrix of variables/features
+- `y::Vector{Float64}`: Target vector
+
+# Throws
+- `ArgumentError`: If rows in `X` does not equal length of `y`.


There should be documentation for the struct being created and then there should be documentation for the constructor in the same docstring.

vp314 · 2026-03-17T16:57:58Z

src/dataset.jl

+        size(X, 1) == length(y) ||
+            throw(ArgumentError("X and y must have same number of rows"))
+
+        new(name, Matrix{Float64}(X), Vector{Float64}(y))


If you are interested in looking at sparse design matrices, this functionality precludes that as any matrix would be converted to Matrix{Float64} type which is dense. You can fix this by considering parametric types or Union types for the fields.

vp314 · 2026-03-17T17:00:17Z

src/dataset.jl

+# Returns
+- `Dataset`: A dataset containing the encoded feature matrix `X`, response vector `y`, and dataset name.
+"""
+function csv_dataset(path_or_url::String;


This does not follow BlueStyle

vp314 · 2026-03-17T17:01:21Z

src/dataset.jl

+# Returns
+- `Matrix{Float64}`: A numeric matrix containing the encoded feature.
+"""
+function one_hot_encode(Xdf::DataFrame; drop_first::Bool = true)::Matrix{Float64}


Maybe this function should focus on one-hot encoding a specific column provided to the function rather than an entire data frame as we do not always know which columns should be one-hot encoded just from their type. Think of categorical data that is saved in the data set as integers rather than as words.

vp314 · 2026-03-17T17:02:27Z

src/dataset.jl

+
+end
+"""
+    csv_dataset(path_or_url; target_col, name="csv_dataset")


This is a bad function name.

vp314 · 2026-03-17T17:03:26Z

src/RidgeRegression.jl

+export Dataset, csv_dataset, one_hot_encode
+
+include("dataset.jl")


You should include before you export generally, but if it works this is fine too.

EtonT471 added 3 commits March 2, 2026 15:32

Add dataset utilities and tests

7dc2968

Adding dataset_tests.jl

47fc954

Small changes to design.md

787a88d

vp314 self-requested a review March 10, 2026 16:49

vp314 requested changes Mar 10, 2026

View reviewed changes

EtonT471 added 2 commits March 16, 2026 22:28

March 16 Updates

2dd2295

Ridge Regression file

173cdd1

etontackett changed the title ~~Feature/datasets~~ Dataset.jl and Corresponding Tests Mar 17, 2026

EtonT471 added 3 commits March 16, 2026 23:12

dataset.jl small update

042d5d6

Updated Experimental Units and Treatments Sections

70cfa67

Small changes

9b8c48c

vp314 requested changes Mar 17, 2026

View reviewed changes

EtonT471 added 2 commits March 20, 2026 18:45

Changes

bf9707b

Ridge Regreesion jl changes

214e74d

		export Dataset, csv_dataset, one_hot_encode

		include("dataset.jl")

Conversation

etontackett commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Types of changes

Checklists:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

etontackett commented Mar 5, 2026 •

edited

Loading