--- title: "A Practical Example for Working with Datasets" author: "Zeynel Cebeci" date: "2026-2-14" output: html_document: theme: cosmo highlight: tango toc: true number_sections: true vignette: > %\VignetteIndexEntry{A Practical Example for Working with Datasets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} oldopts <- options(width = 120) knitr::opts_chunk$set( collapse = TRUE, comment = " ", fig.width=7, fig.height=5 ) ``` # Introduction The Normality Transformation via Optimized Skewness and Kurtosis (OSKT) is a normality method that simultaneously evaluates deviations in skewness and kurtosis of non-normal data. If the package `osktnorm` has already been installed, load it into R working environment by using the following command: ```{r loadlib, eval=TRUE, message=FALSE, warning=FALSE} library(osktnorm) ``` # Normalizing Data from an Excel File ## Load Required Packages for Excel I/O To handle Excel file formats, the readxl package is used for importing data and writexl for exporting results. The following code chunk checks for their presence and installs them if necessary before loading. ```{r load-excel-pkgs, message=FALSE, warning=FALSE} if (!requireNamespace("readxl", quietly = TRUE)) install.packages("readxl") if (!requireNamespace("writexl", quietly = TRUE)) install.packages("writexl") library(readxl) library(writexl) ``` ## Read Excel File The target dataset (sativapheno.xlsx) is imported from the package's internal extdata directory. The code then subsets specific columns of interest and applies a filter to retain only numeric variables, ensuring the data is compatible with the normalization algorithms. ```{r read-excel, message=FALSE, warning=FALSE} file_path <- system.file("extdata", "sativapheno.xlsx", package = "osktnorm") pheno <- read_excel(file_path) # Select some variables from the pheno phenoset <- as.data.frame(pheno[, c(4,5,6,7,9,10,12,13,14)]) phenoset <- phenoset[, sapply(phenoset, is.numeric)] head(phenoset) ``` ## Visualization of Variables This section focuses on evaluating the distributional characteristics of the raw phenotypic data before any transformation. Visual inspection is a critical first step in determining the degree of skewness and kurtosis present in each trait. ### Boxplots A grid of histograms is generated to visualize the distribution of each variable. For each plot, the actual data distribution (black line) is compared against a theoretical normal distribution curve (red dashed line). This comparison highlights which traits deviate significantly from normality. ```{r box-plots, message=FALSE, warning=FALSE} oldpar <- par(no.readonly = TRUE) nvar <- ncol(phenoset) nrow_plot <- ceiling(sqrt(nvar)) ncol_plot <- ceiling(nvar / nrow_plot) par(mfrow = c(nrow_plot, ncol_plot), mar = c(4, 4, 3, 1)) cols <- rainbow(nvar) for(i in seq_len(nvar)) { x <- phenoset[[i]] x <- x[!is.na(x)] if(length(x) == 0) next h_calc <- hist(x, plot = FALSE) d <- density(x) max_y <- max(c(h_calc$density, d$y)) * 1.1 hist(x, breaks = 15, freq = FALSE, col = adjustcolor(cols[i], 0.5), border = "white", main = colnames(phenoset)[i], xlab = "Transformed Value", ylab = "Density", ylim = c(0, max_y)) lines(d, col = "black", lwd = 2) curve(dnorm(x, mean = mean(x), sd = sd(x)), add = TRUE, col = "red", lty = 2, lwd = 2) } par(oldpar) ``` ### Normality Tests for Original Values To provide statistical evidence of non-normality, the Shapiro-Wilk test is applied to each trait. A data frame is created to summarize the W statistic and p-values. Traits with a p-value <0.05 are flagged as "NO" (not normal), justifying the need for OSKT normalization. ```{r densplots, message=FALSE, warning=FALSE} shapiro_results <- data.frame( Trait = colnames(phenoset), W = NA_real_, pval = NA_real_, IsNormal = NA_character_, stringsAsFactors = FALSE ) for(i in seq_len(ncol(phenoset))) { x <- phenoset[[i]] x <- x[!is.na(x)] if(length(x) >= 3) { test <- shapiro.test(x) shapiro_results$W[i] <- test$statistic shapiro_results$pval[i] <- test$p.value shapiro_results$IsNormal[i] <- ifelse(test$p.value > 0.05, "YES", "NO") } else { shapiro_results$W[i] <- NA shapiro_results$pval[i] <- NA shapiro_results$IsNormal[i] <- NA } } shapiro_results ``` ## Normalization with OSKT The core normalization process is executed using the osktnorm function. By setting normtests = FALSE, the function focuses on calculating the optimal transformation parameters and generating the normalized data frame (phenonormal), where each variable is adjusted to achieve a distribution as close to normal as possible. ```{r batch-normalization, message=FALSE, warning=FALSE} # Normalization only norm_res1 <- osktnorm(phenoset, normtests = FALSE) phenonormal <- norm_res1$normalized head(phenonormal) ``` ## Visualizing the Normalized Values Following the transformation, the data is re-visualized to confirm the effectiveness of the OSKT method. This allows for a direct comparison between the initial skewed distributions and the newly normalized results. ### Boxplots Boxplots are utilized to examine the center, spread, and outliers of the normalized traits. This visualization helps verify that the variables now share a similar scale and that the influence of extreme values (outliers) has been effectively stabilized across the entire dataset. ```{r boxplots, message=FALSE, warning=FALSE} oldpar <- par(no.readonly = TRUE) par(mar = c(8,4,4,2)) boxplot( phenonormal, col = rainbow(ncol(phenonormal)), border = "black", outline = TRUE, las = 2, main = "OSKT Normalized Values", ylab = "Values", cex.axis = 0.8 ) par(oldpar) ``` ### Histograms and Density Plots After transformation, a new grid of histograms is generated for the phenonormal dataset. This visualization allows for an empirical evaluation of the OSKT method, confirming that the distributions now exhibit improved symmetry and follow a bell-shaped curve more closely compared to the raw data. ```{r histograms, message=FALSE, warning=FALSE} oldpar <- par(no.readonly = TRUE) nvar <- ncol(phenonormal) nrow_plot <- ceiling(sqrt(nvar)) ncol_plot <- ceiling(nvar / nrow_plot) par(mfrow = c(nrow_plot, ncol_plot), mar = c(4,4,3,1)) cols <- rainbow(nvar) for(i in seq_len(nvar)) { x <- phenonormal[[i]] x <- x[!is.na(x)] if(length(x) == 0) next d <- density(x) h_calc <- hist(x, plot = FALSE) max_y <- max(d$y, h_calc$density) * 1.1 hist(x, freq = FALSE, col = adjustcolor(cols[i], 0.6), border = "white", main = colnames(phenonormal)[i], xlab = "", ylab = "Density", ylim = c(0, max_y)) lines(d, col = "black", lwd = 2) } par(oldpar) ``` ## Optimization Parameters This section extracts the internal parameters estimated during the transformation process. The table provides the optimized values for g (skewness parameter) and h (kurtosis parameter), along with the final A^2 value, which represents the minimized objective function for each trait. ```{r optparams, message=FALSE, warning=FALSE} norm_res2 <- osktnorm(phenoset, normtests = "all") phenonormal <- norm_res2$normalized paramstable <- norm_res2$parameters paramstable ``` ## Normality Test Results A comprehensive suite of statistical diagnostics is performed on the normalized data. By requesting normtests = "all", the function computes multiple metrics including skewness, kurtosis, and various p-values (Shapiro-Wilk, Anderson-Darling, etc.). This table serves as the final validation that the traits have successfully achieved or approached normality. ```{r testresults, message=FALSE, warning=FALSE} testtable <- norm_res2$normtests testtable ``` ## Saving the Normalized Data In the final step, the successfully transformed and normalized dataset is exported to an external file. Using the writexl package, the data is saved as an Excel spreadsheet (sativapheno_normalized.xlsx), making it ready for use in subsequent downstream statistical analyses. ```{r write-excel, eval=FALSE, echo=TRUE, message=FALSE, warning=FALSE} writexl::write_xlsx(phenonormal, path = "sativapheno_normalized.xlsx") ``` ```{r final-checks, message=FALSE, warning=FALSE} options(oldopts) ```