diff_classifier.pca
¶
diff_classifier.pca
¶
Performs principle component analysis on input datasets.
This module performs principle component analysis on input datasets using functions from scikit-learn. It is optimized to data formats used in diff_classifier, but can potentially be extended to other applications.
-
diff_classifier.pca.
build_KNN_model
(rawdata, feature, featvals, equal_sampling=True, tsize=20, n_neighbors=5, from_end=True, input_cols=6)[source]¶ Builds a K-nearest neighbor model using an input dataset.
Parameters: - rawdata : pandas.core.frames.DataFrame
Raw dataset of n samples and p features.
- feature : string or int
Feature in rawdata containing output values on which KNN model is to be based.
- featvals : string or int
All values that feature can take.
- equal_sampling : bool
If True, training dataset will contain an equal number of samples that take each value of featvals. If false, each sample in training dataset will be taken randomly from rawdata.
- tsize : int
Size of training dataset. If equal_sampling is False, training dataset will be exactly this size. If True, training dataset will contain N x tsize where N is the number of unique values in featvals.
- n_neighbors : int
Number of nearest neighbors to be used in KNN algorithm.
- from_end : int
If True, in_cols will select features to be used as training data defined end of rawdata e.g. rawdata[:, -6:]. If False, input_cols will be read as a tuple e.g. rawdata[:, 10:15].
- input_col : int or tuple
Defined in from_end above.
Returns: - clf : sklearn.neighbors.classification.KNeighborsClassifier
KNN model
- X : numpy.ndarray
training input dataset used to create clf
- y : numpy.ndarray
training output dataset used to create clf
-
diff_classifier.pca.
feature_plot_2D
(dataset, label, features=[0, 1], randsel=True, randcount=200, **kwargs)[source]¶ Plots two features against each other from feature dataset.
Parameters: - dataset : pandas.core.frames.DataFrame
Must comtain a group column and numerical features columns
- labels : string or int
Group column name
- features : list of int
Names of columns to be plotted
- randsel : bool
If True, downsamples from original dataset
- randcount : int
Size of downsampled dataset
- **kwargs : variable
- figsize : tuple of int or float
Size of output figure
- dotsize : float or int
Size of plotting markers
- alpha : float or int
Transparency factor
- xlim : list of float or int
X range of output plot
- ylim : list of float or int
Y range of output plot
- legendfontsize : float or int
Font size of legend
- labelfontsize : float or int
Font size of labels
- fname : string
Filename of output figure
Returns: - xy : list of lists
Coordinates of data on plot
-
diff_classifier.pca.
feature_plot_3D
(dataset, label, features=[0, 1, 2], randsel=True, randcount=200, **kwargs)[source]¶ Plots three features against each other from feature dataset.
Parameters: - dataset : pandas.core.frames.DataFrame
Must comtain a group column and numerical features columns
- labels : string or int
Group column name
- features : list of int
Names of columns to be plotted
- randsel : bool
If True, downsamples from original dataset
- randcount : int
Size of downsampled dataset
- **kwargs : variable
- figsize : tuple of int or float
Size of output figure
- dotsize : float or int
Size of plotting markers
- alpha : float or int
Transparency factor
- xlim : list of float or int
X range of output plot
- ylim : list of float or int
Y range of output plot
- zlim : list of float or int
Z range of output plot
- legendfontsize : float or int
Font size of legend
- labelfontsize : float or int
Font size of labels
- fname : string
Filename of output figure
Returns: - xy : list of lists
Coordinates of data on plot
-
diff_classifier.pca.
feature_violin
(df, label='label', lvals=['yes', 'no'], fsubset=3, **kwargs)[source]¶ Creates violinplot of input feature dataset
Designed to plot PCA components from pca_analysis.
Parameters: - df : pandas.core.frames.DataFrame
Must contain a group name column, and numerical feature columns.
- label : string or int
Name of group column.
- lvals : list of string or int
All values that group column can take
- fsubset : int or list of int
Features to be plotted. If integer, will plot range(fsubset). If list, will only plot features contained in fsubset.
- **kwargs : variable
- figsize : tuple of int or float
Dimensions of output figure
- yrange : list of int or float
Range of y axis
- xlabel : string
Label of x axis
- labelsize : int or float
Font size of x label
- ticksize : int or float
Font size of y tick labels
- fname : None or string
Name of output file
- legendfontsize : int or float
Font size of legend
- legendloc : int
Location of legend in plot e.g. 1, 2, 3, 4
-
diff_classifier.pca.
kmo
(dataset)[source]¶ Calculates the Kaiser-Meyer-Olkin measure on an input dataset
Parameters: - dataset : array-like, shape (n, p)
Array containing n samples and p features. Must have no NaNs. Ideally scaled before performing test.
Returns: - kmostat : float
KMO test value
Notes
Based on calculations shown here:
http://www.statisticshowto.com/kaiser-meyer-olkin/
– 0.00-0.49 unacceptable – 0.50-0.59 miserable – 0.60-0.69 mediocre – 0.70-0.79 middling – 0.80-0.89 meritorious – 0.90-1.00 marvelous
-
diff_classifier.pca.
partial_corr
(mtrx)[source]¶ Calculates linear partial correlation coefficients
Returns the sample linear partial correlation coefficients between pairs of variables in mtrx, controlling for the remaining variables in mtrx.
Parameters: - mtrx : array-like, shape (n, p)
Array with the different variables. Each column of mtrx is taken as a variable
Returns: - P : array-like, shape (p, p)
P[i, j] contains the partial correlation of mtrx[:, i] and mtrx[:, j] controlling for the remaining variables in mtrx.
Notes
Partial Correlation in Python (clone of Matlab’s partialcorr)
This uses the linear regression approach to compute the partial correlation (might be slow for a huge number of variables). The algorithm is detailed here:
http://en.wikipedia.org/wiki/Partial_correlation#Using_linear_regression
Taking X and Y two variables of interest and Z the matrix with all the variable minus {X, Y}, the algorithm can be summarized as
- perform a normal linear least-squares regression with X as the target and Z as the predictor
- calculate the residuals in Step #1
- perform a normal linear least-squares regression with Y as the target and Z as the predictor
- calculate the residuals in Step #3
- calculate the correlation coefficient between the residuals from Steps #2 and #4
The result is the partial correlation between X and Y while controlling for the effect of Z
Adapted from code by Fabian Pedregosa-Izquierdo: Date: Nov 2014 Author: Fabian Pedregosa-Izquierdo, f@bianp.net Testing: Valentina Borghesani, valentinaborghesani@gmail.com
-
diff_classifier.pca.
pca_analysis
(dataset, dropcols=[], imputenans=True, scale=True, rem_outliers=True, out_thresh=10, n_components=5)[source]¶ Performs a primary component analysis on an input dataset
Parameters: - dataset : pandas.core.frame.DataFrame, shape (n, p)
Input dataset with n samples and p features
- dropcols : list
Columns to exclude from pca analysis. At a minimum, user must exclude non-numeric columns.
- imputenans : bool
If True, impute NaN values as column means.
- scale : bool
If True, columns will be scaled to a mean of zero and a standard deviation of 1.
- n_components : int
Desired number of components in principle component analysis.
Returns: - pcadataset : diff_classifier.pca.Bunch
Contains outputs of PCA analysis, including: scaled : numpy.ndarray, shape (n, p)
Scaled dataset with n samples and p features
- pcavals : pandas.core.frame.DataFrame, shape (n, n_components)
Output array of n_component features of each original sample
- final : pandas.core.frame.DataFrame, shape (n, p+n_components)
Output array with principle components append to original array.
- prcomps : pandas.core.frame.DataFrame, shape (5, n_components)
Output array displaying the top 5 features contributing to each principle component.
- prvals : dict of list of str
Output dictionary of of the pca scores for the top 5 features contributing to each principle component.
- components : pandas.core.frame.DataFrame, shape (p, n_components)
Raw pca scores.
-
diff_classifier.pca.
plot_pca
(datasets, figsize=(8, 8), lwidth=8.0, labels=['Sample1', 'Sample2'], savefig=True, filename='test.png', rticks=array([-2., -1., 0., 1., 2.]))[source]¶ Plots the average output features from a PCA analysis in polar coordinates
Parameters: - datasets : dict of numpy.ndarray
Dictionary with n samples and p features to plot.
- figize : list
Dimensions of output figure e.g. (8, 8)
- lwidth : float
Width of plotted lines in figure
- labels : list of str
Labels to display in legend.
- savefig : bool
If True, saves figure
- filename : str
Desired output filename
-
diff_classifier.pca.
predict_KNN
(model, X, y)[source]¶ Calculates fraction correctly predicted using input KNN model
Parameters: - model : sklearn.neighbors.classification.KNeighborsClassifier
KNN model
- X : numpy.ndarray
training input dataset used to create clf
- y : numpy.ndarray
training output dataset used to create clf
Returns: - pcorrect : float
Fraction of correctly predicted outputs using the input KNN model and the input test dataset X and y