diff_classifier.pca

diff_classifier.pca

Performs principle component analysis on input datasets.

This module performs principle component analysis on input datasets using functions from scikit-learn. It is optimized to data formats used in diff_classifier, but can potentially be extended to other applications.

class diff_classifier.pca.Bunch(**kwds)[source]

Bases: object

diff_classifier.pca.build_KNN_model(rawdata, feature, featvals, equal_sampling=True, tsize=20, n_neighbors=5, from_end=True, input_cols=6)[source]

Builds a K-nearest neighbor model using an input dataset.

Parameters:
rawdata : pandas.core.frames.DataFrame

Raw dataset of n samples and p features.

feature : string or int

Feature in rawdata containing output values on which KNN model is to be based.

featvals : string or int

All values that feature can take.

equal_sampling : bool

If True, training dataset will contain an equal number of samples that take each value of featvals. If false, each sample in training dataset will be taken randomly from rawdata.

tsize : int

Size of training dataset. If equal_sampling is False, training dataset will be exactly this size. If True, training dataset will contain N x tsize where N is the number of unique values in featvals.

n_neighbors : int

Number of nearest neighbors to be used in KNN algorithm.

from_end : int

If True, in_cols will select features to be used as training data defined end of rawdata e.g. rawdata[:, -6:]. If False, input_cols will be read as a tuple e.g. rawdata[:, 10:15].

input_col : int or tuple

Defined in from_end above.

Returns:
clf : sklearn.neighbors.classification.KNeighborsClassifier

KNN model

X : numpy.ndarray

training input dataset used to create clf

y : numpy.ndarray

training output dataset used to create clf

diff_classifier.pca.feature_plot_2D(dataset, label, features=[0, 1], randsel=True, randcount=200, **kwargs)[source]

Plots two features against each other from feature dataset.

Parameters:
dataset : pandas.core.frames.DataFrame

Must comtain a group column and numerical features columns

labels : string or int

Group column name

features : list of int

Names of columns to be plotted

randsel : bool

If True, downsamples from original dataset

randcount : int

Size of downsampled dataset

**kwargs : variable
figsize : tuple of int or float

Size of output figure

dotsize : float or int

Size of plotting markers

alpha : float or int

Transparency factor

xlim : list of float or int

X range of output plot

ylim : list of float or int

Y range of output plot

legendfontsize : float or int

Font size of legend

labelfontsize : float or int

Font size of labels

fname : string

Filename of output figure

Returns:
xy : list of lists

Coordinates of data on plot

diff_classifier.pca.feature_plot_3D(dataset, label, features=[0, 1, 2], randsel=True, randcount=200, **kwargs)[source]

Plots three features against each other from feature dataset.

Parameters:
dataset : pandas.core.frames.DataFrame

Must comtain a group column and numerical features columns

labels : string or int

Group column name

features : list of int

Names of columns to be plotted

randsel : bool

If True, downsamples from original dataset

randcount : int

Size of downsampled dataset

**kwargs : variable
figsize : tuple of int or float

Size of output figure

dotsize : float or int

Size of plotting markers

alpha : float or int

Transparency factor

xlim : list of float or int

X range of output plot

ylim : list of float or int

Y range of output plot

zlim : list of float or int

Z range of output plot

legendfontsize : float or int

Font size of legend

labelfontsize : float or int

Font size of labels

fname : string

Filename of output figure

Returns:
xy : list of lists

Coordinates of data on plot

diff_classifier.pca.feature_violin(df, label='label', lvals=['yes', 'no'], fsubset=3, **kwargs)[source]

Creates violinplot of input feature dataset

Designed to plot PCA components from pca_analysis.

Parameters:
df : pandas.core.frames.DataFrame

Must contain a group name column, and numerical feature columns.

label : string or int

Name of group column.

lvals : list of string or int

All values that group column can take

fsubset : int or list of int

Features to be plotted. If integer, will plot range(fsubset). If list, will only plot features contained in fsubset.

**kwargs : variable
figsize : tuple of int or float

Dimensions of output figure

yrange : list of int or float

Range of y axis

xlabel : string

Label of x axis

labelsize : int or float

Font size of x label

ticksize : int or float

Font size of y tick labels

fname : None or string

Name of output file

legendfontsize : int or float

Font size of legend

legendloc : int

Location of legend in plot e.g. 1, 2, 3, 4

diff_classifier.pca.kmo(dataset)[source]

Calculates the Kaiser-Meyer-Olkin measure on an input dataset

Parameters:
dataset : array-like, shape (n, p)

Array containing n samples and p features. Must have no NaNs. Ideally scaled before performing test.

Returns:
kmostat : float

KMO test value

Notes

Based on calculations shown here:

http://www.statisticshowto.com/kaiser-meyer-olkin/

– 0.00-0.49 unacceptable – 0.50-0.59 miserable – 0.60-0.69 mediocre – 0.70-0.79 middling – 0.80-0.89 meritorious – 0.90-1.00 marvelous
diff_classifier.pca.partial_corr(mtrx)[source]

Calculates linear partial correlation coefficients

Returns the sample linear partial correlation coefficients between pairs of variables in mtrx, controlling for the remaining variables in mtrx.

Parameters:
mtrx : array-like, shape (n, p)

Array with the different variables. Each column of mtrx is taken as a variable

Returns:
P : array-like, shape (p, p)

P[i, j] contains the partial correlation of mtrx[:, i] and mtrx[:, j] controlling for the remaining variables in mtrx.

Notes

Partial Correlation in Python (clone of Matlab’s partialcorr)

This uses the linear regression approach to compute the partial correlation (might be slow for a huge number of variables). The algorithm is detailed here:

http://en.wikipedia.org/wiki/Partial_correlation#Using_linear_regression

Taking X and Y two variables of interest and Z the matrix with all the variable minus {X, Y}, the algorithm can be summarized as

  1. perform a normal linear least-squares regression with X as the target and Z as the predictor
  2. calculate the residuals in Step #1
  3. perform a normal linear least-squares regression with Y as the target and Z as the predictor
  4. calculate the residuals in Step #3
  5. calculate the correlation coefficient between the residuals from Steps #2 and #4

The result is the partial correlation between X and Y while controlling for the effect of Z

Adapted from code by Fabian Pedregosa-Izquierdo: Date: Nov 2014 Author: Fabian Pedregosa-Izquierdo, f@bianp.net Testing: Valentina Borghesani, valentinaborghesani@gmail.com

diff_classifier.pca.pca_analysis(dataset, dropcols=[], imputenans=True, scale=True, rem_outliers=True, out_thresh=10, n_components=5)[source]

Performs a primary component analysis on an input dataset

Parameters:
dataset : pandas.core.frame.DataFrame, shape (n, p)

Input dataset with n samples and p features

dropcols : list

Columns to exclude from pca analysis. At a minimum, user must exclude non-numeric columns.

imputenans : bool

If True, impute NaN values as column means.

scale : bool

If True, columns will be scaled to a mean of zero and a standard deviation of 1.

n_components : int

Desired number of components in principle component analysis.

Returns:
pcadataset : diff_classifier.pca.Bunch

Contains outputs of PCA analysis, including: scaled : numpy.ndarray, shape (n, p)

Scaled dataset with n samples and p features

pcavals : pandas.core.frame.DataFrame, shape (n, n_components)

Output array of n_component features of each original sample

final : pandas.core.frame.DataFrame, shape (n, p+n_components)

Output array with principle components append to original array.

prcomps : pandas.core.frame.DataFrame, shape (5, n_components)

Output array displaying the top 5 features contributing to each principle component.

prvals : dict of list of str

Output dictionary of of the pca scores for the top 5 features contributing to each principle component.

components : pandas.core.frame.DataFrame, shape (p, n_components)

Raw pca scores.

diff_classifier.pca.plot_pca(datasets, figsize=(8, 8), lwidth=8.0, labels=['Sample1', 'Sample2'], savefig=True, filename='test.png', rticks=array([-2., -1., 0., 1., 2.]))[source]

Plots the average output features from a PCA analysis in polar coordinates

Parameters:
datasets : dict of numpy.ndarray

Dictionary with n samples and p features to plot.

figize : list

Dimensions of output figure e.g. (8, 8)

lwidth : float

Width of plotted lines in figure

labels : list of str

Labels to display in legend.

savefig : bool

If True, saves figure

filename : str

Desired output filename

diff_classifier.pca.predict_KNN(model, X, y)[source]

Calculates fraction correctly predicted using input KNN model

Parameters:
model : sklearn.neighbors.classification.KNeighborsClassifier

KNN model

X : numpy.ndarray

training input dataset used to create clf

y : numpy.ndarray

training output dataset used to create clf

Returns:
pcorrect : float

Fraction of correctly predicted outputs using the input KNN model and the input test dataset X and y

diff_classifier.pca.recycle_pcamodel(pcamodel, df, imputenans=True, scale=True)[source]