Some Data Science/Machine Learning References

Additional Current Resources:
Machine Learning Resources: https://www.sethi.org/classes/pub/resources/ml-resources.html
Project Topics/Datasets: https://www.sethi.org/classes/pub/resources/project_topics_datasets.html

Videos and Courses

Data Science Videos and Courses

Data Science

Data Science Essentials

Data Science Projects/Exercises

Data Science vs. Data Analysis

Miscellaneous Data Science Links

Data Science Resources

Data Science Competitions and Job Sites

Data Science Interview Questions/Hints

EDA and Visualization

Visualization

Exploratory Data Analysis (EDA)

Feature Engineering

General Feature Engineering

Categorical Variables

Pre-processing: Part of pre-processing might involve converting non-numeric, categorical features into numeric variables: you can use methods like Label encoding, One-hot encoding, Mean/Median/Mode imputation, K-nearest neighbors imputation, etc. Here are some links to some helpful tutorials for converting categorical features to numeric encodings:

String Value Variables

Converting non-categorical, general string variables to numeric features can be a bit more challenging than converting categorical variables, as there is no single "best" way to do it. The best approach will depend on the specific data you are working with and the task at hand but here are some general strategies and then links to tutorials for that, as well: Tokenization: This involves breaking the string into individual words or tokens. You can then use techniques like word embedding to convert the tokens into numeric vectors; N-grams: This involves creating features based on n-grams, which are sequences of n consecutive words or tokens. For example, you could create features based on bigrams (sequences of two words) or trigrams (sequences of three words); Entity extraction: This involves identifying and extracting named entities from the text, such as people, places, and organizations. You can then use techniques like entity embedding to convert the entities into numeric vectors; Bag-of-words: This involves creating a feature for each unique word in the text. The value of each feature is the number of times the word appears in the document; TF-IDF: This is a variation of bag-of-words that weights the features based on their term frequency (TF) and inverse document frequency (IDF). TF is the number of times the word appears in the document, and IDF is a measure of how rare the word is across all documents; Other more abstract representations like distributed text representations and embeddings like Word2Vec and GloVe, etc.:

Machine Learning

Choosing Machine Learning Algorithms

Machine Learning Tutorials/Guides

kNN

Neural Network Links

Regression

Ensemble Methods

Probability and Statitics

Probability

Distributions

Statistics Courses

Programming Languages for Data Science

Python Links

R Links

General Reference Sites

Cybersecurity and Data Security

Artificial Intelligence Links

CyberSecurity Links

Miscellaneous ML Notes

Covariance and correlation are similar concepts; the correlation between two variables is equal to their covariance divided by their variances, as explained at http://mccormickml.com/2014/07/22/mahalanobis-distance/

We can uuse the Mahalanobis distance to find outliers in multivariate data. It measures the separation of two groups of objects. Nice intuitive explanation here: https://www.theinformationlab.co.uk/2017/05/26/mahalanobis-distance/ The covariance matrix provides the covariance associated with the variables (the reason covariance is followed is to establish the effect of two or more variables together).

It is primarily used in classification and clustering problems where there is a need to establish correlation between different groups/clusters of data. Euclidean distance only makes sense when all the dimensions have the same units (like meters), since it involves adding the squared value of them.

When you are dealing with probabilities, a lot of times the features have different units. For example: we might have a model for men and a model for women, where both models are based on their weight [Kg] and height [m]. We also know the mean and covariance for each model. Now if we get a new measurement vector, an ordered set composed of weight and height, we have to decide if it's a man or a woman. We can use the Mahalanobis distance from the models of both men and women to decide which is closer, meaning which is more probable. The Mahalnobis distance transforms the random vector into a zero mean vector with an identity matrix for covariance. In that space, the Euclidean distance is safely applied.

Linear Discriminant Analysis (LDA) is used to classify multiple classes using dimensionality reduction like Principal Component Analysis (PCA). For two classes, you can just use Logistic Regression. For each input variable, you need to calculate the mean value of that variable for each class as well as the variance of that variable for each class. "Predictions are made by calculating a discriminate value for each class and making a prediction for the class with the largest value." https://www.kdnuggets.com/2018/02/tour-top-10-algorithms-machine-learning-newbies.html 

This follows from the Curse of Dimensionality: as we we add in higher and higher dimensional in our feature vector, we need more computational power and data to effectively train the model. If you add in more features, you need more data, as seen here: https://towardsdatascience.com/curse-of-dimensionality-2092410f3d27

Thus, the goal of LDA is to reduce the dimension of the feature vectors without loss of information and maximize class separability; discrimination here is coming up with a rule that accurately assigna a new measurement/vector to one of several classes. http://www.cs.uml.edu/~ycao/teaching/fall_2013/downloads/05_MC_2_LDA.pdf

The rule is a discriminant function, a linear equation of the X variables that will provide best separation between the categorical Y variable. This checks to see if there are significant intra-group differences in terms of the X variables. It also identifies the X variables that contribute most to the inter-group separation. https://www.datascience.com/blog/predicting-customer-churn-with-a-discriminant-analysis


Ricky J. Sethi, PhD <rickys@sethi.org>
Last updated: Wednesday, November 29 2023
(sethi.org/tutorials/references_data_science.shtml)