Additional Current Resources:
Machine Learning Resources: https://www.sethi.org/classes/pub/resources/ml-resources.html
Project Topics/Datasets: https://www.sethi.org/classes/pub/resources/project_topics_datasets.html
Computer Science emphasizes theoretical and empirical approaches to mainpulating data via computation or algorithmic processes.
Informatics deals with a broader study of data and its manipulation in information processes and systems, including social and cognitive models.
Data Science tackles structured and unstructured data and uses both computational methods and cognitive or social methods, especially when visualizing complicated data analytics and business analytics.
"In general, while Data Analysts tend to be more business focused, Data Scientists are often more mathematically focused."
A greater emphasis might be on the ability to examine a database's records and the overall behavior of its objects.
Also, Data Analysts should set clear measurement priorities; decide what to measure and how to measure it.
Finally, Data Analysts organize the data efficiently and customize reports and dashboards based on business rules and requirements to make informed business decisions.
"Business Intelligence (BI) is about reporting what happened and Analytics is about answering why. BI includes collecting, transforming and loading (ETL), performing analysis (running queries etc) and presenting (visualize) the results. Power BI, Pentaho are the well known BI tools."You could think of BI as answering what; DA as answering why; and DS as predicting what will happen;
Business Intelligence provides retrospective reports to help businesses monitor the current state and historical business performance. Data Science uses past data to make future predictions.
Can put image pixels, share data on the backend, Digital Advertising Alliance (DAA), etc. Use extensions like Ghostery to block these.
Pre-processing: Part of pre-processing might involve converting non-numeric, categorical features into numeric variables: you can use methods like Label encoding, One-hot encoding, Mean/Median/Mode imputation, K-nearest neighbors imputation, etc. Here are some links to some helpful tutorials for converting categorical features to numeric encodings:
Converting non-categorical, general string variables to numeric features can be a bit more challenging than converting categorical variables, as there is no single "best" way to do it. The best approach will depend on the specific data you are working with and the task at hand but here are some general strategies and then links to tutorials for that, as well: Tokenization: This involves breaking the string into individual words or tokens. You can then use techniques like word embedding to convert the tokens into numeric vectors; N-grams: This involves creating features based on n-grams, which are sequences of n consecutive words or tokens. For example, you could create features based on bigrams (sequences of two words) or trigrams (sequences of three words); Entity extraction: This involves identifying and extracting named entities from the text, such as people, places, and organizations. You can then use techniques like entity embedding to convert the entities into numeric vectors; Bag-of-words: This involves creating a feature for each unique word in the text. The value of each feature is the number of times the word appears in the document; TF-IDF: This is a variation of bag-of-words that weights the features based on their term frequency (TF) and inverse document frequency (IDF). TF is the number of times the word appears in the document, and IDF is a measure of how rare the word is across all documents; Other more abstract representations like distributed text representations and embeddings like Word2Vec and GloVe, etc.:
Most methods in machine learning are based on finding parameters that minimize some objective/loss/cost function. ML is mainly about generalization.
A generative model only applies to probabilistic methods. A generative model learns the joint probability distribution p(x,y) and a discriminative model learns the conditional probability distribution p(y|x)
Gradient descent is a local optimizer. Stochastic gradient descent can find a global optimum. Approaches like Simulated annealing can find global minimum for different types of functions.
ML methods can be categorized as being generative/discriminative or parametric/nonparametric or supervised/unsupervised, etc. Non-parametric just means number of parameters depends on data. Parametric methods are subject to optimization but nonparametric might not be in the same way.
... the input neurons are basically the pixel intensities of an input image and on the right is a one hidden neuron out of the many neurons in the first hidden layer. Each neuron will be connected to only a region of the input layer, that region in the input image is called the local receptive field for the hidden neuron. It's a little window on the input pixels. Receptive field, kernel and filter are used interchangenably.More details and source pages
Bagging (Bootstrap Aggregating): results of multiple classifiers are combined using average or majority voting. Boosting (AdaBoost, XGBoost, Gradient Tree Boost): provides sequential learning of the predictors: first predictor is learned on the whole data set with equal weights on all samples. Each subsequent learner assigns a higher weight to mis-classified samples before learning.
Stacking: uses multiple base classifiers for prediction; learner is used to combine their predictions.
Covariance and correlation are similar concepts; the correlation between two variables is equal to their covariance divided by their variances, as explained at http://mccormickml.com/2014/07/22/mahalanobis-distance/
We can uuse the Mahalanobis distance to find outliers in multivariate data. It measures the separation of two groups of objects. Nice intuitive explanation here: https://www.theinformationlab.co.uk/2017/05/26/mahalanobis-distance/ The covariance matrix provides the covariance associated with the variables (the reason covariance is followed is to establish the effect of two or more variables together).
It is primarily used in classification and clustering problems where there is a need to establish correlation between different groups/clusters of data. Euclidean distance only makes sense when all the dimensions have the same units (like meters), since it involves adding the squared value of them.
When you are dealing with probabilities, a lot of times the features have different units. For example: we might have a model for men and a model for women, where both models are based on their weight [Kg] and height [m]. We also know the mean and covariance for each model. Now if we get a new measurement vector, an ordered set composed of weight and height, we have to decide if it's a man or a woman. We can use the Mahalanobis distance from the models of both men and women to decide which is closer, meaning which is more probable. The Mahalnobis distance transforms the random vector into a zero mean vector with an identity matrix for covariance. In that space, the Euclidean distance is safely applied.
Linear Discriminant Analysis (LDA) is used to classify multiple classes using dimensionality reduction like Principal Component Analysis (PCA). For two classes, you can just use Logistic Regression. For each input variable, you need to calculate the mean value of that variable for each class as well as the variance of that variable for each class. "Predictions are made by calculating a discriminate value for each class and making a prediction for the class with the largest value." https://www.kdnuggets.com/2018/02/tour-top-10-algorithms-machine-learning-newbies.html
This follows from the Curse of Dimensionality: as we we add in higher and higher dimensional in our feature vector, we need more computational power and data to effectively train the model. If you add in more features, you need more data, as seen here: https://towardsdatascience.com/curse-of-dimensionality-2092410f3d27
Thus, the goal of LDA is to reduce the dimension of the feature vectors without loss of information and maximize class separability; discrimination here is coming up with a rule that accurately assigna a new measurement/vector to one of several classes. http://www.cs.uml.edu/~ycao/teaching/fall_2013/downloads/05_MC_2_LDA.pdf
The rule is a discriminant function, a linear equation of the X variables that will provide best separation between the categorical Y variable. This checks to see if there are significant intra-group differences in terms of the X variables. It also identifies the X variables that contribute most to the inter-group separation. https://www.datascience.com/blog/predicting-customer-churn-with-a-discriminant-analysis