Machine Learning Terms and Methods
Last updated
Was this helpful?
Last updated
Was this helpful?
A approach in which the algorithm chooses some of the data it learns from. Active learning is particularly valuable when are scarce or expensive to obtain. Instead of blindly seeking a diverse range of labeled examples, an active learning algorithm selectively seeks the particular range of examples it needs for learning.
AlexNet is the name of a (CNN) architecture
Attention Mechanisms are inspired by human visual attention, the ability to focus on specific parts of an image. Attention mechanisms can be incorporated in both Language Processing and Image Recognition architectures to help the network learn what to “focus” on when making predictions.
Any of a wide range of architecture mechanisms that aggregate information from a set of inputs in a data-dependent manner. A typical attention mechanism might consist of a weighted sum over a set of inputs, where the for each input is computed by another part of the neural network.
Refer also to and , which are the building blocks of .
Among other benefits, capsnets address the "Picasso problem" in image recognition: images that have all the right parts but that are not in the correct spatial relationship (e.g., in a "face", the positions of the mouth and one eye are switched). For image recognition, capsnets exploit the fact that while viewpoint changes have nonlinear effects at the pixel level, they have linear effects at the part/object level.
Convolutional neural networks have had great success in certain kinds of problems, such as image recognition.
A model represented as a sequence of branching statements. For example, the following over-simplified decision tree branches a few times to predict the price of a house (in thousands of USD). According to this decision tree, a house larger than 160 square meters, having more than three bedrooms, and built less than 10 years ago would have a predicted price of 510 thousand USD.
In general, any ML system that converts from a processed, dense, or internal representation to a more raw, sparse, or external representation.
A deep CNN developed for Semantic Image Segmentation.
It is an efficient deep learning model to estimate a residual image from the input image with the Gaussian noise.
It is a branch of signal processing and machine learning that aims at finding a frame (called dictionary) in which some training data admits a sparse representation. The sparser the representation, the better the dictionary.
In general, any ML system that converts from a raw, sparse, or external representation into a more processed, denser, or more internal representation.
Some systems use the encoder's output as the input to a classification or regression network.
different initializations
different overall structure
Deep and wide models are a kind of ensemble.
Overloaded term having either of the following definitions:
Since the training examples are never uploaded, federated learning follows the privacy principles of focused data collection and data minimization.
A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians.
Agglomerative clustering first assigns every example to its own cluster, and iteratively merges the closest clusters to create a hierarchical tree.
Divisive clustering first groups all examples into one cluster and then iteratively divides the cluster into a hierarchical tree.
Inception Modules are used in Convolutional Neural Networks to allow for more efficient computation and deeper Networks trough a dimensionality reduction with stacked 1×1 convolutions.
The ability to explain or to present an ML model's reasoning in understandable terms to a human.
Assigns each example to the closest centroid. Those examples nearest the same centroid belong to the same group.
The k-means algorithm picks centroid locations to minimize the cumulative square of the distances from each example to its closest centroid.
Both for classification and regression, a useful technique can be to assign weights to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor.
The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.
LeNet is a convolutional neural network structure proposed by Yann LeCun et al. ... Convolutional neural networks are a kind of feed-forward neural network whose artificial neurons can respond to a part of the surrounding cells in the coverage range and perform well in large-scale image processing.
Imagine an example that produces a raw prediction (y′) of 2.6. The sigmoid of 2.6 is 0.93. Since 0.93 is greater than 0.82, the system classifies this example as the positive class.
Imagine a different example that produces a raw prediction of 1.3. The sigmoid of 1.3 is 0.79. Since 0.79 is less than 0.82, the system classifies that example as the negative class.
A subset of machine learning that discovers or improves a learning algorithm. A meta-learning system can also aim to train a model to quickly learn a new task from a small amount of data or from experience gained in previous tasks. Meta-learning algorithms generally try to achieve the following:
Improve/learn hand-engineered features (such as an initializer or an optimizer).
Be more data-efficient and compute-efficient.
Improve generalization.
Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.
In some cases, the goal is to maximize the objective function. For example, if the objective function is accuracy, the goal is to maximize accuracy.
Overloaded term with the following meanings:
The traditional meaning within software engineering. Namely: How fast (or efficiently) does this piece of software run?
Region Based Convolutional Neural Networks (R-CNN) are a family of machine learning models for computer vision and specifically object detection.
A RNN models sequential interactions through a hidden state, or memory. It can take up to N inputs and produce up to N outputs. For example, an input sequence may be a sentence with the outputs being the part-of-speech tag for each word (N-to-N). An input could be a sentence, and the output a sentiment classification of the sentence (N-to-1). An input could be a single image, and the output could be a sequence of words corresponding to the description of an image (1-to-N). At each time step, an RNN calculates a new hidden state (“memory”) based on the current input and the previous hidden state. The “recurrent” stems from the facts that at each step the same parameters are used and the network performs the same calculations based on different inputs.
Similar to U-Net but the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling unlike the U-Net in which the entire features from lower-resolution are passed to the higher-resolution layers.
Training a model on data where some of the training examples have labels but others don't. One technique for semi-supervised learning is to infer labels for the unlabeled examples, and then to train on the inferred labels to create a new model. Semi-supervised learning can be useful if labels are expensive to obtain but unlabeled examples are plentiful.
A model whose inputs have a sequential dependence. For example, predicting the next video watched from a sequence of previously watched videos.
SqueezeNet is the name of a deep neural network for computer vision that was released in 2016.
For example, the English language consists of about a million words. Consider two ways to represent a count of the words used in one English sentence:
A dense representation of this sentence must set an integer for all one million cells, placing a 0 in most of them, and a low integer into a few of them.
A sparse representation of this sentence stores only those cells symbolizing a word actually in the sentence. So, if the sentence contained only 20 unique words, then the sparse representation for the sentence would store an integer in only 20 cells.
The number of elements set to zero (or null) in a vector or matrix divided by the total number of entries in that vector or matrix. For example, consider a 10x10 matrix in which 98 cells contain zero. The calculation of sparsity is as follows:sparsity=98100=0.98
Feature sparsity refers to the sparsity of a feature vector; model sparsity refers to the sparsity of the model weights.
TCN Temporal Convolutional Networks, are convolutional neural networks with dilations used for time series data modeling particularly.
Most machine learning systems solve a single task. Transfer learning is a baby step towards artificial intelligence in which a single program can solve multiple tasks.
A Transformer can include any of the following:
both an encoder and decoder
An encoder transforms a sequence of embeddings into a new sequence of the same length. An encoder includes N identical layers, each of which contains two sub-layers. These two sub-layers are applied at each position of the input embedding sequence, transforming each element of the sequence into a new embedding. The first encoder sub-layer aggregates information from across the input sequence. The second encoder sub-layer transforms the aggregated information into an output embedding.
In an image classification problem, an algorithm's ability to successfully classify images even when the position of objects within the image changes. For example, the algorithm can still identify a dog, whether it is in the center of the frame or at the left end of the frame.
Producing a model with poor predictive ability because the model hasn't captured the complexity of the training data. Many problems can cause underfitting, including:
Training on the wrong set of features.
Training for too few epochs or at too low a learning rate.
Training with too high a regularization rate.
Providing too few hidden layers in a deep neural network.
The most common use of unsupervised machine learning is to cluster data into groups of similar examples. For example, an unsupervised machine learning algorithm can cluster songs together based on various properties of the music. The resulting clusters can become an input to other machine learning algorithms (for example, to a music recommendation service). Clustering can be helpful in domains where true labels are hard to obtain. For example, in domains such as anti-abuse and fraud, clusters can help humans better understand the data.
VGG refers to convolutional neural network model that secured the first and second place in the 2014 ImageNet localization and classification tracks, respectively. The VGG model consist of 16–19 weight layers and uses small convolutional filters of size 3×3 and 1×1.
A deep generative model of raw audio waveforms. We show that WaveNets are able to generate speech which mimics any human voice and which sounds more natural than the best existing Text-to-Speech systems, reducing the gap with human performance by over 50%.
A model that, taking inspiration from the brain, is composed of layers (at least one of which is ) consisting of simple connected units or followed by nonlinearities.
An Autoencoder is a Neural Network model whose goal is to predict the input itself, typically through a “bottleneck” somewhere in the network. By introducing a bottleneck, we force the network to learn a lower-dimensional representation of the input, effectively compressing the input into a good representation. Autoencoders are related to PCA and other dimensionality reduction techniques, but can learn more complex mappings due to their nonlinear nature. A wide range of autoencoder architectures exist, including , , or .
Backpropagation is an algorithm to efficiently calculate the gradients in a Neural Network, or more generally, a feedforward computational graph. It boils down to applying the chain rule of differentiation starting from the network output and propagating the gradients backward. The first uses of backpropagation go back to Vapnik in the 1960’s, but is often cited as the source.
The primary algorithm for performing on . First, the output values of each node are calculated (and cached) in a forward pass. Then, the of the error with respect to each parameter is calculated in a backward pass through the graph.
A Capsule Neural Network (CapsNet) is a machine learning system that is a type of (ANN) that can be used to better model hierarchical relationships.
The idea is to add structures called “capsules” to a (CNN), and to reuse output from several of those capsules to form more stable (with respect to various perturbations) representations for higher capsules. The output is a vector consisting of the , and a . This vector is similar to what is done for example when doing in CNNs.
Grouping related , particularly during . Once all the examples are grouped, a human can optionally supply meaning to each cluster.
Many clustering algorithms exist. For example, the algorithm clusters examples based on their proximity to a .
A layer of a in which a passes along an input matrix.
A in which at least one layer is a . A typical convolutional neural network consists of some combination of the following layers:
A CNN uses to connected extract features from local regions of an input. Most CNNs contain a combination of convolutional, and layers. CNNs have gained popularity particularly through their excellent performance on visual recognition tasks, where they have been setting the state of the art for several years.
Artificially boosting the range and number of examples by transforming existing examples to create additional examples. For example, suppose images are one of your features, but your dataset doesn't contain enough image examples for the model to learn useful associations. Ideally, you'd add enough images to your dataset to enable your model to train properly. If that's not possible, data augmentation can rotate, stretch, and reflect each image to produce many variants of the original picture, possibly yielding enough labeled data to enable excellent training.
Decoders are often a component of a larger model, where they are frequently paired with an .
In , a decoder starts with the internal state generated by the encoder to predict the next sequence.
Refer to for the definition of a decoder within the Transformer architecture.
Dropout is a regularization technique for Neural Networks that prevents overfitting. It prevents neurons from co-adapting by randomly setting a fraction of them to 0 at each training iteration. Dropout can be interpreted in various ways, such as randomly sampling from an exponential number of different networks. Dropout layers first gained popularity through their use in , but have since been applied to other layers, including input embeddings or recurrent networks.
A form of useful in training . Dropout regularization works by removing a random selection of a fixed number of the units in a network layer for a single gradient step. The more units dropped out, the stronger the regularization. This is analogous to training the network to emulate an exponentially large ensemble of smaller networks. For full details, see .
A Dynamic Bayesian Network (DBN) is a (BN) which relates variables to each other over adjacent time steps.
Encoders are often a component of a larger model, where they are frequently paired with a . Some pair encoders with decoders, though other Transformers use only the encoder or only the decoder.
In , an encoder takes an input sequence and returns an internal state (a vector). Then, the uses that internal state to predict the next sequence.
Refer to for the definition of an encoder in the Transformer architecture.
A merger of the predictions of multiple . You can create an ensemble via one or more of the following:
different
An input variable used in making .
Retrieving intermediate feature representations calculated by an or pretrained model (for example, values in a ) for use in another model as input.
Synonym for .
A distributed machine learning approach that machine learning using decentralized residing on devices such as smartphones. In federated learning, a subset of devices downloads the current model from a central coordinating server. The devices use the examples stored on the devices to make improvements to the model. The devices then upload the model improvements (but not the training examples) to the coordinating server, where they are aggregated with other updates to yield an improved global model. After the aggregation, the model updates computed by devices are no longer needed, and can be discarded.
For more information about federated learning, see .
Perform a secondary optimization to adjust the parameters of an already trained to fit a new problem. Fine tuning often refers to refitting the weights of a trained model to a model.
Abbreviation for .
Gaussian process is a (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a , i.e. every finite of them is normally distributed. The distribution of a Gaussian process is the of all those (infinitely many) random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.
A system to create new data in which a creates data and a determines whether that created data is valid or invalid.
The name of the Convolutional Neural Network architecture that won the ILSVRC 2014 challenge. The network uses to reduce the parameters and improve the utilization of the computing resources inside the network.
The Gated Recurrent Unit is a simplified version of an LSTM unit with fewer parameters. Just like an LSTM cell, it uses a gating mechanism to allow RNNs to efficiently learn long-range dependency by preventing the . The GRU consists of a reset and update gate that determine which part of the old memory to keep vs. update with new values at the current time step.
It is a technique for , and other tasks, which produces a prediction model in the form of an of weak prediction models, typically . When a decision tree is the weak learner, the resulting algorithm is called gradient boosted trees, which usually outperforms . It builds the model in a stage-wise fashion like other methods do, and it generalizes them by allowing optimization of an arbitrary .
The correct answer. Reality. Since reality is often subjective, expert typically are the proxy for ground truth.
A category of algorithms that create a tree of clusters. Hierarchical clustering is well-suited to hierarchical data, such as botanical taxonomies. There are two types of hierarchical clustering algorithms:
Contrast with .
Hidden Markov Model (HMM) is a in which the system being is assumed to be a .
The "knobs" that you tweak during successive runs of training a model. For example, is a hyperparameter.
Contrast with .
In machine learning, often refers to the process of making predictions by applying the trained model to . In statistics, inference refers to the process of fitting the parameters of a distribution conditioned on some observed data. (See the .)
A popular algorithm that groups examples in unsupervised learning. The k-means algorithm basically does the following:
Iteratively determines the best k center points (known as ).
It is used for and . In both cases, the input consists of the k closest training examples in .. k-NN is a type of where the function is only approximated locally and all computation is deferred until function evaluation. Since this algorithm relies on distance for classification, if the features represent different physical units or come in vastly different scales then the training data can improve its accuracy dramatically.
In supervised learning, the "answer" or "result" portion of an . Each example in a labeled dataset consists of one or more features and a label. For instance, in a housing dataset, the features might include the number of bedrooms, the number of bathrooms, and the age of the house, while the label might be the house's price. In a spam detection dataset, the features might include the subject line, the sender, and the email message itself, while the label would probably be either "spam" or "not spam."
A set of in a that process a set of input features, or the output of those neurons.
Also, an abstraction in TensorFlow. Layers are Python functions that take and configuration options as input and produce other tensors as output.
Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in and other fields, to find a of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a , or, more commonly, for before later
A that uses a to convert a raw prediction (y′) into a value between 0 and 1. You can interpret the value between 0 and 1 in either of the following two ways:
As a probability that the example belongs to the in a binary classification problem.
As a value to be compared against a . If the value is equal to or above the classification threshold, the system classifies the example as the positive class. Conversely, if the value is below the given threshold, the system classifies the example as the . For example, suppose the classification threshold is 0.82:
Although logistic regression is often used in problems, logistic regression can also be used in problems (where it becomes called multi-class logistic regression or multinomial regression).
A type of cell in a used to process sequences of data in applications such as handwriting recognition, machine translation, and image captioning. LSTMs address the that occurs when training RNNs due to long data sequences by maintaining history in an internal memory state based on new input and context from previous cells in the RNN.
Long Short-Term Memory networks were invented to prevent the in Recurrent Neural Networks by using a memory gating mechanism. Using LSTM units to calculate the hidden state in an RNN we help to the network to efficiently propagate gradients and learn long-range dependencies.
A measure of how far a model's are from its . Or, to phrase it more pessimistically, a measure of how bad the model is. To determine this value, a model must define a loss function. For example, linear regression models typically use for a loss function, while logistic regression models use .
Abbreviation for .
Meta-learning is related to .
The process of determining the best .
A Multilayer Perceptron is a Feedforward Neural Network with multiple fully-connected layers that use nonlinear to deal with data which is not linearly separable. An MLP is the most basic form of a multilayer Neural Network, or a deep Neural Networks if it has more than 2 layers.
The mathematical formula or metric that a model aims to optimize. For example, the objective function for is usually . Therefore, when training a linear regression model, the goal is to minimize squared loss.
See also .
The meaning within machine learning. Here, performance answers the following question: How correct is this ? That is, how good are the model's predictions?
Physics informed neural networks, neural networks that are trained to solve supervised learning tasks while respecting any given law of physics described by general nonlinear .
The principal components of a collection of points in a are a sequence of {\displaystyle p} , where the {\displaystyle i}-th vector is the direction of a line that best fits the data while being to the first {\displaystyle i-1} vectors. Here, a best-fitting line is defined as one that minimizes the average squared . These directions constitute an in which different individual dimensions of the data are . Principal component analysis (PCA) is the process of computing the principal components and using them to perform a on the data, sometimes using only the first few principal components and ignoring the rest.
PCA is used in and for making . It is commonly used for by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data's variation as possible. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data. The {\displaystyle i}-th principal component can be taken as a direction orthogonal to the first {\displaystyle i-1} principal components that maximizes the variance of the projected data.
In reinforcement learning, an algorithm that allows an to learn the optimal of a by applying the . The Markov decision process models an .
An ensemble approach to finding the that best fits the training data by creating many decision trees and then determining the "average" one. The "random" part of the term refers to building each of the decision trees from a random selection of features; the "forest" refers to the set of decision trees.
A that is intentionally run multiple times, where parts of each run feed into the next run. Specifically, hidden layers from the previous run provide part of the input to the same hidden layer in the next run. Recurrent neural networks are particularly useful for evaluating sequences, so that the hidden layers can learn from previous runs of the neural network on earlier parts of the sequence.
A family of algorithms that learn an optimal , whose goal is to maximize when interacting with an . For example, the ultimate reward of most games is victory. Reinforcement learning systems can become expert at playing complex games by evaluating sequences of previous game moves that ultimately led to wins and sequences that ultimately led to losses.
The process of mapping data to useful .
Deep Residual Networks won the ILSVRC 2015 challenge. These networks work by introducing shortcut connection across stacks of layers, allowing the optimizer to learn “easier” residual mappings instead of the more complicated original mappings. These shortcut connections are similar to , but they are data-independent and don’t introduce additional parameters or training complexity. ResNets achieved a 3.57% error rate on the ImageNet test set.
Abbreviation for .
An SOM is a type of but is trained using rather than the error-correction learning (e.g., with ) used by other artificial neural networks. The SOM is an technique used to produce a (typically two-dimensional) representation of a higher dimensional data set while preserving the of the data. For example, a data set with p variables measured in n observations could be represented as clusters of observations with similar values for the variables. These clusters then could be visualized as a two-dimensional "map" such that observations in proximal clusters have more similar values than observations in distal clusters. This can make high-dimensional data easier to visualize and analyze.
A family of techniques for converting an problem into a problem by creating surrogate from .
Some -based models such as use self-supervised learning.
Self-supervised training is a approach.
is one technique for semi-supervised learning.
A of a tensor that only stores nonzero elements.
A Siamese neural network (sometimes called a twin neural network) is an that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors.
Stochastic (SCNs) that employ a supervisory mechanism to automatically and fast construct universal approximators can achieve promising performance for resolving regression problems.
Are models with associated learning that analyze data for and . SVMs are one of the most robust prediction methods, being based on statistical learning frameworks. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non- (although methods such as exist to use SVM in a probabilistic classification setting). SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.
In addition to performing , SVMs can efficiently perform a non-linear classification using what is called the , implicitly mapping their inputs into high-dimensional feature spaces.
Training a from input data and its corresponding . Supervised machine learning is analogous to a student learning a subject by studying a set of questions and their corresponding answers. After mastering the mapping between questions and answers, the student can then provide answers to new (never-before-seen) questions on the same topic. Compare with .
Transferring information from one machine learning task to another. For example, in multi-task learning, a single model solves multiple tasks, such as a that has different output nodes for different tasks. Transfer learning might involve transferring knowledge from the solution of a simpler task to a more complex one, or involve transferring knowledge from a task where there is more data to one where there is less data.
A architecture developed at Google that relies on mechanisms to transform a sequence of input into a sequence of output embeddings without relying on or . A Transformer can be viewed as a stack of self-attention layers.
an
a
A decoder transforms a sequence of input embeddings into a sequence of output embeddings, possibly with a different length. A decoder also includes N identical layers with three sub-layers, two of which are similar to the encoder sub-layers. The third decoder sub-layer takes the output of the encoder and applies the mechanism to gather information from it.
The blog post provides a good introduction to Transformers.
See also and .
U-Net is a that was developed for biomedical . The network is based on the fully convolutional network and its architecture was modified and extended to work with fewer training images and to yield more precise segmentations.
Training a to find patterns in a dataset, typically an unlabeled dataset.
Another example of unsupervised machine learning is . For example, applying PCA on a dataset containing the contents of millions of shopping carts might reveal that shopping carts containing lemons frequently also contain antacids.
Compare with .
The tendency for the gradients of early of some to become surprisingly flat (low). Increasingly lower gradients result in increasingly smaller changes to the weights on nodes in a deep neural network, leading to little or no learning. Models suffering from the vanishing gradient problem become difficult or impossible to train. cells address this issue.
Compare to .
One of the loss functions commonly used in , based on the between the distribution of generated data and real data.
A coefficient for a in a linear model, or an edge in a deep network. The goal of training a linear model is to determine the ideal weight for each feature. If a weight is 0, then its corresponding feature does not contribute to the model.
The number of in a particular of a .
The “You Only Look Once,” or YOLO, family of models are a series of end-to-end deep learning models designed for fast object detection, developed by Joseph Redmon, et al. and first described in the 2015 paper titled “.”