Machine Learning Terms and Methods

Active Learning

A trainingarrow-up-right approach in which the algorithm chooses some of the data it learns from. Active learning is particularly valuable when labeled examplesarrow-up-right are scarce or expensive to obtain. Instead of blindly seeking a diverse range of labeled examples, an active learning algorithm selectively seeks the particular range of examples it needs for learning.

AlexNet

AlexNet is the name of a convolutional neural networkarrow-up-right (CNN) architecture

AlexNet: The Architecture that Challenged CNNsarrow-up-right

Attention

Attention Mechanisms are inspired by human visual attention, the ability to focus on specific parts of an image. Attention mechanisms can be incorporated in both Language Processing and Image Recognition architectures to help the network learn what to “focus” on when making predictions.

Any of a wide range of neural networkarrow-up-right architecture mechanisms that aggregate information from a set of inputs in a data-dependent manner. A typical attention mechanism might consist of a weighted sum over a set of inputs, where the weightarrow-up-right for each input is computed by another part of the neural network.

Refer also to self-attentionarrow-up-right and multi-head self-attentionarrow-up-right, which are the building blocks of Transformersarrow-up-right.

Artificial Neural Networks (ANN)

A model that, taking inspiration from the brain, is composed of layers (at least one of which is hiddenarrow-up-right) consisting of simple connected units or neuronsarrow-up-right followed by nonlinearities.

Autoencoder

An Autoencoder is a Neural Network model whose goal is to predict the input itself, typically through a “bottleneck” somewhere in the network. By introducing a bottleneck, we force the network to learn a lower-dimensional representation of the input, effectively compressing the input into a good representation. Autoencoders are related to PCA and other dimensionality reduction techniques, but can learn more complex mappings due to their nonlinear nature. A wide range of autoencoder architectures exist, including Denoising Autoencodersarrow-up-right, Variational Autoencodersarrow-up-right, or Sequence Autoencodersarrow-up-right.

Backpropagation

Backpropagation is an algorithm to efficiently calculate the gradients in a Neural Network, or more generally, a feedforward computational graph. It boils down to applying the chain rule of differentiation starting from the network output and propagating the gradients backward. The first uses of backpropagation go back to Vapnik in the 1960’s, but Learning representations by back-propagating errorsarrow-up-right is often cited as the source.

The primary algorithm for performing gradient descentarrow-up-right on neural networksarrow-up-right. First, the output values of each node are calculated (and cached) in a forward pass. Then, the partial derivativearrow-up-right of the error with respect to each parameter is calculated in a backward pass through the graph.

CapsNet

A Capsule Neural Network (CapsNet) is a machine learning system that is a type of artificial neural networkarrow-up-right (ANN) that can be used to better model hierarchical relationships.

The idea is to add structures called “capsules” to a convolutional neural networkarrow-up-right (CNN), and to reuse output from several of those capsules to form more stable (with respect to various perturbations) representations for higher capsules. The output is a vector consisting of the probability of an observationarrow-up-right, and a pose for that observationarrow-up-right. This vector is similar to what is done for example when doing classification with localizationarrow-up-right in CNNs.

Among other benefits, capsnets address the "Picasso problem" in image recognition: images that have all the right parts but that are not in the correct spatial relationship (e.g., in a "face", the positions of the mouth and one eye are switched). For image recognition, capsnets exploit the fact that while viewpoint changes have nonlinear effects at the pixel level, they have linear effects at the part/object level.

Hinton and Google Brain - Capsule Networksarrow-up-right

Clustering

Grouping related examplesarrow-up-right, particularly during unsupervised learningarrow-up-right. Once all the examples are grouped, a human can optionally supply meaning to each cluster.

Many clustering algorithms exist. For example, the k-meansarrow-up-right algorithm clusters examples based on their proximity to a centroidarrow-up-right.

Convolutional Layer

A layer of a deep neural networkarrow-up-right in which a convolutional filterarrow-up-right passes along an input matrix.

Convolutional Neural Network (CNN)

A neural networkarrow-up-right in which at least one layer is a convolutional layerarrow-up-right. A typical convolutional neural network consists of some combination of the following layers:

Convolutional neural networks have had great success in certain kinds of problems, such as image recognition.

A CNN uses convolutionsarrow-up-right to connected extract features from local regions of an input. Most CNNs contain a combination of convolutional, poolingarrow-up-right and affinearrow-up-right layers. CNNs have gained popularity particularly through their excellent performance on visual recognition tasks, where they have been setting the state of the art for several years.

Data Augmentation

Artificially boosting the range and number of trainingarrow-up-right examples by transforming existing examples to create additional examples. For example, suppose images are one of your features, but your dataset doesn't contain enough image examples for the model to learn useful associations. Ideally, you'd add enough labeledarrow-up-right images to your dataset to enable your model to train properly. If that's not possible, data augmentation can rotate, stretch, and reflect each image to produce many variants of the original picture, possibly yielding enough labeled data to enable excellent training.

Decision Tree

A model represented as a sequence of branching statements. For example, the following over-simplified decision tree branches a few times to predict the price of a house (in thousands of USD). According to this decision tree, a house larger than 160 square meters, having more than three bedrooms, and built less than 10 years ago would have a predicted price of 510 thousand USD.

Decoder

In general, any ML system that converts from a processed, dense, or internal representation to a more raw, sparse, or external representation.

Decoders are often a component of a larger model, where they are frequently paired with an encoderarrow-up-right.

In sequence-to-sequence tasksarrow-up-right, a decoder starts with the internal state generated by the encoder to predict the next sequence.

Refer to Transformerarrow-up-right for the definition of a decoder within the Transformer architecture.

DenseNet

Densely Connected Convolutional Networksarrow-up-right

DeepLab

A deep CNN developed for Semantic Image Segmentation.

DeepLabarrow-up-right

Semantic Image Segmentation with DeepLab in TensorFlowarrow-up-right

DnCNN

It is an efficient deep learning model to estimate a residual image from the input image with the Gaussian noise.

Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoisingarrow-up-right

Dictionary Learning

It is a branch of signal processing and machine learning that aims at finding a frame (called dictionary) in which some training data admits a sparse representation. The sparser the representation, the better the dictionary.

Dropout

Dropout is a regularization technique for Neural Networks that prevents overfitting. It prevents neurons from co-adapting by randomly setting a fraction of them to 0 at each training iteration. Dropout can be interpreted in various ways, such as randomly sampling from an exponential number of different networks. Dropout layers first gained popularity through their use in CNNsarrow-up-right, but have since been applied to other layers, including input embeddings or recurrent networks.

A form of regularizationarrow-up-right useful in training neural networksarrow-up-right. Dropout regularization works by removing a random selection of a fixed number of the units in a network layer for a single gradient step. The more units dropped out, the stronger the regularization. This is analogous to training the network to emulate an exponentially large ensemble of smaller networks. For full details, see Dropout: A Simple Way to Prevent Neural Networks from Overfittingarrow-up-right.

Dynamic Bayesian Networks

A Dynamic Bayesian Network (DBN) is a Bayesian networkarrow-up-right (BN) which relates variables to each other over adjacent time steps.

Encoder

In general, any ML system that converts from a raw, sparse, or external representation into a more processed, denser, or more internal representation.

Encoders are often a component of a larger model, where they are frequently paired with a decoderarrow-up-right. Some Transformersarrow-up-right pair encoders with decoders, though other Transformers use only the encoder or only the decoder.

Some systems use the encoder's output as the input to a classification or regression network.

In sequence-to-sequence tasksarrow-up-right, an encoder takes an input sequence and returns an internal state (a vector). Then, the decoderarrow-up-right uses that internal state to predict the next sequence.

Refer to Transformerarrow-up-right for the definition of an encoder in the Transformer architecture.

Ensemble

A merger of the predictions of multiple modelsarrow-up-right. You can create an ensemble via one or more of the following:

Deep and wide models are a kind of ensemble.

Feature

An input variable used in making predictionsarrow-up-right.

Feature Extraction

Overloaded term having either of the following definitions:

Federated Learning

A distributed machine learning approach that trainsarrow-up-right machine learning modelsarrow-up-right using decentralized examplesarrow-up-right residing on devices such as smartphones. In federated learning, a subset of devices downloads the current model from a central coordinating server. The devices use the examples stored on the devices to make improvements to the model. The devices then upload the model improvements (but not the training examples) to the coordinating server, where they are aggregated with other updates to yield an improved global model. After the aggregation, the model updates computed by devices are no longer needed, and can be discarded.

Since the training examples are never uploaded, federated learning follows the privacy principles of focused data collection and data minimization.

For more information about federated learning, see this tutorialarrow-up-right.

Fine Tuning

Perform a secondary optimization to adjust the parameters of an already trained modelarrow-up-right to fit a new problem. Fine tuning often refers to refitting the weights of a trained unsupervisedarrow-up-right model to a supervisedarrow-up-right model.

GAN

Abbreviation for generative adversarial networkarrow-up-right.

Gaussian Mixture Model (GMM)

A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians.

Gaussian Process

Gaussian process is a stochastic processarrow-up-right (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distributionarrow-up-right, i.e. every finite linear combinationarrow-up-right of them is normally distributed. The distribution of a Gaussian process is the joint distributionarrow-up-right of all those (infinitely many) random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

Generative Adversarial Network (GAN)

A system to create new data in which a generatorarrow-up-right creates data and a discriminatorarrow-up-right determines whether that created data is valid or invalid.

GoogleNet

The name of the Convolutional Neural Network architecture that won the ILSVRC 2014 challenge. The network uses Inception modulesarrow-up-right to reduce the parameters and improve the utilization of the computing resources inside the network.

GRU

The Gated Recurrent Unit is a simplified version of an LSTM unit with fewer parameters. Just like an LSTM cell, it uses a gating mechanism to allow RNNs to efficiently learn long-range dependency by preventing the vanishing gradient problemarrow-up-right. The GRU consists of a reset and update gate that determine which part of the old memory to keep vs. update with new values at the current time step.

Gradient Boosting Machine

It is a machine learningarrow-up-right technique for regressionarrow-up-right, classificationarrow-up-right and other tasks, which produces a prediction model in the form of an ensemblearrow-up-right of weak prediction models, typically decision treesarrow-up-right. When a decision tree is the weak learner, the resulting algorithm is called gradient boosted trees, which usually outperforms random forestarrow-up-right. It builds the model in a stage-wise fashion like other boostingarrow-up-right methods do, and it generalizes them by allowing optimization of an arbitrary differentiablearrow-up-right loss functionarrow-up-right.

Ground Truth

The correct answer. Reality. Since reality is often subjective, expert ratersarrow-up-right typically are the proxy for ground truth.

Hierarchical Clustering

A category of clusteringarrow-up-right algorithms that create a tree of clusters. Hierarchical clustering is well-suited to hierarchical data, such as botanical taxonomies. There are two types of hierarchical clustering algorithms:

  • Agglomerative clustering first assigns every example to its own cluster, and iteratively merges the closest clusters to create a hierarchical tree.

  • Divisive clustering first groups all examples into one cluster and then iteratively divides the cluster into a hierarchical tree.

Contrast with centroid-based clusteringarrow-up-right.

Hidden Markov Model (HMM)

Hidden Markov Model (HMM) is a statisticalarrow-up-right Markov modelarrow-up-right in which the system being modeledarrow-up-right is assumed to be a Markov processarrow-up-right.

Hyperparameter

The "knobs" that you tweak during successive runs of training a model. For example, learning ratearrow-up-right is a hyperparameter.

Contrast with parameterarrow-up-right.

Inception

Inception Modules are used in Convolutional Neural Networks to allow for more efficient computation and deeper Networks trough a dimensionality reduction with stacked 1×1 convolutions.

Inference

In machine learning, often refers to the process of making predictions by applying the trained model to unlabeled examplesarrow-up-right. In statistics, inference refers to the process of fitting the parameters of a distribution conditioned on some observed data. (See the Wikipedia article on statistical inferencearrow-up-right.)

Interpretability

The ability to explain or to present an ML model's reasoning in understandable terms to a human.

K-Means

A popular clusteringarrow-up-right algorithm that groups examples in unsupervised learning. The k-means algorithm basically does the following:

  • Iteratively determines the best k center points (known as centroidsarrow-up-right).

  • Assigns each example to the closest centroid. Those examples nearest the same centroid belong to the same group.

The k-means algorithm picks centroid locations to minimize the cumulative square of the distances from each example to its closest centroid.

K-Nearest Neighbors (KNN)

It is used for classificationarrow-up-right and regressionarrow-up-right. In both cases, the input consists of the k closest training examples in data setarrow-up-right.. k-NN is a type of classificationarrow-up-right where the function is only approximated locally and all computation is deferred until function evaluation. Since this algorithm relies on distance for classification, if the features represent different physical units or come in vastly different scales then normalizingarrow-up-right the training data can improve its accuracy dramatically.

Both for classification and regression, a useful technique can be to assign weights to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor.

The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.

Label

In supervised learning, the "answer" or "result" portion of an examplearrow-up-right. Each example in a labeled dataset consists of one or more features and a label. For instance, in a housing dataset, the features might include the number of bedrooms, the number of bathrooms, and the age of the house, while the label might be the house's price. In a spam detection dataset, the features might include the subject line, the sender, and the email message itself, while the label would probably be either "spam" or "not spam."

Layer

A set of neuronsarrow-up-right in a neural networkarrow-up-right that process a set of input features, or the output of those neurons.

Also, an abstraction in TensorFlow. Layers are Python functions that take Tensorsarrow-up-right and configuration options as input and produce other tensors as output.

LeNet

LeNet is a convolutional neural network structure proposed by Yann LeCun et al. ... Convolutional neural networks are a kind of feed-forward neural network whose artificial neurons can respond to a part of the surrounding cells in the coverage range and perform well in large-scale image processing.

Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statisticsarrow-up-right and other fields, to find a linear combinationarrow-up-right of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifierarrow-up-right, or, more commonly, for dimensionality reductionarrow-up-right before later classificationarrow-up-right

Logistic Regression (LR)

A classification modelarrow-up-right that uses a sigmoid functionarrow-up-right to convert a linear model'sarrow-up-right raw prediction (y′) into a value between 0 and 1. You can interpret the value between 0 and 1 in either of the following two ways:

  • As a probability that the example belongs to the positive classarrow-up-right in a binary classification problem.

  • As a value to be compared against a classification thresholdarrow-up-right. If the value is equal to or above the classification threshold, the system classifies the example as the positive class. Conversely, if the value is below the given threshold, the system classifies the example as the negative classarrow-up-right. For example, suppose the classification threshold is 0.82:

    • Imagine an example that produces a raw prediction (y′) of 2.6. The sigmoid of 2.6 is 0.93. Since 0.93 is greater than 0.82, the system classifies this example as the positive class.

    • Imagine a different example that produces a raw prediction of 1.3. The sigmoid of 1.3 is 0.79. Since 0.79 is less than 0.82, the system classifies that example as the negative class.

Although logistic regression is often used in binary classificationarrow-up-right problems, logistic regression can also be used in multi-class classificationarrow-up-right problems (where it becomes called multi-class logistic regression or multinomial regression).

Long Short-Term Memory (LSTM)

A type of cell in a recurrent neural networkarrow-up-right used to process sequences of data in applications such as handwriting recognition, machine translation, and image captioning. LSTMs address the vanishing gradient problemarrow-up-right that occurs when training RNNs due to long data sequences by maintaining history in an internal memory state based on new input and context from previous cells in the RNN.

Long Short-Term Memory networks were invented to prevent the vanishing gradient problemarrow-up-right in Recurrent Neural Networks by using a memory gating mechanism. Using LSTM units to calculate the hidden state in an RNN we help to the network to efficiently propagate gradients and learn long-range dependencies.

Loss

A measure of how far a model's predictionsarrow-up-right are from its labelarrow-up-right. Or, to phrase it more pessimistically, a measure of how bad the model is. To determine this value, a model must define a loss function. For example, linear regression models typically use mean squared errorarrow-up-right for a loss function, while logistic regression models use Log Lossarrow-up-right.

LSTM

Abbreviation for Long Short-Term Memoryarrow-up-right.

Meta-Learning

A subset of machine learning that discovers or improves a learning algorithm. A meta-learning system can also aim to train a model to quickly learn a new task from a small amount of data or from experience gained in previous tasks. Meta-learning algorithms generally try to achieve the following:

  • Improve/learn hand-engineered features (such as an initializer or an optimizer).

  • Be more data-efficient and compute-efficient.

  • Improve generalization.

Meta-learning is related to few-shot learningarrow-up-right.

Model Training

The process of determining the best modelarrow-up-right.

Multilayer Perceptron (MLP)

A Multilayer Perceptron is a Feedforward Neural Network with multiple fully-connected layers that use nonlinear activation functionsarrow-up-right to deal with data which is not linearly separable. An MLP is the most basic form of a multilayer Neural Network, or a deep Neural Networks if it has more than 2 layers.

Naive Bayes

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.

Objective Function

The mathematical formula or metric that a model aims to optimize. For example, the objective function for linear regressionarrow-up-right is usually squared lossarrow-up-right. Therefore, when training a linear regression model, the goal is to minimize squared loss.

In some cases, the goal is to maximize the objective function. For example, if the objective function is accuracy, the goal is to maximize accuracy.

See also lossarrow-up-right.

Performance

Overloaded term with the following meanings:

  • The traditional meaning within software engineering. Namely: How fast (or efficiently) does this piece of software run?

  • The meaning within machine learning. Here, performance answers the following question: How correct is this modelarrow-up-right? That is, how good are the model's predictions?

PINN

Physics informed neural networks, neural networks that are trained to solve supervised learning tasks while respecting any given law of physics described by general nonlinear partial differential equationsarrow-up-right.

Data-driven solutions and discovery of Nonlinear Partial Differential Equationsarrow-up-right

arrow-up-right

Principal Component Analysis (PCA)

The principal components of a collection of points in a real coordinate spacearrow-up-right are a sequence of {\displaystyle p}p unit vectorsarrow-up-right, where the {\displaystyle i}i-th vector is the direction of a line that best fits the data while being orthogonalarrow-up-right to the first {\displaystyle i-1}i-1 vectors. Here, a best-fitting line is defined as one that minimizes the average squared distance from the points to the linearrow-up-right. These directions constitute an orthonormal basisarrow-up-right in which different individual dimensions of the data are linearly uncorrelatedarrow-up-right. Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basisarrow-up-right on the data, sometimes using only the first few principal components and ignoring the rest.

PCA is used in exploratory data analysisarrow-up-right and for making predictive modelsarrow-up-right. It is commonly used for dimensionality reductionarrow-up-right by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data's variation as possible. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data. The {\displaystyle i}i-th principal component can be taken as a direction orthogonal to the first {\displaystyle i-1}i-1 principal components that maximizes the variance of the projected data.

Q-learning

In reinforcement learning, an algorithm that allows an agentarrow-up-right to learn the optimal Q-functionarrow-up-right of a Markov decision processarrow-up-right by applying the Bellman equationarrow-up-right. The Markov decision process models an environmentarrow-up-right.

R-CNN

Region Based Convolutional Neural Networks (R-CNN) are a family of machine learning models for computer vision and specifically object detection.

Random Forest (RF)

An ensemble approach to finding the decision treearrow-up-right that best fits the training data by creating many decision trees and then determining the "average" one. The "random" part of the term refers to building each of the decision trees from a random selection of features; the "forest" refers to the set of decision trees.

Recurrent Neural Network (RNN)

A neural networkarrow-up-right that is intentionally run multiple times, where parts of each run feed into the next run. Specifically, hidden layers from the previous run provide part of the input to the same hidden layer in the next run. Recurrent neural networks are particularly useful for evaluating sequences, so that the hidden layers can learn from previous runs of the neural network on earlier parts of the sequence.

A RNN models sequential interactions through a hidden state, or memory. It can take up to N inputs and produce up to N outputs. For example, an input sequence may be a sentence with the outputs being the part-of-speech tag for each word (N-to-N). An input could be a sentence, and the output a sentiment classification of the sentence (N-to-1). An input could be a single image, and the output could be a sequence of words corresponding to the description of an image (1-to-N). At each time step, an RNN calculates a new hidden state (“memory”) based on the current input and the previous hidden state. The “recurrent” stems from the facts that at each step the same parameters are used and the network performs the same calculations based on different inputs.

Reinforcement Learning (RL)

A family of algorithms that learn an optimal policyarrow-up-right, whose goal is to maximize returnarrow-up-right when interacting with an environmentarrow-up-right. For example, the ultimate reward of most games is victory. Reinforcement learning systems can become expert at playing complex games by evaluating sequences of previous game moves that ultimately led to wins and sequences that ultimately led to losses.

Representation

The process of mapping data to useful featuresarrow-up-right.

ResNet

Deep Residual Networks won the ILSVRC 2015 challenge. These networks work by introducing shortcut connection across stacks of layers, allowing the optimizer to learn “easier” residual mappings instead of the more complicated original mappings. These shortcut connections are similar to Highway Layersarrow-up-right, but they are data-independent and don’t introduce additional parameters or training complexity. ResNets achieved a 3.57% error rate on the ImageNet test set.

RNN

Abbreviation for recurrent neural networksarrow-up-right.

SegNet

Similar to U-Net but the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling unlike the U-Net in which the entire features from lower-resolution are passed to the higher-resolution layers.

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentationarrow-up-right

SENET

Squeeze-and-Excitation Networksarrow-up-right

Self Organized Mapping (SOM)

An SOM is a type of artificial neural networkarrow-up-right but is trained using competitive learningarrow-up-right rather than the error-correction learning (e.g., backpropagationarrow-up-right with gradient descentarrow-up-right) used by other artificial neural networks. The SOM is an unsupervisedarrow-up-right machine learningarrow-up-right technique used to produce a low-dimensionalarrow-up-right (typically two-dimensional) representation of a higher dimensional data set while preserving the topological structurearrow-up-right of the data. For example, a data set with p variables measured in n observations could be represented as clusters of observations with similar values for the variables. These clusters then could be visualized as a two-dimensional "map" such that observations in proximal clusters have more similar values than observations in distal clusters. This can make high-dimensional data easier to visualize and analyze.

Self-Supervised Learning

A family of techniques for converting an unsupervised machine learningarrow-up-right problem into a supervised machine learningarrow-up-right problem by creating surrogate labelsarrow-up-right from unlabeled examplesarrow-up-right.

Some Transformerarrow-up-right-based models such as BERTarrow-up-right use self-supervised learning.

Self-supervised training is a semi-supervised learningarrow-up-right approach.

Semi-Supervised Learning

Training a model on data where some of the training examples have labels but others don't. One technique for semi-supervised learning is to infer labels for the unlabeled examples, and then to train on the inferred labels to create a new model. Semi-supervised learning can be useful if labels are expensive to obtain but unlabeled examples are plentiful.

Self-trainingarrow-up-right is one technique for semi-supervised learning.

Sequence Model

A model whose inputs have a sequential dependence. For example, predicting the next video watched from a sequence of previously watched videos.

SqueezeNet

SqueezeNet is the name of a deep neural network for computer vision that was released in 2016.

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model sizearrow-up-right

Sparse Representation

A representationarrow-up-right of a tensor that only stores nonzero elements.

For example, the English language consists of about a million words. Consider two ways to represent a count of the words used in one English sentence:

  • A dense representation of this sentence must set an integer for all one million cells, placing a 0 in most of them, and a low integer into a few of them.

  • A sparse representation of this sentence stores only those cells symbolizing a word actually in the sentence. So, if the sentence contained only 20 unique words, then the sparse representation for the sentence would store an integer in only 20 cells.

Siamese

A Siamese neural network (sometimes called a twin neural network) is an artificial neural networkarrow-up-right that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors.

Sparsity

The number of elements set to zero (or null) in a vector or matrix divided by the total number of entries in that vector or matrix. For example, consider a 10x10 matrix in which 98 cells contain zero. The calculation of sparsity is as follows:sparsity=98100=0.98

Feature sparsity refers to the sparsity of a feature vector; model sparsity refers to the sparsity of the model weights.

Stochastic Configuration Networks

Stochastic configuration networksarrow-up-right (SCNs) that employ a supervisory mechanism to automatically and fast construct universal approximators can achieve promising performance for resolving regression problems.

Support Vector Machine (SVM)

Are supervised learningarrow-up-right models with associated learning algorithmsarrow-up-right that analyze data for classificationarrow-up-right and regression analysisarrow-up-right. SVMs are one of the most robust prediction methods, being based on statistical learning frameworks. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilisticarrow-up-right binaryarrow-up-right linear classifierarrow-up-right (although methods such as Platt scalingarrow-up-right exist to use SVM in a probabilistic classification setting). SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

In addition to performing linear classificationarrow-up-right, SVMs can efficiently perform a non-linear classification using what is called the kernel trickarrow-up-right, implicitly mapping their inputs into high-dimensional feature spaces.

TCN Temporal Convolutional Networks, are convolutional neural networks with dilations used for time series data modeling particularly.

Temporal Convolutional Networks, The Next Revolution for Time-Series?arrow-up-right

Supervised Learning

Training a modelarrow-up-right from input data and its corresponding labelsarrow-up-right. Supervised machine learning is analogous to a student learning a subject by studying a set of questions and their corresponding answers. After mastering the mapping between questions and answers, the student can then provide answers to new (never-before-seen) questions on the same topic. Compare with unsupervised machine learningarrow-up-right.

Transfer Learning

Transferring information from one machine learning task to another. For example, in multi-task learning, a single model solves multiple tasks, such as a deep modelarrow-up-right that has different output nodes for different tasks. Transfer learning might involve transferring knowledge from the solution of a simpler task to a more complex one, or involve transferring knowledge from a task where there is more data to one where there is less data.

Most machine learning systems solve a single task. Transfer learning is a baby step towards artificial intelligence in which a single program can solve multiple tasks.

Transformer

A neural networkarrow-up-right architecture developed at Google that relies on self-attentionarrow-up-right mechanisms to transform a sequence of input embeddingsarrow-up-right into a sequence of output embeddings without relying on convolutionsarrow-up-right or recurrent neural networksarrow-up-right. A Transformer can be viewed as a stack of self-attention layers.

A Transformer can include any of the following:

An encoder transforms a sequence of embeddings into a new sequence of the same length. An encoder includes N identical layers, each of which contains two sub-layers. These two sub-layers are applied at each position of the input embedding sequence, transforming each element of the sequence into a new embedding. The first encoder sub-layer aggregates information from across the input sequence. The second encoder sub-layer transforms the aggregated information into an output embedding.

A decoder transforms a sequence of input embeddings into a sequence of output embeddings, possibly with a different length. A decoder also includes N identical layers with three sub-layers, two of which are similar to the encoder sub-layers. The third decoder sub-layer takes the output of the encoder and applies the self-attentionarrow-up-right mechanism to gather information from it.

The blog post Transformer: A Novel Neural Network Architecture for Language Understandingarrow-up-right provides a good introduction to Transformers.

Translational Invariance

In an image classification problem, an algorithm's ability to successfully classify images even when the position of objects within the image changes. For example, the algorithm can still identify a dog, whether it is in the center of the frame or at the left end of the frame.

See also size invariancearrow-up-right and rotational invariancearrow-up-right.

U-Net

U-Net is a convolutional neural networkarrow-up-right that was developed for biomedical image segmentationarrow-up-right. The network is based on the fully convolutional network and its architecture was modified and extended to work with fewer training images and to yield more precise segmentations.

U-Net: Convolutional Networks for Biomedical Image Segmentationarrow-up-right

Underfitting

Producing a model with poor predictive ability because the model hasn't captured the complexity of the training data. Many problems can cause underfitting, including:

  • Training on the wrong set of features.

  • Training for too few epochs or at too low a learning rate.

  • Training with too high a regularization rate.

  • Providing too few hidden layers in a deep neural network.

Unsupervised Learning

Training a modelarrow-up-right to find patterns in a dataset, typically an unlabeled dataset.

The most common use of unsupervised machine learning is to cluster data into groups of similar examples. For example, an unsupervised machine learning algorithm can cluster songs together based on various properties of the music. The resulting clusters can become an input to other machine learning algorithms (for example, to a music recommendation service). Clustering can be helpful in domains where true labels are hard to obtain. For example, in domains such as anti-abuse and fraud, clusters can help humans better understand the data.

Another example of unsupervised machine learning is principal component analysis (PCA)arrow-up-right. For example, applying PCA on a dataset containing the contents of millions of shopping carts might reveal that shopping carts containing lemons frequently also contain antacids.

Compare with supervised machine learningarrow-up-right.

Vanishing Gradient

The tendency for the gradients of early hidden layersarrow-up-right of some deep neural networksarrow-up-right to become surprisingly flat (low). Increasingly lower gradients result in increasingly smaller changes to the weights on nodes in a deep neural network, leading to little or no learning. Models suffering from the vanishing gradient problem become difficult or impossible to train. Long Short-Term Memoryarrow-up-right cells address this issue.

Compare to exploding gradient problemarrow-up-right.

VGG

VGG refers to convolutional neural network model that secured the first and second place in the 2014 ImageNet localization and classification tracks, respectively. The VGG model consist of 16–19 weight layers and uses small convolutional filters of size 3×3 and 1×1.

WaveNet

A deep generative model of raw audio waveforms. We show that WaveNets are able to generate speech which mimics any human voice and which sounds more natural than the best existing Text-to-Speech systems, reducing the gap with human performance by over 50%.

WaveNet: A generative model for raw audioarrow-up-right

Wasserstein loss

One of the loss functions commonly used in generative adversarial networksarrow-up-right, based on the earth mover's distancearrow-up-right between the distribution of generated data and real data.

Weight

A coefficient for a featurearrow-up-right in a linear model, or an edge in a deep network. The goal of training a linear model is to determine the ideal weight for each feature. If a weight is 0, then its corresponding feature does not contribute to the model.

Width

The number of neuronsarrow-up-right in a particular layerarrow-up-right of a neural networkarrow-up-right.

YOLO

The “You Only Look Once,” or YOLO, family of models are a series of end-to-end deep learning models designed for fast object detection, developed by Joseph Redmon, et al. and first described in the 2015 paper titled “You Only Look Once: Unified, Real-Time Object Detectionarrow-up-right.”

Last updated