Photo by Milad Fakurian on Unsplash

The DagsHub Glossary

a

Active learning in Machine Learning

Active learning is a concept in machine learning that involves selecting the most informative data points for labeling to train a model efficiently. In traditional supervised learning, a model is trained on a large labeled dataset to learn the underlying patterns and relationships. However, obtaining labeled data can be a time-consuming and costly process, especially […]

Auto Labeling

What is Auto Labeling? Auto labeling, also known as automated labeling or automated data labeling, is the process of assigning labels or annotations to data automatically using machine learning techniques. It involves training machine learning models to recognize patterns and make accurate predictions on unlabeled data, thereby reducing the need for manual labeling by humans. […]

b

BERT

What is BERT BERT, or Bidirectional Encoder Representations from Transformers, is a powerful NLP model that has revolutionized the field of natural language processing. BERT is built upon the Transformer architecture, which is a neural network architecture designed to process sequential data such as natural language text. In this section, we will dive deeper into […]

c

CI/CD for Machine Learning

What is CI/CD for ML? CI/CD (Continuous Integration and Continuous Delivery) is a set of software engineering practices that enable teams to deliver code changes to production rapidly and reliably. In the context of machine learning, CI/CD for ML is a set of principles and methods for continuously integrating and delivering code changes to production […]

COCO Dataset

In the intricate tapestry of computer vision and AI, the role of datasets is not just fundamental, but transformative. Take, for instance, the COCO dataset, a name that resonates deeply within the community. Born out of the necessity to conquer the complexities of object recognition and segmentation, COCO has etched itself as a cornerstone in […]

Computer vision

What is Computer Vision? Computer vision is an interdisciplinary field that focuses on enabling computers to interpret and understand visual data from the world around them. The primary goal of computer vision is to create algorithms and tools that enable computers to extract meaningful information from images, videos, and other types of visual data. Computer […]

Convolutional Neural Network

What is a Convolutional Neural Network? A Convolutional Neural Network (CNN) is a specialized type of artificial neural network that is primarily designed for processing and analyzing structured grid-like data, such as images and videos. CNNs have revolutionized the field of computer vision and are widely used for tasks such as image classification, object detection, […]

Cross Validation

What is Cross Validation Cross validation is a statistical method used in machine learning to assess the performance of a predictive model on an independent data set. It is a technique that helps to ensure the model’s effectiveness and accuracy by testing its ability to generalize to an independent data set. The term “cross validation” […]

d

DVC

What is DVC? Data Version Control (DVC) is a version control system designed for machine learning (ML) projects. It provides data versioning tools that help manage and track changes to ML models and datasets, making it easy to reproduce results and collaborate with others. DVC is based on Git, which is a popular version control […]

Data Registry

What is Data Registry? A data registry, also known as a data catalog or data inventory, is a centralized system or repository that stores and organizes metadata about various data assets within an organization. It serves as a comprehensive inventory of data sources, datasets, and data-related artifacts, providing a holistic view of an organization’s data […]

Data Versioning

What is Data Versioning? Data versioning is the practice of keeping track of changes made to a dataset over time. This means that each version of the dataset is saved and can be accessed at any time, even after subsequent changes have been made. Data versioning is commonly used in industries that require the management […]

Data-Centric AI

What is Data-Centric AI? Data-Centric AI (Artificial Intelligence) is an approach to AI that focuses on the data as the main driver of the machine learning (ML) process. In this approach, data is at the center of the AI pipeline, and all decisions and actions revolve around it. This means that data is not only […]

Decision Trees in Machine Learning

What are Decision Trees in Machine Learning A Decision Tree is a supervised machine-learning algorithm predominantly used for classification problems. It is a tree-structured model of decisions where each node represents a feature(attribute), each link(branch) means a decision rule, and each leaf represents an outcome(categorical or continuous value). The topmost node in a Decision Tree […]

e

Elo Rating System

The Elo Rating System is a method primarily devised for two-player games such as chess to calculate the relative skill levels of players. Conceived by Arpad Elo, this system assigns a numeric rating representing a player’s skill level. Beyond its traditional board game application, Elo’s principles have been extrapolated to evaluate and rank models, especially […]

Epoch in Machine Learning

What is Epoch in Machine Learning? Epoch is a fundamental concept in machine learning, particularly in the field of deep learning, which refers to a single iteration or pass over the entire training dataset during the training phase of a machine learning model. In simpler terms, an epoch represents one complete cycle of the model […]

Experiment Tracking

Experiment tracking is a crucial aspect of machine learning (ML) and data science workflows. It involves systematically recording and organizing information about experiments conducted during model development, including hyperparameters, datasets, code versions, and results. Experiment tracking enables reproducibility, collaboration, and efficient management of ML experiments. In this glossary term, we will explore the key components […]

f

F1 Score

In the realm of classification problems, the F1 Score emerges as a crucial metric to evaluate binary classification models, especially when data is imbalanced. Let’s delve deeper into understanding what it signifies and why it’s pivotal in the world of machine learning. What is the F1 Score? The F1 Score is a harmonic mean of […]

Feature Store

What is Feature Store A feature store is a crucial component in the field of machine learning that serves as a repository for storing, managing, and serving machine learning features. It is designed to handle the entire lifecycle of features, from their creation and storage to their retrieval for model training and prediction purposes. The […]

g

GPU

Graphical Processing Unit (GPU) A Graphical Processing Unit (GPU), also known as a Graphics Processing Unit, is a specialized electronic circuit designed to rapidly manipulate and render images, animations, and videos. Originally developed for accelerating graphics rendering in computer games, GPUs have evolved into powerful parallel processing units capable of handling complex computations. In this […]

Gaussian Distribution

What is Gaussian Distribution? Defined by its probability density function, the Gaussian distribution outlines the probability of a random variable assuming a specific value. Also known as the normal distribution, the function’s formula is: f(x) = (1/√2πσ²) * e^(-(x-μ)²/2σ²), where e is the natural logarithm’s base, μ the mean, and σ the standard deviation, encapsulates […]

Generative AI

What is Generative AI? Generative AI, or generative artificial intelligence, refers to a type of AI system that can create new and original content such as images, videos, music, and text. Unlike other types of AI systems that are designed to recognize patterns in existing data, generative AI models are trained on large datasets and […]

Generative Adversarial Network

In the dynamic landscape of artificial intelligence and machine learning, few innovations have captivated the imagination of researchers and practitioners quite like Generative Adversarial Networks (GANs). Conceived by Ian Goodfellow and his collaborators in 2014, GANs have emerged as a cornerstone in the domain of generative modeling, revolutionizing the synthesis of realistic data samples that […]

h

Holdout Set

In the realm of machine learning and data science, a holdout set, also known as a holdout data or holdout validation set, is a subset of data that is intentionally withheld from the model during the training process. It is used to evaluate the performance and generalization ability of the trained model on unseen data. […]

Human in the Loop Machine Learning

What is Human in the Loop Machine Learning At the core of Human-Guided Machine Learning (HGML) is a harmonious fusion where human sagacity and machine operational efficacy converge. This approach acknowledges that machines, despite their adeptness at processing vast data, often lack the finesse to fully comprehend the nuanced and contextual layers that the human […]

Hyperparameter Tuning

What is Hyperparameter Tuning? Hyperparameter tuning is the process of selecting the best hyperparameters for a machine learning model. Hyperparameters are parameters that are not learned by the model during training, but rather are set by the data scientist or machine learning engineer prior to training. Examples of hyperparameters include learning rate, batch size, and […]

i

Internal Covariate Shift

What is Internal Covariate Shift? Internal Covariate Shift refers to the phenomenon where the distribution of input values to a learning algorithm changes as the network’s parameters are updated during training. In other words, it occurs when the distribution of the input data to each layer of a neural network changes as the parameters of […]

j

JAX

JAX is an open-source numerical computing library that extends the capabilities of NumPy by enabling automatic differentiation. Born from the need to address modern machine learning problems, JAX provides the tools for high-performance machine learning research, especially in areas requiring gradients like optimization problems. What is JAX? At its core, JAX is like NumPy but […]

k

Kullback-Leibler (KL) Divergence

Kullback-Leibler (KL) divergence, also known as relative entropy, is a measure of how one probability distribution diverges from a second expected probability distribution. Originating from information theory, KL divergence quantifies the difference between two probability distributions in terms of the number of extra bits required to code samples from one distribution using a code optimized […]

l

Label Studio

What is Label Studio? Label Studio is an open-source data labeling and annotation tool developed by Heartex. It is designed to simplify the process of creating high-quality training data sets for machine learning models. Label Studio supports a wide range of annotation types, including text, image, video, and audio. One of the key features of […]

Learning Rate

What is Learning Rate In the intricate realm of machine learning, a pivotal factor emerges: the ‘learning rate.’ This element surpasses mere conceptualization; it serves as the cornerstone of the evolutionary learning process in machine learning frameworks. Picture the learning rate as a mysterious hyperparameter, delicately adjusting the extent to which an algorithm modifies the […]

Linear Regression

Linear regression is a statistical modeling technique used to analyze the relationship between a dependent variable and one or more independent variables. It is a fundamental and widely used algorithm in machine learning and statistical analysis. Linear regression aims to find the best-fitting linear relationship between the variables, allowing for prediction and inference. What is […]

Low-Rank Adaptation (LoRA)

LoRA is a technique used to reduce the computational complexity of neural networks (NNs) during deployment on resource-constrained devices, such as smartphones, smart home devices, and embedded systems. LoRA adapts the weight matrices of a NN to have a low-rank structure, which reduces the number of computations required during forward propagation, resulting in faster execution […]

m

ML Model Deployment

What is ML Model Deployment Delving into the realm of Machine Learning, the deployment of an ML Model is a critical juncture. It’s not merely about crafting a model but ushering it into the operational world where it interacts with other software systems, offering insightful predictions. Picture this as the crescendo in a machine learning […]

MLOps

What is MLOps? MLOps, short for Machine Learning Operations, is an emerging practice that combines machine learning (ML) with DevOps principles to effectively manage and operationalize ML workflows. It focuses on streamlining the development, deployment, and maintenance of ML models in production environments. MLOps aims to bridge the gap between data science teams, responsible for […]

MLOps Monitoring

What is MLOps Monitoring? MLOps Monitoring refers to the practice of monitoring machine learning (ML) operations (MLOps) in order to ensure the performance, reliability, and compliance of ML models in production environments. It involves continuously monitoring various aspects of ML models, including data quality, model performance, and system behavior. MLOps monitoring provides valuable insights into […]

MLOps Platform

MLOps (Machine Learning Operations) has emerged as a critical discipline in the field of data science and machine learning. It focuses on the efficient and reliable deployment, monitoring, and management of machine learning models in production environments. To facilitate the MLOps process, organizations often rely on MLOps platforms, which provide a comprehensive set of tools […]

MLflow

What is MLflow? MLflow is an open-source platform designed to simplify the machine learning lifecycle. It provides a comprehensive set of tools and frameworks to manage and track the end-to-end ML development process, including experimentation, reproducibility, deployment, and collaboration. MLflow enables data scientists and ML engineers to focus on building and deploying models while maintaining […]

Machine Learning Reproducibility

What is Machine Learning Reproducibility? Machine Learning (ML) reproducibility refers to the ability to obtain consistent and reliable results when running ML experiments or workflows. It involves ensuring that the results obtained from a particular ML model or experiment can be replicated by others using the same data, code, and computational resources. Reproducibility is a […]

Mean Absolute Error (MAE)

Mean Absolute Error (MAE) is a commonly used metric in machine learning and statistics to measure the average magnitude of errors between predicted and actual values. It provides a straightforward and intuitive measure of the model’s accuracy and is particularly useful when dealing with continuous numerical data. This glossary term will explain what MAE is, […]

Mean Square Error (MSE)

What is Mean Squared Error (MSE)? Mean Squared Error (MSE) is a commonly used statistical metric that measures the average squared difference between the predicted values and the actual values in a dataset. It is widely employed in various domains, including statistics, machine learning, and data analysis. MSE provides a quantitative measure of the accuracy […]

Model Accuracy

What is Model Accuracy in Machine Learning Model accuracy in machine learning refers to the degree to which the predictions made by a machine learning model align with the actual outcomes. It is a key metric used to evaluate the performance of a model, particularly in supervised learning scenarios where the true outcomes are known. […]

Model Fairness

Model fairness is a crucial aspect of machine learning and artificial intelligence (AI) that focuses on ensuring equitable and unbiased outcomes in predictive models. With the increasing adoption of AI technologies across various domains, it is essential to address the potential biases and discrimination that can arise from machine learning models. Model fairness aims to […]

Model Registry

What is a Model Registry? A model registry is a central repository that stores and manages machine learning models and their associated metadata throughout their lifecycle. It serves as a catalog and control center for organizing, versioning, and tracking ML models, enabling efficient collaboration, reproducibility, and governance within the machine learning operations (MLOps) workflow. A […]

Model Serving 

Model serving, an often overlooked yet pivotal aspect in the realm of machine learning, plays an indispensable role in bringing trained models into real-world application. It’s the process where a model, having been rigorously trained, steps out of its theoretical bounds and into a live environment, making predictions from fresh input data. Picture model serving […]

n

NumPy

NumPy, short for Numerical Python, is the cornerstone of numerical computing in Python. Offering an array object called ndarray, it provides a set of fast mathematical operations that can be performed on arrays. NumPy’s simplicity and efficiency have cemented its position as the fundamental package for scientific computing with Python. What is NumPy? NumPy is […]

o

Object Detection

Delving into the realm of computer vision, object detection stands out as an intricate yet enthralling field, having undergone remarkable progress in the wake of the machine learning and AI revolution. It encompasses a computer system’s capability to discern and pinpoint objects within imagery or video sequences. This innovation finds its utility in a spectrum […]

One-hot Encoding

One-hot encoding is a process used in machine learning and data science to convert categorical data into a format that can be provided to machine learning algorithms to improve predictions. While machines understand numbers, they do not comprehend categories or labels in the raw form. One-hot encoding is a technique that transforms these categories into […]

Optuna

In the world of machine learning and deep learning, the right set of hyperparameters can make the difference between an average model and a state-of-the-art one. Optuna, a hyperparameter optimization framework, has rapidly gained popularity by making the task of finding these optimal hyperparameters efficient and user-friendly. What is Optuna? Optuna is an open-source Python […]

p

Parameter Efficient Fine Tuning (PEFT)

PEFT stands for Parameter Efficient Fine-Tuning, which is a technique used in deep learning to fine-tune pre-trained language models (PLMs) for downstream natural language processing tasks. The main goal of PEFT is to improve the performance of PLMs on specific tasks while reducing the number of task-specific parameters and computation required for fine-tuning. Traditional fine-tuning […]

Precision

Precision is a foundational term in the realm of machine learning and statistics, especially within classification tasks. When you’re evaluating a model’s performance, particularly in situations where false positives carry significant implications, precision becomes a vital metric. Let’s dive deep into the world of precision. What is Precision? Precision is one of the fundamental metrics […]

PyTorch

What is PyTorch? PyTorch is an open-source machine learning library that is widely used for developing and training deep learning models. It is based on the Torch library, which is a scientific computing framework with a focus on machine learning algorithms. PyTorch provides a flexible and intuitive interface that enables researchers and developers to build […]

q

Quantized Low-Rank Adaptation (QLoRA)

QLoRA is a technique used to reduce the precision of neural network weights and activations while preserving accuracy. QLoRA combines the concepts of low-rank adaptation and quantization to efficiently deploy neural networks on devices with limited computing resources. When is QLoRA Useful? QLoRA is useful in scenarios where neural networks need to be deployed on […]

r

Random Forest

What is Random Forest? Random Forest is a versatile and powerful ensemble learning method used in machine learning for both classification and regression tasks. It is based on the principle of combining the predictions of multiple individual decision trees to make more accurate and robust predictions. The term “Random Forest” was coined by Leo Breiman […]

Recall

Recall, often referred to as sensitivity or the true positive rate, holds a prime position in the evaluation metrics used for classification models. Primarily, it answers a critical question: “Of all the actual positive instances, how many were correctly predicted by the model?” This page provides an in-depth understanding of the recall metric and its […]

Reinforcement Learning

Reinforcement Learning (RL) is a subfield of machine learning that focuses on developing algorithms and models capable of learning optimal behaviors through trial and error interactions with an environment. Inspired by how humans and animals learn from the consequences of their actions, reinforcement learning enables machines to learn by receiving feedback in the form of […]

Retrieval Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a technique used in natural language processing (NLP) and machine learning to improve the performance of language models by augmenting their training data with retrieved information from external sources. The goal of RAG is to enable language models to generate more accurate, informative, and diverse responses by incorporating knowledge from various […]

s

Scale Machine Learning

What is Scale Machine Learning? Scale machine learning refers to the ability to apply machine learning techniques and algorithms to large volumes of data, complex models, and high-performance computing systems. It involves the process of handling and processing massive amounts of data efficiently, training and deploying complex models, and managing the computational resources required to […]

Supervised-learning

What is Supervised Learning? Supervised learning is a popular and widely used machine learning approach that involves training a model to make predictions or classify data based on labeled examples. In supervised learning, the algorithm learns from a given set of input-output pairs, where the inputs are the features or attributes of the data, and […]

Support Vector Machines (SVM)

What is a Support Vector Machine? A Support Vector Machine (SVM), also referred to as a support vector machine algorithm or a vector support machine, is a powerful supervised machine learning algorithm used for both classification and regression tasks. It is particularly effective in solving complex classification problems where the data is not linearly separable. […]

t

Training Data sets in Machine Learning

What is a Training Data Set? A training data set, also known as a train set, is a fundamental component of machine learning. It is a collection of labeled examples that is used to train a machine learning model. The training data set consists of input data, which represents the features or attributes of the […]

Triplet Loss

Triplet loss is a loss function commonly used in machine learning for tasks like face recognition, image retrieval, and similarity learning. Its primary purpose is to train a model to learn embeddings (vector representations) of data points such that similar data points are closer together in the embedding space, while dissimilar ones are farther apart. […]

u

Unstructured Data

What is Unstructured Data Unstructured data refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured data is typically non-textual, like images, audio, video and multi-modal data, but can also be textual data, for example in the case of LLM applications. This results in […]

Unsupervised-learning

What is Unsupervised Learning? Unsupervised learning is a machine learning approach in which a model learns patterns and relationships in data without explicit supervision or labeled examples. Unlike supervised learning, where the model learns from labeled data to make predictions or classify new instances, unsupervised learning focuses on extracting meaningful information and structures from unlabeled […]

v

Validation Set

What is a Validation Set? In machine learning, a validation set refers to a subset of data that is used to assess the performance of a trained model. It is distinct from the training set, which is used to train the model, and the test set, which is used to evaluate the final performance of […]

x

XGBoost

What is XGBoost? XGBoost (Extreme Gradient Boosting) is a powerful machine learning algorithm known for its efficiency and accuracy in handling structured data. It is an optimized implementation of the gradient boosting algorithm, which is an ensemble learning technique that combines multiple weak predictive models to create a stronger model. XGBoost has gained significant popularity […]

y

Yolo-object-detection

What is YOLO Object Detection? YOLO, short for “You Only Look Once,” is an object detection algorithm that aims to detect and classify objects within an image or video in real-time. Unlike traditional object detection methods that rely on region proposal algorithms followed by classification, YOLO takes a different approach by performing both tasks simultaneously […]

z

Zero-Shot Learning

Zero-shot learning is an exciting and emerging field within machine learning that enables models to generalize and make predictions on unseen classes or tasks. Traditional machine learning approaches require a large amount of labeled data for training and are limited to making predictions only on classes or tasks seen during training. However, zero-shot learning pushes […]
Back to top
Back to top