What are etl technologies?
Basic transformations improve data quality by removing errors, emptying data fields, or simplifying data. Examples of these transformations follow.
Data cleansing removes errors and maps source data to the target data format. For example, you can map empty data fields to the number 0, map the data value “Parent” to “P,” or map “Child” to “C.”
Deduplication in data cleansing identifies and removes duplicate records.
Format revision converts data, such as character sets, measurement units, and date/time values, into a consistent format. For example, a food company might have different recipe databases with ingredients measured in kilograms and pounds. ETL will convert everything to pounds.
Advanced transformations use business rules to optimize the data for easier analysis. Examples of these transformations follow.
Derivation applies business rules to your data to calculate new values from existing values. For example, you can convert revenue to profit by subtracting expenses or calculating the total cost of a purchase by multiplying the price of each item by the number of items ordered.
In data preparation, joining links the same data from different data sources. For example, you can find the total purchase cost of one item by adding the purchase value from different vendors and storing only the final total in the target system.
You can divide a column or data attribute into multiple columns in the target system. For example, if the data source saves the customer name as “Jane John Doe,” you can split it into a first, middle, and last name.
Quick answer? ETL stands for "Extract, Transform, and Load."
In the world of data warehousing, if you need to bring data from multiple different data sources into one, centralized database, you must first:
ETL tools enable data integration strategies by allowing companies to gather data from multiple data sources and consolidate it into a single, centralized location. ETL tools also make it possible for different types of data to work together.
A typical ETL process collects and refines different types of data, then delivers the data to a data lake or data warehouse such as Redshift, Azure, or BigQuery.
ETL tools also makes it possible to migrate data between a variety of sources, destinations, and analysis tools. As a result, the ETL process plays a critical role in producing business intelligence and executing broader data management strategies. We are also seeing the process of Reverse ETL become more common, where cleaned and transformed data is sent from the data warehouse back into the business application.
The ETL process is comprised of 3 steps that enable data integration from source to destination: data extraction, data transformation, and data loading.
Most businesses manage data from a variety of data sources and use a number of data analysis tools to produce business intelligence. To execute such a complex data strategy, the data must be able to travel freely between systems and apps.
Before data can be moved to a new destination, it must first be extracted from its source — such as a data warehouse or data lake. In this first step of the ETL process, structured and unstructured data is imported and consolidated into a single repository. Volumes of data can be extracted from a wide range of data sources, including:
Although it can be done manually with a team of data engineers, hand-coded data extraction can be time-intensive and prone to errors. ETL tools automate the extraction process and create a more efficient and reliable workflow.
During this phase of the ETL process, rules and regulations can be applied that ensure data quality and accessibility. You can also apply rules to help your company meet reporting requirements. The process of data transformation is comprised of several sub-processes:
Transformation is generally considered to be the most important part of the ETL process. Data transformation improves data integrity — removing duplicates and ensuring that raw data arrives at its new destination fully compatible and ready to use.
The final step in the ETL process is to load the newly transformed data into a new destination (data lake or data warehouse.) Data can be loaded all at once (full load) or at scheduled intervals (incremental load).
Full loading — In an ETL full loading scenario, everything that comes from the transformation assembly line goes into new, unique records in the data warehouse or data repository. Though there may be times this is useful for research purposes, full loading produces datasets that grow exponentially and can quickly become difficult to maintain.
Incremental loading — A less comprehensive but more manageable approach is incremental loading. Incremental loading compares incoming data with what’s already on hand, and only produces additional records if new and unique information is found. This architecture allows smaller, less expensive data warehouses to maintain and manage business intelligence.
Data strategies are more complex than they’ve ever been; SaaS gives companies access to data from more data sources than ever before. ETL tools make it possible to transform vast quantities of data into actionable business intelligence.
Consider the amount of raw data available to a manufacturer. In addition to the data generated by sensors in the facility and the machines on an assembly line, the company also collects marketing, sales, logistics, and financial data (often using a SaaS tool).
All of that data must be extracted, transformed, and loaded into a new destination for analysis. ETL enables data management, business intelligence, data analytics, and machine learning capabilities by:
Managing multiple data sets in a world of enterprise data demands time and coordination, and can result in inefficiencies and delays. ETL combines databases and various forms of data into a single, unified view. This makes it easier to aggregate, analyze, visualize, and make sense of large datasets.
ETL allows the combination of legacy enterprise data with data collected from new platforms and applications. This produces a long-term view of data so that older datasets can be viewed alongside more recent information.
ETL Software automates the process of hand-coded data migration and ingestion, making it self-service. As a result, developers and their teams can spend more time on innovation and less time managing the painstaking task of writing code to move and format data.
ETL can be accomplished in one of two ways. In some cases, businesses may task developers with building their own ETL. However, this process can be time-intensive, prone to delays, and expensive.
Most companies today rely on an ETL tool as part of their data integration process. ETL tools are known for their speed, reliability, and cost-effectiveness, as well as their compatibility with broader data management strategies. ETL tools also incorporate a broad range of data quality and data governance features.
When choosing which ETL tool to use, you’ll want to consider the number and variety of connectors you’ll need as well as its portability and ease of use. You’ll also need to determine if an open-source tool is right for your business since these typically provide more flexibility and help users avoid vendor lock-in.
ELT is a modern take on the older process of extract, transform, and load in which transformations take place before the data is loaded. Over time, running transformations before the load phase is found to result in a more complex data replication process. While the purpose of ETL is the same as ELT, the method is evolved for better processing.
Traditional ETL software extracts and transforms data from different sources before loading it into a data warehouse or data lake. With the introduction of the cloud data warehouse, there was no longer the need for data cleanup on dedicated ETL hardware before loading into your data warehouse or data lake. The cloud enables a push-down ELT architecture with two steps changed from the ETL pipeline.
If you are still on premises and your data isn't coming from several different sources, ETL tools still fit your data analytics needs. But as more businesses move to a cloud data architecture (or hybrid), ELT processes are more adaptable and scalable to evolving needs of cloud-based businesses.
ETL tools require processing engines for running transformations prior to loading data into a destination. On the other hand, with ELT, businesses use the processing engines in the destinations to efficiently transform data within the target system itself. This removal of an intermediate step streamlines the data loading process.
In computing, extract, transform, load (ETL) is a three-phase process where data is extracted, transformed (cleaned, sanitized, scrubbed) and loaded into an output data container. The data can be collated from one or more sources and it can also be output to one or more destinations. ETL processing is typically executed using software applications but it can also be done manually by system operators. ETL software typically automates the entire process and can be run manually or on reoccurring schedules either as single jobs or aggregated into a batch of jobs.
A properly designed ETL system extracts data from source systems and enforces data type and data validity standards and ensures it conforms structurally to the requirements of the output. Some ETL systems can also deliver data in a presentation-ready format so that application developers can build applications and end users can make decisions.[1]
The ETL process became a popular concept in the 1970s and is often used in data warehousing.[2] ETL systems commonly integrate data from multiple applications (systems), typically developed and supported by different vendors or hosted on separate computer hardware. The separate systems containing the original data are frequently managed and operated by different stakeholders. For example, a cost accounting system may combine data from payroll, sales, and purchasing.
Data extraction involves extracting data from homogeneous or heterogeneous sources; data transformation processes data by data cleaning and transforming it into a proper storage format/structure for the purposes of querying and analysis; finally, data loading describes the insertion of data into the final target database such as an operational data store, a data mart, data lake or a data warehouse.[3][4]
ETL processing involves extracting the data from the source system(s). In many cases, this represents the most important aspect of ETL, since extracting data correctly sets the stage for the success of subsequent processes. Most data-warehousing projects combine data from different source systems. Each separate system may also use a different data organization and/or format. Common data-source formats include relational databases, XML, JSON and flat files, but may also include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even formats fetched from outside sources by means such as web spidering or screen-scraping. The streaming of the extracted data source and loading on-the-fly to the destination database is another way of performing ETL when no intermediate data storage is required.
An intrinsic part of the extraction involves data validation to confirm whether the data pulled from the sources has the correct/expected values in a given domain (such as a pattern/default or list of values). If the data fails the validation rules, it is rejected entirely or in part. The rejected data is ideally reported back to the source system for further analysis to identify and to rectify the incorrect records.
In the data transformation stage, a series of rules or functions are applied to the extracted data in order to prepare it for loading into the end target.
An important function of transformation is data cleansing, which aims to pass only "proper" data to the target. The challenge when different systems interact is in the relevant systems' interfacing and communicating. Character sets that may be available in one system may not be so in others.
In other cases, one or more of the following transformation types may be required to meet the business and technical needs of the server or data warehouse:
The load phase loads the data into the end target, which can be any data store including a simple delimited flat file or a data warehouse.[5] Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative information; updating extracted data is frequently done on a daily, weekly, or monthly basis. Other data warehouses (or even other parts of the same data warehouse) may add new data in a historical form at regular intervals — for example, hourly. To understand this, consider a data warehouse that is required to maintain sales records of the last year. This data warehouse overwrites any data older than a year with newer data. However, the entry of data for any one year window is made in a historical manner. The timing and scope to replace or append are strategic design choices dependent on the time available and the business needs. More complex systems can maintain a history and audit trail of all changes to the data loaded in the data warehouse. As the load phase interacts with a database, the constraints defined in the database schema — as well as in triggers activated upon data load — apply (for example, uniqueness, referential integrity, mandatory fields), which also contribute to the overall data quality performance of the ETL process.
The typical real-life ETL cycle consists of the following execution steps:
ETL processes can involve considerable complexity, and significant operational problems can occur with improperly designed ETL systems.
The range of data values or data quality in an operational system may exceed the expectations of designers at the time validation and transformation rules are specified. Data profiling of a source during data analysis can identify the data conditions that must be managed by transform rules specifications, leading to an amendment of validation rules explicitly and implicitly implemented in the ETL process.
Data warehouses are typically assembled from a variety of data sources with different formats and purposes. As such, ETL is a key process to bring all the data together in a standard, homogeneous environment.
Design analysis[6] should establish the scalability of an ETL system across the lifetime of its usage — including understanding the volumes of data that must be processed within service level agreements. The time available to extract from source systems may change, which may mean the same amount of data may have to be processed in less time. Some ETL systems have to scale to process terabytes of data to update data warehouses with tens of terabytes of data. Increasing volumes of data may require designs that can scale from daily batch to multiple-day micro batch to integration with message queues or real-time change-data-capture for continuous transformation and update.
ETL vendors benchmark their record-systems at multiple TB (terabytes) per hour (or ~1 GB per second) using powerful servers with multiple CPUs, multiple hard drives, multiple gigabit-network connections, and much memory.
In real life, the slowest part of an ETL process usually occurs in the database load phase. Databases may perform slowly because they have to take care of concurrency, integrity maintenance, and indices. Thus, for better performance, it may make sense to employ:
Still, even using bulk operations, database access is usually the bottleneck in the ETL process. Some common methods used to increase performance are:
Whether to do certain operations in the database or outside may involve a trade-off. For example, removing duplicates using distinct may be slow in the database; thus, it makes sense to do it outside. On the other side, if using distinct significantly (x100) decreases the number of rows to be extracted, then it makes sense to remove duplications as early as possible in the database before unloading data.
A common source of problems in ETL is a big number of dependencies among ETL jobs. For example, job "B" cannot start while job "A" is not finished. One can usually achieve better performance by visualizing all processes on a graph, and trying to reduce the graph making maximum use of parallelism, and making "chains" of consecutive processing as short as possible. Again, partitioning of big tables and their indices can really help.
Another common issue occurs when the data are spread among several databases, and processing is done in those databases sequentially. Sometimes database replication may be involved as a method of copying data between databases — it can significantly slow down the whole process. The common solution is to reduce the processing graph to only three layers:
This approach allows processing to take maximum advantage of parallelism. For example, if you need to load data into two databases, you can run the loads in parallel (instead of loading into the first — and then replicating into the second).
Sometimes processing must take place sequentially. For example, dimensional (reference) data are needed before one can get and validate the rows for main "fact" tables.
A recent[update] development in ETL software is the implementation of parallel processing. It has enabled a number of methods to improve overall performance of ETL when dealing with large volumes of data.
ETL applications implement three main types of parallelism:
All three types of parallelism usually operate combined in a single job or task.
An additional difficulty comes with making sure that the data being uploaded is relatively consistent. Because multiple source databases may have different update cycles (some may be updated every few minutes, while others may take days or weeks), an ETL system may be required to hold back certain data until all sources are synchronized. Likewise, where a warehouse may have to be reconciled to the contents in a source system or with the general ledger, establishing synchronization and reconciliation points becomes necessary.
Data warehousing procedures usually subdivide a big ETL process into smaller pieces running sequentially or in parallel. To keep track of data flows, it makes sense to tag each data row with "row_id", and tag each piece of the process with "run_id". In case of a failure, having these IDs help to roll back and rerun the failed piece.
Best practice also calls for checkpoints, which are states when certain phases of the process are completed. Once at a checkpoint, it is a good idea to write everything to disk, clean out some temporary files, log the state, etc.
As of 2010[update], data virtualization had begun to advance ETL processing. The application of data virtualization to ETL allowed solving the most common ETL tasks of data migration and application integration for multiple dispersed data sources. Virtual ETL operates with the abstracted representation of the objects or entities gathered from the variety of relational, semi-structured, and unstructured data sources. ETL tools can leverage object-oriented modeling and work with entities' representations persistently stored in a centrally located hub-and-spoke architecture. Such a collection that contains representations of the entities or objects gathered from the data sources for ETL processing is called a metadata repository and it can reside in memory[7] or be made persistent. By using a persistent metadata repository, ETL tools can transition from one-time projects to persistent middleware, performing data harmonization and data profiling consistently and in near-real time.[8]
Unique keys play an important part in all relational databases, as they tie everything together. A unique key is a column that identifies a given entity, whereas a foreign key is a column in another table that refers to a primary key. Keys can comprise several columns, in which case they are composite keys. In many cases, the primary key is an auto-generated integer that has no meaning for the business entity being represented, but solely exists for the purpose of the relational database - commonly referred to as a surrogate key.
As there is usually more than one data source getting loaded into the warehouse, the keys are an important concern to be addressed. For example: customers might be represented in several data sources, with their Social Security Number as the primary key in one source, their phone number in another, and a surrogate in the third. Yet a data warehouse may require the consolidation of all the customer information into one dimension.
A recommended way to deal with the concern involves adding a warehouse surrogate key, which is used as a foreign key from the fact table.[9]
Usually, updates occur to a dimension's source data, which obviously must be reflected in the data warehouse.
If the primary key of the source data is required for reporting, the dimension already contains that piece of information for each row. If the source data uses a surrogate key, the warehouse must keep track of it even though it is never used in queries or reports; it is done by creating a lookup table that contains the warehouse surrogate key and the originating key.[10] This way, the dimension is not polluted with surrogates from various source systems, while the ability to update is preserved.
The lookup table is used in different ways depending on the nature of the source data. There are 5 types to consider;[10] three are included here:
An established ETL framework may improve connectivity and scalability.[citation needed] A good ETL tool must be able to communicate with the many different relational databases and read the various file formats used throughout an organization. ETL tools have started to migrate into Enterprise Application Integration, or even Enterprise Service Bus, systems that now cover much more than just the extraction, transformation, and loading of data. Many ETL vendors now have data profiling, data quality, and metadata capabilities. A common use case for ETL tools include converting CSV files to formats readable by relational databases. A typical translation of millions of records is facilitated by ETL tools that enable users to input csv-like data feeds/files and import them into a database with as little code as possible.
ETL tools are typically used by a broad range of professionals — from students in computer science looking to quickly import large data sets to database architects in charge of company account management, ETL tools have become a convenient tool that can be relied on to get maximum performance. ETL tools in most cases contain a GUI that helps users conveniently transform data, using a visual data mapper, as opposed to writing large programs to parse files and modify data types.
While ETL tools have traditionally been for developers and IT staff, research firm Gartner wrote that the new trend is to provide these capabilities to business users so they can themselves create connections and data integrations when needed, rather than going to the IT staff.[11] Gartner refers to these non-technical users as Citizen Integrators.[12]
Extract, load, transform (ELT) is a variant of ETL where the extracted data is loaded into the target system first.[13] The architecture for the analytics pipeline shall also consider where to cleanse and enrich data[13] as well as how to conform dimensions.[1]
Ralph Kimball and Joe Caserta's book The Data Warehouse ETL Toolkit, (Wiley, 2004), which is used as a textbook for courses teaching ETL processes in data warehousing, addressed this issue.[14]
Cloud-based data warehouses like Amazon Redshift, Google BigQuery, and Snowflake Computing have been able to provide highly scalable computing power. This lets businesses forgo preload transformations and replicate raw data into their data warehouses, where it can transform them as needed using SQL.
After having used ELT, data may be processed further and stored in a data mart.[15]
There are pros and cons to each approach.[16] Most data integration tools skew towards ETL, while ELT is popular in database and data warehouse appliances. Similarly, it is possible to perform TEL (Transform, Extract, Load) where data is first transformed on a blockchain (as a way of recording changes to data, e.g., token burning) before extracting and loading into another data store.[17]
Organizations leverage data in the same way but on a larger scale. They have data on customers, employees, products, and services that all must be standardized and shared across various teams and systems. This information may even be made available for external partners and vendors.
To achieve this highly scaled information sharing and avoid data silos, organizations turn to the extract, transform, and load (ETL) practice for formatting, passing, and storing data between systems. With the large volumes of data organizations are handling between all their business processes, ETL tools can standardize and scale their data pipelines.
ETL tools are software designed to support ETL processes: extracting data from disparate sources, scrubbing data for consistency and quality, and consolidating this information into data warehouses. If implemented correctly, ETL tools simplify data management strategies and improve data quality by providing a standardized approach to intake, sharing, and storage.
This video provides a good overview of ETL tools and approaches:
ETL tools support data-driven organizations and platforms. For example, customer-relationship management (CRM) platforms' central advantage is that all business activities are conducted through the same interface. This allows CRM data to be easily shared between teams to provide a more holistic view of business performance and progress toward goals.
Next, let's examine the four types of ETL tools available.
ETL tools can be grouped into four categories based on their infrastructure and supporting organization or vendor. These categories — enterprise-grade, open-source, cloud-based, and custom ETL tools — are defined below.
Enterprise software ETL tools are developed and supported by commercial organizations. These solutions tend to be the most robust and mature in the marketplace since these companies were the first to champion ETL tools. This includes offering graphical user interfaces (GUIs) for architecting ETL pipelines, support for most relational and non-relational databases, and extensive documentation and user groups.
Though they offer more functionality, enterprise software ETL tools will typically have a larger price tag and require more employee training and integration services to onboard due to their complexity.
With the rise of the open-source movement, it’s no surprise that open-source ETL tools have entered the marketplace. Many ETL tools today are free and offer GUIs for designing data-sharing processes and monitoring the flow of information. A distinct advantage of open-source solutions is that organizations can access the source code to study the tool's infrastructure and extend capabilities.
However, open-source ETL tools can vary in upkeep, documentation, ease of use, and functionality since they are not usually supported by commercial organizations.
Following the widespread adoption of cloud and integration-platform-as-a-service technologies, cloud service providers (CSPs) now offer ETL tools built on their infrastructure.
A specific advantage of cloud-based ETL tools is efficiency. Cloud technology provides high latency, availability, and elasticity so that computing resources scale to meet the data processing demands at that time. If the organization also stores its data using the same CSP, then the pipeline is further optimized because all processes take place within a shared infrastructure.
A drawback of cloud-based ETL tools is that they only work within the CSP's environment. They do not support data stored in other clouds or on-premise data centers without first being shifted onto the provider's cloud storage.
Companies with development resources may produce proprietary ETL tools using general programming languages. The key advantage of this approach is the flexibility to build a solution customized to the organization's priorities and workflows. Popular languages for building ETL tools include SQL, Python, and Java.
The largest drawback of this approach is the internal resources required to build out a custom ETL tool, including testing, maintenance, and updates. An additional consideration is the training and documentation to onboard new users and developers who will all be new to the platform.
Now that you understand what ETL tools are and the categories of tools available, let's examine how to evaluate these solutions for the ideal fit for your organizations' data practices and use cases.
Every organization has a unique business model and culture, and the data that a company collects and values will reflect this. However, there are common criteria that you can measure ETL tools against that will be relevant to each organization, which are outlined below.
Next, let’s examine individual tools to power your ETL pipelines and group them by the types discussed above.
Price: Free 14-day trial & flexible paid plans available
Type: Cloud
Integrate.io is a leading low-code data integration platform with a robust offering (ETL, ELT, API Generation, Observability, Data Warehouse Insights) and hundreds of connectors to build and manage automated, secure pipelines in minutes. Get constantly refreshed data to help deliver actionable, data-backed insights to do things like lower your CAC, increase your ROAS, and drive go-to-market success.
The platform is highly scalable with any data volume or use case, while enabling you to easily aggregate data to warehouses, databases, data stores, and operational systems.
Price: Free trial with paid plans available
Type: Enterprise
IBM DataStage is a data integration tool built around a client-server design. From a Windows client, tasks are created and executed against a central data repository on a server. The tool is designed to support ETL and extract, load, and transform (ELT) models and supports data integrations across multiple sources and applications while maintaining high performance.
IBM DataStage is built for on-premise deployment and is also available in a cloud-enabled version: DataStage for IBM Cloud Pak for Data.
Price: Pricing available on request
Type: Enterprise
Oracle Data Integrator (ODI) is a platform designed to build, manage, and maintain data integration workflows across organizations. ODI supports the full spectrum of data integration requests from high-volume batch loads to service-oriented architecture data services. It also supports parallel task execution for faster data processing and offers built-in integrations with Oracle GoldenGate and Oracle Warehouse Builder.
ODI and other Oracle solutions can be monitored through the Oracle Enterprise Manager for greater visibility across the toolstack.
Price: 60$/month for Standard Select plan; 120$/month for Starter plan; 180$/month for Standard plan; $240/month for Enterprise plan
Type: Enterprise
Fivetran aims to add convenience to your data management process with its platform of handy tools. The easy-to-use software keeps up with API updates and pulls the latest data from your database in minutes.
In addition to ETL tools, Fivetran offers data security services, database replication, and 24/7 support. Fivetran prides itself on its nearly perfect uptime, giving you access to its team of engineers at a moment's notice.
Price: Free 14-day trial with paid plans available
Type: Cloud
Coupler.io is an all-in-one data analytics and automation platform that enables businesses to fully leverage their data. In a nutshell, it helps to gather, transform, and analyze data flows. The foundation of the platform is the no-code ETL solution that can be used without technical skills. You can export and blend data from various business applications to data warehouses or spreadsheets. It can also help to automate your reporting by refreshing data on a schedule. Organizations can use this tool to collect, track, and streamline business metrics by creating live dashboards.
In addition, Coupler.io offers data analytics services and can build custom connectors on request. Coupler.io even comes with an integration with HubSpot that allows you to automatically export data from HubSpot and other apps to Google Sheets, Excel, Google BigQuery, and other destinations on a set schedule.
Price: Pricing available on request
Type: Enterprise
SAS Data Management is a data integration platform built to connect with data wherever it exists, including cloud, legacy systems, and data lakes. These integrations provide a holistic view of the organization's business processes. The tool optimizes workflows by reusing data management rules and empowering non-IT stakeholders to pull and analyze information within the platform.
SAS Data Management is also flexible and works in a variety of computing environments and databases as well as integrating with third-party data modeling tools to produce compelling visualizations.
Price: Free
Type: Open Source
Image Source
Talend Open Studio is an open-source tool designed to rapidly build data pipelines. Data components can be connected to run jobs through Open Studio's drag-and-drop GUI from Excel, Dropbox, Oracle, Salesforce, Microsoft Dynamics, and other data sources. Talend Open Studio has built-in connectors to pull information from diverse environments, including relational database management systems, software-as-a-service platforms, and packaged applications.
Price: Pricing available on request
Type: Open Source
Pentaho Data Integration (PDI) manages data integration processes, including capturing, cleansing, and storing data in a standardized and consistent format. The tool also shares this information with end users for analysis, and it supports data access for IoT technologies to facilitate machine learning.
PDI also offers the Spoon desktop client for building transformations, scheduling jobs, and manually initiating processing tasks when needed.
Price: Free
Type: Open Source
Singer is an open-source scripting technology built to enhance data transfer between an organization's applications and storage. Singer defines the relationship between data extraction and data loading scripts, allowing information to be pulled from any source and loaded to any destination. The scripts use JSON so that they are accessible in any programming language and also support rich data types and enforce data structures through JSON Schema.
Price: Free
Type: Open Source
The Apache Hadoop software library is a framework designed to support processing large data sets by distributing the computational load across clusters of computers. The library is designed to detect and handle failures at the application layer versus the hardware layer, providing high availability while combining the computing power of multiple machines. Through the Hadoop YARN module, the framework also supports job scheduling and cluster resource administration.
Price: Free with paid plans available
Type: Cloud
Dataddo is a no-code, cloud-based ETL platform that enables technical and non-technical users to flexibly integrate data. It offers a wide range of connectors, fully customizable metrics, a central system for simultaneous management of all data pipelines, and can be seamlessly incorporated into existing technology architecture.
Users can deploy pipelines within minutes of account creation and all API changes are managed by the Dataddo team, so pipelines require no maintenance. New connectors can be added within 10 business days upon request. The platform is GDPR, SOC2, and ISO 27001 compliant.
Price: Free with paid plans available
Type: Cloud
AWS Glue is a cloud-based data integration service that supports visual and code-based clients to support technical and non-technical business users. The serverless platform offers multiple features to provide additional functions, such as the AWS Glue Data Catalog for finding data across the organization and the AWS Glue Studio for visually designing, executing, and maintaining ETL pipelines.
AWS Glue also supports custom SQL queries for more hands-on data interactions.
Price: Free trial with paid plans available
Type: Cloud
Azure Data Factory is a serverless data integration service built on a pay-as-you-go model that scales to meet computing demands. The service offers both no-code and code-based interfaces and can pull data from more than 90 built-in connectors. In addition, Azure Data Factory integrates with Azure Synapse Analytics to provide advanced data analysis and visualization.
The platform also supports Git for version control and continuous integration/continuous deployment workflows for DevOps teams.
Price: Free trial with paid plans available
Type: Cloud
Google Cloud Dataflow is a fully managed data processing service built to optimize computing power and automate resource management. The service is focused on reducing processing costs through flexible scheduling and automatic resource scaling to ensure usage matches needs. In addition, Google Cloud Dataflow offers AI capabilities to power predictive analysis and real-time anomaly detection as the data is transformed.
Price: Free trial with paid plans available
Type: Cloud
Stitch is a data integration service designed to source data from 130+ platforms, services, and applications. The tool centralizes this information in a data warehouse without requiring any manual coding. Stitch is open source, allowing development teams to extend the tool to support additional sources and features. In addition, Stitch focuses on compliance, providing the power to analyze and govern data to meet internal and external requirements.
Price: Free trial with paid plans available
Type: Enterprise
Image Source
Informatica PowerCenter is a metadata-driven platform focused on improving collaboration between the business and IT teams and streamlining data pipelines. PowerCenter parses advanced data formats, including JSON, XML, PDF, and Internet of Things machine data, and automatically validates transformed data to enforce defined standards.
The platform also has pre-built transformations for ease of use, and it offers high availability and optimized performance to scale to meet computing demands.
Price: Starts free; $15/month for Basic plan; $79/month for Standard plan; $399/month for Professional Plan
Type: Opensource
Skyvia creates a data sync that is fully customizable. You decide exactly what you want to extract, including custom fields and objects. There's also no need to customize your data's structure as Skyvia operates on autogenerated primary keys.
Skyvia also allows users to import data to cloud apps and databases, replicate cloud data, and export data to CSV for sharable access.
Price: Starts free manual syncs with unlimited data volumes, $200/month for scheduled transfers
Type: Enterprise
Portable is built on serving the kind of long-tail connectors Fivetran doesn't support. Portable offers 300+ ETL connectors and creates custom integrations on request, exclusively focused on the long-tail business applications that aren't supported by other data integration tools.
For teams without the resources to create and maintain hard-to-find connectors, Portable creates an easy way to integrate all of your business data.
Extract, transform, and load (ETL) is the process of combining data from multiple sources into a large, central repository called a data warehouse. ETL uses a set of business rules to clean and organize raw data and prepare it for storage, data analytics, and machine learning (ML).