![pdi pentaho data integration pdi pentaho data integration](https://miro.medium.com/max/1838/1*fENQN5AZrX_QI3Xl6GtBJg.jpeg)
![pdi pentaho data integration pdi pentaho data integration](https://i.ytimg.com/vi/fqMzUVs3DuI/maxresdefault.jpg)
Both files are being sorted, using the sort rows transformation, and merged together using a merge join transformation. In this example you can see a transformation that takes two CSV files, one holding incoming flights records and the other outgoing flights. Hops are the arrows in the picture below: If you are using transformations, you’ll create hops, which connect between your sources, targets and transformations. Here are a few of the available transformations: Another example is adding a calculated field, using the “calculator” transformation. You can do that using the “Split Fields” transformation. You may want to split a source field into several fields, for instance split a date field into day, month and year fields. There are many transformations included in the product. For example you can sort your data using the “sort rows” transformation. You may want to transform the data before loading it. In other cases our data flow may not be that simple. Furthermore, If the target table does not exist, Spoon can generate a SQL script for you to run and create it. Spoon can detect field names and try to craete the mapping between source and target automatically, or you can do it yourself. In such case all we need to do is define the CSV file as source and a database table as target. Some of the main data connectors are: database tables (almost any relational database you can think of is supported), CSV files, spreadsheets, Google Analytics, AWS S3, email messages and XML files.Ī classic case in an ELT approach is to load a CSV file, as-is to a database table. Spoon has a large number of connectors included, and you can also create custom connectors. In a typical Data Warehouse environment we’ll probably want to extract data from various sources and load to one target – the data warehouse. The “E” and “L” in ETL stand for Extracting data from sources and Loading data to the target. Spoon is used to define data connections, create data transformations and coltrol flows. Once transformations and jobs are designed using Spoon, “Pan” can be used to execute transformations in command line and “Kitchen” does the same with jobs. This is the development environment of your ETL processes. It comes with a simple to use drag-and-drop UI which enables the creation of data and control flows, or “transformations” and “jobs” which will be explained shortly. The main program, where ETL development is done is called “Spoon”. PDI is written in Java and therefore can run on Unix/Linux/Mac and Windows.It is not just one product, but rather a group of programs responsible for different parts of the ETL solution.
![pdi pentaho data integration pdi pentaho data integration](https://i.stack.imgur.com/p89kG.png)
In the next few sections I’ll briefly go over the product components. In 2015, Pentaho was acquired by Hitachi, and since then it seems like the community edition is being more and more hidden, which can be understood as Hitachi is trying to maximize revenue. I ended up starting a new role elsewhere and only looking at Pentaho again until 2016, but that’s beside the point. I’ve first heard of Pentaho when I interviewed for a BI manager role in 2010 and the company I interviewed for was already using their community edition. I’m not a technology historian but my understanding is that at the begining they were focusing on open source offerings and then started to build premium services on top. The company has built a Business Intelligence products suite. Once you’ve configure it correctly though, these are very strong products. I haven’t yet had the chance to work with Talend Open Studio, but I’ve worked with PDI on three different projects, and I’m very impressed. In fact, I’m much more impressed with PDI then I am with Microsoft’s SSIS, which I’ve used a lot, and costs much more.
#Pdi pentaho data integration install#
Both products ask you to install a specific Java versions and to configure it properly. It’s not bad, but I’m sure drives people away in some cases. The problem with both, is that sometimes it isn’t easy to configure them and start using them quickly.
#Pdi pentaho data integration free#
If you search to free or cheap options, two products are often mentioned – Talend Open Studio and Pentaho Data Integration. An ETL product is the one part of a BI solution where if you are on a small budget, you don’t have many options. There are plenty of expensive ETL products in the market, from the likes of Industry leaders like IBM, Oracle and Microsoft, but not many cheap or free options. This product is the open-source, community edition of Pentaho Data Integration, also called PDI, or Kettle. If you are not familiar with these terms, please start by reading this and this. In my previous article I talked about ETL and ELT, and today I want to talk briefly about a specific ETL product in the market. This article assumes basic knowledge about what a data warehouse is, and what are ETL and ELT.