A data warehouse is a central integrated database including data from diverse source systems in a company. The data is transformed to purge the inconsistencies, then aggregated to summarize data and lastly loaded into the data warehouse. The database can be accessed by numerous users, ensuring that each group in an enterprise is accessing stable and important data. To process the large volumes of data from various source systems effectively, the ETL software implement the mechanism of parallel processing. Our DataStage online training will tell you how it is done.

The method of parallel processing is divided into two separate processes – pipeline parallelism and partition parallelism. IBM DataStage allows user to use both of these parallel processing methods.

Pipeline Parallelism

DataStage pipelines the data from one stage to the next stage. ETL processes the data simultaneously in all stages in a particular job and they are operating simultaneously. The downstream process would start as soon as data is available in the upstream. Pipeline parallelism abolishes the requirement of intermediate storing to a disk.

Partition Parallelism

The main target of most partitioning operations is to end up with a set of partitions that are as approximately equal size as possible, ensuring an even load across processors. This division is ideal for handling large quantities of data by breaking it into several partitions. Each partition is being handled by a different instance of job stages.

The greater performance can be achieved by combining these two techniques – pipeline parallelism and partition parallelism. The data are partitioned and then partitioned data to fill up the pipeline so that the downstream stage processes the already partitioned data while the upstream is still running. DataStage allows its users to use these parallel processing methods in the parallel jobs. You will be able to learn more after completion of your DataStage training. Repartition of the partitioned data based on the business needs can be done in DataStage and repartitioned data will not load to the disk.

Our DataStage online training will explain about the parallel processing environment. The environment where you run your DataStage job is defined by your hardware resources and system architecture.

All parallel processing environments can be categorized as SMP and Clusters or MPP.

SMP or symmetric multiprocessing – shared memory

  • Some of the hardware resources may be shared between processors.
  • The processors communicate through shared memory and have single operating system.
  • All CPU share the system resources

MPP or Massively parallel processing – shared-nothing

  • A MPP is a bunch of connected SMPs.
  • Each processor has private access to the hardware resources.
  • MPP systems physically housed in the similar box.

Cluster Systems

  • Here UNIX systems are connected via networks
  • Cluster systems can be physically isolated.

Our DataStage online training will provide you more concepts on the above topic. To get more information about DataStage online training reach us at www.datastageonlinetrainings.com