Quest For Knowledge: September 2008

Tuesday, September 30, 2008

Oracle: Import/Export, COPY, Datapump utilities expdp/impdp

Refreshing data in DEV/QA from PROD

Following methods can be used:
1. Export/Import utilities can be used. But it requires
-- First export tables into dump file
-- Drop tables in DEV/QA environment.
-- Import export dumps (see below in detail the exp/imp)
2. Oracle Data Pump in Oracle Database 10g (impdp/expdp)
3. COPY command

Export/Import:
Oracle's export (exp) and import (imp) utilities are used to perform exports and imports of data objects (such as tables) and therefore Logical database backup and recovery. When exporting, database objects are dumped to a binary file which can then be imported into another Oracle database. These utilities can be used to move data between different machines, databases or schema and/or on different Oracle versions. However, as they use a proprietary binary file format, they can only be used between Oracle databases. One cannot export data and expect to import it into a non-Oracle database. If the data is exported on a system with a different Oracle version then on that on which it is imported, imp must be the newer version. That means, if something needs to be exported from 10g into 9i, it must be exported with 9i's exp.

Syntax :
exp scott/tiger file=emp.dmp log=emp.log tables=emp rows=yes indexes=no
exp scott/tiger file=emp.dmp tables=(emp,dept)

imp scott/tiger file=emp.dmp full=yes
imp scott/tiger file=emp.dmp fromuser=scott touser=scott tables=dept

Using parameter file
exp userid=scott/tiger@orcl parfile=export.txt

where export.txt is like

BUFFER=100000
FILE=account.dmp
FULL=n
OWNER=scott
GRANTS=y
COMPRESS=y

Note:From Oracle 10g, we can choose between using the old imp/exp utilities, or the newly introduced Datapump utilities, called expdp and impdp. These new utilities introduce much needed performance improvements, network based exports and imports, etc.

Link: http://www.orafaq.com/wiki/Import_Export_FAQ

Oracle Data Pump in Oracle Database 10g:

Oracle Data Pump is a newer, faster and more flexible alternative to the exp and imp utilities available in previous oracle version. In addition to basic import and export functionality, data pump provides a PL/SQL API ( DBMS_DATAPUMP API ), a mechanism to allow users to move all or parts of a database between databases, thereby, provides a server-side infrastructure for fast data and metadata movement.
New Export (expdp) and Import (impdp) utilities are available to make use of the API.

Prerequisites:
Before running Data Pump Export or Data Pump Import, a directory object must be created by a DBA or by any user with CREATE ANY DIRECTORY privilege. Create directory as mentioned below.

alter user scott identified by tiger account unlock;
grant create any directory to scott;

create or replace directory test_dir as '/u01/app/oracle/oradata/';
grant read, write on directory test_dir to scott;

Once directory is created run below scripts to export and import tables. The parameter TABLES is used to specify the tables to be exported.

The TABLE_EXISTS_ACTION=APPEND parameter allows data to be imported into existing tables.

expdp scott/tiger@db10g tables=emp,dept directory=test_dir dumpfile=emp_dept.dmp logfile=expdpemp_dept.log

impdp scott/tiger@db10g tables=emp,dept directory=test_dir dumpfile=emp_dept.dmp logfile=impdpemp_dept.log

Schema Exports/Imports:

expdp scott/tiger@db10g schemas=SCOTT directory=TEST_DIR dumpfile=SCOTT.dmp logfile=expdpSCOTT.log

impdp scott/tiger@db10g schemas=SCOTT directory=TEST_DIR dumpfile=SCOTT.dmp logfile=impdpSCOTT.log

Database Export/Import:

expdp system/password@db10g full=Y directory=TEST_DIR dumpfile=DB10G.dmp logfile=expdpDB10G.log

impdp system/password@db10g full=Y directory=TEST_DIR dumpfile=DB10G.dmp logfile=impdpDB10G.log

Links :
http://www.orafaq.com/wiki/Datapump
http://www.oracle-base.com/articles/10g/OracleDataPump10g.php
http://www.oracle.com/technology/obe/obe10gdb/storage/datapump/datapump.htm

COPY Command:

COPY command is one of the fastest way of copying data between databases and schemas.
Format of copy command is as

copy from database to database {appendcreateinsertreplace} [(column, column, column, ...)] using ;

COPY command supports the following data types:

CHAR
DATE
LONG
NUMBER
VARCHAR2

Example:
copy from scott/tiger@local_db to scott/tiger@remote_db -
create image_table using -
select image_no, image -
from images;

copy from scott/tiger@local_db to scott/tiger@remote_db -
create tgt using -
select * from src -
where x =5

Friday, September 26, 2008

Informatica Best Practices

1 Mapping Development Standards
Mappings
Naming conventions for mappings will be in the form of - m__
Subject Area: Datawarehouse, ODS or Data mart names etc.
Dimension/Fact/Table: Name of the dimension or fact or related table. Use underscore wherever required to name dimension and facts.
Ex: m_VISION_ACTION_RESULT_CODES

Mapplets
Use same convention as mapping name with prefix MPLT_;
E.g. MPLT_

Transformations Objects
Transformation Naming Convention
Expression Transformation EXP__
Lookup Transformation LKP_
Source Qualifier Transformation SQ_
Aggregator Transformation AGG_
Filter Transformation FIL_
Update Strategy Transformation UPD__
Normalizer Transformation NRM_
Rank Transformation RNK_
Stored Procedure Transformation SP_StoredProcedureName
External Procedure Transformation EXT_ProcedureName
Joiner Transformation JNR_
Sorter Transformation SRT_<>
Router Transformation RTR_<>
Sequence Generator SEQ_ (Always Reusable)

For any reusable Transformation
R_ as prefix.

Mapping Variable
$$Function or Process that is being done

Mapping Parameter
$$Function or Process that is being done
2 Mapping Description
It is a good practice to note down the name of the developer, Date of development and a brief description of the functionality of the mapping in the description field of the mapping. Whenever there is a change in the mapping the change, date of change and person responsible for change should be noted there. This helps in tracking history of changes in a mapping and acts as a log.
Similarly all major expressions should have a brief description and its functionality, for example a lookup transformation should have the lookup override condition or Source Qualifier should have the filter condition.
3 TIPS ON using different transformations
Aggregator transformation
The best practices of using Aggregator transformation are –
Ø Always sort the inputs before passing it to the Aggregator, one way of sorting data could be applying an ORDER BY clause in the Source Qualifier on the GROUP BY columns of Aggregator.
Ø In Aggregator transformation the GROUP BY ports should be placed in the same order as required for the aggregation.
Ø While summarize an Amount field, the input port should always be defaulted to zero (set the default value as zero). A NULL value added with other values will result NULL.
Ø If Amount filed has a precision of 2 (2 decimal places), it should be defaulted as 0.00, defaulted as 0, will result an integer value truncating the decimal places.
Expression transformation
Ø To write an expression in an output/ variable port use functions and ports by pointing and clicking instead of typing the names of ports, functions or transformations to avoid typo mistakes.
Ø For all Output ports delete the Default Value “ERROR ('transformation error')”
Lookup transformation
Ø While calling an unconnected Lookup transformation in an expression instead of writing the expression point and click the lookup transformation from the function pane.
Ø The data type should match for ports in each side of lookup condition. To avoid any error in this, copy and paste the lookup ports to create the input ports.
Ø If there is a Lookup SQL override statement, server automatically puts the ORDER BY clause after the Lookup SQL override statement based on the order of the lookup ports. Hence putting any ORDER BY clause in Lookup SQL override creates an error. The work around is – Use “—“(comment-out sign) at the end of the Lookup SQL override statement and then put the specific ORDER BY clause.
Ø Only Required fields from Lookup to another transformation must be marked as Output port.
Ø Remove all unnecessary fields from lookup.
Ø All input ports must be prefixed with the keyword IN.
Ø Location Information of the Lookup Object must be $Target at mapping level.
Ø If there are multiple calls to only one lookup in the mapping, use unconnected lookup.
E.g. IN_Loan_Number
Source Qualifier transformation
Ø Try to filter as much records as possible at the Source Qualifier Level.
Ø All Sources under one folder must point to same database using the same connect string.
Ø Links further only those ports, which are required in the mapping.
Ø All ports names should be Initcap.

4 Workflow Development standards
1) Always run workflows in Suspended mode (check SUSPEND ON ERROR option on workflow level properties).

Workflow Objects
Naming Convention
Workflow - WF_Name
Worklets - WLT_Name
Assignment - ASGN_Name
Command - CMD_Name
Control - CTL_Name
Decision - DCN_Name
Email - EML_Name
Event Raise - EVTR_Name
Event Wait - EVTW_Name
Session - s_
Time - TIM_Name
5 TIPS ON using sessions
Ø Use generic Source and Target connection Information across all environments (Development, Test, Cert, Production).
Ø Location Information of $Source and $Target must be defined (on Session’s general Properties tab).
6 COMMIT AND ROLLBACK SETTINGS
Ideally commit levels should be set high enough such that changes made by an individual session could be completely rolled back if the session was stopped by an error. In practice, depending on the resources available, it may not be possible to roll back a very large number of records. Whenever it is not possible to set the commit interval high enough, it is not advisable to use PowerCenter session recovery option as it causes performance issue. In those cases the source should be sorted on the primary key so that it is possible to restart processing the source file with a filter condition to pick up data from failure point.
7 Folder object Naming convention
Repository Name Description
Project_Datamart Development/Testing/Certification/Production {e.g. LS_ODS}
8 SOURCE/TARGET DEFINITION
All source and target definitions as stated above will be created in a central “shared” folder in Informatica facilitating a single version of the source and target objects as other folders will reference these objects via shortcuts which deliver full inheritance. Source and target definitions will be imported from database.
9 TARGET TABLE NAMING CONVENTION
There are often several instances of the same target, usually because of different actions. When looking at a session run, there will be several instances each with its own successful rows, failed rows, etc. To make observing a session run easier, targets should be named according to the action being executed on that target.
For example, if a mapping has three instances of LOAN table according to update strategy (Insert, Update, Delete), and as we have shortcuts available for the objects, the tables should be named as follows:
Shortcut_To_LOAN_INS
Shortcut_To LOAN_UPD
Shortcut_To LOAN_DEL
10 General Guidelines
Ø Always make Sequence Transformation as Reusable to maintain consistency in the Incremental load.
Ø The property “Number of Cached Values” in Sequence Generator Transformation should be set to 1.
Ø Always use Expression Transformation after Source Qualifier to trim spaces from all the string column values.
11 Migration process from development to production

It is always best to have dedicated environment for development, test and production. If not possible there could be one shared environment for development and test and another dedicated environment for production. Again from security standpoint any user other than project administrator should have only read access to the production repository. This will stop developers from accidentally logging onto the wrong repository and making changes without realizing that there are not in the development repository.
Migrate from development to production can be done primarily two ways - copying entire folders and copying individual mappings into production. There are advantages and disadvantages with each approach described below.Copy an Entire Folder
Copying an entire folder gives the ability to quickly promote all of the objects in the folder from test to production. All of the source tables, target tables, re-usable transformations, mappings, and sessions are promoted at once. What this means is that everything must be ready to be moved to the next stage. If certain mappings are not ready, then after the folder is copied the users must manually delete them from the new folder. It is possible to copy an entire folder with all of the versions associated with it or to copy a specific version of the folder. The steps for copying a folder are:
Ø If using shortcuts follow these steps
o In each of the dedicated repositories create a common folder with the EXACT same name and case sensitivity
o Copy the shortcut objects into common folder in production and making sure that the shortcut has the exact same name
Ø Open either the Repository Manager or Designer connecting to the repository
Ø Drag and drop the folder onto the repository icon within the Navigator tree structure
o Copying the entire folder: drag and drop the folder icon just under the repository level
o Copying a specific version: drag and drop the version # icon under the folder level
o Drag and drop the mapping from development into production
Ø Follow the Copy Folder Wizard steps
o If a folder with that name already exists, the folder must be renamed
Ø Modify the pre/post session commands as necessary
o It is recommended to set up the pre/post session commands to work in both environments. This means that the paths are the same in both environments (log/bad files, scripts, etc.).
o In the Server Manager within the Session Wizard in the General tab press the pre/post session button and make the required changes
Ø Implement appropriate production security
o In development, have the owner of the folders be a user in the development group
o In production, change the owner of the folders to a user in the production/operations group
o Revoke all rights to public other than read and all privileges other than browse repository
o Now if a developer accidentally logs into production, they can not make any changesCopy Individual Mappings
Copying individual mappings into the next stage has the same advantages and disadvantages in the dedicated environment as it does in the shared environment. It gives the user the ability to promote into production the desired mapping, which is the lowest level of granularity.
If the folder where the mapping is copied into does not contain the source/target table or the re-usable transformation it will bring it along as well.
There are two disadvantages of copying individual mappings into production. First of all, only 1 mapping at a time can be copied. If there is a large number of mappings needing to be promoted into production, then it will be time consuming. Secondly, the drawback is that the sessions must be re-created from scratch. This is a tedious process, especially if pre/post session scripts are used.
Ø If using shortcuts follow these steps
o In each of the dedicated repositories create a common folder with the EXACT same name and case sensitivity
o Copy the shortcut objects into common folder in production and making sure that the shortcut has the exact same name
Ø Copy the mapping from development into production
o In the Designer, connect to both the development and production repositories and open the appropriate folders in each
o Drag and drop the mapping from development into production
Ø Create a session in the Server Manager to run the mapping
o Enter all the appropriate information in the Session Wizard
Ø Implement appropriate production security
o In development, have the owner of the folders be a user in the development group
o In production, change the owner of the folders to a user in the production/operations group
o Revoke all rights to public other than read for the production folders
o Now if a developer accidentally logs into production, they can not make any changes.

It is recommended to promote individual folders into a production environment that is already exists. By promoting individual mappings, a lower level of control is attained, but the sessions associated to the mappings must be manually recreated.

12 Session Recovery
There are various approaches in implementing Session Recovery. They are:
Ø Using PowerCenter built-in Session Recovery Feature
Ø Using Session ID
Ø Setting high commit level in the sessions
Ø Using update strategy
Ø Using combination of Session Id and Temp Table
13 Using the INFA Server Recovery Feature

When a network or other problem causes a session whose source contains a million rows to fail after only half of the rows are committed to the target, one option is to truncate the target and run the session again from the beginning. But that is not the only option. Rather than processing the first half of the source again, you can tell the server to keep data already committed to the target database and process the rest of the source. This results in accurate and complete target data, as if the session completed successfully with one run.
When run a session in recovery mode, the server notes the row id of the last row committed to the target database. The server then reads all sources again, but only processes from the subsequent row id. For example, if the server commits 1000 rows before the session fails, when you run the session in recovery mode, the server reads all source tables, and then passes data to the Data Transformation Manager (DTM) starting from row 1001.
This would work well when following criteria and considerations are met:
Ø All session targets are relational.
Ø The session is configured for a normal (not bulk) target load.
Ø When you configure a session to load in bulk, the server logs a message in the session log stating that recovery is not supported.
Ø The server configuration parameter Disable Recovery is not selected.
Ø Source data does not change before performing recovery.
Ø The mapping used in the session does not use a Sequence Generator or Normalizer.

14 Using Session ID

Ø All target tables will have session id number (SIN) as administrative column to identify which session the data is loaded. These session id are also kept track in the control tables.
Ø When a session run first the Unix script generates the session id and inserts into the control tables. The command line will have session name as on of the parameter. Inside the mapping using the Session Name a lookup will be done against the Control tables to find the matching latest session id. The relevant session id will be inserted into the target tables through the mappings.
Ø After the session completes the data loading, the wrapper script updates the control tables with session metadata like no of records loaded, session successful or not, no of records which can’t be loaded etc.
Ø Using the pmcmd return code it is possible to determine whether the session failed or not. If the session failed the wrapper script sends e-mail to the ETL administrator by providing the name of the session, error returned etc.
Ø If the mapping has only INSERTS then, the strategy would be look for those records which are inserted during the session and delete them. This can be done with the help of control tables and session id number (SIN). Since this involves manual work, a smart way would be to prepare SQL scripts with prompts for session id, target table etc to ensure safe and smooth operations

15 Stop at First Error and High Commit Level in Sessions

Ø This strategy involves instructing PowerCenter server Engine to fail at the first error encountered. By setting the high COMMIT level in the session will help in roll backing all the data loaded into the target. Normally this level would be set 6 times the size of the incoming data.
Ø For example if the mapping needs to update 100, 000 rows into multiple target tables; the desired commit level would be 600,000 rows. Once the error is rectified the mapping or session would be re-run by the administrator without any changes to data.

16 Using Update Strategy
Ø Suited for the cases where the in coming data volumes are not large and lot of updates are needed to perform. The typical candidate for this would be to load the dimension or master tables, where the data volumes are not high but involves lot of updates.
Ø The mapping will cross check for the presence of the incoming record in the target table, if the record exists it compares the data and if it founds a change then it overwrites the changed value. Using this feature of the PowerCenter it is possible to cross verify the existence of incoming data in the target table.
In case of session failure and when mapping is re-run this ensures only new or changed records will go into the data warehouse or data mart.

Tuesday, September 16, 2008

Service Oriented Architecture

Introduction:
SOA (service-oriented architecture) has gained its popularity due to web services.
Before dive in depth of SOA it is essential to have knowledge about its evolution. To do this we have to look back at the challenges developers have faced over the past decades and observe the solutions that have been proposed to solve their problems.
Early programmers realized that writing software have become more and more complex. They needed a better way to reuse the codes that they were rewriting. So the concepts of modular design were introduced. With modular design principles, programmers could write subroutines and functions and reuse their code. Later it was realized that cutting and pasting modules in different application had lead to maintenance problem. Whenever a bug was found in a function, all applications wherever the function was used need to be tracked in order to fix the bug and consequently deployment nightmare began.
To avoid above problem higher level of abstraction was needed. So concept of classes and object-oriented software were introduced. Developing and maintaining software lead to software complexity. So it was thought to have something to reuse functionality not only code.
So another abstraction of layer was thought to handle above problem- Component-based software. It was good solution for reuse and maintenance, but it was not able to handle all the complexities like distributed software, application integration, varying platforms, varying protocols, various devices, and internet etc.
SOA (along with web services) answers to all the above said. SOA eliminates the headache of protocol, platform, and application integration seamlessly.

Key Components of SOA:
The terms service, message, dynamic discovery, and web services play an essential role in SOA. All the terminology will be explained in detail as we move further.

Definition:
Service Oriented Architecture (SOA) is an architectural style of building software application whose goal is to achieve Loose Coupling among interacting components (software agents) so that we can reuse application. In other words Service Oriented Architecture (SOA) is a design methodology aimed at maximizing the reuse of application-neutral services to increase IT adaptability and efficiency.
Service Oriented Architecture (SOA) is the underlying structure supporting communications between services. In this context, a service is defined as a unit of work to be performed on behalf of some computing entity, such as human user or another program. SOA defines how two computing entities, such as programs, interact in such a way as to enable one entity to perform a unit of work on behalf of another entity. Service interactions are defined using a description language. Each description is self contained and loosely coupled, so that each interaction is independent of any other interaction.

Example:
Take an example of day to day life such as Purchase made online. We look at catalog available and choose a number of items. Now what happens?
We invoke one service to specify order, which in turn communicates with an inventory service to find out whether desired items requested are available in stock or not. Now order and shipping details are submitted to another service which calculates total, and tells when items should arrive and furnishes a tracking number that, through another service.
Thus entire process, from the initial order to its delivery, is managed by communication between the Web services - programs talking to other programs. All these are made possible by underlying framework that SOA provides.

Service: A service is a unit of work done by service provider to achieve desired end results for a service consumer. Both provider and consumer roles are played by software agents on behalf of their owners.
Services are business logic that are protocol- independent, location-agnostic and contain no user state. Services are coarse-grained, meaning the service can perform its logic and return the result in a single call. Services don’t contain presentation logic, so they can be reused across diverse application.
So, a service in SOA is an exposed piece of functionality with three properties:
1. The interface contract to the service is platform independent.
2. The service can be dynamically located and invoked.
3. The service is self-contained. That is, the service maintains its own state.
- A platform-independent interface contract implies that a client from anywhere, on any OS, and in any language, can consume the service.
- Dynamic discovery hints that a discovery service (e.g., a directory service) is available. The directory service enables a look-up mechanism where consumers can go to find a service based on some criteria.
For example.
If I was looking for a credit-card authorization service, I might query the directory service to find a list of service providers that could authorize a credit card for a fee. Based on the fee, I would select a service .

Loose Coupling: Loose coupling can be defined as the state in which the impact of change (change in consumers, providers, or cross-cutting policies) is minimized across dependencies.
It refers to reduced interdependencies between modules or components, and consequently reduced interoperability risk.

Message: Service providers and consumers communicate via messages. Services expose an interface contract. This contract defines the behavior of the service and the messages they accept and return. Because the interface contract is platform and language-independent, the technology used to define messages must also be agnostic to any specific platform/language. Therefore, messages are typically constructed using XML documents that conform to XML schema. XML provides all of the functionality, granularity, and scalability required by messages. XML provides effective communication between consumers and providers by providing a non-restrictive type of system to clearly define messages.

Dynamic discovery: Dynamic discovery is an important piece of SOA. At high level, SOA is composed of three core pieces: service providers, service consumers, and the directory service. The role of providers and consumers are apparent but directory service needs explanation.
The directory service is an intermediary between providers and consumers. Providers register with the directory service and consumers query service directory to find service providers. Embedding directory service within SOA accomplishes the following:
- Scalability of services; we can add services incrementally.
- Decouples consumers from providers.
- Allows for hot updates of services.
- Provides a look-up service for consumers.

Web Services: Web services play a major role in a SOA. This is because web services are built on top of well known and platform-independent protocols: HTTP, XML, UDDI, WSDL, and SOAP. These protocols fulfill the key requirements of SOA.
That is, a SOA requires that a service be dynamically discoverable and invokeable, this requirement is fulfilled by UDDI, WSDL, and SOAP.
SOA requires that service have platform-independent interface contract, this requirement is fulfilled by XML.
SOA stresses interoperability; this requirement is fulfilled by HTTP. This is why web services lie at the heart of SOA.

WSDL: Web Services Description Language, used to describe exactly (in XML) what web services does.
UDDI: Universal Description, Discovery, and Integration protocol, used to publish web services description.
SOAP: Simple Object Access protocol, to send xml messages using HTTP (which runs on top of TCP protocol usually on port 80)