Friday, September 26, 2008

Informatica Best Practices

1 Mapping Development Standards
Mappings
Naming conventions for mappings will be in the form of - m__
Subject Area: Datawarehouse, ODS or Data mart names etc.
Dimension/Fact/Table: Name of the dimension or fact or related table. Use underscore wherever required to name dimension and facts.
Ex: m_VISION_ACTION_RESULT_CODES


Mapplets
Use same convention as mapping name with prefix MPLT_;
E.g. MPLT_


Transformations Objects
Transformation Naming Convention
Expression Transformation EXP__
Lookup Transformation LKP_
Source Qualifier Transformation SQ_
Aggregator Transformation AGG_
Filter Transformation FIL_
Update Strategy Transformation UPD__
Normalizer Transformation NRM_
Rank Transformation RNK_
Stored Procedure Transformation SP_StoredProcedureName
External Procedure Transformation EXT_ProcedureName
Joiner Transformation JNR_
Sorter Transformation SRT_<>
Router Transformation RTR_<>
Sequence Generator SEQ_ (Always Reusable)

For any reusable Transformation
R_ as prefix.

Mapping Variable
$$Function or Process that is being done

Mapping Parameter
$$Function or Process that is being done
2 Mapping Description
It is a good practice to note down the name of the developer, Date of development and a brief description of the functionality of the mapping in the description field of the mapping. Whenever there is a change in the mapping the change, date of change and person responsible for change should be noted there. This helps in tracking history of changes in a mapping and acts as a log.
Similarly all major expressions should have a brief description and its functionality, for example a lookup transformation should have the lookup override condition or Source Qualifier should have the filter condition.
3 TIPS ON using different transformations
Aggregator transformation
The best practices of using Aggregator transformation are –
Ø Always sort the inputs before passing it to the Aggregator, one way of sorting data could be applying an ORDER BY clause in the Source Qualifier on the GROUP BY columns of Aggregator.
Ø In Aggregator transformation the GROUP BY ports should be placed in the same order as required for the aggregation.
Ø While summarize an Amount field, the input port should always be defaulted to zero (set the default value as zero). A NULL value added with other values will result NULL.
Ø If Amount filed has a precision of 2 (2 decimal places), it should be defaulted as 0.00, defaulted as 0, will result an integer value truncating the decimal places.
Expression transformation
Ø To write an expression in an output/ variable port use functions and ports by pointing and clicking instead of typing the names of ports, functions or transformations to avoid typo mistakes.
Ø For all Output ports delete the Default Value “ERROR ('transformation error')”
Lookup transformation
Ø While calling an unconnected Lookup transformation in an expression instead of writing the expression point and click the lookup transformation from the function pane.
Ø The data type should match for ports in each side of lookup condition. To avoid any error in this, copy and paste the lookup ports to create the input ports.
Ø If there is a Lookup SQL override statement, server automatically puts the ORDER BY clause after the Lookup SQL override statement based on the order of the lookup ports. Hence putting any ORDER BY clause in Lookup SQL override creates an error. The work around is – Use “—“(comment-out sign) at the end of the Lookup SQL override statement and then put the specific ORDER BY clause.
Ø Only Required fields from Lookup to another transformation must be marked as Output port.
Ø Remove all unnecessary fields from lookup.
Ø All input ports must be prefixed with the keyword IN.
Ø Location Information of the Lookup Object must be $Target at mapping level.
Ø If there are multiple calls to only one lookup in the mapping, use unconnected lookup.
E.g. IN_Loan_Number
Source Qualifier transformation
Ø Try to filter as much records as possible at the Source Qualifier Level.
Ø All Sources under one folder must point to same database using the same connect string.
Ø Links further only those ports, which are required in the mapping.
Ø All ports names should be Initcap.

4 Workflow Development standards
1) Always run workflows in Suspended mode (check SUSPEND ON ERROR option on workflow level properties).

Workflow Objects
Naming Convention

Workflow - WF_Name
Worklets - WLT_Name
Assignment - ASGN_Name
Command - CMD_Name
Control - CTL_Name
Decision - DCN_Name
Email - EML_Name
Event Raise - EVTR_Name
Event Wait - EVTW_Name
Session - s_
Time - TIM_Name
5 TIPS ON using sessions
Ø Use generic Source and Target connection Information across all environments (Development, Test, Cert, Production).
Ø Location Information of $Source and $Target must be defined (on Session’s general Properties tab).
6 COMMIT AND ROLLBACK SETTINGS
Ideally commit levels should be set high enough such that changes made by an individual session could be completely rolled back if the session was stopped by an error. In practice, depending on the resources available, it may not be possible to roll back a very large number of records. Whenever it is not possible to set the commit interval high enough, it is not advisable to use PowerCenter session recovery option as it causes performance issue. In those cases the source should be sorted on the primary key so that it is possible to restart processing the source file with a filter condition to pick up data from failure point.
7 Folder object Naming convention
Repository Name Description
Project_Datamart Development/Testing/Certification/Production {e.g. LS_ODS}
8 SOURCE/TARGET DEFINITION
All source and target definitions as stated above will be created in a central “shared” folder in Informatica facilitating a single version of the source and target objects as other folders will reference these objects via shortcuts which deliver full inheritance. Source and target definitions will be imported from database.
9 TARGET TABLE NAMING CONVENTION
There are often several instances of the same target, usually because of different actions. When looking at a session run, there will be several instances each with its own successful rows, failed rows, etc. To make observing a session run easier, targets should be named according to the action being executed on that target.
For example, if a mapping has three instances of LOAN table according to update strategy (Insert, Update, Delete), and as we have shortcuts available for the objects, the tables should be named as follows:
Shortcut_To_LOAN_INS
Shortcut_To LOAN_UPD
Shortcut_To LOAN_DEL
10 General Guidelines
Ø Always make Sequence Transformation as Reusable to maintain consistency in the Incremental load.
Ø The property “Number of Cached Values” in Sequence Generator Transformation should be set to 1.
Ø Always use Expression Transformation after Source Qualifier to trim spaces from all the string column values.
11 Migration process from development to production

It is always best to have dedicated environment for development, test and production. If not possible there could be one shared environment for development and test and another dedicated environment for production. Again from security standpoint any user other than project administrator should have only read access to the production repository. This will stop developers from accidentally logging onto the wrong repository and making changes without realizing that there are not in the development repository.
Migrate from development to production can be done primarily two ways - copying entire folders and copying individual mappings into production. There are advantages and disadvantages with each approach described below.
Copy an Entire Folder
Copying an entire folder gives the ability to quickly promote all of the objects in the folder from test to production. All of the source tables, target tables, re-usable transformations, mappings, and sessions are promoted at once. What this means is that everything must be ready to be moved to the next stage. If certain mappings are not ready, then after the folder is copied the users must manually delete them from the new folder. It is possible to copy an entire folder with all of the versions associated with it or to copy a specific version of the folder. The steps for copying a folder are:
Ø If using shortcuts follow these steps
o In each of the dedicated repositories create a common folder with the EXACT same name and case sensitivity
o Copy the shortcut objects into common folder in production and making sure that the shortcut has the exact same name
Ø Open either the Repository Manager or Designer connecting to the repository
Ø Drag and drop the folder onto the repository icon within the Navigator tree structure
o Copying the entire folder: drag and drop the folder icon just under the repository level
o Copying a specific version: drag and drop the version # icon under the folder level
o Drag and drop the mapping from development into production
Ø Follow the Copy Folder Wizard steps
o If a folder with that name already exists, the folder must be renamed
Ø Modify the pre/post session commands as necessary
o It is recommended to set up the pre/post session commands to work in both environments. This means that the paths are the same in both environments (log/bad files, scripts, etc.).
o In the Server Manager within the Session Wizard in the General tab press the pre/post session button and make the required changes
Ø Implement appropriate production security
o In development, have the owner of the folders be a user in the development group
o In production, change the owner of the folders to a user in the production/operations group
o Revoke all rights to public other than read and all privileges other than browse repository
o Now if a developer accidentally logs into production, they can not make any changes
Copy Individual Mappings
Copying individual mappings into the next stage has the same advantages and disadvantages in the dedicated environment as it does in the shared environment. It gives the user the ability to promote into production the desired mapping, which is the lowest level of granularity.
If the folder where the mapping is copied into does not contain the source/target table or the re-usable transformation it will bring it along as well.
There are two disadvantages of copying individual mappings into production. First of all, only 1 mapping at a time can be copied. If there is a large number of mappings needing to be promoted into production, then it will be time consuming. Secondly, the drawback is that the sessions must be re-created from scratch. This is a tedious process, especially if pre/post session scripts are used.
Ø If using shortcuts follow these steps
o In each of the dedicated repositories create a common folder with the EXACT same name and case sensitivity
o Copy the shortcut objects into common folder in production and making sure that the shortcut has the exact same name
Ø Copy the mapping from development into production
o In the Designer, connect to both the development and production repositories and open the appropriate folders in each
o Drag and drop the mapping from development into production
Ø Create a session in the Server Manager to run the mapping
o Enter all the appropriate information in the Session Wizard
Ø Implement appropriate production security
o In development, have the owner of the folders be a user in the development group
o In production, change the owner of the folders to a user in the production/operations group
o Revoke all rights to public other than read for the production folders
o Now if a developer accidentally logs into production, they can not make any changes.

It is recommended to promote individual folders into a production environment that is already exists. By promoting individual mappings, a lower level of control is attained, but the sessions associated to the mappings must be manually recreated.


12 Session Recovery
There are various approaches in implementing Session Recovery. They are:
Ø Using PowerCenter built-in Session Recovery Feature
Ø Using Session ID
Ø Setting high commit level in the sessions
Ø Using update strategy
Ø Using combination of Session Id and Temp Table
13 Using the INFA Server Recovery Feature

When a network or other problem causes a session whose source contains a million rows to fail after only half of the rows are committed to the target, one option is to truncate the target and run the session again from the beginning. But that is not the only option. Rather than processing the first half of the source again, you can tell the server to keep data already committed to the target database and process the rest of the source. This results in accurate and complete target data, as if the session completed successfully with one run.
When run a session in recovery mode, the server notes the row id of the last row committed to the target database. The server then reads all sources again, but only processes from the subsequent row id. For example, if the server commits 1000 rows before the session fails, when you run the session in recovery mode, the server reads all source tables, and then passes data to the Data Transformation Manager (DTM) starting from row 1001.
This would work well when following criteria and considerations are met:
Ø All session targets are relational.
Ø The session is configured for a normal (not bulk) target load.
Ø When you configure a session to load in bulk, the server logs a message in the session log stating that recovery is not supported.
Ø The server configuration parameter Disable Recovery is not selected.
Ø Source data does not change before performing recovery.
Ø The mapping used in the session does not use a Sequence Generator or Normalizer.


14 Using Session ID

Ø All target tables will have session id number (SIN) as administrative column to identify which session the data is loaded. These session id are also kept track in the control tables.
Ø When a session run first the Unix script generates the session id and inserts into the control tables. The command line will have session name as on of the parameter. Inside the mapping using the Session Name a lookup will be done against the Control tables to find the matching latest session id. The relevant session id will be inserted into the target tables through the mappings.
Ø After the session completes the data loading, the wrapper script updates the control tables with session metadata like no of records loaded, session successful or not, no of records which can’t be loaded etc.
Ø Using the pmcmd return code it is possible to determine whether the session failed or not. If the session failed the wrapper script sends e-mail to the ETL administrator by providing the name of the session, error returned etc.
Ø If the mapping has only INSERTS then, the strategy would be look for those records which are inserted during the session and delete them. This can be done with the help of control tables and session id number (SIN). Since this involves manual work, a smart way would be to prepare SQL scripts with prompts for session id, target table etc to ensure safe and smooth operations


15 Stop at First Error and High Commit Level in Sessions

Ø This strategy involves instructing PowerCenter server Engine to fail at the first error encountered. By setting the high COMMIT level in the session will help in roll backing all the data loaded into the target. Normally this level would be set 6 times the size of the incoming data.
Ø For example if the mapping needs to update 100, 000 rows into multiple target tables; the desired commit level would be 600,000 rows. Once the error is rectified the mapping or session would be re-run by the administrator without any changes to data.


16 Using Update Strategy
Ø Suited for the cases where the in coming data volumes are not large and lot of updates are needed to perform. The typical candidate for this would be to load the dimension or master tables, where the data volumes are not high but involves lot of updates.
Ø The mapping will cross check for the presence of the incoming record in the target table, if the record exists it compares the data and if it founds a change then it overwrites the changed value. Using this feature of the PowerCenter it is possible to cross verify the existence of incoming data in the target table.
In case of session failure and when mapping is re-run this ensures only new or changed records will go into the data warehouse or data mart.

No comments: