Wednesday, December 18, 2019

Data Analysis And Data Processes - 1165 Words

Data Transformation are often very complex and is the most costly section of the ETL process. Transformations are often achieved outside the database using flat files, but mostly occurs within an Oracle database. The transform step applies rules or functions to the extracted data. These rules or functions will decide on the analysis of data and can involve transformations like the following: †¢ Data Summations †¢ Data Merging †¢ Data Encoding †¢ Data Splitting †¢ Data Calculations †¢ Creating Surrogate Keys †¢ Data Aggregation When transformed, the data is clean, accurate, consistent and ready for analysis by the data warehouse users. Data can be transformed in two ways: †¢ Multistage Data Transformation †¢ Pipelined Data Transformation The most†¦show more content†¦Most of the tasks are completed outside the database, for example step 1 to 4 as shown in figure 6 below and then is inserted into the warehouse table. FIGURE 6: PIPELINED DATA TRANSFORMATION 2.2.2 Cleansing Data Data cleansing is the process of removing incorrect, inappropriate and duplicate data. It can also be called data cleaning or data scrubbing. Not cleansing the data will lead to inaccurate and unreliable results. During the process of cleansing data, if dirty data is detected, it will have to be modified and will therefore cause delays within the process. Dirty data can be caused by the following: †¢ Members of an organization are poorly trained and will therefore enter data erroneously. †¢ Inaccurate system configuration rules are applied. †¢ Regular data updates are neglected. †¢ Inconsistent or lack of validation rules. †¢ Duplicates are not removed. Data cleansing can be performed within single or multiple sets of data. Usually problems within single data sets are due to misspellings when entering data of that they have left out missing information when inputting. When problems occur within multiple data sets e.g. in a data warehouse, the need to cleanse data increases greatly. This is because the data is more than likely to have redundant data from different sources, and will therefore have to make sure all

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.