Data warehouse management
Data warehouse is a Process of collection data from different sources and loading the data in one place to get the once format of data to build different kinds of reports which will be used for higher business development and for higher business decisions.
There are different kinds of roles and responsibility involved in data warehouse.
General roles in the data warehouse team for the specific software’s
1. UNIX admin (Aix, Linux, Solaris...Etc)--OS
2. DBAs (oracle, DB2, SQL...Etc)--Database
3. Datastage---ETL Tool
4. Business object developers—ETL reporting tool
I am talking about only about datastage and business object with Oracle/DB2 as database based in aix or Linux environment since the present data warehouse working environment is the same for me.
The datawarehosue project always should be maintained to get the continuous and steady report process.The main area of datawarehosue project where the space maintenance should be done as fallows
1. Unix environment
a. For the daily, weekly and monthly loading process. The data from different sources will be imported to the data warehouse server .The data loaded will be raw data and will be having lot of unwanted data to be purified. Once the data is purified, the 2nd processed data size file will be less compare to the raw data.(Ex:If the raw data is 400GB and after purified it for the exact required data. Then the file size will be less and again it will be in the UNIX server ready for the real data load for the DWH process.)
Once the raw data is processed or purified for the real data. The raw file is waste and will not be used unless if we need to re-run the whole process again. So please indentify the raw files and zip it and take a back up of the file and delete it from the DWH server for the next month load space. If we automate this process. Then every month their will not be any space problem for the data loading in the server.
b. The 2nd processed files which will be less size then raw file but still after the DWH process for the for the current day, week or month. It will be wasted .you can zip it and take a backup of the files and delete it to get the space for the next month load.
c. Always we will take the Database backup to the current database or DWH server every day or week or month. Make use you zip the backup file after database backup and take the tape backup and delete it from the server to make the space for the next month backup.
d. When every the datastage jobs runs, There will be intermediate file(hash or look up files..etc) created in the datastage path ,due to which the mount point will be 100% one day. Please make use the unwanted files are deleted after the DWH process to make the space for the next Run.
e.when every we do some testing ,most of the file will be unloaded to /temp or / by deflat.which will have very less space. Even when datastage jobs run the logs will be created in the /tmp where if it become 100%. Then the datastage job will fail. So please make sure you clean up the old logs and other unwanted files regularly
2. DBA space management
a. The Normal datawarehouse process will always take the full data from the files or other database to the current DWH database.
The DWH database is divided in 3 main parts
a. Staging database
The main loading process will be importing all the raw files data to the staging database .Here most of the data will be unpurified. It will be most of the time always having the same source Table structure for every month, week or day.
So every month the same kind of data loading but for the different table name representing the current month.
We can always take the backup of these tables and can be delete the tables to make the space for the next month load.(Ex:If the table is SMS table ,Then after the file (file size is 25GB)load of 25GB,Then the space 25GB will be taken from Unix file system every month. so if the /database mount point has 100GB space. then within 4 Months it will be full and no space. Or if the DBMS tablespace and tablespace is allocated to 100GB then within 4 months again it will be full and need to be increase the space.) So better to take to backup of this unwanted data and delete it from the database. Later if needed to re-run the DWH process, any time we can load this backup data and run it.
b. processing Database
Usually all the tables in this database will be empty or every time will be updated with new data. since while running the DWH process, we need some intermediate tables to do the merging and other calculations. So these tables will be used and the data will be deleted and will be less in the table. so better to keep the tables as it is. Since it will not impact more on the space problem. Yes but we need to deleted unwanted tables which are created during the implementation or any test process which may not be used after the production/development process is completed.
C. Dimensional database
Dimensional database is the final database table while the real processed data will be stored. The data will be in different tables and may be in one table. The data need to be stored for more than 10 years some time. But the server or database may not be having that much of space to keep the huge amount of data. Some time it will be backup to mirror servers and some time if it is not required regularly, then it will be backed up and kept in the tapes and other storage server.
Note: If the data is very important like banking company. We should be very careful about the data backup and we should make the user can access the Old data when they needed it.
Data from the dimensional tables are viewed as fallows
1.The Processed data can be migrated to another different database where all the users have access to the database using any front end application to access the data. Here the main DWH database dimensional data will not be having problem since the whole data is loaded the database.
2.The materialized query Table will be created in the same DWH dimensional database and which will be storing the data from all the different dimensional tables to one or two materialized to which the users or any reporting application like business object will be connected to get the data.(Note here the since the whole data is physical stored /copied from different dimensional tables to the materialized query table. There will not be any connection to real data in the dimensional table to materialized query table after the process. so to get the current data ,we need to refresh the materialized query table every time to get the current data after the DWH process completed.
3. The general views will be created to see the data from different dimensional tables and the data will be current data here. since it will not store the data physical in the view/table.