Case Study: New York City Higher Education Institution Phase 1
Executive Summary of Data Governance Project in A New York City University
Introduction
Our initial deployment is one of the premiere colleges within the CUNY network, located in the heart of Manhattan, and home to more than 16,000 students. As is the case within all highly functioning Enterprises, this College created a centralized and unified set of software applications and databases that would serve all departments. Over the past decade, they have improved its IT environment by creating valuable applications for its stakeholders- students, professors and administrators. This diversity in their application environment has helped individual departments realize their immediate goals, but has created a challenge for cross-department functions such as Finance, Career Placement and the Office of the Registrar. Disparate applications and databases for each department lead to a “silo” mentality, inconsistencies within the data environment and the storage of sensitive data in repositories that may be inconsistent with the College and regulatory security requirements.
Centralized Data Governance
A relatively small IT team is responsible for the analytical needs of the university community. Several issues can arise when managing a complex information environment:
1. The information needs of the College community are continuously evolving, and the IT department is expected to support new requirements while smoothly maintaining the older information systems.
2. University budgets remain constrained, thereby limiting investments into resources to manage the growing data environment.
3. The complexity of the environment is increasing, and so is the disparity in the data environment.
4. The limited transparency in the environment has led to a lack of awareness about what applications and databases exist in the environment. There is minimal understanding of the redundancy that exists in the environment.
Information management will increase in importance in the coming years as the sheer volume of data and number of apps that use it increases. In order to prevent significant lapses in data management operations, it is important for them to automate as many of their manual operational processes as possible, while systematically improving the maturity of its Data Management infrastructure.
This College chose MyEduLife software to automate the management of their data landscape. MyEduLife employs the same algorithms and techniques that are used by many F500 firms for their mission critical, global data landscapes.
Proof of Concept
The objective of the engagement was to create transparency across software functions and data repositories with minimal usage of client resources. Without prior information about the network or databases and having been granted read access to an Oracle instance, MyEduLife software was able to profile a sample set of the 84 databases associated with the Oracle instance.
They chose to examine 8 oracle schemas. The databases selected were ones that they already had an intimate knowledge of, so that they could verify that the software would be able to accurately define the databases. MyEduLife software was able to identify 1,447 tables that contained 18,390 columns and 54,353,984 records. The mapping of the schemas included:
• Describing all database, table and column names
• Mapping all column metadata information
• Mapping all column key information
The next step was to define the quality of the data that is stored in the databases by analyzing the column data. To begin the process, the software performed an outlier analysis on all columns that had data. The outlier analysis read the column data and examined it to find non-normal data. The outliers included variance to data type, length, frequency, pattern and value. The outlier analysis uses a six sigma variance to the mean in order to flag a column as having variances. It is important to note that the outliers do not necessarily indicate a data quality issue. For instance a column contained three digit values between 100 and 399. However, there were instances where the column data included three digits and a single alphabetic character. On the surface, this appears to be a data quality issue. However, it was determined to be as a false positive because this particular column contained course numbers, and honors courses contain an “h” at the end of the course number to indicate this designation.
After the data quality metrics were reported, columns were then profiled and categorized into both global and business domains. A domain is a standard pattern of data such as a zip code, city or state. Over 90% of the database columns were categorized as either a global domain (a domain that exists across industries such as zip code, social security numbers, city, state, last name, first name etc.) or a business domain (a domain that is unique to a particular business such as a student ID or course number).
In order to properly govern data within the organization, MyEduLife software can scan for ports that are housing database systems. We performed approximately 160 class C port scans (ports 1 to 6,000) attempting to identify unauthorized database servers. Of the networks that were scanned, one desktop computer running an instance of Oracle Server was identified as being out of compliance.
In addition to databases and network ports, data can reside in unstructured data types such as spreadsheets and text files. The client had a protected directory with faculty archive files that MyEduLife reviewed to determine if it contained sensitive student data. The system scanned the directory structure and was able to identify spreadsheets that contained Social Security numbers. While this would normally signal a breach of security policies, the directory was an archive and not available to the organization as a whole. It is worth repeating that this proof of concept use case was the vehicle to determine if MyEduLife could be used to assist in security studies.
Resource Utilization
The study took approximately 160 hours of one person to complete from installation of server hardware to production of the final report, with minimal guidance from College personnel. The client indicated that the information derived from this case study would have taken approximately 3 times longer to generate and would require effort from their Network engineers, database administrators and security administrators had they used the software tools offered by other vendors.
About MyEduLife
The Team at MyEduLife brings together leading experts in Business Data, Education, Career Advancement and Software Creation for one of the most important topics in the 21st century. Educational Data, how it is aggregated, mined, and ultimately used to improve every person’s life is the passion that drives this team.
Comments