Data warehousing project:
Over the last several years I have been engaging in a project of creating and maintaining a data warehouse for a non-profit healthcare organization. This has been a challenging but rewarding project that I was fortunate to be a part of. During this project I have served in multiple roles, acting as a project manager, software engineer writing integrations and transformations in multiple language, performing data analysis, defining processes for non-technical staff to create, update and transform data to be compatible with the needs to the data warehousing project, as well as supervising technical and non-technical staff engaged in completing this project.
I am a firm believer that no project like this will ever be “done”, however as this project has moved forward we have made some decisions and performed research that I feel is worth sharing that may help others who are engaging in similar projects. This post briefly touches on the process we engaged in and our motivation behind selecting the methods and technologies along the way, this is in no way enough information to complete a project of this size on it’s own.
We chose to use Microsoft SQL Server for our data warehouse’s back end storage solution. While there are other/newer technologies that have been created that may provide options for storing data in less structured, faster or (according to some) better ways, we decided that MS SQL was the best choice for our needs.
One aspect we considered when choosing Microsoft SQL Server was of its availability and documentation. Microsoft SQL Server is a widely used technology that has a large community of users and developers. This made it easy for us to find resources and support when we needed it. Additionally, Microsoft provides extensive documentation on SQL Server, which made it easy for us to learn and understand the technology.
Another reason we chose SQL Server was the wide range of technical staff who have experience with it. Most of our team had prior experience working with SQL Server and similar technologies, which made it easy for us to get up and running with the technology quickly. This also meant that we had a team that was familiar with the technology, which made it easier to troubleshoot and resolve any issues that arose.
We also chose SQL Server because of its compatibility with existing systems and processes already used at the organization. We were able to integrate SQL Server with other systems and applications that were already in place, which made it easy for us to get started with the data warehouse. Additionally, SQL Server was compatible with the organization's existing data structures and relationships, which made it easy to link data across different systems.
Finally, we chose SQL Server because of its low cost to implement. Thanks to Microsoft's software donation and discount programs for non-profits, we were able to implement SQL Server at a lower cost than other options. This helped us to keep the project within budget and allowed us to allocate more resources to other areas of the project.
I chose to use Python as one of our key technologies. Python is a powerful tool that allowed us to perform data transformations, ingest data from standalone data sources, and integrate with external applications and APIs.
One of the reasons Python is so powerful when performing data transformations is its ability to handle large datasets. Python's data manipulation libraries, such as Pandas and Numpy, make it easy to work with large datasets and perform complex data transformations. This was particularly useful when we needed to clean, standardize and enrich the data to make it more useful for analysis.
In addition to data transformations, Python also proved to be a valuable tool when it came to ingesting data from standalone data sources. For example, we used Python to automate the process of importing data from static reports generated by other applications, as well as manually generated spreadsheets containing operational and scheduling data. This allowed us to easily integrate data from a variety of sources, which was crucial in creating a unified view of the data.
Another benefit of using Python is the wide range of modules available to be integrated. Python has a large ecosystem of libraries and frameworks that can be used to perform a variety of tasks, such as web scraping, data visualization, and machine learning. This gave us the flexibility to choose the best tools for the job and made it easy to add new functionality as needed.
Finally, Python also has a large community support and documentation available. This made it easy for us to find resources and support when we needed it, which was particularly helpful when it came to troubleshooting and resolving any issues that arose.
External APIs to expand and enrich our data:
One of the key aspects we considered while executing this project was the need to have accurate, complete and up-to-date information about clients and staff. One way to achieve this is by integrating external APIs to expand and enrich the data stored in the data warehouse.
We used a variety of external APIs to enhance the data stored in our data warehouse. For example, we used Twilio's Phone number validation and formatting API to ensure that phone numbers were entered in a consistent format and that they were valid. This helped us to ensure that we had accurate contact information for clients and staff, which was essential for communication and follow-up. Additionally through other Twilio API’s we are able to perform telephone based communications with our clients and staff, for appointment reminder, quality of care surveys, staff alerting and other operational uses.
We also used the Geocodio API to add census information, school districts, U.S. Census information, State Legislative Districts, Demographics, Families, Social: Education & Veteran Status, Housing, Economics: Income Data, Congressional Districts & Contact Info as well as address parsing, completion and standardization. This allowed us to provide a more complete picture of our clients and staff, and it helped us to make better data-driven decisions.
For example, by integrating demographic data such as income and education, we were able to identify patterns and trends that could be used to improve patient care and financial performance. Additionally, by integrating data on families, we were able to better understand the social context of our clients and staff, which helped us to improve our services and support.
We found that integrating external APIs was an important step in creating and maintaining our data warehouse. By using APIs such as Twilio's Phone number validation and formatting API and the Geocodio API, we were able to ensure that the data stored in the data warehouse was accurate, complete, and up-to-date, which helped us to make better data-driven decisions and improve patient care and financial performance.
From the onset of the project we knew it's important to have the ability to easily analyze and visualize the data stored in the warehouse. That's why we chose to use Power BI as our data visualization and business intelligence tool. Power BI is a powerful tool that allowed us to integrate with our data warehouse and perform analysis, build reporting, and create visualizations. The ability to easily drill down on key data points and connect data from multiple sources was crucial in identifying trends and patterns in the data that we were able to use to make data-driven decisions.
One of the key advantages of using Power BI is its ability to create interactive dashboards. These dashboards allowed us to quickly and easily explore the data, and they made it easy for us to identify key insights and trends. Additionally, Power BI's drag-and-drop interface made it easy for us to create and customize our visualizations, which helped us to communicate our findings in a clear and effective way. The familiar interface in Power BI, which shares many similar aspects to other Microsoft Office tools, also enabled us to empower non-technical users to create and maintain their own dashboards to explore data.
Another advantage of using Power BI is its ability to publish dashboards to the web. This made it easy and secure for non-technical staff to view and draw value from the data being visualized. This was particularly useful for our healthcare organization, where staff members from different departments needed to access and understand the data in order to make data-driven decisions.
Power BI has proven to be a valuable tool, its ability to integrate with our data warehouse and perform analysis, build reporting, and create visualizations, as well as the ability to publish dashboards to the web helped us to easily identify trends and patterns in the data and make data-driven decisions. This was crucial for our healthcare organization in order to improve patient care and financial performance.
Data governance, security and privacy:
When creating a data warehouse, one of the main challenges is ensuring that the data is accurate, reliable, and secure. Our team faced this challenge head-on by implementing data governance and quality control processes to ensure that the data was accurate and consistent. Additionally, we made sure that the data was accessible to authorized users only.
To control access to the data, we decided to provide access to anonymized data through analytical dashboards instead of direct access to source data or transformed data. This allowed authorized users to gain insights and make decisions based on the data without compromising patient privacy. Additionally, we limited access to the data based on group enrollment, and through the use of managed exports through database views and stored procedures.
Another important aspect of the project was ensuring the security of the data, it was essential to have robust security measures in place to protect the data from unauthorized access or breaches. To achieve this, we implemented security protocols such as access control, encryption, and monitoring to keep the data secure. We also managed access to the data through single sign-on integration with Office365 accounts with multi-factor authentication. This ensured that access to the data was only possible by identified and authenticated staff, contractors, and partner agencies.
The first step we took when creating our data warehouse was surveying each program area and administrative department. By meeting with each department, we were able to understand what data sources they used, how they were used, what data was stored in each source, and the process for how frequently it was updated. Additionally, we were able to identify who was responsible for capturing, maintaining, and performing quality audits on each data source and process.
This survey also included a review of data that is not stored in databases. This helped us to identify any data that is currently not captured at all in digital forms. By identifying this data, we were able to develop process improvements to better capture this information in the future. This was particularly important when it comes to healthcare organizations, where accurate and up-to-date data is critical for providing high-quality patient care.
By conducting this survey, we were able to understand the specific data needs and requirements of each department. This helped us to design a data warehouse that was tailored to meet the needs of all stakeholders. We were able to ensure that the data was stored in a format that was easy to use, that it was accurate, and that it was updated on a regular basis. This helped us to create a data warehouse that was not only useful, but also easy to use and maintain.
Extract, Load, Transform:
Next, we integrated our key data source systems. We integrated our data sources in stages prioritizing the largest and most critical systems first and working iteratively towards smaller, less frequently updated and/or more complex systems and data sources over time. We first extracted data from key source systems such as electronic health records and billing systems, and loaded it into staging tables in our data warehouse. This process not only helped us to build an archive of the state of the data over time, but it also allowed us to avoid putting too much strain on the source systems by extracting the data in its raw form.
Integrating the data was a complex process that required a deep understanding of the data structures and relationships within each system. It was important to make sure that we were pulling in all of the necessary information, and that it was being loaded into the data warehouse in a way that would make it easy to work with later on.
Once the data was integrated, we moved on to transforming it. This step involved cleaning, standardizing, and enriching the data to make it more useful for analysis. One of the key areas we focused on during the transformation process was date manipulation. We calculated ages, age ranges, and grouped data based on dates of birth and diagnosis dates. Additionally, by breaking down service dates by fiscal year, quarter, month, and year, we were able to easily link service-related data to fiscal performance data later on.
Overall, the process of extracting, loading, and transforming data is crucial for creating a data warehouse that is accurate, reliable, and useful for analysis. It requires a deep understanding of the data and the systems it comes from, as well as the ability to work with it in a way that makes it easy to use later on. With the data transformed and integrated, we were able to move on to the next step of the process – analysis and reporting.
Master People Index:
An important step in creating the data warehouse was building the master people index. This step involved matching unique identifiers for each client served from the multiple source systems, and performing partial or imperfect matches between systems where names, dates of birth, and addresses may be different or entered incorrectly.
The master people index was critical in ensuring the accuracy and completeness of the data, and it allowed us to link information from different systems to create a single, unified view of each client. To build the master people index, we used a combination of data matching and de-duplication techniques, such as probabilistic matching and deterministic matching.
Probabilistic matching uses statistical algorithms to determine the likelihood that two records refer to the same person, based on factors such as name, date of birth, and address. Deterministic matching uses a set of rules to determine whether two records are a match, such as an exact match on a unique identifier like a Social Security Number.
We also performed partial or imperfect matches, which allowed us to match records that did not have a perfect match but still had a high likelihood of being the same person. This was particularly useful in situations where names, dates of birth, and addresses may be different or entered incorrectly.
By building the master people index, we were able to ensure that the data was accurate and could be easily linked across different systems, which was crucial for making data-driven decisions.
Analyze the data:
Analyzing the data in our new data warehouse proved to be one of the biggest challenges that we faced. One of the limitations we faced while implementing this data warehouse was the ability to perform analysis effectively. Despite being able to find insights and connect data across multiple sources, drawing out the insights that provided key insights into patient care and financial performance was difficult. This was compounded by the use of technical staff performing the initial phases of analysis, while these staff (myself included) were able to find connections, correlations and insights in the data sets, the value of this data was less valuable to the organization than expected for the project.
To overcome this, we engaged key stakeholders across different departments and programs. Through an iterative process, we worked closely with them to understand their specific needs and requirements, and then using that information to guide our analysis. Through this engagement, we were able to identify key metrics and data points that were most relevant to their work and focus our analysis on those areas.
We were able to work with a group of key staff across the organization, granting these staff access to subsets of data and tools that helped link data and draw out ideas and explore what was possible. Through each iteration with these staff we were able to demonstrate what data was available and types of transformations and data joins were possible.
As our data sources continue to evolve, and our procedures and requirements change we have continued to revisit this process to allow the data warehouse to continue to grow and support the needs of the organization.
Visualize and report on the data:
Visualizing and reporting on the data is crucial in gaining insights and making data-driven decisions. Our team leveraged tools such as Power BI to create interactive dashboards and reports that allowed us to easily understand and explore the data. This step was essential in gaining insights into patient care, financial performance, and operations, which helped the organization make informed decisions.
Building these dashboards and visualizations was a process that required combining data from different sources and creating meaningful metrics that could be used to track progress and measure performance. This step was critical in providing a clear picture of the data and allowed us to present the insights in a way that was easy to understand and use.
One of the most beneficial areas for the organization early on was financial performance reporting. The structured nature of the financial data made it easy to link revenue, services performed, billing rate codes, managed care organizations, and insurance providers. Through the use of time-based visualizations, we were able to demonstrate the effectiveness of programs over time and compare and contrast regulatory and contractual changes across programs and locations.
In addition to financial performance, we also created visualizations that helped us understand patient care by using data such as patient demographics, cross-program enrollment, and outcomes. Through exploring this type of data, we were able to identify gaps in our data collection processes and make necessary changes to better capture data and identify trends and patterns that focus on representing the quality of care by measuring metrics that demonstrate the social determinants of care.
As we continue to develop and improve our data warehouse, we are constantly looking for ways to analyze and extract insights from the data more effectively. One area that we are particularly interested in is the use of advanced analytics tools and techniques.
One of the key areas we plan to explore is the use of machine learning algorithms. Machine learning is a powerful tool that can be used to identify patterns in data that are not immediately apparent. By using machine learning algorithms, we can identify trends and patterns in the data that can help us make better decisions and improve the overall quality of care.
Another area we plan to explore is the use of natural language processing (NLP). NLP is a technique that can be used to extract insights from unstructured data such as patient notes, service documentation and other clinical documentation. By using NLP, we can gain insights from data that would otherwise be difficult to extract. This can help us identify patterns in patient care and make more informed decisions about how to improve the overall quality of care.
In addition, we also plan to explore the use of predictive analytics, which can help us identify trends, patterns, and relationships in the data that can help us predict future outcomes and make more informed decisions.
Overall, by exploring these advanced analytics techniques and tools, we can gain a deeper understanding of the data and make more informed decisions that will ultimately help us improve the overall quality of care for our patients.
In this post, I review the journey of creating and maintaining a data warehouse for a healthcare organization. It all started with us surveying different departments to get a sense of what data they had, what they needed, and how they wanted to use it. This helped us design the data warehouse to fit the needs of everyone involved.
We relied on technologies like SQL Server, Python, and Power BI to help us organize, transform, and make sense of the data. Along the way, we faced some challenges like making sure the data was accurate and keeping it secure. But we also found ways to enhance the data, like using external APIs to add more information to our records. And in the end, we used Power BI to build dashboards and reports that helped the organization make better decisions with the data.
Looking to the future, we're excited to explore even more advanced analytics tools. We're thinking about using machine learning and natural language processing to uncover insights that might not be obvious at first glance. It's all about finding new ways to make sense of the data and make a real impact on patient care.
I hope that you’ve found this brief explanation of the process to be helpful and informative. I plan to write more in the future to provide a deeper technical review of specific aspects of this project.