I led this project under the mentorship of the Data Science group director at Nokia Bell Labs
Researched, designed and developed a prototype for data exploration in an autoML tool.
June 2022 - May 2023
Exploratory data analysis is a critical process in data science, as it lays the foundation for insights and models to be developed. However, this process can be time-consuming and cognitively overwhelming, with most data scientists spending more than half of their time looking for interesting patterns in datasets with 100+ columns.
DS/ML solutions in the past few years have seen a large rise in the tools that look to automate the ML pipeline. However, these tools fail to accelerate data analysis as they rely on completely automating the end-to-end pipeline, leaving humans out of the loop.
The human and curiousity-driven nature of knowledge discovery makes it difficult to automate. This means that 70-80% of a data science project continues to be manual and cognitively overwhelming, leaving it up to the users to connect the dots in extremely large datasets.
Despite the increasing trend in automating different parts of the DS lifecyle, and increasing usage patterns of autoML tools such as VertexAI and AzureML, data exploration continues to be a painstakingly time consuming and manual process. Due to the amount of creativity required in this phase, they must be carefully designed to place the control in the hands of the human. At what points can automation be introduced to augment human creativity and curiousity?
Understand how data scientists would explore an unfamiliar tabular dataset for a classification task and their experience with existing autoML tools.
My research goals of Round 1 were to understand how users would explore an unfamiliar tabular dataset for a classification task and their experience with existing autoML tools. To explore this, I conducted 45 minute semi-structured with 6 data scientists within a research team.
Some users relied heavily on modeling to understand the data, and were looking for tools that would give them a stronger grasp.
Users talked about feeling lost while studying large datasets, not knowing what to do with the data, and the danger of overlooking patterns.
Users talked about existing autoML tools being a “no-brainer” and “blackboxes”.
Concept 1: Quality Fingerprint
Understanding the quality of data was something crucial to users in the initial steps. The system could reduce the load placed on users by automatically calculating quality based on commonly used criteria, and visualizing its varying levels in a single view.
Concept 2: Quality Fingerprint with columns
While the previous concept prioritize visualizing the entire dataset in a single view, this concept focused on giving users contextual information about the columns through column names.
It also looked to provide users with a more natural transition into the second level of granularity by substituting colors with values when zooming in.
At the second granularity level, my goal was to allow users to feel in control of the exploration process and further inspect various patterns.
Guiding users using automated insights
At this level, my goal was to help users navigate through the table using automated insights. The insights were designed to highlight strange things in a dataset that often tap into the curiousity of data scientists during exploration.
Visualizing key relationships
Data scientists at the Bell Labs research group have created algorithms to extract relaionships from the data. This section houses them together, conveying important statistical features through network visualization diagrams.
The final granularity level helped users study individual columns in isolation.
Understand how quickly users were able to navigate through the dataset at different levels of detail and how well the system supported their data exploration workflow.
The tasks given to users were:
T1, Level 1: Gather an overview of the data using the visualization and insights in Level 1.
T2, Level 2: Use the automated insights in Level 2 to study specific columns
T3, Level 2: Understand relationships between columns through automated visualizations
T4, Level 2: Filter to generate automated insights for a subset of columns
T5, Level 3: Study columns individually and filter values using the interactive visualization
Key Feedback
The scenarios I created journey maps for are:
Considering the high amount of cognitive load placed on users while studying tabular data, designing comprehensive visualizations to ease users into a dataset was one of my primary goals. To benchmark design concepts I created a list inspired by 3 principles based on E. Tufte's guidelines and 3 from user feedback.
Initial concept: Encoding quality in an interactive heatmap
Initial concept
At this level, my goal was to help users navigate through the table using automated insights. The insights were designed to highlight strange things in a dataset that often tap into the curiousity of data scientists during exploration.
Initial concept
At this level, my goal was to help users navigate through the table using automated insights. The insights were designed to highlight strange things in a dataset that often tap into the curiousity of data scientists during exploration.
Initial concept
At this level, my goal was to help users navigate through the table using automated insights. The insights were designed to highlight strange things in a dataset that often tap into the curiousity of data scientists during exploration.
Initial concept
At this level, my goal was to help users navigate through the table using automated insights. The insights were designed to highlight strange things in a dataset that often tap into the curiousity of data scientists during exploration.
Final Concept: Connecting views using a brushing & linking interaction
An example of brushing & linking using D3.js
By creating 3 responsive views, the interaction adds another layer of visualization before users dive deeper into specific columns.
I had 20-minute short interview sessions with 2 visualization experts at Georgia Tech and 2 senior data scientists.
While I recieved positive feedback on the concepts, visualization experts suggested using lesser space for the main visualization and thinking about how users would decipher the column names when 50+ data points are involved.
Having laid down the foundation of the tool in the previous design phase, the goal of this design phase was to further refine the system based on feedback from usability tests and interviews.
Users did not find the column-level visualizations too helpful in the previous design. Rather than viewing the distribution of individual columns, they would prefer to plot relationships between various columns.
Similarly, users mentioned they would prefer to see data that is more descriptive of the dataset before understanding its quality.
A key finding from usability tests was that users were not willing to trust the visualization without seeing quantitative information. Will this design change be enough for users to trust the overview visualization?
A key finding from usability tests was that users were not willing to trust the visualization without seeing quantitative information.
Will this design change be enough for users to trust the overview visualization?
During the usability tests I discovered that visualizing relationships between columns was more critical to users' sensemaking process than studying a single column in isolation. For the final granularity level, my goal was to allow users to visualize relationships between columns without having to write code.
2 users voiced their confusion on their first impression of the chart. However, once users spent a few minutes understanding the view, they were able to grasp the visualization.
3 out of the 4 users seemed skeptical about the visualization. One user even counted the dots in the visualization to make sure all columns were correctly being represented!
4/4 users in the usability tests immediately wanted to know more about the insights were being calculated.
2/4 users in the usability tests wanted to be able to play around with the thresholds. This also aligned with a key finding from the exploratory interviews where 2/4 users mentioned that their trust in the system overlapped with how much control they had.
4/4 users in the usability tests immediately wanted to know more about the insights were being calculated.
4/4 data scientists in the usability tests responded positively to histogram visualizations of each column. Allowing them to preview each column in this screen to validate errors noticed by the system, should allow data scientists to be more comfortable with the scores.
2/4 users in the usability tests wanted to be able to play around with the thresholds. This also aligned with a key finding from the exploratory interviews where 2/4 users mentioned that their trust in the system overlapped with how much control they had.
Users in this round of usability test followed a similar thought process while validating automated insights. Their trust in the system increased progressively across the 3 stages of validation. Unlike in the previous rounds of usability tests, users were comfortable with the insights generated by the system.
Unlike in the previous rounds of usability testing, where it took a while for users to grasp what the visualization was conveying, users in this round were immediately able to get an overview of the data. All users further interacted with the visualization by clicking on the bars to test their understanding.
Across previous usability tests there were certain users who were skeptical about the insights as they were not familiar with the data visualization. However, 4/4 users were comfortable with representing columns using bar graphs and coincidentally had higher trust ratings in the system.
Out of the various different stages in an ML pipeline, accelerating exploratory data analysis continues to be a challenging problem space. This is largely due to the numerous ways in which a data scientist can approach a problem and choose to conduct his analysis. A lot of modern tools have tried to automate this analysis and in the process left humans out of the loop.
At the end of this project, I propose a human-driven framework, that presents automated insights dynamically as the user drills down on a dataset. The structured organization of the system allows users to ease into a dataset, and not get overwhelmed by the large number of columns. By designing overview visualizations that give users quick ideas about the dataset, the proposed framework gives direction to their exploration and provides them with various entrypoints into a dataset.
There remains more work to be done before modern automation-based tools can be introduced to the workflows of experts. Throughout my research, I’ve learned about the skepticism in userswhile relying on results from processes not designed by them. Simply being transparent about the methods being used was not enough to gain the trust of users. As evidenced by the 3-step flow I designed, systems must accommodate various methods for users to further investigate the accuracy of insights.