Accelerating data exploration through automation

Team & Role

I led this project under the mentorship of the Data Science group director at Nokia Bell Labs

What I did

Researched, designed and developed a prototype for data exploration in an autoML tool.

What I delivered

High fidelity wireframes
User feedback from 10 user interviews, 13 usability tests
Market Research
3 Journey maps
Framework built on React.js and R
Product Branding Guide

Project Duration

June 2022 - May 2023

Note: Brand naming and design were done as part of a class project.

Problem space

About 50-80% of a data scientists' time is spent on making sense out of large datasets

Exploratory data analysis is a critical process in data science, as it lays the foundation for insights and models to be developed. However, this process can be time-consuming and cognitively overwhelming, with most data scientists spending more than half of their time looking for interesting patterns in datasets with 100+ columns.

Proposed solution

DS/ML solutions in the past few years have seen a large rise in the tools that look to automate the ML pipeline. However, these tools fail to accelerate data analysis as they rely on completely automating the end-to-end pipeline, leaving humans out of the loop.

The human and curiousity-driven nature of knowledge discovery makes it difficult to automate. This means that 70-80% of a data science project continues to be manual and cognitively overwhelming, leaving it up to the users to connect the dots in extremely large datasets.

Research Scientists at Bell Labs recognized this gap and integrated statistics and machine learning techniques in their autoML solution to augment human judgement during data exploration.

How can a tool support and augment the user while still ensuring they are in control?

Exploratory analysis can be overwhelming when faced with a large, unfamiliar dataset. How can its UX help ease users into a dataset, and tell a story?
‍
People are skeptical when any automation is involved. At what points can we improve transparency to boost user trust?

Design challenge

However, the interface of the existing autoML tool limited the user and prevented them from taking control of the exploration process. How can I better integrate automated techniques to support knowledge discovery?

Process

This was a long project, with 2 major design iterations across a period of one year.

Problem space

About 50-80% of a data scientists time is spent on making sense out of large datasets

Despite the increasing trend in automating different parts of the DS lifecyle, and increasing usage patterns of autoML tools such as VertexAI and AzureML, data exploration continues to be a painstakingly time consuming and manual process. Due to the amount of creativity required in this phase, they must be carefully designed to place the control in the hands of the human. At what points can automation be introduced to augment human creativity and curiousity?

User Research I

‍

Research goals

Understand how data scientists would explore an unfamiliar tabular dataset for a classification task and their experience with existing autoML tools.

‍

My research goals of Round 1 were to understand how users would explore an unfamiliar tabular dataset for a classification task and their experience with existing autoML tools. To explore this, I conducted 45 minute semi-structured with 6 data scientists within a research team.

‍

Key Finding #1

Users had different perspectives on data exploration

Some users relied heavily on modeling to understand the data, and were looking for tools that would give them a stronger grasp.

‍
Key Finding #2

Users faced a high cognitive load while studying datasets

Users talked about feeling lost while studying large datasets, not knowing what to do with the data, and the danger of overlooking patterns.

Key Finding #3

Users felt a lack of control in existing autoML tools

Users talked about existing autoML tools being a “no-brainer” and “blackboxes”.

‍

Finding #1

Users faced a high amount of cognitive load while studying data.

Design implication: Support users' exploratory workflows through better structure and guidance.

Finding #2

This led to some users brute forcing models to better understand their data.

Design implication: Encourage users to be curious and explore by highlighting strange things in a dataset.

Finding #3

All users agreeed that data quality was key to their sensemaking.

Design implication: Visualize the quality of a dataset to accelerate data analysis and sensemaking

Finding #4

Users were reluctant to shift to newer tools.

Design implication: Reduce the learning curve involved by designing flows that feel natural and familiar to users.

Design Goals

Design : Cognitive Load

EDA is deeply layered and involves inspecting the data at different granularities. What insights can we automate to help accelerate this process?

My solution was to build a user experience around zooming in and zooming out of a dataset.

I structured the data sensemaking process and insights into 3 levels of granularity.

My goal was to design a layout that allowed users to move between different granularities.

Concept 1: Quality Fingerprint

Understanding the quality of data was something crucial to users in the initial steps. The system could reduce the load placed on users by automatically calculating quality based on commonly used criteria, and visualizing its varying levels in a single view.

Concept 2: Quality Fingerprint with columns

While the previous concept prioritize visualizing the entire dataset in a single view, this concept focused on giving users contextual information about the columns through column names.

It also looked to provide users with a more natural transition into the second level of granularity by substituting colors with values when zooming in.

The previous two concepts used a lot of vertical space that did not convey any information. While it seemed more closer to how columns are viewed, it was important to prioritize efficiency over aesthetic considering the large amount of columns in a dataset.
A heatmap would allow users to view column names while accomadating more columns in a single view.

Users talked about the dangers of overlooking patterns while studying data. My solution was to provide users with a fingerprint of their dataset.

To ease the initial cognitive load placed on users, I proposed using level 1 of exploration to communicate global views of the dataset. By highlighting "strange things" in a dataset, these would also serve as more accessible entrypoints that encourage curiousity among users and guide a deeper exploration.

As an example, I chose to visualize the quality of a dataset for my prototype. Green indicated columns having minimal errors, while red highlighted columns showing various type of errors.

I used an interactive heatmap to help users visualize the quality of their data.

Once users had a birds eye view of their dataset, the subsequent granularity levels shifted control from the system to users.

At the second granularity level, my goal was to allow users to feel in control of the exploration process and further inspect various patterns.

Guiding users using automated insights

At this level, my goal was to help users navigate through the table using automated insights. The insights were designed to highlight strange things in a dataset that often tap into the curiousity of data scientists during exploration.

‍

Visualizing key relationships

Data scientists at the Bell Labs research group have created algorithms to extract relaionships from the data. This section houses them together, conveying important statistical features through network visualization diagrams.

‍

I designed minimal system interventions that guided the users attention towards interesting columns.

The final granularity level helped users study individual columns in isolation.

User Research II

I tested my prototype with users and conducted another round of interviews to dive deeper into key research areas.

‍

Research goals

Understand how quickly users were able to navigate through the dataset at different levels of detail and how well the system supported their data exploration workflow.

‍

The tasks given to users were:

T1, Level 1: Gather an overview of the data using the visualization and insights in Level 1.
T2, Level 2: Use the automated insights in Level 2 to study specific columns
T3, Level 2: Understand relationships between columns through automated visualizations
T4, Level 2: Filter to generate automated insights for a subset of columns
T5, Level 3: Study columns individually and filter values using the interactive visualization

‍

Research goals

Understand the influence of control on trust while working with automated tools
‍
Understand the points at which users would want to interact with the system
‍
Underestand how cognitive load can be reduce while working with large datasets

Key Findings

Finding #1

Users appreciated the tool for providing useful starting points within datasets.

Finding #2

However, 80% of the users wanted to see more quantitative evidence included in automated insights.

However, users had difficulties trusting the automated insights

Design implication: Integrate quantitative insights and eliminate qualitative language to boost user trust.

Finding #3

All users believed that their trust in the system overlapped with transparency & control.

All users believed that their trust in the system overlapped with transparency & control

Design implication: Find the right balance between control and automation.

Finding #4

Some users wanted to see a different first view of the dataset.

Some users wanted to see a different first view of the dataset

Design implication: Give users a more descriptive preview of the dataset before presenting information about its quality. Redesign the visualization to be scalable to 100+ columns.

Finding #5

Users saw the potential for more ways of narrowing down on columns.

Users saw the potential for more ways of narrowing down on columns

While the filter feature in level 2 was well recieved by all users, some users found it to be unituitive. Users seemed more comfortable filtering through columns at the overview level.

Design implication: Integrate a filter feature at the overview level, to further reduce the cognitive load placed on users during analysis.

Finding #6

Studying column relationships enhances sensemaking more than analyzing individual values.

Users did not find Level 3 too helpful to their analysis

Design implication: Redesign the final level to include more relationships exploration

Key Feedback

Design goals

Creating journey maps helped me identify key patterns and interaction points during data analysis

The scenarios I created journey maps for are:
‍

Manual analysis of data: Data scientist uses Python to build a machine learning model that predicts how capable each applicant is of paying back a loan.
‍
Analysis of data using Dataiku: Data scientist uses Dataiku to build a model for revenue forecasting. Dataiku is a leading competitor in the space of human-centered autoML that I identified during my market research.
‍
Analysis of data using the augmented data science (ADS) tool developed by data scientists at Bell Labs: Data scientist uses ADS tool to build a machine learning model that predicts how capable each applicant is of paying back a loan.

‍

Design II : Cognitive Load, Interactivity & Usability

While studying large amounts of data, some users preferred minimal visual cues while others wanted to visualize all columns without having to scroll. How could I redesign the overview screen to address these user needs?

I combined visualization principles with user goals to further evaluate concepts for a global view.

Considering the high amount of cognitive load placed on users while studying tabular data, designing comprehensive visualizations to ease users into a dataset was one of my primary goals. To benchmark design concepts I created a list inspired by 3 principles based on E. Tufte's guidelines and 3 from user feedback.

After creating a list of benchmarks, I realized that the existing concept could be further optimized to improve usage of color, space and interactions.

Initial concept: Encoding quality in an interactive heatmap

1. Encourage the eye to compare different pieces of data
2. Reveal the data at different levels of details
3. Maximize the data-ink ratio
4. Eliminate or reduce scrolling
5. Place higher importance on visualizing key distributions and column relationships
6. Allow users to filter, select, and drill down on subsets of columns

I iterated through different visualizations with the goal of maximizing the data to ink ratio:

The final concept allowed users to interact with a minimal overview visualization to open connected views using a brushing & linking interaction.

Initial concept

‍

Initial concept

‍

Initial concept

‍

Initial concept

‍

Final Concept: Connecting views using a brushing & linking interaction

The goal of the interaction was to allow users to intuitively scan through large datasets and visualize interesting data points in further granularity.

[data-wf-bgvideo-fallback-img] { display: none; } @media (prefers-reduced-motion: reduce) { [data-wf-bgvideo-fallback-img] { position: absolute; z-index: -100; display: inline-block; height: 100%; width: 100%; object-fit: cover; } }

An example of brushing & linking using D3.js

By creating 3 responsive views, the interaction adds another layer of visualization before users dive deeper into specific columns.

I sketched a storyboard and collected feedback from 4 users.

I had 20-minute short interview sessions with 2 visualization experts at Georgia Tech and 2 senior data scientists.

While I recieved positive feedback on the concepts, visualization experts suggested using lesser space for the main visualization and thinking about how users would decipher the column names when 50+ data points are involved.

Based on user feedback, I designed my high-fidelity prototype to incorporate an edge case of 50+ columns.

Design II

I brainstormed design updates from the lens of control, trust and cognitive load.

Having laid down the foundation of the tool in the previous design phase, the goal of this design phase was to further refine the system based on feedback from usability tests and interviews.

‍

Redesigning the layout to optimize it for interactivity, transparency and space:

Refining the granularity levels to better match user expecations:

Users did not find the column-level visualizations too helpful in the previous design. Rather than viewing the distribution of individual columns, they would prefer to plot relationships between various columns.

Similarly, users mentioned they would prefer to see data that is more descriptive of the dataset before understanding its quality.

Allowing users alternate ways to explore data:

A key finding from usability tests was that users were not willing to trust the visualization without seeing quantitative information. Will this design change be enough for users to trust the overview visualization?

A key finding from usability tests was that users were not willing to trust the visualization without seeing quantitative information.

Will this design change be enough for users to trust the overview visualization?

Redesigning the final level to help users explore relationships between columns:

During the usability tests I discovered that visualizing relationships between columns was more critical to users' sensemaking process than studying a single column in isolation. For the final granularity level, my goal was to allow users to visualize relationships between columns without having to write code.

Accelerating data exploration through automation

Team & Role

What I did

What I delivered

Project Duration

Problem space

About 50-80% of a data scientists' time is spent on making sense out of large datasets

Proposed solution

Design challenge

Process

Solution

Problem space

About 50-80% of a data scientists time is spent on making sense out of large datasets

User Research I

Research goals

Key Finding #1

Users had different perspectives on data exploration

‍Key Finding #2

Users faced a high cognitive load while studying datasets

Key Finding #3

Users felt a lack of control in existing autoML tools

Finding #1

Users faced a high amount of cognitive load while studying data.

Users faced a high amount of cognitive load while studying data.

Finding #2

This led to some users brute forcing models to better understand their data.

This led to some users brute forcing models to better understand their data.

Finding #3

All users agreeed that data quality was key to their sensemaking.

All users agreeed that data quality was key to their sensemaking.

Finding #4

Users were reluctant to shift to newer tools.

Users were reluctant to shift to newer tools.

Design Goals

Design : Cognitive Load

User Research II

Research goals

Research goals

Key Findings

Finding #1

Users appreciated the tool for providing useful starting points within datasets.

Users appreciated the tool for providing useful starting points within datasets.

Finding #2

However, 80% of the users wanted to see more quantitative evidence included in automated insights.

However, users had difficulties trusting the automated insights

Finding #3

All users believed that their trust in the system overlapped with transparency & control.

All users believed that their trust in the system overlapped with transparency & control

Finding #4

Some users wanted to see a different first view of the dataset.

Some users wanted to see a different first view of the dataset

Finding #5

Users saw the potential for more ways of narrowing down on columns.

Users saw the potential for more ways of narrowing down on columns

Finding #6

Studying column relationships enhances sensemaking more than analyzing individual values.

Users did not find Level 3 too helpful to their analysis

Design goals

Design II : Cognitive Load, Interactivity & Usability

Design II

Testing

Research goals

Key Findings:

While users' trust levels increased, they were still hesitant to completely trust the automated aggregate.

50% of the users found the overview visualization ambiguous.

Users comfort levels with the visualization possibly impacted their trust.

Design III : Control & Trust

Step 1: How is the system generating a score?

Step 2: Can I control the way in which it generates insights?

Step 1: Explainability

Step 2: Verification

Step 3: Experimentation

Testing

Research goals

Trust scores in automated insights increased by 31%.

All users were comfortable with the overview visualization.

There could be a possible overlap between visualization type and user trust.

Final Prototype

Branding

Reflection

‍
Key Finding #2

What I learned from this experience
‍