Supercharging data analysis: How “The Box” is transforming Kickstarter research

Unleashing D'Amore-McKim potential with the DASH_Box

At Northeastern University, AI research thrives on cutting-edge computing power. With over 50,000 CPU cores and 500+ GPUs in the Discovery compute cluster, researchers have access to immense resources. But here's the thing: sometimes, you don't need all that firepower to get started.

That's where the DASH_Box comes in. This local computing solution is designed for usability and puts powerful AI capabilities at the fingertips of D'Amore-McKim researchers. Whether they're fine-tuning models, running experiments, or testing new ideas, the DASH_Box makes high-performance computing faster, more accessible, and seamlessly integrated into their workflow. It's not about replacing large-scale infrastructure; it's about making AI research more efficient, one breakthrough at a time.

Key applications of “The Box” in Debashish's research

In the world of empirical research, cleaning and processing large-scale datasets can be a time-consuming challenge, especially when dealing with raw, unstructured data from platforms like Kickstarter. Debashish Ghose's research analyzes Kickstarter project data to uncover trends and patterns. However, working with over 25GB of raw data presented significant computational challenges. Before using The Box, his local machine struggled with memory (RAM) limitations, forcing him to spend valuable time modifying scripts to overcome hardware constraints.

The Box has empowered Ghose to efficiently manage and process large-scale web-scraped data, performing tasks such as:

Cleaning and compiling panel datasets from raw data
Parsing JSON-like text fields that are incorrectly formatted
Inferring gender from project creator names using AI models
Fixing location data errors to improve dataset accuracy
Measuring linguistic markers, such as “confidence” and “sentiment,” from short text inputs

These tasks, which once required intensive manual intervention, are now automated and streamlined, allowing Debashish to focus on higher-level analysis and interpretation.

One of the most exciting aspects for Ghose is the integration of large language models (LLMs) in his ETL (Extract, Transform, Load) pipeline. With The Box's powerful GPU capabilities, he can now:

Use LLMs to correct and refine raw data, improving the quality of his datasets
Analyze linguistic markers to assess the impact of confidence in language on project outcomes
Enhance data accuracy by incorporating AI-based error correction, replacing older tools like LIWC, which were prone to errors when analyzing short text inputs

Measuring impact: Bigger, better datasets with AI correction

The impact of The Box is already evident in the quality and scale of Ghose's datasets. Previously, a hardcoded approach to cleaning and inferring gender from names led to dropping two-thirds of his raw data due to minor errors in spelling, formatting, and JSON structures. “Using the LLM to ‘correct' the data before inference is likely to generate a better, bigger dataset. The Box is helping me retain and analyze more data, which improves the quality of my findings,” he says.

He says a key factor in his success was the support from the DASH team. They modified his existing R code and provided a Python script that allowed him to run processes directly from his local machine.