Entity Matching at scale

Louis de Bruijn | Jan. 29, 2023 | #entity resolution #record linkage #deduplication #spark

Introduction

This application provides an interactive interface to train a Spark-matcher model in the active learning component and predict on unseen data. The application is hosted in a self-contained FastAPI, with endpoints accessible via api.louisdebruijn.com.

Spark Matcher is a scalable entity matching algorithm implemented in PySpark. With Spark Matcher the user can easily train an algorithm to solve a custom matching problem. Spark Matcher uses active learning (modAL) to train a classifier (Sklearn) to match entities. In order to deal with the N^2 complexity of matching large tables, blocking is implemented to reduce the number of pairs. Since the implementation is done in PySpark, Spark Matcher can deal with extremely large tables.

Implementation

FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints. This API provides several endpoints.

/dataset/upload upload datasets for training or prediction
/dataset choose dataset columns
/model/train train with Spark-matcher active learning
/model/fit fit your model
/model/predict predict on unseen data
/predictions/download download predictions
Train

https://api.louisdebruijn.com/api/v1/dataset/

Upload a dataset

https://api.louisdebruijn.com/api/v1/dataset/upload/{uid}

.csv file with column names on the first line
column separator such as a comma
Overwrites previously uploaded dataset.
File size limit is 200mb.
References