Entity Matching at scale

Louis de Bruijn | Jan. 29, 2023 | #entity resolution #record linkage #deduplication #spark

Introduction

This application provides an interactive interface to train a Spark-matcher model in the active learning component and predict on unseen data. The application is hosted in a self-contained FastAPI, with endpoints accessible via api.louisdebruijn.com.

Spark Matcher is a scalable entity matching algorithm implemented in PySpark. With Spark Matcher the user can easily train an algorithm to solve a custom matching problem. Spark Matcher uses active learning (modAL) to train a classifier (Sklearn) to match entities. In order to deal with the N^2 complexity of matching large tables, blocking is implemented to reduce the number of pairs. Since the implementation is done in PySpark, Spark Matcher can deal with extremely large tables.

Implementation

FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints. This API provides several endpoints.

`/dataset/upload`	upload datasets for training or prediction
`/dataset`	choose dataset columns
`/model/train`	train with Spark-matcher active learning
`/model/fit`	fit your model
`/model/predict`	predict on unseen data
`/predictions/download`	download predictions

Train

https://api.louisdebruijn.com/api/v1/dataset/

Choose a dataset

Choose a purpose

Choose the data columns

Upload a dataset

https://api.louisdebruijn.com/api/v1/dataset/upload/{uid}

.csv file with column names on the first line

Delimiter column separator such as a comma

Choose a purpose

Overwrites previously uploaded dataset.

File size limit is 200mb.

Entity Matching at scale

Introduction

Implementation

Train

Upload a dataset

References