This application provides an interactive interface to train a Spark-matcher model in the active learning component and predict on unseen data. The application is hosted in a self-contained FastAPI, with endpoints accessible via api.louisdebruijn.com.
Spark Matcher is a scalable entity matching algorithm implemented in PySpark. With Spark Matcher the user can easily train an algorithm to solve a custom matching problem. Spark Matcher uses active learning (modAL) to train a classifier (Sklearn) to match entities. In order to deal with the N^2 complexity of matching large tables, blocking is implemented to reduce the number of pairs. Since the implementation is done in PySpark, Spark Matcher can deal with extremely large tables.
FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints. This API provides several endpoints.
/dataset/upload |
upload datasets for training or prediction |
/dataset |
choose dataset columns |
/model/train |
train with Spark-matcher active learning |
/model/fit |
fit your model |
/model/predict |
predict on unseen data |
/predictions/download |
download predictions |