In addition to a full slate of presentations and panel discussions, the first day of the Ascend New York conference boasted an exciting new event for WorkFusion partners and RPA developers: a one-day hackathon offering a cash prize of $2,000 to the winning team.
Event: WorkFusion’s First Hackathon
Challenge: Build an information extraction model (sample documents: invoices) within exactly 8 hours
Participants: 6 teams, representing WorkFusion partners: EPAM (5 members) Cognizant (2), Capgemini (2), EY (3), HCL (2) and InfoSys (2)
Details of the challenge were announced at 9 a.m. that day: Build an information extraction model that can extract eight fields from unstructured documents (e.g., invoices).
There was only a little advance preparation allowed, said Hackathon Head Judge and WorkFusion Lead Data Scientist Artsiom Strok.
“We told them we are going to build a model, but not the type, and that everyone would be using WorkFusion AutoML SDK, and the related software requirements,” Artsiom said. “We gave one week’s notice so they could try their machines on the SDK so they could learn about that, and we sent everyone the same training videos.”
The “AutoML SDK” he mentions is WorkFusion’s machine learning software development kit. It allows any developer — such as our Hackathon competitors — to customize ML models and train those models on any given set of data. This provides customers with self-service AI tools for their existing development teams to use — without the need to hire ML engineers or data scientists.
To set up the contest, hackathon organizers split key data set into three parts: training set, validation set, and test set. Every team had access to the training set and validation set, but only judges had access to the test set.
The teams’ goal was to extract data from eight fields, with each field a different type, such as a unique identifier number, date, price, email address, etc. Plus, three of the eight were multi-value fields, wherein the competitors had to extract multiple values and group them properly, such as line items in a table.
Teams were given eight hours to solve this problem and were permitted an unlimited number of attempts to submit their bots. One eager team submitted its first model for assessment within about an hour. The standings changed often over the course of the day, with each team taking the lead at various points.
The teams’ approaches had similarities and differences, Artsiom said. For example, teams chose different strategies: to go field-by-field, or prepare baselines across fields. But every team set a goal to launch a generic out-of-the-box model to produce a baseline, then build normalization/post-processing (including OCR correction) to normalize and format the dates or numbers. Then they all worked on data cleaning and feature engineering to improve results. Also, they built specific annotators to give models insights about domain-specific knowledge.
In the end, only one team could be the winner, and the judges used a very straightforward and objective metric: a pure f1 score across all fields. Artsiom pointed out that the final tally was extremely close across the board, with scores only ranging from 0.9179 to 0.837 — and second place was only 0.0058 points lower than the winner!
Congratulations to all for a great showing and a very exciting day!