What is WebSRC?

WebSRC is a novel Web-based Structural Reading Comprehension dataset. It consists of 0.44M question-answer pairs, which are collected from 6.5K web pages with corresponding HTML source code, screenshots and metadata. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no.

News

Aug 27, 2022 To access the test set, see here Sept 1, 2021 The full dataset and baseline is available in: Extraction code: 2mys Download from Amazon.com Aug 25, 2021 Our paper is accepted by EMNLP 2021, the updated version of dataset will be available soon. Baseline is available on WebSRC-Baseline

Contact Us

If you have any questions about this dataset, please contact chenlusz@sjtu.edu.cn or galaxychen@sjtu.edu.cn

Leaderboard

Three metrics, i.e. Exact Match (EM), F1 score, and Path Overlap Score (POS), are used to evaluate on the test set of WebSRC. Please refer to the paper to find more details about evaluation metrics.

Rank Model EM F1 POS
1 Sep 4, 2023 SageGPT-small-v0.2

4paradigm.Inc

89.11 92.15 N/A
2 Jan 12, 2024 ScreenAI 5B

Google Research

(Baechler et al.)
84.02 87.24 N/A
3 Oct 12, 2022 DocPrompt (ErnieLayout-Large)

BAIDU-Document Intelligence

(Wu et al.) code demo
77.35 85.04 N/A
4 Mar 01, 2022 TIE (MarkupLM-Large)

Shanghai Jiao Tong University

(Zhao et al., NAACL'22) code
76.25 80.51 89.50
5 Mar 01, 2022 TIE (MarkupLM-Large)-3_seeds_ave

Shanghai Jiao Tong University

(Zhao et al., NAACL'22) code
75.87 80.19 89.73
6 Nov 30, 2021 MarkupLM-Large

MSRA+SJTU

(Li et al., ACL'22) code
69.87 77.94 88.09
7 Nov 30, 2021 MarkupLM-Large (3_seeds_ave)

MSRA+SJTU

(Li et al., ACL'22) code
69.09 76.45 87.24
8 Sept 01, 2021 V-PLM (ELECTRA)

Shanghai Jiao Tong University

(Chen et al., EMNLP'21) code
68.07 75.25 84.96
9 Sept 01, 2021 V-PLM (BERT)

Shanghai Jiao Tong University

(Chen et al., EMNLP'21) code
54.84 62.80 76.39