WebSRC is a novel Web-based Structural Reading Comprehension dataset. It consists of 0.44M question-answer pairs, which are collected from 6.5K web pages with corresponding HTML source code, screenshots and metadata. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no.
NewsAug 27, 2022 To access the test set, see here Sept 1, 2021 The full dataset and baseline is available in: Extraction code: 2mys Download from Amazon.com Aug 25, 2021 Our paper is accepted by EMNLP 2021, the updated version of dataset will be available soon. Baseline is available on WebSRC-Baseline
Contact UsIf you have any questions about this dataset, please contact chenlusz@sjtu.edu.cn or galaxychen@sjtu.edu.cn
Three metrics, i.e. Exact Match (EM), F1 score, and Path Overlap Score (POS), are used to evaluate on the test set of WebSRC. Please refer to the paper to find more details about evaluation metrics.
Rank | Model | EM | F1 | POS |
---|---|---|---|---|
1 Sep 4, 2023 |
SageGPT-small-v0.2
4paradigm.Inc |
89.11 | 92.15 | N/A |
2 Jan 12, 2024 |
ScreenAI 5B
Google Research (Baechler et al.) |
84.02 | 87.24 | N/A |
3 Oct 12, 2022 |
DocPrompt (ErnieLayout-Large)
BAIDU-Document Intelligence (Wu et al.) code demo |
77.35 | 85.04 | N/A |
4 Mar 01, 2022 |
TIE (MarkupLM-Large)
Shanghai Jiao Tong University (Zhao et al., NAACL'22) code |
76.25 | 80.51 | 89.50 |
5 Mar 01, 2022 |
TIE (MarkupLM-Large)-3_seeds_ave
Shanghai Jiao Tong University (Zhao et al., NAACL'22) code |
75.87 | 80.19 | 89.73 |
6 Nov 30, 2021 |
MarkupLM-Large
MSRA+SJTU (Li et al., ACL'22) code |
69.87 | 77.94 | 88.09 |
7 Nov 30, 2021 |
MarkupLM-Large (3_seeds_ave)
MSRA+SJTU (Li et al., ACL'22) code |
69.09 | 76.45 | 87.24 |
8 Sept 01, 2021 |
V-PLM (ELECTRA)
Shanghai Jiao Tong University (Chen et al., EMNLP'21) code |
68.07 | 75.25 | 84.96 |
9 Sept 01, 2021 |
V-PLM (BERT)
Shanghai Jiao Tong University (Chen et al., EMNLP'21) code |
54.84 | 62.80 | 76.39 |