List of dataset used in state-of-art techniques
Quora released a new dataset in January 2017. The dataset consists of over 400K potential duplicate question pairs.
The initial corpus contains 51,524 human annotated sentence pairs: 42200 for training and 9324 for testing. Authors have released data collected over 1 year which consists of 2,869,657 candidate pairs.
Microsoft Research Paraphrase Corpus.
This dataset contains 5,801 pairs of sentences with 4,076 for training and the remaining 1,725 for testing. The training set contains 2753 true paraphrase pairs and 1323 false paraphrase pairs; the test set contains 1147 and 578 pairs, respectively.
The training set contains 5000 true paraphrase pairs and 5000 false paraphrase pairs; the test set contains 1500 and 1500 pairs, respectively. The test collection from the PAN 2010 plagiarism detection competition was used to generate the sentence-level PAN dataset. PAN 2010 dataset consists of 41,233 text documents from Project Gutenberg in which 94,202 cases of plagiarism have been inserted. The plagiarism was created either by using an algorithm or by explicitly asking Turkers to paraphrase passages from the original text. Only on the human created plagiarism instances were used here.
To generate the sentence-level PAN dataset, a heuristic alignment algorithm is used to find corresponding pairs of sentences within a passage pair linked by the plagiarism relationship. The alignment algorithm utilized only bag-of-words overlap and length ratios and no MT metrics. For negative evidence, sentences were sampled from the same document and extracted sentence pairs that have at least 4 content words in common. Then from both the positive and negative evidence files, training set of 10,000 sentence pairs and a test set of 3,000 sentence pairs were created through random sampling.
In this dataset, each sentence pair has a relatedness score ∈ [0, 5], with higher scores indicating the two sentences are more closely-related. Microsoft exchange email on mac. The dataset comprises pairs of sentences drawn from publicly available datasets which are given below.
- Microsoft Research Paraphrase Corpus: 750 pairs of sentences.
- Microsoft Research Video Description Corpus: 750 pairs of sentences.
- SMTeuroparl: WMT2008 develoment dataset (Europarl section): 734 pairs of sentences.
- Pascal Dataset: 1000 images with 5 different sentences describing the corresponding image.
- Flicker8k: 7678 images from Flicker with 5 different sentences describing the corresponding image.
- Flicker30k: An image caption corpus consisting of 158,915 crowd-sourced captions describing 31,783 images.
- MSCOCO: 328,000 images with 5 different sentences describing the corresponding image.
- MSR-VTT Dataset: Comprised of 10,000 videos with 20 sentences each describing the videos.
This dataset consists of 9,927 sentence pairs with 4,500 for training, 500 as a development set, and the remaining 4,927 in the test set. The sentences are drawn from image video descriptions. Each sentence pair is annotated with a relatedness score ∈ [1, 5], with higher scores indicating the two sentences are more closely-related.
Microsoft Research Paraphrase Corpus Mac Program
The PPDB contains more than 220 million paraphrase pairs of which 73 million are phrasal paraphrases and 140 million are paraphrase patterns that capture syntactic transformations of sentences.
The WikiAnswers corpus contains clusters of questions tagged by WikiAnswers users as paraphrases. Each cluster optionally contains an answer provided by WikiAnswers users. There are 30,370,994 clusters containing an average of 25 questions per cluster. 3,386,256 (11%) of the clusters have an answer.
The data can be downloaded from: http://knowitall.cs.washington.edu/oqa/data/wikianswers/. The corpus is split into 40 gzip-compressed files. The total compressed filesize is 8GB; the total decompressed filesize is 40GB. Each file contains one cluster per line. Each cluster is a tab-separated list of questions and answers. Questions are prefixed by q: and answers are prefixed by a:. Here is an example cluster (tabs replaced with newlines):