Saturday, April 9, 2016

Conversational datasets to train a chatbot

As in the last two months I read a lot about chatbots which awakens in me the desire to develop my own chatbot. And of course the most trendy approach is some deep learning. That's why as a first step a decided to collect the available conversation datasets which are definitely needed for training. Here is the list of English conversation datasets I found: (If you know about more please leave a comment.)

Data collected from twitter (by Chenhao Tan):

  • Argument trees, "successful persuasion" metadata, and related data from the subreddit ChangeMyView. First release 2016.

  • Multi-community engagement (users posting, or not posting, in different subreddits since Reddit's inception). Data includes the texts of posts made and associated metadata, such as the subreddit, the "number" of upvotes, and the time stamp. First release 2015.

  • Cornell natural-experiment tweet pairs: data for investigating whether whether phrasing affects message propagation, controlling for user and topic. zip file can be retrieved from the given URL (first release 2014)

  • Supreme Court dialogs corpus: conversations and metadata (such as vote outcomes) from oral arguments before the US Supreme Court (first release 2012)
  • Wikipedia editor conversations corpus: zip file can be retrieved from the page I've linked to (first release 2012)
  • Cornell movie-dialogs corpus: conversations and metadata (IMDB rating, genre, character gender, etc.) from movie scripts (first release 2011). This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts: 220,579 conversational exchanges between 10,292 pairs of movie characters.
  • Microsoft Research Social Media Conversation Corpus. A collection of 12,696 Tweet Ids representing 4,232 three-step conversational snippets extracted from Twitter logs. Each row in the dataset represents a single context-message-response triple that has been evaluated by crowdsourced annotators as scoring an average of 4 or higher on a 5-point Likert scale measuring quality of the response in the context.
  • And a conversation on Reddit about a Reddit corpus.
  • The Santa Barbara corpus is an interesting one because it's a transcription of spoken dialogues.
  • The NPS Chat Corpus is part of the Python NLTK. Release 1.0 consists of 10,567 posts out of approximately 500,000 posts we have gathered from various online chat services in accordance with their terms of service. Future releases will contain more posts from more domains. 
  • NUS Corpus is a collection of SMS messages. There is English and Chines corpus as well.

  • Off: during my research for conversation datasets I found a relatively large collection of public datasets here .

    EDIT: you can also check the collection of QA datasets.
    ALSO CHECK OUT THIS more comprehensive list of dialogue datasets.


    Vishal said...

    This is so helpful ! Thanks.... I owe you at least a beer or a coffee !

    Izabelle Fernando said...

    The blog was very informative, I am really crazy about chatbots. I really appreciate your work.