To train the model, I needed a dataset of clean jokes in question-answer text format.
While I did find a dataset of question-answer format jokes, the jokes are scraped from Reddit’s r/jokes subreddit. Going through the file, I did not like most of the jokes at all, as most of them were highly problematic. They were often racist, sexist, queerphobic, etc., and I would rather compile my own than to feed bad data into my model.
One option would be to filter this dataset using a set of “bad” keywords, but trying to filter a heavily biased dataset was less appealing to me than to create a new set entirely. An alternative could be to write a scraper for r/cleanjokes, filtering in only question-answer format jokes, but I didn’t want to invest too much time/energy on this toy project, and I personally am not a fan of using Reddit for training data in general.
I ended up compiling my own small dataset of clean jokes in the question-answer format, consisting of a little over 500 jokes total. A major trade-off was that the model’s vocabulary is relatively limited, but I enjoyed the jokes much more and felt much better about the data I was feeding into the model.
For the joke2punchline and punchline2joke models, the teacher forcing ratio was set to 0.5. I’d be curious to adjust this parameter and see the results. I would expect a lower ratio to result in more nonsensical output, whereas a higher ratio would likely result in more outputs that are directly from the training set.
I think an ideal setup would be to lower the teacher forcing ratio in addition to having a much larger training set.
I do think it would be fun to generate jokes and punchlines using an RNN or LSTM before feeding it into these models, such that there is less human intervention (i.e. writing fake jokes/punchlines manually).
I also think the model would be way more fun to play with if it I could train it with a much larger dataset, i.e. 10K+ jokes.