Massif

« back to search

Massif is a free, open-source project to collect, curate, and index content for foreign language learners [GitHub]. Currently, the only target language is Japanese. See the README for more of the thinking behind it.

The current index contains ~30 million sentences extracted from the top 2000 series from the user-generated novel site 小説家になろう.

Sentences are ranked by their quality, which is estimated using a machine learning model (gpt2-japanese). Sentences that the model deems more probable tend to be more idomatic, use common colocations, and generally use more common words, so they end up usually making better example sentences. (There has been some research on this specific idea, and related work on applying language models to language learning.)

If you have any questions or feedback, feel free to send me an email.