Chapter 2 Data sources

2.1 Description of dataset

We will develop a web crawler to get data from MyAnimeList, a professional anime rating website in Europe and North America, and we would only get the necessary data. There were not many alternative ways, since the data were public on the website, may an alternative would be collecting the data manually, which is not very practical. According to Wikipedia, the site received 120 million visitors a month in 2015. The datasets we collected include several tables, the first one is the Anime table (anime ID, name, average rating, genre and synopsis), it contained information of about 18000 animes (all types included). The second one is the User Profile table (user_id, gender, birthday, age, favorite_anime_list) and it contains information of about 80000 users. The third one is the Rating table (anime_id, user_id, rating, watching_status, watched_episodes) with about 20 million entries. And the last one is the review table (user_id, anime_id, review_text, review_score) with about 80000 pieces of reviews. We also did some data pre-processing work to the datasets to compute the features that are needed for future analysis. The datasets were all created by real users of the websites, a potential issue might be that some of the reviews and rating data are quite casual and may not reflect the true feelings of the users. For example, some users may rate the anime after watching only a very small proportion of the episodes, which may not reflect the actual quality of the anime. Also, some users may choose not to reveal some personal information, and some information may be incorrect (just filled randomly), which incurs potential incorrectness of the analysis. But overall, the website maintains a very well-structured database with many careful ratings and review texts, so it’s still a trustworthy source of data.

Anime