We start by assessing the relative importance of ingredients scraped from recipes on foodnetwork.com and
About five main types of cuisines can be distinguised. 20 or even 7 topics appears to be too many, but the topics are well defined and distinguished with 4 topics. In the 7 topic model, the model is able to recover something close to Mexican, Italian, Greek, Asian, and 'French' cuisines as different topics.
For example, using data containing information about purchases, such as customer ids, ingredients used in products, and geographical information, we may be able to discover groups of people who prefer characteristic kinds of ingredients. We may also be able to discover groups of ingredients or preparation processes underlying the success or failure of different food products. We can use the models to predict whether new products will be likely to succeed or not.
When salt, garlic, and black pepper are included in the list of ingredients, the word clouds look more similar.
Eliminating these spices draws out the differences better so that there is less overlap in key ingredients.
According to the data, French and Polish dishes tend to contain ingredients common in baked goods and dairy products. This is partly due to the fact that searching with the keyword 'french' pulls up 'french toast' as a search result in many, but not all results. "French" dishes tend on the sweet side while Polish dishes contain ingredients like sauerkraut and onions. Greek and Italian dishes are similar, though there might be a slight emphasis on citrus ingredients in Greek cuisine that is not as present in Italian. Japanese, Korean, Thai, Chinese, and Vietnamese food share ingredients not present in other dishes, including scallions and soy sauce. However, Vietnamese food tends more towards sweetness, Korean food emphasises sesame oil, Chinese food ginger (as does Indian and Japanese food), and Thai food lime juice and cilantro leaves.
I thought it would be interesting to see if machine learning algorithms can pick up on similarities and differences in unlabelled data. I used gensim to train an LDA (Latent Dirichlet Allocation) model and pyLDAvis.gensim to visualize the model.
With 20 topics, many of the topics aren't really well defined, but some of the main categories show up. You can check out which words are most important in a given topic and which topics certain words are most important with the interactive tool generated with pyLDAvis.gensim here.
A 7 topic model is able to distinguish something close to Mexican, Italian, Greek, Asian, and 'French' cuisines as different topics. So far, LDA is not able to distinguish more than that, but as more features are added and the dataset is balanced and enlarged, more cuisines will be distinguishable.
Finally, with fewer than 5 topics, we find that LDA is able to separate out different topics robustly. Here we find something corresponding roughly to Asian, Italian, Greek, and 'French'.
My name is Daniel. I am a physics PhD student at UC San Diego. My dissertation involves developing computational models of neurons, estimating parameters for these models, and detecting patterns in sets of estimated parameters.
I enjoy discovering insights from data which can inform decision making. When I am not learning about science and technology, I like to run, swing dance, and play violin. I look forward to applying my technical skills in industry to create interesting and valuable products.