Mini Project
Changes in Wine Taste Trends Over the Years
We started the project with a team consisting of five members from different background but everyone wants to become a data analysis. Our team name is 'mini 4t', and for our first mini project together, we conducted data analysis.
We utilized a wine information database from Kaggle to examine the trend of changes in wine taste over the years.
Project Name | Description |
Changes in Wine Taste Trends Over the Years | Utilizing wine information from Kaggle, analyze how the taste preferences of people change over the years based on data and determine which flavors should be targeted for new product releases. |
Category | Description |
Language | SQL |
DB Tool | DBeaver My SQL |
Visualization tool | Canva |
Category | Address |
Result (PPT) | https://www.canva.com/design/DAGCvjGVSvk/pxh8geyleeyFIebe-rpfuQ/view?utm_content=DAGCvjGVSvk&utm_campaign=designshare&utm_medium=link&utm_source=editor |
https://www.kaggle.com/datasets/dev7halo/wine-information
Data informations.
1) id - wine ID
2) name - wine Name
3) producer -
4) nation - Country of origin
5) local1 - region1
6) local2 - region2
7) local3 - region3
8) local4 - region4
9) ~ 20) varieties1 ~ varieties12 - wine varieties
21) type - wine type
22) use - wine use
23) abv - Alcohol by volume
24) degree - Drinking temperature
25) sweet - Sweet rank (score between 1 and 5)
26) acidity - Acidity rank (score between 1 and 5)
27) body - Body rank (score between 1 and 5)
28) tannin - Tannin rank (score between 1 and 5)
29) price - wine price that based on WON NOT dollars!! (₩, WON)
30) year - Production year
31) ml - wine Volume (milliliter)
Acknowledgements
Data was collected on May 24th and preprocessed until May 28th.
Since the data was collected on the Korean website, some countries and regions data includes Korean and English, and the price is in Korean won, not dollars.
We first discussed what purpose to create this data for. Various opinions were suggested, and we decided to analyze the 'wine trends based on global consumer tastes for new product development' using this data.
However, there was a big issue before we starting the project, The CSV file downloaded from Kaggle didn't work properly in MySQL. Although we were able to load the data, we encountered an error where the character format was not converted to numbers. To resolve this issue, we ran MySQL Community, created a new server with localhost, and modified the data format of the CSV file. You can see the process of how we solved it through the link below.
https://junhyundata.tistory.com/40
After solving the issue, we began the data preprocessing process to extract the necessary information from the 21,600 data entries.
To get the desired results, we reviewed how many different years and counted flavors were produced in the data.
The years range from 1900 to 2020, and we utilized columns for each flavor category such as Sweet1, 2, 3, 4, 5, acidity1, 2, 3, 4, 5, body 1, 2, 3, 4, 5, tannin 1, 2, 3, 4, 5.
After reviewing the data, we found that the data before the year 2000 was too sparse, so we replaced missing values with the mode.
Additionally, we grouped the range of years to simplify it. The groups are as follows: 0~1999, 2000~2004, 2005~2009, 2010~2014, 2015~2020.
Using the following code, I discovered information about wine varieties related to sweetness by year.
select years, -- Sweetcolumn
max(if(sweet='sweet1',T,0))"sweet1",
max(if(sweet='sweet2',T,0))"sweet2",
max(if(sweet='sweet3',T,0))"sweet3",
max(if(sweet='sweet4',T,0))"sweet4",
max(if(sweet='sweet5',T,0))"sweet5"
from -- year range
(select case when year between 0 and 1999 then '0~1999'
when year between 2000 and 2004 then '2000~2004'
when year between 2005 and 2009 then '2005~2009'
when year between 2010 and 2014 then '2010~2014'
when year between 2015 and 2020 then '2015~2020'
end YEARS,
sweet,
count(*) T
from cleansingWine cw
group by 1,2
order by 1)A
group by 1
Using by this code, I got results for sweetness, acidity, body, and tannin, totaling four different outcome measures.
Interpretation of Results
'SWEET1' wines increased significantly compared to other wines.
▶ Consumers prefer wines with lower sweetness.
It can be observed that products other than 'SWEET1' are preferred only by enthusiasts.
▶ Wines other than 'SWEET1' are not widely
consumed."
Interpretation of Results
There is a tendency for recent wines to have higher acidity, especially those produced between 2015 and 2020, which exhibit the highest acidity levels.
Flavors such as Acidity1 and Acidity5 are not widely consumed.
Interpretation of Results
Compared to other flavors, lighter intensity flavors are not preferred even over time.
There is a tendency for recent wines to have higher body, especially those produced between 2015 and 2020, which exhibit the highest body levels.
Interpretation of Results
The proportion of Tannin1 wines was high in wines produced in the 2000s, but then sharply declined, only to increase again in consumption.
Excluding Tannin1, the concentration of other tannins increased as the variety expanded.
Based on the numerical data, we created the final graph.
The following code to generate this graph.
select case when year between 0 and 1999 then '0~1999'
when year between 2000 and 2004 then '2000~2004'
when year between 2005 and 2009 then '2005~2009'
when year between 2010 and 2014 then '2010~2014'
when year between 2015 and 2020 then '2015~2020'
end 'YEARS_range',
avg(cast(right(tannin,1) as unsigned)) 'avg_tannin',
avg(cast(right(sweet,1) as unsigned)) 'avg_sweet',
avg(cast(right(acidity,1) as unsigned))'avg_acidity',
avg(cast(right(body,1) as unsigned)) 'avg_body'
from
(
select price, acidity,tannin, sweet, body, year
from cleansingWine cw
group by 1, 2, 3, 4, 5, 6
order by year
) a
group by 1
The results from the code I wrote and the code written by others were different.
SELECT
-- 'year'
CASE
WHEN year BETWEEN 0 AND 1999 THEN '0~1999'
WHEN year BETWEEN 2000 AND 2004 THEN '2000~2004'
WHEN year BETWEEN 2005 AND 2009 THEN '2005~2009'
WHEN year BETWEEN 2010 AND 2014 THEN '2010~2014'
WHEN year BETWEEN 2015 AND 2020 THEN '2015~2020'
ELSE 'Unknown' -- not included
END AS year_range,
-- average
AVG(CASE WHEN tannin = 'tannin1' THEN 1
WHEN tannin = 'tannin2' THEN 2
WHEN tannin = 'tannin3' THEN 3
WHEN tannin = 'tannin4' THEN 4
WHEN tannin = 'tannin5' THEN 5
ELSE NULL
END) AS avg_tannin,
AVG(CASE WHEN sweet = 'sweet1' THEN 1
WHEN sweet = 'sweet2' THEN 2
WHEN sweet = 'sweet3' THEN 3
WHEN sweet = 'sweet4' THEN 4
WHEN sweet = 'sweet5' THEN 5
ELSE NULL
END) AS avg_sweet,
AVG(CASE WHEN acidity = 'acidity1' THEN 1
WHEN acidity = 'acidity2' THEN 2
WHEN acidity = 'acidity3' THEN 3
WHEN acidity = 'acidity4' THEN 4
WHEN acidity = 'acidity5' THEN 5
ELSE NULL
END) AS avg_acidity,
AVG(CASE WHEN body = 'body1' THEN 1
WHEN body = 'body2' THEN 2
WHEN body = 'body3' THEN 3
WHEN body = 'body4' THEN 4
WHEN body = 'body5' THEN 5
ELSE NULL
END) AS avg_body
FROM
cleansingwine
GROUP BY
year_range;
Upon comparing, I realized that there was an issue with my query. I had placed the group by clause inside the subquery, making it difficult to obtain accurate average values. Additionally, the use of the right function to retrieve characters did not assist in correctly converting the column values into numerical ones.
Here is the final query
SELECT years,
AVG(tannin_c) AS tannin_avg,
AVG(sweet_c) AS sweet_avg,
AVG(acidity_c) AS acidity_avg,
AVG(body_c) AS body_avg
FROM (
SELECT CASE
WHEN year BETWEEN 0 AND 1999 THEN '0~1999'
WHEN year BETWEEN 2000 AND 2004 THEN '2000~2004'
WHEN year BETWEEN 2005 AND 2009 THEN '2005~2009'
WHEN year BETWEEN 2010 AND 2014 THEN '2010~2014'
WHEN year BETWEEN 2015 AND 2020 THEN '2015~2020'
END AS years,
CASE
WHEN tannin LIKE '%1' THEN 1
WHEN tannin LIKE '%2' THEN 2
WHEN tannin LIKE '%3' THEN 3
WHEN tannin LIKE '%4' THEN 4
WHEN tannin LIKE '%5' THEN 5
END AS tannin_c,
CASE
WHEN sweet LIKE '%1' THEN 1
WHEN sweet LIKE '%2' THEN 2
WHEN sweet LIKE '%3' THEN 3
WHEN sweet LIKE '%4' THEN 4
WHEN sweet LIKE '%5' THEN 5
END AS sweet_c,
CASE
WHEN acidity LIKE '%1' THEN 1
WHEN acidity LIKE '%2' THEN 2
WHEN acidity LIKE '%3' THEN 3
WHEN acidity LIKE '%4' THEN 4
WHEN acidity LIKE '%5' THEN 5
END AS acidity_c,
CASE
WHEN body LIKE '%1' THEN 1
WHEN body LIKE '%2' THEN 2
WHEN body LIKE '%3' THEN 3
WHEN body LIKE '%4' THEN 4
WHEN body LIKE '%5' THEN 5
END AS body_c
FROM cleansingWine cw
) AS A
GROUP BY 1
ORDER BY 1
Finally, I got the results, and the corresponding graph looks as follows:
Hypothesis: Wine preferences change over time.
Conclusion:
Since the 2000s, there has been a trend towards preferring higher TANNIN and lower SWEET compared to before. Additionally, the variability of scores for each indicator is not significant, suggesting that there has not been a meaningful change in wine preferences over time since the 2000s.
What I felt through this project
Areas for Improvement:
1. Additional Data: It would have been beneficial to have wine rating data. With this information, correlations between taste and ratings could have been analyzed.
2. Sales Volume Information: If data on wine sales volume per wine were available, it would have been possible to understand consumer needs better, leading to more diverse analyses.
Lessons Learned:
1. Clarity of Purpose: Learned the importance of having a clear objective when interpreting and analyzing data.
2. Data Preparation: Learned methods such as data cleansing, handling missing values, and detecting outliers to enhance the accuracy and reliability of analysis.
3. Order of Operations: Learned the significance of the sequence in which sets are grouped during data manipulation, as it can affect the resulting values.
'Project ( Portfolio ) > Mini Project' 카테고리의 다른 글
Mini Project (USA Balance Sheet, Select Years 1980-2020) (0) | 2024.05.03 |
---|