본문 바로가기
Project ( Portfolio )/Mini Project

Mini Project - Changes in Wine Taste Trends Over the Years

by jhleeatl 2024. 4. 16.

Mini Project

Changes in Wine Taste Trends Over the Years

 

 

We started the project with a team consisting of five members from different background but everyone wants to become a data analysis. Our team name is 'mini 4t', and for our first mini project together, we conducted data analysis.

 

We utilized a wine information database from Kaggle to examine the trend of changes in wine taste over the years.

 


 

Project Name Description
Changes in Wine Taste Trends Over the Years Utilizing wine information from Kaggle, analyze how the
taste preferences of people change over the years based on data and determine which flavors should be targeted for
new product releases.

 

Category Description
Language SQL
DB Tool DBeaver My SQL
Visualization tool Canva

 

Category Address
Result (PPT) https://www.canva.com/design/DAGCvjGVSvk/pxh8geyleeyFIebe-rpfuQ/view?utm_content=DAGCvjGVSvk&utm_campaign=designshare&utm_medium=link&utm_source=editor

 

 

 

https://www.kaggle.com/datasets/dev7halo/wine-information

 

Wine Information

Wine Information with nation, varieties, flavor, price, etc

www.kaggle.com

 

 

Data informations.


1) id - wine ID
2) name - wine Name
3) producer -
4) nation - Country of origin
5) local1 - region1
6) local2 - region2
7) local3 - region3
8) local4 - region4
9) ~ 20) varieties1 ~ varieties12 - wine varieties
21) type - wine type
22) use - wine use
23) abv - Alcohol by volume
24) degree - Drinking temperature
25) sweet - Sweet rank (score between 1 and 5)
26) acidity - Acidity rank (score between 1 and 5)
27) body - Body rank (score between 1 and 5)
28) tannin - Tannin rank (score between 1 and 5)
29) price - wine price that based on WON NOT dollars!! (₩, WON)
30) year - Production year
31) ml - wine Volume (milliliter)

 

Acknowledgements

Data was collected on May 24th and preprocessed until May 28th.

Since the data was collected on the Korean website, some countries and regions data includes Korean and English, and the price is in Korean won, not dollars.

 


 

 

We first discussed what purpose to create this data for. Various opinions were suggested, and we decided to analyze the 'wine trends based on global consumer tastes for new product development' using this data.

 

 

However, there was a big issue before we starting the project, The CSV file downloaded from Kaggle didn't work properly in MySQL. Although we were able to load the data, we encountered an error where the character format was not converted to numbers. To resolve this issue, we ran MySQL Community, created a new server with localhost, and modified the data format of the CSV file. You can see the process of how we solved it through the link below.

 

 

https://junhyundata.tistory.com/40

 

How to add localhost at DBeaver (MY SQL)

An error occurred during the process of converting text to numbers in DBeaver. When importing a CSV file into DBeaver, it failed to recognize text as numbers, resulting in this error. T o resolve this, MySQL Community was installed and localhost was config

junhyundata.tistory.com

 

 

After solving the issue, we began the data preprocessing process to extract the necessary information from the 21,600 data entries.

 

To get the desired results, we reviewed how many different years and counted flavors were produced in the data.

 

The years range from 1900 to 2020, and we utilized columns for each flavor category such as Sweet1, 2, 3, 4, 5, acidity1, 2, 3, 4, 5, body 1, 2, 3, 4, 5, tannin 1, 2, 3, 4, 5.

 

 

After reviewing the data, we found that the data before the year 2000 was too sparse, so we replaced missing values with the mode.

 

 

 

Additionally, we grouped the range of years to simplify it. The groups are as follows: 0~1999, 2000~2004, 2005~2009, 2010~2014, 2015~2020.

 

 

Using the following code, I discovered information about wine varieties related to sweetness by year.

 

select years, -- Sweetcolumn
       max(if(sweet='sweet1',T,0))"sweet1",
       max(if(sweet='sweet2',T,0))"sweet2",
       max(if(sweet='sweet3',T,0))"sweet3",
       max(if(sweet='sweet4',T,0))"sweet4",
       max(if(sweet='sweet5',T,0))"sweet5"
from -- year range
(select case when year between 0 and 1999 then '0~1999'
       when year between 2000 and 2004 then '2000~2004'
       when year between 2005 and 2009 then '2005~2009'
       when year between 2010 and 2014 then '2010~2014'
       when year between 2015 and 2020 then '2015~2020'
       end YEARS,
       sweet,
       count(*) T
from cleansingWine cw 
group by 1,2
order by 1)A
group by 1

 

 

 

 

Using by this code, I got results for sweetness, acidity, body, and tannin, totaling four different outcome measures.

 

 

 

 

Interpretation of Results

 

'SWEET1' wines increased significantly compared to other wines.

   ▶ Consumers prefer wines with lower sweetness.

 

It can be observed that products other than 'SWEET1' are preferred only by enthusiasts.

   ▶ Wines other than 'SWEET1' are not widely     

       consumed."

 

 

 

 

 

Interpretation of Results

 

There is a tendency for recent wines to have higher acidity, especially those produced between 2015 and 2020, which exhibit the highest acidity levels.

 

Flavors such as Acidity1 and Acidity5 are not widely consumed.

 

 

 

 

 

 

 

 

Interpretation of Results

 

Compared to other flavors, lighter intensity flavors are not preferred even over time.

 

There is a tendency for recent wines to have higher body, especially those produced between 2015 and 2020, which exhibit the highest body levels.

 

 

 

 

 

Interpretation of Results

 

The proportion of Tannin1 wines was high in wines produced in the 2000s, but then sharply declined, only to increase again in consumption.

 

Excluding Tannin1, the concentration of other tannins increased as the variety expanded.

 

 

 

 

 

 


 

Based on the numerical data, we created the final graph. 

 

The following code to generate this graph.

 

select case when year between 0 and 1999 then '0~1999'
       when year between 2000 and 2004 then '2000~2004'
       when year between 2005 and 2009 then '2005~2009'
       when year between 2010 and 2014 then '2010~2014'
       when year between 2015 and 2020 then '2015~2020'
       end 'YEARS_range',
			 avg(cast(right(tannin,1) as unsigned)) 'avg_tannin',
			 avg(cast(right(sweet,1) as unsigned)) 'avg_sweet',
			 avg(cast(right(acidity,1) as unsigned))'avg_acidity',
			 avg(cast(right(body,1) as unsigned)) 'avg_body'
from
(
select price, acidity,tannin, sweet, body, year
from cleansingWine cw 
group by 1, 2, 3, 4, 5, 6
order by year
) a
group by 1

 

 

The results from the code I wrote and the code written by others were different.

 

SELECT
    -- 'year' 
    CASE
        WHEN year BETWEEN 0 AND 1999 THEN '0~1999'
        WHEN year BETWEEN 2000 AND 2004 THEN '2000~2004'
        WHEN year BETWEEN 2005 AND 2009 THEN '2005~2009'
        WHEN year BETWEEN 2010 AND 2014 THEN '2010~2014'
        WHEN year BETWEEN 2015 AND 2020 THEN '2015~2020'
        ELSE 'Unknown' -- not included
    END AS year_range,
    -- average
    AVG(CASE WHEN tannin = 'tannin1' THEN 1
             WHEN tannin = 'tannin2' THEN 2
             WHEN tannin = 'tannin3' THEN 3
             WHEN tannin = 'tannin4' THEN 4
             WHEN tannin = 'tannin5' THEN 5
             ELSE NULL
        END) AS avg_tannin,
    AVG(CASE WHEN sweet = 'sweet1' THEN 1
             WHEN sweet = 'sweet2' THEN 2
             WHEN sweet = 'sweet3' THEN 3
             WHEN sweet = 'sweet4' THEN 4
             WHEN sweet = 'sweet5' THEN 5
             ELSE NULL
        END) AS avg_sweet,
    AVG(CASE WHEN acidity = 'acidity1' THEN 1
             WHEN acidity = 'acidity2' THEN 2
             WHEN acidity = 'acidity3' THEN 3
             WHEN acidity = 'acidity4' THEN 4
             WHEN acidity = 'acidity5' THEN 5
             ELSE NULL
        END) AS avg_acidity,
    AVG(CASE WHEN body = 'body1' THEN 1
             WHEN body = 'body2' THEN 2
             WHEN body = 'body3' THEN 3
             WHEN body = 'body4' THEN 4
             WHEN body = 'body5' THEN 5
             ELSE NULL
        END) AS avg_body
FROM
    cleansingwine
GROUP BY
    year_range;

 

 

 

Upon comparing, I realized that there was an issue with my query. I had placed the group by clause inside the subquery, making it difficult to obtain accurate average values. Additionally, the use of the right function to retrieve characters did not assist in correctly converting the column values into numerical ones.

 

Here is the final query 

 

SELECT years,
       AVG(tannin_c) AS tannin_avg,
       AVG(sweet_c) AS sweet_avg,
       AVG(acidity_c) AS acidity_avg,
       AVG(body_c) AS body_avg
FROM (
    SELECT CASE 
               WHEN year BETWEEN 0 AND 1999 THEN '0~1999'
               WHEN year BETWEEN 2000 AND 2004 THEN '2000~2004'
               WHEN year BETWEEN 2005 AND 2009 THEN '2005~2009'
               WHEN year BETWEEN 2010 AND 2014 THEN '2010~2014'
               WHEN year BETWEEN 2015 AND 2020 THEN '2015~2020'
           END AS years,
           CASE 
               WHEN tannin LIKE '%1' THEN 1
               WHEN tannin LIKE '%2' THEN 2
               WHEN tannin LIKE '%3' THEN 3
               WHEN tannin LIKE '%4' THEN 4
               WHEN tannin LIKE '%5' THEN 5
           END AS tannin_c,
           CASE 
               WHEN sweet LIKE '%1' THEN 1
               WHEN sweet LIKE '%2' THEN 2
               WHEN sweet LIKE '%3' THEN 3
               WHEN sweet LIKE '%4' THEN 4
               WHEN sweet LIKE '%5' THEN 5
           END AS sweet_c,
           CASE 
               WHEN acidity LIKE '%1' THEN 1
               WHEN acidity LIKE '%2' THEN 2
               WHEN acidity LIKE '%3' THEN 3
               WHEN acidity LIKE '%4' THEN 4
               WHEN acidity LIKE '%5' THEN 5
           END AS acidity_c,
           CASE 
               WHEN body LIKE '%1' THEN 1
               WHEN body LIKE '%2' THEN 2
               WHEN body LIKE '%3' THEN 3
               WHEN body LIKE '%4' THEN 4
               WHEN body LIKE '%5' THEN 5
           END AS body_c
    FROM cleansingWine cw
) AS A
GROUP BY 1
ORDER BY 1

 

 

 

Finally, I got the results, and the corresponding graph looks as follows:

 

 

 

 

Hypothesis:  Wine preferences change over time.

 


Conclusion: 

Since the 2000s, there has been a trend towards preferring higher TANNIN and lower SWEET compared to before. Additionally, the variability of scores for each indicator is not significant, suggesting that there has not been a meaningful change in wine preferences over time since the 2000s.

 


 

What I felt through this project 

 

Areas for Improvement:

1. Additional Data: It would have been beneficial to have wine rating data. With this information, correlations between taste and ratings could have been analyzed.

2. Sales Volume Information: If data on wine sales volume per wine were available, it would have been possible to understand consumer needs better, leading to more diverse analyses.

 

 

Lessons Learned:

 

1. Clarity of Purpose: Learned the importance of having a clear objective when interpreting and analyzing data.

 

2. Data Preparation: Learned methods such as data cleansing, handling missing values, and detecting outliers to enhance the accuracy and reliability of analysis.

 

3. Order of Operations: Learned the significance of the sequence in which sets are grouped during data manipulation, as it can affect the resulting values.