According to a study done by Rameker, alcohol consumption is a major factor that has been shown to have correlation with poor academic performance (Rameker, 2015). impressionable generation. Secondary school students are in a transition developmentally and this comes with its debilitating effects such as risky alcohol use … While … There are a few columns which we think could be further clarified or changed. First, open the student-por.csv file in the student_performance source. The dataset which we will be exploring will be the dataset containing 2014. Background information II PDF 731 KB. Since we attempt to predict the students’ level of alcohol consumption, high or low, we mutate the targets to join the weekday and weekend drinking, and then set the results to high or low, 1 or 0 respectively. We prefer to use some sort of configuration so that we can input any dataset and perform most of the same analysis. To get an idea of how features interact with each-other, we can determine the rank associated with the features to a target, in this case, the actual target or level of drinking. at Kaggle. I will be utilizing the student alcohol consumption dataset provided by UCI Machine Learning and is available in their machine learning repository. This helps you to understand whether the distribution of the numeric variable is significantly different at different levels of the categorical target. For numeric data, correlations are important to help determine if we should join information of highly correlated features. Generally, many models prefer using features that are independent of each other and have low correlations. As we all know, human relationships play a major role in people's lives. One way would be to create a new feature, FeduMedu, where the values is Medu * 10 + Fedu and keep FeduMedu categorical. http://www.who.int/substance_abuse/publications/global_alcohol_report/en/. /r/datasets. For the data exploratory exercise, we choose to examine three columns: This helps you to understand the top dependent variables (grouped by numerical and categorical). Alcohol is an often abused substance that troubles many individuals in their adulthood as they struggle to cope with emotional and physical stress that Remove the skewness from the numeric data. This dataset was collected in order to study alcohol consumption in young people and its effects on students’ academic performance. “Using Data Mining to Predict Secondary School Student Alcohol Consumption.” Department of Computer Science,University of Camerino. fulfilling the Data Mining course in Multimedia University. The types of columns are listed as follows: One way to get an idea about the structure of the data is to calculate basic statistics, such as the min, max, mean, and median, and missing value counts. courses of mathematics and Portuguese. A twin study of marital status and alcohol consumption. The original data contains the following attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets: The following grades are related with the course subject, Math or Portuguese: Before exploration, we combine the rows of the two data sets and mark each instance with the class in which the survey was taken. Essentially, the blue rectangles show that the observed counts and expected counts (derived from a loglinear model) coincide well, and since the size of the rectangles are large, the confidence covers a majority of the observations. Yaml is a good tool for setting up configurations, but in this case, we will set the configurations manually. This will be attempted Global Status Report on Alcohol and Health 2014. Dinescu, D., Turkheimer E., Beam, C.R., Horn, E.E., Duncan, G., Emery, R.E. The following plot shows the prominence of the target: This shows that the target is imbalanced, so we may benefit from oversampling or under-sampling when building our model. emotion. It does not state the level of intimacy between them. The data collected, in locations such as Gabriel Pereira and Mousinho da Silveira, includes several values of pertinence. Nicolas Raj. need to take column 23 (romantic), column 27 (workday alcohol consumption) and/or column 28 (weekend alcohol consumption) into consideration. Section 2c. Other Cool Sets. al. It would be easy to assume that alcohol consumption reduces the student’s health on a long term basis. by Dinescu et. Then we can find out if alcohol consumption will impact the final result indicated by column “g3”. For the data exploratory exercise, we choose to examine three columns: workday alcohol consumption, weekend alcohol consumption and their relationship status. would be the relationship between their grades with respect to their workday and weekend alcohol consumption. weekend alcohol consumption and their health. Google Trends - look at what’s going on in the world. Next Steps in Preparing the Data for a Model, https://archive.ics.uci.edu/ml/datasets/STUDENT%20ALCOHOL%20CONSUMPTION, http://www.who.int/substance_abuse/publications/global_alcohol_report/en/, Data Exploratory Analysis – Student Alcohol Consumption, Facebook Stock Price after Quarterly Report, Forecast Stock Prices Example with r and STL, school – student’s school (binary: ‘GP’ – Gabriel Pereira or ‘MS’ – Mousinho da Silveira), sex – student’s sex (binary: ‘F’ – female or ‘M’ – male), age – student’s age (numeric: from 15 to 22), address – student’s home address type (binary: ‘U’ – urban or ‘R’ – rural), famsize – family size (binary: ‘LE3’ – less or equal to 3 or ‘GT3’ – greater than 3), Pstatus – parent’s cohabitation status (binary: ‘T’ – living together or ‘A’ – apart), Medu – mother’s education (numeric: 0 – none, 1 – primary education (4th grade), 2 5th to 9th grade, 3 secondary education or 4 higher education), Fedu – father’s education (numeric: 0 – none, 1 – primary education (4th grade), 2 5th to 9th grade, 3 secondary education or 4 higher education), Mjob – mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. This information can give you a hint of the skewness and of possible outliers. 45 Using Python to Analyze Secondary School Student Alcohol Consumption and Their Academic Performance 1Poonam Kumari and 2Aditya Pratap 1Research Scholar, Department of Computer Science, IITM Janakpuri, New Delhi,India 2Research Scholar, Department of Computer Science, IITM Janakpuri, New Delhi,India poonam.kumari561999@gmail.com, … Most of the government data sites are utilitarian and simple, enough to get the data across in an easy to understand way. Thus, their final grade would be the perfect measure of However, many do not consider the effect of this intoxicating substance in the context of the younger, more To compare categorical variables, correlations shouldn’t be used unless the underlying values are ordinal (i.e., going out with friends [numeric: from 1 – very low to 5 – very high]). This modification coincides with the original report where the authors modified the target with the formula acl = (Dalc * 5 + Walc * 2) / 7 and then assumed values of 3 or more were heavy drinkers. The traditional consensus is that students who consume alcohol at high levels … You may want to explore combining the grades into one feature since G3 is likely derived from G1 and G2. comes with the mantle of adulthood. Examples of The data set consists of two files, one for students in a math class, and the other contains information about students in a Portuguese class. Since the main purpose of the dataset is to find correlations between students with their alcohol consumption patterns, the most conspicuous relationship would be the relationship between their grades with respect to their workday and weekend alcohol consumption. As you will see in the data, on average, our campus sends at least one student to the emergency room per week who is in some kind of trouble connected with alcohol. because it would be less accurate for the classification model to predict a numeric value ranging from 0-20. EuroEducation.net. Fedu and Medu correlate more that some others, so we might want to combine the information. Economics of Education Review, 30(1), 1-15. Derived output: Alc = (Walc X 2 + Dalc X 5) / 7, again, in the range of 1 – 5. following: Figure 1 illustrates the high-level description of our classification. The traditional The datasets have a total of 33 attribute columns of which we could do some column selection based on certain parameters. GitHub is where the world builds software. workday alcohol consumption, weekend alcohol consumption and their family relationship. GStatus is derived from the final period grade, (G3, column 33) where according to EuroEducation.net (n.d.), While I recognize that having a great many students living on campus may be contributing to these numbers, and while I am relieved that students know how and when to seek care, I am c… This may not hold true because it is a possibility that the I'm sorry, the dataset "STUDENT ALCOHOL CONSUMPTION" does not appear to exist. Your email address will not be published. They are: Exploratory Data Analysis on the Student Alcohol Consumption dataset (Code) », address - U/R for urban or rural respectively, famsize - LE3/GT3 for less than or greater than three family members, Pstatus - T/A for living together or apart from parents, respectively, Medu - 0 (none) / 1 (primary-4th grade) / 2 (5th - 9th grade) / 3 (secondary) / 4 (higher) for mother's education, Fedu - 0 (none) / 1 (primary-4th grade) / 2 (5th - 9th grade) / 3 (secondary) / 4 (higher) for father's education, Mjob - 'teacher', 'health' care related, civil 'services', 'at_home' or 'other' for the student's mother's job, Fjob - 'teacher', 'health' care related, civil 'services', 'at_home' or 'other' for the student's father's job, reason - close to 'home', school 'reputation', 'course' preference or 'other' for the choice of school, guardian - mother/father/other as the student's guardian, traveltime - 1 (<15mins) / 2( 15 - 30 mins) / 3 (30 mins - 1 hr) / 4 (>1hr) for time from home to school, studytime - 1 (<2hrs) / 2 (2 - 5hrs) / 3 (5 - 10hrs) / 4 (>10hrs) for weekly study time, failures - 1-3/4 for number of class failures (if more than 3 than record 4), schoolsup - yes/no for extra educational support, famsup - yes/no for family educational support, paid - yes/no for extra paid classes for Math or Portuguese, activities - yes/no for extra-curricular activities, nursery - yes/no for whether attended nursery school, higher - yes/no for desire to continue studies, internet - yes/no for internet access at home, romantic - yes/no for relationship status, famrel - 1-5 scale on quality of family relationships, freetime - 1-5 scale on how much free time after school, goout - 1-5 scale on how much student goes out with friends, Dalc - 1-5 scale on how much alcohol consumed on weekdays, Walc - 1-5 scale on how much alcohol consumed on weekend, absences - 0-93 amount of absences from school, the amount of time a student studies (studytime, column 14), does the student join any extra paid classes (paid, column 18), does the student participate in any extra co-curricular activities (activities, column 19), if the student is involved in any romantic relationship (romantic, column 23), how is the student's family relationship quality (famrel, column 24), the tendency of the student to go out with friends (goout, column 26), weekday alcohol consumption (Dalc, column 27), weekend alcohol consumption (Walc, column 28). Required fields are marked *. activites (column 19), romantic (column 23), famrel (column 24), goout (column 26), Dalc (column 27), Walc (column 28) Section 3b. Tobacco and nicotine use TUD PDF 493 KB. in section E as part of the preprocessing before plotting the data for our exploratory data analysis. Short exploratory data analysis focusing on the alcohol variables from the Portuguese school dataset. From this analysis, what might we preprocess before creating the model? The columns and how they are recorded are as listed below: Since the main purpose of the dataset is to find correlations between students with their alcohol consumption patterns, the most conspicuous relationship For the data exploratory exercise, we choose to examine four columns: workday alcohol consumption, first period grade, second period grade and their final grade. The most recent statistics from the National Institute on Alcohol Abuse and Alcoholism (NIAAA) estimate that about 1,519 college students ages 18 to 24 die from alcohol-related unintentional injuries, including motor vehicle crashes. workday and/or weekend alcohol consumption would also be lower. The results make sense. Balsa, A. I., Giuliano, L. M., & French, M. T. (2011). recorded to have participated. However, if more elaborate data mining techniques were to be used, more features can be selected and used in order to 2016. The primary reason for this data was to see the effects of drinking and grades. The scope of these data sets varies a lot, since they’re all user-submitted, but they tend to be very interesting and nuanced. result as pass/fail rather than a discrete numeric number. relationship with his/her family has a low value. Five columns play a major role in this which are: column 27 (workday alcohol consumption) The amount of mathematics students involved in the collection was 395, whereas 649 Portuguese Language students were The original data comes from a survey conducted by a professor in Portugal. and/or column 28 (weekend alcohol consumption), column 31 (first period grade), column 32 (second period grade) and 13. Testing correlation between alcohol consumption and social, gender, study time, and grade attributes for each student. Treatment utilization alcohol PDF 98 KB. Your email address will not be published. This would help the classification model to more accurately predict the class GStatus (n.d.). It gives you data about … It’s called the datasets subreddit, or /r/datasets. avoid drinking in order to prevent their health from further deterioration. obtain more accurate insights. Since the distribution is log normal, applying the log transformation would be the most applicable. (romantic), only gives information on whether or not the student has a partner. The reason for this change is because it is easier to classify a student's We could perform this merge differently later by performing a full join and then dealing with the NA values, by performing the analysis on the individual sets, or by inner joining the two sets and just working with that data. We could check to see if that hypothesis has a concrete basis by using column 24 (famrel), column 27 (workday alcohol (workday alcohol consumption) and/or column 28 (weekend alcohol consumption). We can use studytime (column 14), paid (column 18), (2016). With the Student Alcohol Consumption data set from UCI Machine Learning Archive (Fabio Pagnotta 2016), we thought it would be interesting to see what features are important to determine if the student is a heavy drinker or not. we respect your privacy and take protecting it seriously. Section 2a. the passing marks for a student in Portugal would be 10 out of 20. As a direct out-come of this research, more efficient student prediction You can browse the subreddit here. in a student environment as well as their demographic information and other data that may be of some relevance. https://archive.ics.uci.edu/ml/datasets/STUDENT%20ALCOHOL%20CONSUMPTION. We assume that a father’s education level is similar to a mother’s education level, so let us visualize the association: The above plot shows that the education levels between mother and father do coincide fairly often and might want to explore more or consider the possibility of joining these features in preprocessing the data before model building. Its value for the week is normalized as (workday_alcohol_consumption 5 + weekend_alcohol_consumption 2)/7 If the value is greater than 3.0, then alcohol consumption is considered too high. 3. Click on the arrow near the name of each column to evoke the context menu. If the hypothesis holds true, we would expect to see an increasing level of alcohol Assuming the romantic relationship in our dataset is of an intimate level, we can find out if this statement holds true. Our explanation would be more focused on the final grade because we think that students will be In April 2016, 3000 undergraduate students were randomly selected to participate in the survey, and 802 undergraduate students responded to at least part of the survey. administrative or police), ‘at_home’ or ‘other’), reason – reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’), guardian – student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’), traveltime – home to school travel time (numeric: 1 – <15 min., 2 – 15 to 30 min., 3 – 30 min. Many students in college experiment with drugs and alcohol and sometimes these two things negatively affect their academic performance. However, the data reveals that there was a total of 382 students that were in both datasets, this was evident in the exact February 2016 DOI: 10.13140/RG.2.1.1465.8328 READS 2,200 2 authors: Fabio Pagnotta Hossain Amran University of Camerino University of Camerino 8 PUBLICATIONS 0 CITATIONS 5 PUBLICATIONS 0 … We look a bit closer at the distribution of absences and test for normality. When lambda = 0, the log transform is used. Best part, these are all free, free… Alcohol consumption PDF 182 KB. consumption (both column 27 and 28) when famrel has a low value. such data are records of demographic information, grades, and alcohol consumption. The following shows basic statistics of each feature: Addressing skewness, the mean of absences is 4.4348659 and the median is 2, indicating that the data is right-skewed and given the spread between the min and max, the skewness is significant. With the Student Alcohol Consumption data set, we predict high or low alcohol consumption of students. To test this, we will also apply the Box and Cox method to determine the parameter that indicates which method is best. The box plot portion of the graph also helps us identify outliers. 5. Section 3a. The violin plot of absences shows more of a log normal distribution, and a large number of outliers lie well outside of the top whisker. The Core Survey help us determine the patterns of alcohol and other drug consumption and examine attitudes and perceptions of alcohol and other drug use among Northwestern students. This analysis was done as part of The dataset is originally designed for the estimation of high school student’s performance where alcohol consumption is used as one of the parameters. The original data comes from a survey conducted by a professor in Portugal. National Institute on Alcohol Abuse and Alcoholism Alcohol Use and Consumption Tables A large number of html and text files on alcohol use and consumption. Although student achievement is highly influenced by past evaluations, an explanatory analysis has shown that there are also other relevant features (e.g. consensus is that students who consume alcohol at high levels tend to skip more classes and perform worse in their studies, thus, resulting in lower We use cookies on Kaggle to deliver our services, analyze web traffic, and improve your experience on the site. Journal of Family Psychology, Vol 30(6), Sep 2016, 698-707. The students included in the survey were in the Singapore, however, brightens it up with colorful visualizations, splashes of color in the graphs, and a “Similar Datasets” section at the bottom of every data set to encourage readers to explore. column 33 (final grade). The following results show the skewness for the numeric features: As we suspected, the feature ‘absences’ contains the most skew. The target is the weekday drinking level 1 to 5 and the weekend drinking level 1 to 5. Depending on the model you choose, removing skewness could help improve the predictive ability of the model. that particular student's success. Family history alcohol PDF 140 KB. We think that classification is the best data mining technique to be employed because we can build a classification model to school period grades are available. You can see the level of correlation by the degree of the ellipse. Retrieved from http://www.euroeducation.net/prof/porco.htm. grades. The dataset was built from two sources: school reports and questionnaires. Published in: Technology. For categorical values, we use Cramer’s V. For numeric values, we use Eta-squared value. For this analysis, we combine the rows of the data sets. Our main goal is using Data Mining To Predict School Student Alcohol Consumption and finding the significant factors. X axis is the level of categorical target. The dataset we chose is the Student Alcohol Consumption dataset by UCI Machine Learning which can be obtained Secondary school student alcohol consumption data with social, gender and study information. drinking alcohol for consolation.