Table of contents
No headings in the article.
Link to dataset: kaggle.com/c/kaggle-survey-2020
Abstract – There is quite a buzz going on in the computer science community regarding machine learning, data scientist, AI and so forth. Data science is very popular and in high demand. There’s isn’t much information about what it looks like and what it takes to work with data. There is confusion about the actual roles like as a data analyst, business analyst or machine learning engineer. This report is based on the responses from a survey from Kaggle on its members. The survey was to identify trends within this quickly growing field.
I. BUSINESS DATA UNDERSTANDING A. Data Selection Kaggle has conducted a survey to present a comprehensive picture of the state of data science and machine learning. The survey, 2020 Kaggle Machine Learning and Data Science, was live for 3.5 weeks in October 2020 and had 20,036 responses. The results sheds light about who is working with data, what is going on within machine learning in multiple industries and identifies best ways to break into the field of data science.
B. Data Understanding The survey has 20,036 usable responses from respondents in 171 different countries and territories. If a country or territory had less than 50 respondents, it was placed in the “Other” category for anonymity. The survey answers covered demographic, education, employment, and technology usage of Kaggle community members. The data from the survey has 355 attributes or columns. All questions are optional which leads empty values. Overall, the data is a mix of numeric and alphanumeric values.
Questions 1 to 5 focus on demographic details of the respondent. The field of machine learning and data science is relatively new and details such as age, gender, and country are interesting details to explore. A review of a respondent’s current role and highest level of education will provide context of how people are operating in this industry. Questions 6 – 39 solely focus on technologies and usage. The data set is a snapshot in time which alludes to that the most used visualizations would comparison among items, relationships, and static composition.
There is a large amount of missing data throughout the dataset. Overall, 87.6% accounts for missing data. There are blanks in both single choice questions and multiple choice or select all that apply. Single choice questions with blanks amounts up to almost 55% of the data set.
C. Business Use Case The primary interest with this dataset is in identifying target niches for content development for data scientist and machine learning enthusiasts. After exploring the dataset, a model will be ran in hopes of gathering insight which would lead to interesting, engaging content. The dataset 2020 Kaggle Machine Learning and Data Science was selected by Group 9.
D. Preliminary Conclusions Group 9 ran a cursory review of the data and suspects that majority of the respondents are male, under 30 and from Asia. A successful data scientist would be one with a masters, currently employed with 4+ years of programming and $100,000 on average for compensation. Python would be the most popular language used. Women would factor in less than 30% percent based on current statistics of women employed in the field of computer science.
E. Visualizations From the cursory review has led Group 9 to building visualization to begin gaining insight from the data which will allow for further hypothesis testing and research. Due to ambiguity of the ‘Other’ category, it will be removed from visualizations. Questions 1 – 5 focus on building background of the Kaggle community. The field of machine learning and data science is fairly new, so it’ll be interesting to see how age, gender, and country are distributed.
Figure 1. Top 5 Countries With The Most Respondents Figure 1 shows that India, United States, and Brazil are the most responsive countries. India is the majority of the respondents. There were up to 54 countries which responded to the survey. India accounts for about 32% of responses of the survey. India is a densely populated country with a high interest in STEM. It would make sense that India would be highest on the list. Following would be the United States at 12.11% and surprisingly Brazil at 3.76%. India and the United States account for 43.77% of the responses. These two countries are the largest sub-groups of the Kaggle community.
Figure 2. Age | 25 - 29 Group Are The Majority of Respondents Figure 2 displays that the overall majority of age of the respondents are 25 – 29 years of age. They account for 20.44% of the respondents. Over 50% of the respondents are younger then 30. The Kaggle community is fairly young. 17.64% of respondents are between the ages of 18 – 21. Which means that there is a high interest in data science with the youth.
Figure 3. Gender and Age Ratio Figure 3 represents the ratio between gender and age. As stated above, over 50% are under the age of 29 however, there is a higher percentage of women who are 18-21 than all the other age groups which shows promise of more women entering the field. The highest percentage of men are between the ages of 30 – 34.
Figure 4. Gender & Country of Origin Ratio Figure 4 showcase the gender and country of origin ratio. It is a comparison of the number of males and females from the country category. India, United States, and Brazil are the largest groups of respondents in the Kaggle community. In addition, we can see that males are the dominate respondents whereas female respondents make up less than 20%. This reflects the youth of the field of machine learning and data science.
Figure 5. Country of Origin and Role Ratio Figure 5 is a stacked bar chart representing the ration between country of origin and roles. Overall, the most reported role is student, this indicates there is going to be a large amount of work-ready graduates looking for work. This also implies that there would be an attractive target market for content – students. Again, India is leading the charge due to their dense populace. China have more students and data analyst in comparison to other roles. The United States has more data scientists than students. The United States have more research scientists than India.
Figure 6. Gender and Role Ratio Figure 6 represents the ratio between gender and roles. Data scientist is the leading role from the survey both men and women are high. This chart shows that there are more men as students, data scientists, and software engineers. It is already known that there is a considerable gender gap in computer science. However, women are highest as data scientist, data analysts, software engineers, and research scientists. These are the types of roles women are seeming are growing fastest in.
Figure 7. Most Frequent Education Level
Figure 7 represents the education levels of respondents. Bachelor’s and master’s degree appear the most often.
Figure 8. Most Used Programming Language Figure 8 represents the most used programming languages. The top two most used languages are python and SQL. This helpful to know as this indicates trends in the industry.
II. DATA PRE-PROCESSING (E) For the most part the data is ready for visualizations but not necessarily for analysis. With knowledge discovery in databases (KDD), comes multiple processes and one which accounts for 80% of effort is preparing the data. This means organizing the data into a standard form ready for data mining processing and increase data mining performance.
Group 9 prepared the final dataset for modeling in the following manner. For the country variable, all records listed as ‘Other’ was removed. 19 duplicate values were found and removed. Error detection was performed by calculating min, max, average, and blanks. In addition, there is a category called ‘Other’. This category has been removed entirely from the dataset. For example, if there were less than 50 iterations of a country it was labeled as Other. Since other is only a small portion of the dataset it was decided to remove it. Multiple choice or ‘Select all that apply’ questions where there were blanks were converted to binary in which blanks became 0.
There are 6,231,964 blanks in the entire dataset. The multiple-choice blanks will be changed to ‘blanks’ and then a dummy conversion will be applied. All dummies created for the blanks will be removed since there’s no need to analyze blanks. In regard to single-choice blanks, which is about 55% of the dataset, were removed reducing the dataset from 20,036 records to 9,142 records. The dataset would be analyzed via Tableau.
III. MODELING
F. Discussion Due to the nature of the data being a snapshot in time. The most appropriate models may be classification, linear regression, multiple linear regression and clustering. Classification is a simple, effective method which assigns labels to values. This is helpful for categorical data and helps with what class a value may be classified as. However, it has limitations such as overfitting and is slow for large datasets.
The hypothesis to be tested is what skills(education and tools)would yield the most success. Success is measured by the annual salary. The impression is that the higher the compensation, the more successful one is. In addition to identifying these skills will aid in the development of content to help other become successful in this emerging field. This calls for prediction modeling techniques.
Prediction is best used for continuous variables and allows for a continuous value to be predicted against another variable or a set of variables. Linear regression would be helpful with exploring if there’s a relationship between salary and education. A limitation or challenge with linear regression are outliers as they can skew predictions.
Multiple linear regression is a model which is a step further than linear regression. It’ll determine what skills and or location will yield the highest compensation. Dependent variables will be age, education, gender, programming language used, origin(country), current role, visualization tools, ml usage, BI tools, analytical tools, cloud platform used.
As said before the hypothesis was to explore direct relationships between variables age, technology, skills and education. Linear regression model, stepwise regression, and multiple linear regression. Variables to be used are: age, gender, education, role, years of programming xp, most used language, most used IDE, most used viz tool, ml xp, most used, ml framework, compensation, and most used cloud platform.
The plan is to convert all non-numeric, categorical variables to dummy variables, numeric variables, which are all are range, will be converted to the max of its range.
G. Model Application
Linear Regression Linear regression was applied to education and compensation. The intent is to see if education is a key predictor of compensation would aid in identifying target markets for content development. Education was converted to dummy variables and blanks were removed. Compensation were listed as a series of ranges which were converted to the max compensation for each range. For example, $0-999 was converted to 1,000.
Table 1. Linear Regression Summary - Education Level & Compensation
Table 1 is the regression summary for education level and compensation. It shows that R2adj is .0243 which is low. The data is not close to the fitted model. This could be due to the qualitative data type or categorical data converted to binary. There is not enough quantitative data to get a best fitted model for compensation and education. It also has the coefficients table which shows that the doctoral degree has the highest impact on compensation due to its P-Value, 2.48E-05. T
Let’s attempt to predict compensation for someone with a PhD, assuming person has a bachelors and masters as well, with the coefficients:
Compensation = 30783.02 + 9498.23bachelors + 42155.30doctoral + 24806.75masters + 1672.54noFormaledu + 10240.54professional + 16735.16someEdu
Without a bachelor’s and master’s but just a doctorate: $55589.77 = 30783.02 + 9498.230 + 42155.300 + 24806.751 + 1672.540 + 10240.540 + 16735.160
With a bachelor’s, master’s, and a doctorate: $107,243.3 = 30783.02 + 9498.231 + 42155.301 + 24806.751 + 1672.540 + 10240.540 + 16735.160
This implies that someone with a PhD would be making approximately $110,000 annually. It is not possible for someone to get a PhD without having a bachelor’s and a master’s so they must be added in as well. In reviewing the data for those with a PhD and finding that the average compensation was $74,755.71. That is just for the PhD without combining the perquisite degrees. When averaging all those who have all three the expected compensation came out to be $173,946.43 The variance from the regression must be accounted for when preforming these equations as well as the poor fit. From this data, it can be implied that someone with a PhD and working in Data Science would earn somewhere between $107, 243.3 and $173,946.43. This doesn’t include other factors such as years of programming experience, years of machine learning experience, and primary programming language used.
Table 2. Descriptive Statistics for Compensation based on Education Levels Table 2 show that the average compensation for someone with a PhD is about $75k. However, that doesn’t seem correct. After further review the average annual compensation for those with a doctorate is $102k. Which why to predict the compensation of someone with a doctorate, it would be best to include previous education levels.
Multiple Linear Regression
Let’s explore the relationship between age, education, country, gender, most used programming language, years of experience with both programming and machine learning, and compensation. To reiterate we’re looking for what it takes to be a successful member of the data science and machine learning field.
Due to the limitation of our current XLMiner license only the top five countries will be taken into account: India, United States, Brazil, Japan, and Russia. Gender will take only into account those who identified as either a man or a woman. Gender will be converted to binary variables, 1 = man and woman = 0. Education levels will be converted to categorical binary dummies.
Table 3 Regression Summary
Table 4. Validation Summaries Table 3 is the regression summary of the multiple linear regression model. The R2 value is .47 or 47% which is low and implies that the data is not close to the fitted model. Table 4 shows that the validation set had a better performance than the testing data set due to its low RMSE value. A lower RMSE value indicates a better fit. It is higher that the R2 value from the linear regression model summary, however, when adding more variables the R2 value will increase due to more data being processed.
Let’s attempt to predict compensation for someone with a PhD, assuming person has a bachelors and masters as well, with the coefficients: 34 years old, male, from United States, has a master’s, five years programming experience, uses python primarily, two years of ML experience, and is a data scientist.
Table 5. Regression Model Test Table 5 shows that a male, 34 years of age, from United States, has a master’s, five years programming experience, uses python primarily, two years of ML experience, and is a data scientist would likely earn $124,402.99.
Multiple Linear Regression with Stepwise Regression
Stepwise regression is a method for identifying the ‘best’ subset of a regression model. The idea is to reduce the number of predictors which isolates the key predictors to get the best fitting model. Using the same data for the multiple linear regression model above, stepwise regression is added. From the results, Table. 6, the best subset is 17.
Table 6. Best Subset Details Subset 17 is the ‘good’ model because of the RSS, Mallow’s Cp, and R2 Adj values. CP is the closest to the predictor amount and the R2/R2 Adj. values are closest to 1. Subset suggests that the following predictors are ‘good’: gender, Brazil, India, Russia, United States, doctoral, professional, some college, experience programming, Javascript, Bash, MATLAB, ML experience, Data Scientist, Product/Project Manager, and Research Scientist. With these predictors another multiple linear regression model is to be ran.
Table 7. Subset 17 Regression Summary
Table 8. Performance Summaries Table 7 shows that the fit is less than the fit of the previous model. Table 8 shows that the validation dataset performed better than the training dataset. In this dataset, the most important predictor is years of programming experience followed by machine learning experience.
With this model, let’s determine the compensation of a male, , from United States, five years programming experience, two years of ML experience, and is a data scientist.
With this subset, a male, , from United States, five years programming experience, two years of ML experience, and is a data scientist would earn $122,633.62.
IV. Evaluation and Deployment
Group 9 chose this dataset, 2021 Kaggle Machine Learning and Data Science survey, for analysis to identify elements which lead to a successful person in the field of data science and machine learning. In performing an exploratory data analysis and investigating the relationship between compensation and other variables, it led to investigating the following questions: Which country had the most participants in the survey? Did more males or females participate in the survey? What age range had the most participants in the survey? What age range participated most from the male group in the survey? What age range participated most from the female group in the survey?
After analysis, the original dataset was highly qualitative and required an extensive amount of preparation for regression analysis. The visualizations made it clear that the Kaggle community is predominately male, Indian, preferred Python, 57% was younger than 30 years of age, employed as data scientists and on average earns $52,800.00. After cleaning and running regression analysis it was found that the maximum compensation was directly related to the amount of experience of programming and machine learning, country of residence, and a high education level.
The survey data has made it clear that any content development for those pursuing a career in data science and machine learning would need to accrue 2-5 years of programming and machine learning in Python and SQL. Content will need to be developed geared towards Python. The analysis has lead Group 9 to decide that the target market would be towards women who are 24 years and younger who live in India, Brazil, Russia, Japan and the United States. This is valuable insight for content development.
Further analysis will be needed to done in identifying media channels for marketing based on geographic region, gender, and age, and identifying which platforms to host content. There is a set of questions near the end of the survey which would be helpful in determine which content will likely be consumed in the next two years. Additional analysis will be needed to identify which cloud computing platforms, cloud products, BI tools and big data products are currently being used versus what the growing interest will be.