Introduction to Exploratory Data Analysis with Python
This 4-sessions hands-on introductory data analysis program focuses on learning essential data analysis tools and techniques in Python through hands-on activities. The program covers Google Colab, Python, Numpy, Pandas data frames, exploratory data analysis, and visualization. The target audience is beginners looking to develop core data analysis skills in Python.
S1- Intro to Google Colab, Python, and Numpy 8:00 to 9:30 AM EDT, January 15th
Resources:
Session Materials:
S2- Introduction to the Pandas Dataframe 8:00 to 9:30 AM EDT, January 22nd
S3- Exploratory Data Analysis using Pandas 8:00 to 9:30 AM EDT, January 29th
- Descriptive Statistics
- Pivot Tables
- Final Submission Research Question Example
- Lecture Video
S4- Visual Exploratory Data Analysis 8:00 to 9:30 AM EDT, February 5th
- Exploratory visualization versus Explanatory visualization
- Seaborn visualization toolbox
- Lecture Video
Mini Project (Required for the Certificate)
Due Date: Feb 12, 2024 11:59 PM EST.
Introduction:
Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics. It typically helps us see what the data can tell us before the formal modeling or hypothesis-testing task. EDA uses different approaches such as,
- Descriptive analysis aims to describe or summarize a set of data using statistical measures such as mean, variance, median, .. and visualization, including
- Exploratory visualization helps us to get ready for deeper predictive analysis and gain a better understanding of the data (scatter plots, histograms, box plots, etc.
- Explanatory visualization is used to tell a story about our findings or data itself (interactive dashboards and infographics).
- Cluster analysis aims to understand hidden patterns in a data set (K-mean, hierarchical, or density base).
- Affinity analysis tries to discover the relationship among the variables (rule mining, correlation analysis)
Each of these approaches has specific applications depending on the data at hand and the goals of the analysis. Therefore, the first step is to define the objectives of the study and ask research questions.
Forming Research Questions for EDA:
Formulating research questions for EDA is different than the research questions of predictive and prescriptive data analysis and modeling. Predictive analysis asks research questions regarding unknown events. For example, “Can we predict the cotton yield as a function of drought indicators?” EDA research questions are broad and exploratory. The goal is to discover patterns, anomalies, relationships, or trends in the data. For example, “What are there any differences between the yield of the states?” “Are there any unusual weather patterns in the data set that we may look into carefully?”
Required Assignment for Introduction to Exploratory Data Analysis with Python Certificate:
- Formulate three to five research questions based on the data set given below about US cotton yields.
- Explain why you ask each research question.
- Use techniques we covered in the workshop, such as Pandas’ descriptive statistical tools, group by, pivot tables, and Seaborn visualization package.
- Create a Google Colab Notebook in the following format for each research question:
- Text Cell: Provide your research question (bolded) and briefly explain your justification for the research question and methods selected to answer the research question.
- Python Cells: Create a Python cell for each method you use to answer the research. To answer the research question, you may use multiple approaches and methods. For example, you can use descriptive statistics or charts. Include their code in a different cell.
- Text Cell: Briefly interpret your findings and observations.
- Share your Collab Notebook publicly by selecting (Anyone with the link and View Permission). Please run all the cells so that the output will be visible to others, and then enter the URL of your NoteBook and other information at by Feb 12, 2024 11:59 PM EST.
- https://forms.gle/wPkarAEeCDfTjUxr9
- I prepared a starter notebook for you with an example as follows. You can copy this file to your codelab space to start.
Data Set:
USA Cotton Data was downloaded from Kaggle: Effect of climate change on commodity yields and modified for easier column names ( https://www.kaggle.com/datasets/abhisaha97/effect-of-climate-change-on-commodity-yields)
You can download the modified data file from:
https://sites.psu.edu/auk3/files/2024/01/cotton-a69096c342c63445.csv
Column | Name | Description | |
year | Year – | The year of the data observation. | |
state | State – | The name of the state where the data was recorded. | |
planted_acres | Planted (1000 Acres) – | The area in thousands of acres that was planted with crops. | |
harvested_acres | Harvested (1000 Acres) – | The area in thousands of acres that was harvested for crops. | |
yield_lbs_per_acre | Yield (Pounds/Harvested Area) – | The average yield of crops in pounds per harvested area. | |
avg_temp | Average Temperature Value – | The average temperature recorded during the specified period. | |
avg_temp_anomaly | Average Temperature Anomaly – | The deviation from the long-term average temperature for the specified period. | |
max_temp | Maximum Temperature Value – | The highest temperature recorded during the specified period. | |
max_temp_anomaly | Maximum Temperature Anomaly – | The deviation from the long-term average maximum temperature for the specified period. | |
min_temp | Minimum Temperature Value – | The lowest temperature recorded during the specified period. | |
min_temp_anomaly | Minimum Temperature Anomaly – | The deviation from the long-term average minimum temperature for the specified period. | |
precipitation | Precipitation Value – | The total amount of precipitation (rainfall) recorded during the specified period. | |
precipitation_anomaly | Precipitation Anomaly – | The deviation from the long-term average precipitation for the specified period. | |
cooling_degree_days | Cooling Degree Days Value – | The number of cooling degree days recorded during the specified period. | |
cooling_deg_days_anomaly | Cooling Degree Days Anomaly – | The deviation from the long-term average cooling degree days for the specified period. | |
heating_degree_days | Heating Degree Days Value – | The number of heating degree days recorded during the specified period. | |
heating_deg_days_anomaly | Heating Degree Days Anomaly – | The deviation from the long-term average heating degree days for the specified period. | |
pdsi | Palmer Drought Severity Index (PDSI) Value – | The value of the Palmer Drought Severity Index, which indicates the severity of drought conditions. | |
pdsi_anomaly | Palmer Drought Severity Index (PDSI) Anomaly – | The deviation from the long-term average Palmer Drought Severity Index for the specified period. | |
phdi | Palmer Hydrological Drought Index (PHDI) Value – | The value of the Palmer Hydrological Drought Index, which indicates hydrological drought conditions. | |
phdi_anomaly | Palmer Hydrological Drought Index (PHDI) Anomaly – | The deviation from the long-term average Palmer Hydrological Drought Index for the specified period. | |
pmdi | Palmer Modified Drought Index (PMDI) Value – | The value of the Palmer Modified Drought Index, which indicates modified drought conditions. | |
pmdi_anomaly | Palmer Modified Drought Index (PMDI) Anomaly – | The deviation from the long-term average Palmer Modified Drought Index for the specified period. | |
z_index | Palmer Z-Index Value – | The value of the Palmer Z-Index, which indicates drought conditions based on soil moisture. | |
z_index_anomaly | Palmer Z-Index Anomaly – | The deviation from the long-term average Palmer Z-Index for the specified period. |