Introduction to Exploratory Data Analysis with Python

Introduction to Exploratory Data Analysis with Python

This 4-sessions hands-on introductory data analysis program focuses on learning essential data analysis tools and techniques in Python through hands-on activities. The program covers Google Colab, Python, Numpy, Pandas data frames, exploratory data analysis, and visualization. The target audience is beginners looking to develop core data analysis skills in Python. 

S1- Intro to Google Colab, Python, and Numpy 8:00 to 9:30 AM EDT, January 15th

Resources:

Session Materials:

S2- Introduction to the Pandas Dataframe 8:00 to 9:30 AM EDT, January 22nd

S3- Exploratory Data Analysis using Pandas 8:00 to 9:30 AM EDT, January 29th

S4- Visual Exploratory Data Analysis 8:00 to 9:30 AM EDT, February 5th

Mini Project (Required for the Certificate)

Due Date: Feb 12, 2024 11:59 PM EST. 

Introduction:

Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics. It typically helps us see what the data can tell us before the formal modeling or hypothesis-testing task. EDA uses different approaches such as,

  • Descriptive analysis aims to describe or summarize a set of data using statistical measures such as mean, variance, median, .. and visualization, including
    • Exploratory visualization helps us to get ready for deeper predictive analysis and gain a better understanding of the data (scatter plots, histograms, box plots, etc.
    • Explanatory visualization is used to tell a story about our findings or data itself (interactive dashboards and infographics).
  • Cluster analysis aims to understand hidden patterns in a data set (K-mean, hierarchical, or density base).
  • Affinity analysis tries to discover the relationship among the variables (rule mining, correlation analysis)

Each of these approaches has specific applications depending on the data at hand and the goals of the analysis. Therefore, the first step is to define the objectives of the study and ask research questions.

Forming Research Questions for EDA:

Formulating research questions for EDA is different than the research questions of predictive and prescriptive data analysis and modeling. Predictive analysis asks research questions regarding unknown events. For example, “Can we predict the cotton yield as a function of drought indicators?”   EDA research questions are broad and exploratory. The goal is to discover patterns, anomalies, relationships, or trends in the data. For example, “What are there any differences between the yield of the states?” “Are there any unusual weather patterns in the data set that we may look into carefully?”

Required Assignment for Introduction to Exploratory Data Analysis with Python Certificate:

  • Formulate three to five research questions based on the data set given below about US cotton yields.
  • Explain why you ask each research question.
  • Use techniques we covered in the workshop, such as Pandas’ descriptive statistical tools, group by, pivot tables, and Seaborn visualization package.
  • Create a Google Colab Notebook in the following format for each research question:
    • Text Cell: Provide your research question (bolded) and briefly explain your justification for the research question and methods selected to answer the research question.
    • Python Cells: Create a Python cell for each method you use to answer the research. To answer the research question, you may use multiple approaches and methods. For example, you can use descriptive statistics or charts. Include their code in a different cell.
    • Text Cell:  Briefly interpret your findings and observations.
  • Share your Collab Notebook publicly by selecting  (Anyone with the link and View Permission). Please run all the cells so that the output will be visible to others, and then enter the URL of your NoteBook and other information at by Feb 12, 2024 11:59 PM EST. 
    • https://forms.gle/wPkarAEeCDfTjUxr9
  • I prepared a starter notebook for you with an example as follows. You can copy this file to your codelab space to start.

Data Set:

USA Cotton Data was downloaded from Kaggle: Effect of climate change on commodity yields and modified for easier column names ( https://www.kaggle.com/datasets/abhisaha97/effect-of-climate-change-on-commodity-yields)

You can download the modified data file from:

https://sites.psu.edu/auk3/files/2024/01/cotton-a69096c342c63445.csv

Data Dictionary:
Column Name Description
 year Year –  The year of the data observation.
 state State –  The name of the state where the data was recorded.
 planted_acres Planted (1000 Acres) –  The area in thousands of acres that was planted with crops.
 harvested_acres Harvested (1000 Acres) –  The area in thousands of acres that was harvested for crops.
 yield_lbs_per_acre Yield (Pounds/Harvested Area) –  The average yield of crops in pounds per harvested area.
 avg_temp Average Temperature Value –  The average temperature recorded during the specified period.
 avg_temp_anomaly Average Temperature Anomaly –  The deviation from the long-term average temperature for the specified period.
 max_temp Maximum Temperature Value –  The highest temperature recorded during the specified period.
 max_temp_anomaly Maximum Temperature Anomaly –  The deviation from the long-term average maximum temperature for the specified period.
 min_temp Minimum Temperature Value –  The lowest temperature recorded during the specified period.
 min_temp_anomaly Minimum Temperature Anomaly –  The deviation from the long-term average minimum temperature for the specified period.
 precipitation Precipitation Value –  The total amount of precipitation (rainfall) recorded during the specified period.
 precipitation_anomaly Precipitation Anomaly –  The deviation from the long-term average precipitation for the specified period.
 cooling_degree_days Cooling Degree Days Value –  The number of cooling degree days recorded during the specified period.
 cooling_deg_days_anomaly Cooling Degree Days Anomaly –  The deviation from the long-term average cooling degree days for the specified period.
 heating_degree_days Heating Degree Days Value –  The number of heating degree days recorded during the specified period.
 heating_deg_days_anomaly Heating Degree Days Anomaly –  The deviation from the long-term average heating degree days for the specified period.
 pdsi Palmer Drought Severity Index (PDSI) Value –  The value of the Palmer Drought Severity Index, which indicates the severity of drought conditions.
 pdsi_anomaly Palmer Drought Severity Index (PDSI) Anomaly –  The deviation from the long-term average Palmer Drought Severity Index for the specified period.
 phdi Palmer Hydrological Drought Index (PHDI) Value –  The value of the Palmer Hydrological Drought Index, which indicates hydrological drought conditions.
 phdi_anomaly Palmer Hydrological Drought Index (PHDI) Anomaly –  The deviation from the long-term average Palmer Hydrological Drought Index for the specified period.
 pmdi Palmer Modified Drought Index (PMDI) Value –  The value of the Palmer Modified Drought Index, which indicates modified drought conditions.
 pmdi_anomaly Palmer Modified Drought Index (PMDI) Anomaly –  The deviation from the long-term average Palmer Modified Drought Index for the specified period.
 z_index Palmer Z-Index Value –  The value of the Palmer Z-Index, which indicates drought conditions based on soil moisture.
 z_index_anomaly Palmer Z-Index Anomaly –  The deviation from the long-term average Palmer Z-Index for the specified period.