Python Data Science: Getting Started with Pandas
Pandas is one of the most powerful libraries for data manipulation and analysis in Python. Let's explore how to get started with data science using Pandas.
What is Pandas?
Pandas provides:
- DataFrames: Powerful data structures for structured data
- Data cleaning: Tools for handling missing data and duplicates
- Data analysis: Statistical operations and aggregations
- File I/O: Read from CSV, Excel, JSON, and databases
Installation and Setup
Install Pandas using pip:
pip install pandas numpy matplotlib
Your First DataFrame
import pandas as pd
import numpy as np
# Create a DataFrame from a dictionary
data = {
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)
print(df)
Reading Data from Files
# Read from CSV
df = pd.read_csv('data.csv')
# Read from Excel
df = pd.read_excel('data.xlsx')
# Read from JSON
df = pd.read_json('data.json')
Data Exploration
# Basic info about the dataset
print(df.info())
print(df.describe())
# View first and last rows
print(df.head())
print(df.tail())
# Check for missing values
print(df.isnull().sum())
Data Manipulation
# Filter data
high_earners = df[df['salary'] > 55000]
# Group by and aggregate
avg_salary_by_age = df.groupby('age')['salary'].mean()
# Sort data
df_sorted = df.sort_values('salary', ascending=False)
# Add new columns
df['salary_per_year'] = df['salary'] * 12
Data Visualization
import matplotlib.pyplot as plt
# Simple plotting
df['salary'].hist(bins=10)
plt.title('Salary Distribution')
plt.show()
# Box plot
df.boxplot(column='salary')
plt.show()
Conclusion
Pandas is an essential tool for anyone working with data in Python. Master these basics and you'll be well on your way to becoming proficient in data analysis and data science.
Next, explore libraries like NumPy for numerical computing and Matplotlib/Seaborn for advanced visualization!