Python Data Science: Getting Started with Pandas

Pandas is one of the most powerful libraries for data manipulation and analysis in Python. Let's explore how to get started with data science using Pandas.

What is Pandas?

Pandas provides:

DataFrames: Powerful data structures for structured data
Data cleaning: Tools for handling missing data and duplicates
Data analysis: Statistical operations and aggregations
File I/O: Read from CSV, Excel, JSON, and databases

Installation and Setup

Install Pandas using pip:

pip install pandas numpy matplotlib

Your First DataFrame

import pandas as pd
import numpy as np

# Create a DataFrame from a dictionary
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'salary': [50000, 60000, 70000, 55000]
}

df = pd.DataFrame(data)
print(df)

Reading Data from Files

# Read from CSV
df = pd.read_csv('data.csv')

# Read from Excel
df = pd.read_excel('data.xlsx')

# Read from JSON
df = pd.read_json('data.json')

Data Exploration

# Basic info about the dataset
print(df.info())
print(df.describe())

# View first and last rows
print(df.head())
print(df.tail())

# Check for missing values
print(df.isnull().sum())

Data Manipulation

# Filter data
high_earners = df[df['salary'] > 55000]

# Group by and aggregate
avg_salary_by_age = df.groupby('age')['salary'].mean()

# Sort data
df_sorted = df.sort_values('salary', ascending=False)

# Add new columns
df['salary_per_year'] = df['salary'] * 12

Data Visualization

import matplotlib.pyplot as plt

# Simple plotting
df['salary'].hist(bins=10)
plt.title('Salary Distribution')
plt.show()

# Box plot
df.boxplot(column='salary')
plt.show()

Conclusion

Pandas is an essential tool for anyone working with data in Python. Master these basics and you'll be well on your way to becoming proficient in data analysis and data science.

Next, explore libraries like NumPy for numerical computing and Matplotlib/Seaborn for advanced visualization!