Last updated August 10, 2022
The Pandas Data Analysis Library provides a way of bringing SQL-like sorting and querying to semi-structured data, through Python. These examples provided below were shamelessly lifted from the book, "Python for Data Analysis."
Installing Python Pandas:
From the command line, install the Python package manager pip if you haven't done so yet:
sudo apt-get install python-pipPandas requires numpy, so install both from pip:
sudo pip install numpy sudo pip install pandasAnd at the start of your Python program you need to alert the compiler of the necessary libraries:
from pandas import Series from pandas import DataFrame import pandas as pd
Working with Arrays: Series
(You can run the code below from this file)
To know pandas you need to know all about series and data frames. Let's start with a series. A series is a one-dimensional array (or object) of data and an index. Pandas will let you create a series:obj = Series([ 13, 23, 2, 15])If no index is present, one will be created automatically. You can create a series and define the index:
obj2 = Series([ 4, 7, -5, 3], index =['d', 'b', 'a', 'c']) obj2['d'] = 6Use the index to assign a certain value:
IndexedSeries['a'] = 14;You can create a series from a Python Dict:
Dict2SeriesData = {'Monday': 2200, 'Tuesday': 3528, 'Wednesday': 123299, 'Thursday': 3234} Dict2Series = Series(Dict2SeriesData)Sort a Series by providing the sorting order (Note: Pandas will assign a NaN to any values it does not find):
Days = ['Wednesday', 'Friday', 'Monday', 'Tuesday'] SortedDays = Series(Dict2SeriesData, index=Days)You can combine two series into a single one:
Dict3SeriesData = {'Monday': 1400, 'Tuesday': 10000, 'Wednesday': 5, 'Sunday': 2365} Dict3Series = Series(Dict3SeriesData) Dailies = Dict3Series + Dict2Series
Working with Arrays: Data Frames
A data frame is a two-dimensional labeled data structure (of potentially different data types) that resembles a spreadsheet. It has an index for both the row and the column (Operational code samples for this section are available here).data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2001, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} frame = DataFrame(data) print frame
Reading in Data:
The next example requires users.dat, ratings.dat, and movies.dat. Run the code here.
#Run these commands in iPython, or as a stand-alone Python program import pandas as pd unames = [' user_id', 'gender', 'age', 'occupation', 'zip'] users = pd.read_table('users.dat', sep ='::', header = None, names = unames) rnames = [' user_id', 'movie_id', 'rating', 'timestamp'] ratings = pd.read_table('ratings.dat', sep ='::', header = None, names = rnames) mnames = [' movie_id', 'title', 'genres'] movies = pd.read_table('movies.dat', sep ='::', header = None, names = mnames) users[: 5] movies[: 5] ratings data = pd.merge( ratings, users) active_titles = ratings_by_title.index[ ratings_by_title > = 250]