Contents

pandas is an open source Python library for data analysis and manipulation, built on top of two core Python libraries — NumPy for mathematical operations and Matplotlib for data visualization. This post is a practical guide for using the Python data analysis library. For a more comprehensive understanding, check the official pandas documentation. For a handy quick reference, check the official pandas cheat sheet. There is also a Cookbook for advanced strategies with more complex recipes and useful links.

0. Data structures

To make analytical tasks easier and more flexible, pandas introduced two new types of objects for storing data: Series, which have a list-like structure, and DataFrames, which have a tabular structure. You can think of DataFrames as a collection of series.

1. Read data

The first thing to get started with pandas is how to get data into it.

1.1 Tabular data

Tabular data is data that is structured into rows, each of which contains information about some thing.

1.1.1 Manage delimited files

import pandas as pd

# basic case
df = pd.read_csv('path/to/file.csv')

# use `sep` parameter to specify what the delimiting character to use, e.g., ";"
df = pd.read_csv('path/to/file.csv', sep=';')

# specify headers and rows to skip
df = pd.read_csv('path/to/file.csv', header=3, skiprows=3)

# read date columns using parse_dates
df = pd.read_csv('path/to/file.csv', parse_dates=['date_1', 'date_2'])

For more parameters, check out the pandas.read_csv() official documentation.

1.1.2 Manage Excel files

To read Excel files in Python, use the pandas.read_excel() function, it has many of the same parameters as pandas.read_csv.

1.1.3 Manage databases

Python can be used in databases applications, the first step is to create a connection, check this tutorial for example.

# extract data from a table
df = pd.read_sql_table(table_name='table_name', con='postgres:///db_name')

# run a query
query = """
SELECT col_1,
    col_2,
From  table_name
WHERE col_3 > 4
"""
df = pd.read_sql_query(query, 'postgres:///db_name')

1.2 JSON data

JSON (JavaScript Object Notation) is plain text with the format of an object, it is often used to exchange data on the web.

import pandas as pd

df = pd.read_json('data.json')

print(df.to_string()) # print the entire DataFrame

There are two more methods for reading JSON data:

  • from_dict() method for dictionary by columns or by index allowing dtype specification.
>>> data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
>>> pd.DataFrame.from_dict(data)
   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d
  • from_records() method for a structured ndarray, sequence of tuples or dicts, or DataFrame.
>>> data = np.array([(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')],
...                 dtype=[('col_1', 'i4'), ('col_2', 'U1')])
>>> pd.DataFrame.from_records(data)
   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d

2. Basic data interrogation

After reading data in pandas DataFrame, the next step is to assess our data.

2.1 Dimensions

>>> import pandas as pd
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4],
...                    'col3': [5, 6]})
>>> df.shape # return a tuple contains the number of rows and columns
(2, 3)
>>> len(df) # return the number of rows
2
>>> df.dtypes # return the data type of each column

2.2 Samples

df.head(n) # return the first n rows, 5 by default
df.tail(n) # return the last  n rows, 5 by default
df.sample(n) # return n samples from an axis, 1 by default

# find the top n rows
df.sort_values('list_sort_by', ascending=False).head(n)
# find the top n columns
df.head().transpose() # accessor `T`

2.3 Statistics

How to calculate summary statistics?

The most basic way to summarize the data is the describe() method. Also you can identify the data types and the number of non-missing values using the info() method.

3. Filter data

Filtering data is one of the most common ways to interact with pandas DataFrame.

  • Use iloc[] to choose rows and columns by position (equivalent to iat[])
  • Use loc[] to choose row and columns by label (equivalent to at[])
  • Be explicit about both rows and columns, even if it’s with :.

3.1 Columns

There are two ways to select columns in pandas: df['column_name'] and df.column_name. For some reasons, the former is better.

To extract multiple columns, a list of column names is needed.

df[['col_1', 'col_2', 'col_3']]
# double square brackets is needed

# to make it clearer
columns_to_extract = ['col_1', 'col_2', 'col_3']
df[columns_to_extract]

To select columns based on their position without knowing their name, we can use the iloc[] method.

df.iloc[<row numbers>, <column numbers>]

3.2 Rows

3.2.1 Select based on index

Every DataFrame has an index for rows by default, which works like names for columns. Use df.index to look at the index for the DataFrame df. The set_index() method is used to set the DataFrame index using existing columns or arrays (of the correct length). And the sort_index() method sort object by labels (along an axis).

Equivalent to iloc[], loc[] access a group of rows and columns by label(s) or a boolean array. Note that contrary to usual python slices, both the start and the stop are included.

df.loc[<row indices>, <column names>]

Sometimes we just want to select some rows for all columns, the <column names> can be omitted, i.e., df.loc[[1, 2, 3]] is the same as df.loc[[1, 2, 3], :].

3.2.2 Select based on value

# return all rows where the value in the "filter_col" column is greater than n
df[df['filter_col'] > n]
# repeat df is needed

# to make it clearer
filter = df['filter_col'] > n
# return a boolean series with the same length as the DataFrame
df[filter] # or df.loc[filter], return the rows with a True value

# select every 3rd rows
every_3rd_row = [i % 3 == 0 for i in range(len(df))]
df[every_3rd_row]

3.2.3 Select based on multiple conditions

The following characters will be helpful.

Character Meaning
& AND condition
| OR condition
~ negation
filter = ((df['col_1'] > a) & (df['col_2'] < b)) | (df['col_3'] == c)
df.loc[filter, ['col_4', 'col_5']]

Note: Boolean operation will fail without parentheses.

3.2.4 Select based on advanced booleans

  • df.isin() returns whether each element in the DataFrame is contained in values.
  • df.isna() mask of bool values for each element in DataFrame that indicates whether an element is an NA value. (equivalent to isnull(), np.isnan() in numpy)
  • pandas.Series.between return boolean Series equivalent to left <= series <= right.
  • Working with text data
  • Time series / date functionality

4. Aggregation

Aggregation for efficient summarization is an essential piece of analysis of large data.

4.1 Basic

4.1.1 Most common methods

Most used for numeric data.

  • count() — count non-NA cells for each column(default) or row (set axis=1)
  • nunique() — count number of distinct elements
  • sum()/mean() — return the sum/mean of the values
  • min()/max()/median() — return the minimum/maximum/median of the values
  • quantile() — return values at the given quantile, defaults to the 50th quantile (q=0.5, the median)
  • var()/std() — return (unbiased variance)/(sample standard deviation) of the values. Normalized by N-1(the sample formulation) by default, set ddof=0(N-ddof) for population formulation.

4.1.2 Use groupby()

When doing aggregation, more often than not, we would like to analyze data by some categories. groupby() allows us to specify a column (or multiple columns) to aggregate the values for better analysis.

The name GroupBy should be quite familiar to those who have used a SQL-based tool, in which you can write code like:

SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2

Let’s take a simple example.

>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
...                               'Parrot', 'Parrot'],
...                    'Max Speed': [380., 370., 24., 26.]})
>>> df
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0
>>> df.groupby(['Animal']).mean()
        Max Speed
Animal
Falcon      375.0
Parrot       25.0
  • df.groupby('A') is just syntactic sugar for df.groupby(df['A']).
  • We can groupby different levels of a hierarchical index using the level parameter.
  • We can also choose to include NA in group keys or not by setting dropna(default is True) parameter.

4.1.3 Use agg()

If we want to aggregate using one or more operations and have one or more columns to aggregate, then we can use the agg() (agg is an alias for aggregate) method.

>>> df = pd.DataFrame([[1, 2, 3],
...                    [4, 5, 6],
...                    [7, 8, 9],
...                    [np.nan, np.nan, np.nan]],
...                   columns=['A', 'B', 'C'])

# Aggregate these functions over the rows.
>>> df.agg(['sum', 'min'])
        A     B     C
sum  12.0  15.0  18.0
min   1.0   2.0   3.0

# Different aggregations per column.
>>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
        A    B
sum  12.0  NaN
min   1.0  2.0
max   NaN  8.0

# Aggregate different functions over the columns and rename the index of the resulting DataFrame.
>>> df.agg(x=('A', 'max'), y=('B', 'min'), z=('C', np.mean))
     A    B    C
x  7.0  NaN  NaN
y  NaN  2.0  NaN
z  NaN  NaN  6.0

4.2 Advanced

4.2.1 Combine groupby() and agg()

>>> df = pd.DataFrame({'A': [1, 1, 2, 2],
...                    'B': [1, 2, 3, 4],
...                    'C': np.random.randn(4)})

>>> df
   A  B         C
0  1  1  0.362838
1  1  2  0.227877
2  2  3  1.267767
3  2  4 -0.562860

# Multiple aggregations
>>> df.groupby('A').agg(['min', 'max'])
    B             C
  min max       min       max
A
1   1   2  0.227877  0.362838
2   3   4 -0.562860  1.267767

# Select a column for aggregation
>>> df.groupby('A').B.agg(['min', 'max'])
   min  max
A
1    1    2
2    3    4

# Different aggregations per column
>>> df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'})
    B             C
  min max       sum
A
1   1   2  0.590716
2   3   4  0.704907

The column have been put on two levels is called the MultiIndex. You can think of MultiIndex as an array of tuples where each tuple is unique.

  • df['B'] return a DataFrame with the two columns under B
  • df[('B', 'max')] return the specific B max column
  • To fatten the index into one level, use to_flat_index

4.2.2 Custom

In addition to using the default aggregations, we can also create our own aggregation functions and call them using agg(). The function should takes a series (or list) in and returns a single value. Let’s look the example of the Pythagorean means.

# Calculate the harmonic mean
def harmonic_mean(s):
    return len(s) / sum([1/x for x in s])

# Calculate the Geometric mean
import numpy as np

def geometric_mean(s):
    return np.exp(np.log(s).mean())

# Default aggregation methods should be put in quotes, custom functions pass
df.groupby('A')['B'].agg(['mean', harmonic_mean, geometric_mean])

4.2.3 Transform

The transform() method could save you a lot of time when you want aggregated values without aggregating your Series/DataFrame.

>>> df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...                           'foo', 'bar'],
...                    'B' : ['one', 'one', 'two', 'three',
...                           'two', 'two'],
...                    'C' : [1, 5, 5, 2, 5, 5],
...                    'D' : [2.0, 5., 8., 1., 2., 9.]})
>>> grouped = df.groupby('A')
>>> grouped.transform(lambda x: (x - x.mean()) / x.std())
          C         D
0 -1.154701 -0.577350
1  0.577350  0.000000
2  0.577350  1.154701
3 -1.154701 -1.000000
4  0.577350 -0.577350
5  0.577350  1.000000

In the example above, the transform() method takes an aggregated value and repeats it for every row in the original DataFrame.

5. Pivot data

If you have an Excel and SQL background, you might be very comfortable using pivot tables. But when it comes to pandas, how to recreate the functionality of pivot tables? Let’s look at several ways to do just that.

5.1 The manual way

To have a deeper understanding of pivoting data, we are going to start with learning the manual way combining methods covered in previous section (set_index() and groupby()) with a new method called unstack() to pivot your data any way you want.

>>> stacked
first  second
bar    one     A   -0.727965
               B   -0.589346
       two     A    0.339969
               B   -0.693205
baz    one     A   -0.339355
               B    0.593616
       two     A    0.884345
               B    1.591431
dtype: float64
>>> stacked.unstack()
                     A         B
first second
bar   one    -0.727965 -0.589346
      two     0.339969 -0.693205
baz   one    -0.339355  0.593616
      two     0.884345  1.591431
>>> stacked.unstack(1)
second        one       two
first
bar   A -0.727965  0.339969
      B -0.589346 -0.693205
baz   A -0.339355  0.884345
      B  0.593616  1.591431
>>> stacked.unstack(0)
first          bar       baz
second
one    A -0.727965 -0.339355
       B -0.589346  0.593616
two    A  0.339969  0.884345
       B -0.693205  1.591431

What unstack() does is takes a specified index column and converts each value in that index into its own column. By default, it unstacks the last level (level=-1).

5.1.1 With aggregation

The typical scenario where the data you want to pivot involves aggregation.

df.groupby(['A', 'B'])['C'].agg().unstack('A')

From this, we first groupby() the columns as indexes, and then agg() the column as values, finally unstack() the index column (or columns) for which we want to have the categories as columns.

5.1.2 Without aggregation

What if we already have aggregated data like this:

df_agg = df.groupby(['A', 'B'])['C'].agg()

Technically, we could do the same as above using groupby(). Because every row already represents a unique combination of A and B, aggregate again will not change the values. There’s a better way using set_index() though.

df_agg.set_index(['A', 'B'])['C'].unstack('A')

5.2 Use pivot()

The pivot() method is a one-shot way of doing what we did in the previous section with set_index() — pivoting an already aggregated DataFrame.

Note: Series object has no attribute pivot, a groupby object after aggregation can be series. To convert it to DataFrame, directly use to_frame, or use reset_index() to treate the index as a column.

df_agg.reset_index().pivot(index='B', columns='A', values='C')

Here’s a more detailed example from the official documentation.

>>> df
    lev1 lev2 lev3 lev4 values
0   1    1    1    1    0
1   1    1    2    2    1
2   1    2    1    3    2
3   2    1    2    4    3
4   2    1    1    5    4
5   2    2    2    6    5
>>> df.pivot(index="lev1", columns=["lev2", "lev3"], values="values")
lev2    1         2
lev3    1    2    1    2
lev1
1     0.0  1.0  2.0  NaN
2     4.0  3.0  NaN  5.0
>>> df.pivot(index=["lev1", "lev2"], columns=["lev3"], values="values")
      lev3    1    2
lev1  lev2
   1     1  0.0  1.0
         2  2.0  NaN
   2     1  4.0  3.0
         2  NaN  5.0

Notes:

  • A ValueError is raised if there are any duplicates (index, columns combinations with multiple values). So you’d better aggregate your data before using pivot().
  • The values parameter is optional. If not specified, all remaining columns will be used and the result will have hierarchically indexed columns.

5.3 Use pandas.pivot_table()

The pandas.pivot_table() creates a spreadsheet-style pivot table as a DataFrame. If pivot() is the equivalent of the “without aggregation” — df_agg.set_index(['A', 'B'])['C'].unstack('A'), then [pivot_table()] is the equivalent of the “with aggregation” — df.groupby(['A', 'B'])['C'].agg().unstack('A'). That is, it is better to pivot data using pivot_table(), which avoid raising ValueError when you need to aggregate before pivoting using pivot().

pd.pivot_table(df, values='C', index=['B'], columns=['A'], aggfunc='agg')

The difference between pivot() and pivot_table():

  • pivot() is a method of the DataFrame class, while pivot_table() is a standalone function in the pandas library. Which means we need pass a DataFrame to pd.pivot_table().
  • pivot_table() is a generalization of pivot() that can handle duplicate values for one pivoted index/column pair.
  • pivot_table() contains the three parameters we used for the pivot() method (index, columns, and values), and additionally provide other parameters such as aggfunc (by default is numpy.mean).
  • pivot_table() also supports using multiple columns for the index and column of the pivoted table to generate a hierarchical index.

Function examples:

>>> df
     A    B      C  D  E
0  foo  one  small  1  2
1  foo  one  large  2  4
2  foo  one  large  2  5
3  foo  two  small  3  5
4  foo  two  small  3  6
5  bar  one  large  4  6
6  bar  one  small  5  8
7  bar  two  small  6  9
8  bar  two  large  7  9
>>> table = pd.pivot_table(df, values='D', index=['A', 'B'],
...                     columns=['C'], aggfunc=np.sum, fill_value=0)
>>> table
C        large  small
A   B
bar one      4      5
    two      7      6
foo one      4      1
    two      0      6
>>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
...                     aggfunc={'D': np.mean,
...                              'E': [min, max, np.mean]})
>>> table
                D    E
            mean  max      mean  min
A   C
bar large  5.500000  9.0  7.500000  6.0
    small  5.500000  9.0  8.500000  8.0
foo large  2.000000  5.0  4.500000  4.0
    small  2.333333  6.0  4.333333  2.0

5.4 Use pandas.crosstab()

The pandas.crosstab() is another function of the pandas library similar to pivot_table() but more generalized in a way. However, for crosstable() we don’t have a DataFrame to pass. Instead, we pass actual columns of data to the index, columns, and values (optional, by default computes a frequency table of the factors). If you want to perform another aggregation, an array of values and an aggregation function is required.

Let’s reproduce pivoting data again:

pd.crosstab(index=df['B'], columns=df['A'], values=df['C'], aggfunc='agg')

Function example:

>>> a = np.array(["foo", "foo", "foo", "foo", "bar", "bar",
...               "bar", "bar", "foo", "foo", "foo"], dtype=object)
>>> b = np.array(["one", "one", "one", "two", "one", "one",
...               "one", "two", "two", "two", "one"], dtype=object)
>>> c = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny",
...               "shiny", "dull", "shiny", "shiny", "shiny"],
...              dtype=object)
>>> pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
b   one        two
c   dull shiny dull shiny
a
bar    1     2    1     0
foo    2     2    1     2

Keep in mind, use crosstab() when you don’t have a DataFrame. It is a direct way if you’re only interested in the aggregated results, otherwise pivot_table() is nicer and cleaner.

6. Join tables

In data science, joining two or more tables together based on some shared columns or index, which is one of the most fundamental concepts. It is JOIN in SQL, VLOOKUP or INDEX-MATCH in Excel. In pandas, there are various facilities with various kinds of set logic. We will look at how to combine pandas objects through different methods.

6.1 Use merge()

The merge() method merge DataFrame or named Series objects (treated as a DataFrame with a single named column) with a database-style join.

>>> df1
    lkey value
0   foo      1
1   bar      2
2   baz      3
3   foo      5
>>> df2
    rkey value
0   foo      5
1   bar      6
2   baz      7
3   foo      8
>>> df1.merge(df2, left_on='lkey', right_on='rkey',
          suffixes=('_left', '_right'))
  lkey  value_left rkey  value_right
0  foo           1  foo            5
1  foo           1  foo            8
2  foo           5  foo            5
3  foo           5  foo            8
4  bar           2  bar            6
5  baz           3  baz            7

From the above example, we joined tables on df1.lkey == df2.rkey. However, there are two columns with the same information, lkey and rkey. Using merge(), we can not do much to avoid that when the columns we are joining on have different names. An alternative method is using drop() to remove the duplicate column.

6.2 Use join()

The join() method joins columns with other DataFrame either on index or on a key column. We can pass a list of multiple DataFrame objects to join by index at once. Working much same as merge(), join() is a more specific and concise version — it is designed to join on an index which may be a column or multiple columns, especially for many-to-one joins (where one of the DataFrames is already indexed by the join key).

left.join(right, on=key_or_keys) is equivalent to left.merge(right, left_on=key_or_keys, right_index=True, how='left', sort=False)

The default for join() is to perform a left join (essentially a “VLOOKUP” operation for Excel usres). To perform other join types, for example inner join:

left.join(right, on=['key1', 'key2'], how='inner')

More detailed example:

>>> df
  key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K3  A3
4  K4  A4
5  K5  A5
>>> other
  key   B
0  K0  B0
1  K1  B1
2  K2  B2
# Join using indexes
>>> df.join(other, lsuffix='_caller', rsuffix='_other')
  key_caller   A key_other    B
0         K0  A0        K0   B0
1         K1  A1        K1   B1
2         K2  A2        K2   B2
3         K3  A3       NaN  NaN
4         K4  A4       NaN  NaN
5         K5  A5       NaN  NaN
# Join using the key columns, by setting both key columns to be the index
>>> df.set_index('key').join(other.set_index('key'))
      A    B
key
K0   A0   B0
K1   A1   B1
K2   A2   B2
K3   A3  NaN
K4   A4  NaN
K5   A5  NaN
# Join using the key columns, by the `on` parameter. The original `df` index preserved.
>>> df.join(other.set_index('key'), on='key')
  key   A    B
0  K0  A0   B0
1  K1  A1   B1
2  K2  A2   B2
3  K3  A3  NaN
4  K4  A4  NaN
5  K5  A5  NaN

join() always uses other’s index but we can use any column in df. A little different from merge() using one suffixes parameter, when there are columns with the same names, the way join() handles suffixes by passing lsuffix and rsuffix parameters separately.

6.3 Use pandas.concat()

The pandas.concat() concatenates pandas objects along a particular axis with optinal set logic (union or intersection) along the other axes. Primarily, we use concat() to stack two DataFrames together. Generally, the joining is vertical, we can combine DataFrames horizontally along the x axis by passing in axis=1.

>>> s1 = pd.Series(['a', 'b'])
>>> s2 = pd.Series(['c', 'd'])
>>> pd.concat([s1, s2])
0    a
1    b
0    c
1    d
dtype: object
# Add a hierarchical index with the `key` option and label the index keys
>>> pd.concat([s1, s2], keys=['s1', 's2'],
...           names=['Series name', 'Row ID'])
Series name  Row ID
s1           0         a
             1         b
s2           0         c
             1         d
dtype: object

Use the keys and names parameters if you want keep track of which DataFrame each row originally came from. Add .reset_index() to the end of pd.concat() function to covert into regular indexes.

6.4 Use append()

The append() method appends rows of other to the end of caller. As join() is more specific to merge() for joining tables, append() is a shortcut to concat() for streamlined appending. We have an ignore_index parameter, but there is no join parameter, append() will always do an outer join.

When we need to repeatedly add one row at a time to generate DataFrames, there are two ways (while not recommended):

  • Less efficient:
>>> df = pd.DataFrame(columns=['A'])
>>> for i in range(5):
...     df = df.append({'A': i}, ignore_index=True)
>>> df
   A
0  0
1  1
2  2
3  3
4  4
  • More efficient:
>>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)],
...           ignore_index=True)
   A
0  0
1  1
2  2
3  3
4  4

Adding a column to a DataFrame is relatively fast. However, adding a row requires a copy, and may be expensive. Passing a pre-built list of records to the DataFrame constructor instead of building a DataFrame by iteratively appending records to it, which is recommended.

6.5 Differences between JOIN methods

  • merge() is used to combine DataFrames on the common columns or indexes. join() is used to join tables on the indexes. concat() is used to concatenate pandas objects aligned by indexes depending on the axis option.
  • By default, merge() perform an inner join, join() perform a left join, concat() perform an outer join.
  • merge() exists both as a top-level pandas function and a DataFrame method, join() is a DataFrame method, concat() is a top-level pandas function.
  • merge() and join() handle duplicates on the joining index (or columns) by performing a cartesian product, while using concat() to append horizontally (axis=1), if all DataFrames have the same indexes and the same number of rows, concat() will perform row for row, even if there are duplicate values in the index; if either of the two conditions are not met, it will throw an error.

7. To be Added (Under construction)

  • DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note it does not include the index or column labels in the output. NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. So to_numpy() can be an expensive operation when your DataFrame has columns with different data types.
  • melt()
  • shift()

Operations in general exclude missing data.

Time Series: to_datetime()

Text: .str, re, astype(), rename()

Missing data: df.isnull().values.any(), dropna(axis=1, how='any'), df.fillna(value)

Plot: value_counts()

Rewrite SQL

We will use the OurAirports Datasets. Examples are from Irina Truong.

SQL Pandas
SELECT DISTINCT type FROM airports airports.type.nunique()
SELECT id FROM airports WHERE ident = ‘KLAX’ LIMIT 3 airports[airports.ident == ‘KLAX’].id.head(3)
SELECT ident, name, municipality FROM airports WHERE iso_region = ‘US-CA’ AND type = ‘large_airport’ airports[(airports.iso_region == ‘US-CA’) & (airports.type == ‘large_airport’)][[‘ident’, ‘name’, ‘municipality’]]
SELECT * FROM airport_freq WHERE airport_ident = ‘KLAX’ ORDER BY type DESC airport_freq[airport_freq.airport_ident == ‘KLAX’].sort_values(‘type’, ascending=False)
SELECT * FROM airports WHERE type NOT IN (‘heliport’, ‘balloonport’) airport[~airports.type.isin([‘heliport’, ‘balloonport’])]
SELECT iso_country, type, COUNT(*) FROM airports GROUP BY iso_country, type ORDER BY iso_country, COUNT(*) DESC airports.groupby([‘iso_country’, ‘type’]).size().to_frame(‘size’).reset_index().sort_values([‘iso_country’, ‘size’], ascending=[True, False])
SELECT type, COUNT(*) FROM airports WHERE iso_country = ‘US’ GROUP BY type HAVING COUNT(*) > 1000 ORDER BY COUNT(*) DESC airports[airports.iso_country == ‘US’].groupby(‘type’).filter(lambda g:len(g) > 1000).groupby(‘type’).size().sort_values(ascending=False)
SELECT iso_country FROM by_country ORDER BY size DESC LIMIT 10 OFFSET 10 by_country.nlargest(20, columns=’airport_count’).tail(10)
SELECT MAX(length_ft), MIN(length_ft), AVG(length_ft) FROM runways runways.agg({‘length_ft’: [‘min’, ‘max’, ‘mean’]}).T
SELECT surface, MIN(length_ft) AS min_length_ft, MAX(width_ft) AS max_width_ft FROM runaways GROUP BY surface runaways.groupby(‘surface’).agg(min_length_ft = (‘length_ft’, ‘min’), max_width_ft = (‘width’, ‘max’))
SELECT airport_ident, type, description, frequency_mhz FROM airport_freq JOIN airports ON airport_freq.airport_ref = airports.id WHERE airports.ident = ‘KLAX’ airport_freq.merge(airports[airports.ident == ‘KLAX’][[‘id’]], left_on=’airport_ref’, right_on=’id’, how=’inner’)[[‘airport_ident’, ‘type’, ‘description’, ‘frequency_mhz’]]
SELECT name, municipality FROM airport WHERE ident = ‘KLAX’ UNION SELECT name, municipality FROM airports WHERE ident = ‘KLGB’ pd.concat([airports[airports.ident == ‘KLAX’][[‘name’, ‘municipality’]], airports[airports.ident == ‘KLGB’][[‘name’, ‘municipality’]]]).drop_duplicates()
CREATE TABLE heroes (id INTEGER, name TEXT);
INSERT INTO heros VALUES (1, ‘Harry Potter’), (2, ‘Ron Weasley’), (3, ‘Hermione Granger’);
df1 = pd.DataFrame({‘id’: [1], ‘name’: [‘Harry Potter’]})
df2 = pd.DataFrame({‘id’: [2, 3], ‘name’: [‘Ron Weasley’, ‘Hermione Granger’]})
pd.concat([df1, df2].reset_index(drop=True)
UPDATE airports SET home_link = ‘fixed_url’ WHERE ident = ‘KLAX’ airports.loc[airports[‘ident’] == ‘KLAX’, ‘home_link’] = ‘fixed_url’
DELETE FROM lax_freq WHERE type = ‘MISC’ lax_freq.drop(lax_freq[lax_freq.type == ‘MISC’].index)

More