Jekyll2022-11-30T01:22:32+00:00/feed.xmluniQiaoMath Towards Data ScienceQiao HuangMy Jupyter Notebooks2022-01-15T00:00:00+00:002022-01-15T00:00:00+00:00/my-jupyter-notebooks<p>Welcome to <strong><a href="https://nb.uniqiao.com/">my Jupyter Notebooks</a></strong> 📙 — a site under a new subdomain: <em><<strong><span style="color: darkorange">nb</span></strong>.uniQiao.com></em></p>
<p><img src="https://nb.uniqiao.com/images/diagram.png" alt="flow-diagram" /></p>
<p>Built with <a href="https://github.com/fastai/fastpages">
<img src="/projects/assets/images/fastai-logo.png" alt="fastai logo" title="fastpages" height="48" />
</a> + <a href="https://github.com/features/actions">
<img src="/projects/assets/images/actions-logo.png" alt="actions logo" title="GitHub Actions" height="48" />
</a></p>Qiao HuangWelcome to my Jupyter Notebooks 📙 — a site under a new subdomain: <nb.uniQiao.com>Pandas Guide2021-10-23T00:00:00+00:002021-10-23T00:00:00+00:00/pandas<p><strong>Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#0-data-structures" id="markdown-toc-0-data-structures">0. Data structures</a></li>
<li><a href="#1-read-data" id="markdown-toc-1-read-data">1. Read data</a> <ul>
<li><a href="#11-tabular-data" id="markdown-toc-11-tabular-data">1.1 Tabular data</a> <ul>
<li><a href="#111-manage-delimited-files" id="markdown-toc-111-manage-delimited-files">1.1.1 Manage delimited files</a></li>
<li><a href="#112-manage-excel-files" id="markdown-toc-112-manage-excel-files">1.1.2 Manage Excel files</a></li>
<li><a href="#113-manage-databases" id="markdown-toc-113-manage-databases">1.1.3 Manage databases</a></li>
</ul>
</li>
<li><a href="#12-json-data" id="markdown-toc-12-json-data">1.2 JSON data</a></li>
</ul>
</li>
<li><a href="#2-basic-data-interrogation" id="markdown-toc-2-basic-data-interrogation">2. Basic data interrogation</a> <ul>
<li><a href="#21-dimensions" id="markdown-toc-21-dimensions">2.1 Dimensions</a></li>
<li><a href="#22-samples" id="markdown-toc-22-samples">2.2 Samples</a></li>
<li><a href="#23-statistics" id="markdown-toc-23-statistics">2.3 Statistics</a></li>
</ul>
</li>
<li><a href="#3-filter-data" id="markdown-toc-3-filter-data">3. Filter data</a> <ul>
<li><a href="#31-columns" id="markdown-toc-31-columns">3.1 Columns</a></li>
<li><a href="#32-rows" id="markdown-toc-32-rows">3.2 Rows</a> <ul>
<li><a href="#321-select-based-on-index" id="markdown-toc-321-select-based-on-index">3.2.1 Select based on index</a></li>
<li><a href="#322-select-based-on-value" id="markdown-toc-322-select-based-on-value">3.2.2 Select based on value</a></li>
<li><a href="#323-select-based-on-multiple-conditions" id="markdown-toc-323-select-based-on-multiple-conditions">3.2.3 Select based on multiple conditions</a></li>
<li><a href="#324-select-based-on-advanced-booleans" id="markdown-toc-324-select-based-on-advanced-booleans">3.2.4 Select based on advanced booleans</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#4-aggregation" id="markdown-toc-4-aggregation">4. Aggregation</a> <ul>
<li><a href="#41-basic" id="markdown-toc-41-basic">4.1 Basic</a> <ul>
<li><a href="#411-most-common-methods" id="markdown-toc-411-most-common-methods">4.1.1 Most common methods</a></li>
<li><a href="#412-use-groupby" id="markdown-toc-412-use-groupby">4.1.2 Use <code class="language-plaintext highlighter-rouge">groupby()</code></a></li>
<li><a href="#413-use-agg" id="markdown-toc-413-use-agg">4.1.3 Use <code class="language-plaintext highlighter-rouge">agg()</code></a></li>
</ul>
</li>
<li><a href="#42-advanced" id="markdown-toc-42-advanced">4.2 Advanced</a> <ul>
<li><a href="#421-combine-groupby-and-agg" id="markdown-toc-421-combine-groupby-and-agg">4.2.1 Combine <code class="language-plaintext highlighter-rouge">groupby()</code> and <code class="language-plaintext highlighter-rouge">agg()</code></a></li>
<li><a href="#422-custom" id="markdown-toc-422-custom">4.2.2 Custom</a></li>
<li><a href="#423-transform" id="markdown-toc-423-transform">4.2.3 Transform</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#5-pivot-data" id="markdown-toc-5-pivot-data">5. Pivot data</a> <ul>
<li><a href="#51-the-manual-way" id="markdown-toc-51-the-manual-way">5.1 The manual way</a> <ul>
<li><a href="#511-with-aggregation" id="markdown-toc-511-with-aggregation">5.1.1 With aggregation</a></li>
<li><a href="#512-without-aggregation" id="markdown-toc-512-without-aggregation">5.1.2 Without aggregation</a></li>
</ul>
</li>
<li><a href="#52-use-pivot" id="markdown-toc-52-use-pivot">5.2 Use <code class="language-plaintext highlighter-rouge">pivot()</code></a></li>
<li><a href="#53-use-pandaspivot_table" id="markdown-toc-53-use-pandaspivot_table">5.3 Use <code class="language-plaintext highlighter-rouge">pandas.pivot_table()</code></a></li>
<li><a href="#54-use-pandascrosstab" id="markdown-toc-54-use-pandascrosstab">5.4 Use <code class="language-plaintext highlighter-rouge">pandas.crosstab()</code></a></li>
</ul>
</li>
<li><a href="#6-join-tables" id="markdown-toc-6-join-tables">6. Join tables</a> <ul>
<li><a href="#61-use-merge" id="markdown-toc-61-use-merge">6.1 Use <code class="language-plaintext highlighter-rouge">merge()</code></a></li>
<li><a href="#62-use-join" id="markdown-toc-62-use-join">6.2 Use <code class="language-plaintext highlighter-rouge">join()</code></a></li>
<li><a href="#63-use-pandasconcat" id="markdown-toc-63-use-pandasconcat">6.3 Use <code class="language-plaintext highlighter-rouge">pandas.concat()</code></a> <ul>
<li><a href="#64-use-append" id="markdown-toc-64-use-append">6.4 Use <code class="language-plaintext highlighter-rouge">append()</code></a></li>
</ul>
</li>
<li><a href="#65-differences-between-join-methods" id="markdown-toc-65-differences-between-join-methods">6.5 Differences between <code class="language-plaintext highlighter-rouge">JOIN</code> methods</a></li>
</ul>
</li>
<li><a href="#7-to-be-added-under-construction" id="markdown-toc-7-to-be-added-under-construction">7. To be Added (Under construction)</a></li>
<li><a href="#rewrite-sql" id="markdown-toc-rewrite-sql">Rewrite SQL</a></li>
<li><a href="#more" id="markdown-toc-more">More</a></li>
</ul>
<p><strong><a href="https://pandas.pydata.org/">pandas</a></strong> is an open source Python library for data analysis and manipulation,
built on top of two core Python libraries — <a href="https://numpy.org/">NumPy</a> for mathematical operations and <a href="https://matplotlib.org/">Matplotlib</a> for data visualization. This post is a practical guide for using the Python data analysis library. For a more comprehensive understanding, check the official <a href="https://pandas.pydata.org/docs/index.html">pandas documentation</a>. For a handy quick reference, check the official <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf">pandas cheat sheet</a>. There is also a <a href="https://pandas.pydata.org/docs/user_guide/cookbook.html">Cookbook</a> for advanced strategies with more complex recipes and useful links.</p>
<h2 id="0-data-structures">0. Data structures</h2>
<p>To make analytical tasks easier and more flexible, pandas introduced two new types of objects for storing data: <strong>Series</strong>, which have a list-like structure, and <strong>DataFrames</strong>, which have a tabular structure. You can think of DataFrames as a collection of series.</p>
<h2 id="1-read-data">1. Read data</h2>
<p>The first thing to get started with pandas is how to get data into it.</p>
<h3 id="11-tabular-data">1.1 Tabular data</h3>
<p>Tabular data is data that is structured into rows, each of which contains information about some thing.</p>
<h4 id="111-manage-delimited-files">1.1.1 Manage delimited files</h4>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="c1"># basic case
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'path/to/file.csv'</span><span class="p">)</span>
<span class="c1"># use `sep` parameter to specify what the delimiting character to use, e.g., ";"
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'path/to/file.csv'</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s">';'</span><span class="p">)</span>
<span class="c1"># specify headers and rows to skip
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'path/to/file.csv'</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">skiprows</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="c1"># read date columns using parse_dates
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'path/to/file.csv'</span><span class="p">,</span> <span class="n">parse_dates</span><span class="o">=</span><span class="p">[</span><span class="s">'date_1'</span><span class="p">,</span> <span class="s">'date_2'</span><span class="p">])</span>
</code></pre></div></div>
<p>For more parameters, check out the <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html"><code class="language-plaintext highlighter-rouge">pandas.read_csv()</code> official documentation</a>.</p>
<h4 id="112-manage-excel-files">1.1.2 Manage Excel files</h4>
<p>To read Excel files in Python, use the <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html"><code class="language-plaintext highlighter-rouge">pandas.read_excel()</code></a> function, it has many of the same parameters as <code class="language-plaintext highlighter-rouge">pandas.read_csv</code>.</p>
<h4 id="113-manage-databases">1.1.3 Manage databases</h4>
<p>Python can be used in databases applications, the first step is to create a connection, check this <a href="https://www.w3schools.com/python/python_mysql_getstarted.asp">tutorial</a> for example.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># extract data from a table
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_sql_table</span><span class="p">(</span><span class="n">table_name</span><span class="o">=</span><span class="s">'table_name'</span><span class="p">,</span> <span class="n">con</span><span class="o">=</span><span class="s">'postgres:///db_name'</span><span class="p">)</span>
<span class="c1"># run a query
</span><span class="n">query</span> <span class="o">=</span> <span class="s">"""
SELECT col_1,
col_2,
From table_name
WHERE col_3 > 4
"""</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_sql_query</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="s">'postgres:///db_name'</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="12-json-data">1.2 JSON data</h3>
<p><a href="https://en.wikipedia.org/wiki/JSON">JSON</a> (JavaScript Object Notation) is plain text with the format of an object, it is often used to exchange data on the web.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_json</span><span class="p">(</span><span class="s">'data.json'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">to_string</span><span class="p">())</span> <span class="c1"># print the entire DataFrame
</span></code></pre></div></div>
<p>There are two more methods for reading JSON data:</p>
<ul>
<li><a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_dict.html"><code class="language-plaintext highlighter-rouge">from_dict()</code></a> method for dictionary by columns or by index allowing dtype specification.</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s">'col_1'</span><span class="p">:</span> <span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="s">'col_2'</span><span class="p">:</span> <span class="p">[</span><span class="s">'a'</span><span class="p">,</span> <span class="s">'b'</span><span class="p">,</span> <span class="s">'c'</span><span class="p">,</span> <span class="s">'d'</span><span class="p">]}</span>
<span class="o">>>></span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">.</span><span class="n">from_dict</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">col_1</span> <span class="n">col_2</span>
<span class="mi">0</span> <span class="mi">3</span> <span class="n">a</span>
<span class="mi">1</span> <span class="mi">2</span> <span class="n">b</span>
<span class="mi">2</span> <span class="mi">1</span> <span class="n">c</span>
<span class="mi">3</span> <span class="mi">0</span> <span class="n">d</span>
</code></pre></div></div>
<ul>
<li><a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_records.html"><code class="language-plaintext highlighter-rouge">from_records()</code></a> method for a structured ndarray, sequence of tuples or dicts, or DataFrame.</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">data</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([(</span><span class="mi">3</span><span class="p">,</span> <span class="s">'a'</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s">'b'</span><span class="p">),</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">'c'</span><span class="p">),</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="s">'d'</span><span class="p">)],</span>
<span class="p">...</span> <span class="n">dtype</span><span class="o">=</span><span class="p">[(</span><span class="s">'col_1'</span><span class="p">,</span> <span class="s">'i4'</span><span class="p">),</span> <span class="p">(</span><span class="s">'col_2'</span><span class="p">,</span> <span class="s">'U1'</span><span class="p">)])</span>
<span class="o">>>></span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">.</span><span class="n">from_records</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">col_1</span> <span class="n">col_2</span>
<span class="mi">0</span> <span class="mi">3</span> <span class="n">a</span>
<span class="mi">1</span> <span class="mi">2</span> <span class="n">b</span>
<span class="mi">2</span> <span class="mi">1</span> <span class="n">c</span>
<span class="mi">3</span> <span class="mi">0</span> <span class="n">d</span>
</code></pre></div></div>
<h2 id="2-basic-data-interrogation">2. Basic data interrogation</h2>
<p>After reading data in pandas DataFrame, the next step is to assess our data.</p>
<h3 id="21-dimensions">2.1 Dimensions</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="o">>>></span> <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">'col1'</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">],</span> <span class="s">'col2'</span><span class="p">:</span> <span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span>
<span class="p">...</span> <span class="s">'col3'</span><span class="p">:</span> <span class="p">[</span><span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">]})</span>
<span class="o">>>></span> <span class="n">df</span><span class="p">.</span><span class="n">shape</span> <span class="c1"># return a tuple contains the number of rows and columns
</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
<span class="o">>>></span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span> <span class="c1"># return the number of rows
</span><span class="mi">2</span>
<span class="o">>>></span> <span class="n">df</span><span class="p">.</span><span class="n">dtypes</span> <span class="c1"># return the data type of each column
</span></code></pre></div></div>
<h3 id="22-samples">2.2 Samples</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">head</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="c1"># return the first n rows, 5 by default
</span><span class="n">df</span><span class="p">.</span><span class="n">tail</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="c1"># return the last n rows, 5 by default
</span><span class="n">df</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="c1"># return n samples from an axis, 1 by default
</span>
<span class="c1"># find the top n rows
</span><span class="n">df</span><span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s">'list_sort_by'</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">).</span><span class="n">head</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
<span class="c1"># find the top n columns
</span><span class="n">df</span><span class="p">.</span><span class="n">head</span><span class="p">().</span><span class="n">transpose</span><span class="p">()</span> <span class="c1"># accessor `T`
</span></code></pre></div></div>
<h3 id="23-statistics">2.3 Statistics</h3>
<p><a href="https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html">How to calculate summary statistics?</a></p>
<p>The most basic way to summarize the data is the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html"><code class="language-plaintext highlighter-rouge">describe()</code></a> method. Also you can identify the data types and the number of non-missing values using the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html"><code class="language-plaintext highlighter-rouge">info()</code></a> method.</p>
<h2 id="3-filter-data">3. Filter data</h2>
<p>Filtering data is one of the most common ways to interact with pandas DataFrame.</p>
<ul>
<li>Use <code class="language-plaintext highlighter-rouge">iloc[]</code> to choose rows and columns by position (equivalent to <code class="language-plaintext highlighter-rouge">iat[]</code>)</li>
<li>Use <code class="language-plaintext highlighter-rouge">loc[]</code> to choose row and columns by label (equivalent to <code class="language-plaintext highlighter-rouge">at[]</code>)</li>
<li>Be explicit about both rows and columns, even if it’s with <code class="language-plaintext highlighter-rouge">:</code>.</li>
</ul>
<h3 id="31-columns">3.1 Columns</h3>
<p>There are two ways to select columns in pandas: <code class="language-plaintext highlighter-rouge">df['column_name']</code> and <code class="language-plaintext highlighter-rouge">df.column_name</code>. For some reasons, the former is better.</p>
<p>To extract multiple columns, a list of column names is needed.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">[[</span><span class="s">'col_1'</span><span class="p">,</span> <span class="s">'col_2'</span><span class="p">,</span> <span class="s">'col_3'</span><span class="p">]]</span>
<span class="c1"># double square brackets is needed
</span>
<span class="c1"># to make it clearer
</span><span class="n">columns_to_extract</span> <span class="o">=</span> <span class="p">[</span><span class="s">'col_1'</span><span class="p">,</span> <span class="s">'col_2'</span><span class="p">,</span> <span class="s">'col_3'</span><span class="p">]</span>
<span class="n">df</span><span class="p">[</span><span class="n">columns_to_extract</span><span class="p">]</span>
</code></pre></div></div>
<p>To select columns based on their position without knowing their name, we can use the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html"><code class="language-plaintext highlighter-rouge">iloc[]</code></a> method.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="o"><</span><span class="n">row</span> <span class="n">numbers</span><span class="o">></span><span class="p">,</span> <span class="o"><</span><span class="n">column</span> <span class="n">numbers</span><span class="o">></span><span class="p">]</span>
</code></pre></div></div>
<h3 id="32-rows">3.2 Rows</h3>
<h4 id="321-select-based-on-index">3.2.1 Select based on index</h4>
<p>Every DataFrame has an index for rows by default, which works like names for columns. Use <code class="language-plaintext highlighter-rouge">df.index</code> to look at the index for the DataFrame <code class="language-plaintext highlighter-rouge">df</code>. The <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html"><code class="language-plaintext highlighter-rouge">set_index()</code></a> method is used to set the DataFrame index using existing columns or arrays (of the correct length). And the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html"><code class="language-plaintext highlighter-rouge">sort_index()</code></a> method sort object by labels (along an axis).</p>
<p>Equivalent to <code class="language-plaintext highlighter-rouge">iloc[]</code>, <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html"><code class="language-plaintext highlighter-rouge">loc[]</code></a> access a group of rows and columns by label(s) or a boolean array. Note that contrary to usual python slices, <strong>both</strong> the start and the stop are included.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="o"><</span><span class="n">row</span> <span class="n">indices</span><span class="o">></span><span class="p">,</span> <span class="o"><</span><span class="n">column</span> <span class="n">names</span><span class="o">></span><span class="p">]</span>
</code></pre></div></div>
<p>Sometimes we just want to select some rows for all columns, the <code class="language-plaintext highlighter-rouge"><column names></code> can be omitted, i.e., <code class="language-plaintext highlighter-rouge">df.loc[[1, 2, 3]]</code> is the same as <code class="language-plaintext highlighter-rouge">df.loc[[1, 2, 3], :]</code>.</p>
<h4 id="322-select-based-on-value">3.2.2 Select based on value</h4>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># return all rows where the value in the "filter_col" column is greater than n
</span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s">'filter_col'</span><span class="p">]</span> <span class="o">></span> <span class="n">n</span><span class="p">]</span>
<span class="c1"># repeat df is needed
</span>
<span class="c1"># to make it clearer
</span><span class="nb">filter</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'filter_col'</span><span class="p">]</span> <span class="o">></span> <span class="n">n</span>
<span class="c1"># return a boolean series with the same length as the DataFrame
</span><span class="n">df</span><span class="p">[</span><span class="nb">filter</span><span class="p">]</span> <span class="c1"># or df.loc[filter], return the rows with a True value
</span>
<span class="c1"># select every 3rd rows
</span><span class="n">every_3rd_row</span> <span class="o">=</span> <span class="p">[</span><span class="n">i</span> <span class="o">%</span> <span class="mi">3</span> <span class="o">==</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))]</span>
<span class="n">df</span><span class="p">[</span><span class="n">every_3rd_row</span><span class="p">]</span>
</code></pre></div></div>
<h4 id="323-select-based-on-multiple-conditions">3.2.3 Select based on multiple conditions</h4>
<p>The following characters will be helpful.</p>
<table>
<thead>
<tr>
<th>Character</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>&</td>
<td>AND condition</td>
</tr>
<tr>
<td>|</td>
<td>OR condition</td>
</tr>
<tr>
<td>~</td>
<td>negation</td>
</tr>
</tbody>
</table>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">filter</span> <span class="o">=</span> <span class="p">((</span><span class="n">df</span><span class="p">[</span><span class="s">'col_1'</span><span class="p">]</span> <span class="o">></span> <span class="n">a</span><span class="p">)</span> <span class="o">&</span> <span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'col_2'</span><span class="p">]</span> <span class="o"><</span> <span class="n">b</span><span class="p">))</span> <span class="o">|</span> <span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'col_3'</span><span class="p">]</span> <span class="o">==</span> <span class="n">c</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="nb">filter</span><span class="p">,</span> <span class="p">[</span><span class="s">'col_4'</span><span class="p">,</span> <span class="s">'col_5'</span><span class="p">]]</span>
</code></pre></div></div>
<p>Note: Boolean operation will fail without parentheses.</p>
<h4 id="324-select-based-on-advanced-booleans">3.2.4 Select based on advanced booleans</h4>
<ul>
<li><code class="language-plaintext highlighter-rouge">df.isin()</code> returns whether each element in the DataFrame is contained in values.</li>
<li><code class="language-plaintext highlighter-rouge">df.isna()</code> mask of bool values for each element in DataFrame that indicates whether an element is an NA value. <em>(equivalent to <code class="language-plaintext highlighter-rouge">isnull()</code>, <code class="language-plaintext highlighter-rouge">np.isnan()</code> in numpy)</em></li>
<li><a href="https://pandas.pydata.org/docs/reference/api/pandas.Series.between.html"><code class="language-plaintext highlighter-rouge">pandas.Series.between</code></a> return boolean Series equivalent to <code class="language-plaintext highlighter-rouge">left <= series <= right</code>.</li>
<li><a href="https://pandas.pydata.org/docs/user_guide/text.html">Working with text data</a></li>
<li><a href="https://pandas.pydata.org/docs/user_guide/timeseries.html">Time series / date functionality</a></li>
</ul>
<h2 id="4-aggregation">4. Aggregation</h2>
<p>Aggregation for efficient summarization is an essential piece of analysis of large data.</p>
<h3 id="41-basic">4.1 Basic</h3>
<h4 id="411-most-common-methods">4.1.1 Most common methods</h4>
<p>Most used for numeric data.</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">count()</code> — count non-NA cells for each column(default) or row (set axis=1)</li>
<li><code class="language-plaintext highlighter-rouge">nunique()</code> — count number of distinct elements</li>
<li><code class="language-plaintext highlighter-rouge">sum()</code>/<code class="language-plaintext highlighter-rouge">mean()</code> — return the sum/mean of the values</li>
<li><code class="language-plaintext highlighter-rouge">min()</code>/<code class="language-plaintext highlighter-rouge">max()</code>/<code class="language-plaintext highlighter-rouge">median()</code> — return the minimum/maximum/median of the values</li>
<li><code class="language-plaintext highlighter-rouge">quantile()</code> — return values at the given quantile, defaults to the 50th quantile (q=0.5, the median)</li>
<li><code class="language-plaintext highlighter-rouge">var()</code>/<code class="language-plaintext highlighter-rouge">std()</code> — return (unbiased variance)/(sample standard deviation) of the values. Normalized by N-1(the sample formulation) by default, set <code class="language-plaintext highlighter-rouge">ddof=0</code>(N-ddof) for population formulation.</li>
</ul>
<h4 id="412-use-groupby">4.1.2 Use <code class="language-plaintext highlighter-rouge">groupby()</code></h4>
<p>When doing aggregation, more often than not, we would like to analyze data by some categories. <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html"><code class="language-plaintext highlighter-rouge">groupby()</code></a> allows us to specify a column (or multiple columns) to aggregate the values for better analysis.</p>
<p>The name GroupBy should be quite familiar to those who have used a SQL-based tool, in which you can write code like:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">Column1</span><span class="p">,</span> <span class="n">Column2</span><span class="p">,</span> <span class="n">mean</span><span class="p">(</span><span class="n">Column3</span><span class="p">),</span> <span class="k">sum</span><span class="p">(</span><span class="n">Column4</span><span class="p">)</span>
<span class="k">FROM</span> <span class="n">SomeTable</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">Column1</span><span class="p">,</span> <span class="n">Column2</span>
</code></pre></div></div>
<p>Let’s take a simple example.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">'Animal'</span><span class="p">:</span> <span class="p">[</span><span class="s">'Falcon'</span><span class="p">,</span> <span class="s">'Falcon'</span><span class="p">,</span>
<span class="p">...</span> <span class="s">'Parrot'</span><span class="p">,</span> <span class="s">'Parrot'</span><span class="p">],</span>
<span class="p">...</span> <span class="s">'Max Speed'</span><span class="p">:</span> <span class="p">[</span><span class="mf">380.</span><span class="p">,</span> <span class="mf">370.</span><span class="p">,</span> <span class="mf">24.</span><span class="p">,</span> <span class="mf">26.</span><span class="p">]})</span>
<span class="o">>>></span> <span class="n">df</span>
<span class="n">Animal</span> <span class="n">Max</span> <span class="n">Speed</span>
<span class="mi">0</span> <span class="n">Falcon</span> <span class="mf">380.0</span>
<span class="mi">1</span> <span class="n">Falcon</span> <span class="mf">370.0</span>
<span class="mi">2</span> <span class="n">Parrot</span> <span class="mf">24.0</span>
<span class="mi">3</span> <span class="n">Parrot</span> <span class="mf">26.0</span>
<span class="o">>>></span> <span class="n">df</span><span class="p">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'Animal'</span><span class="p">]).</span><span class="n">mean</span><span class="p">()</span>
<span class="n">Max</span> <span class="n">Speed</span>
<span class="n">Animal</span>
<span class="n">Falcon</span> <span class="mf">375.0</span>
<span class="n">Parrot</span> <span class="mf">25.0</span>
</code></pre></div></div>
<ul>
<li><code class="language-plaintext highlighter-rouge">df.groupby('A')</code> is just syntactic sugar for <code class="language-plaintext highlighter-rouge">df.groupby(df['A'])</code>.</li>
<li>We can groupby different levels of a hierarchical index using the <code class="language-plaintext highlighter-rouge">level</code> parameter.</li>
<li>We can also choose to include NA in group keys or not by setting <code class="language-plaintext highlighter-rouge">dropna</code>(default is <em>True</em>) parameter.</li>
</ul>
<h4 id="413-use-agg">4.1.3 Use <code class="language-plaintext highlighter-rouge">agg()</code></h4>
<p>If we want to aggregate using one or more operations and have one or more columns to aggregate, then we can use the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html"><code class="language-plaintext highlighter-rouge">agg()</code></a> (<em>agg</em> is an alias for <em>aggregate</em>) method.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span>
<span class="p">...</span> <span class="p">[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">],</span>
<span class="p">...</span> <span class="p">[</span><span class="mi">7</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">9</span><span class="p">],</span>
<span class="p">...</span> <span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">nan</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">nan</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">nan</span><span class="p">]],</span>
<span class="p">...</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">'A'</span><span class="p">,</span> <span class="s">'B'</span><span class="p">,</span> <span class="s">'C'</span><span class="p">])</span>
<span class="c1"># Aggregate these functions over the rows.
</span><span class="o">>>></span> <span class="n">df</span><span class="p">.</span><span class="n">agg</span><span class="p">([</span><span class="s">'sum'</span><span class="p">,</span> <span class="s">'min'</span><span class="p">])</span>
<span class="n">A</span> <span class="n">B</span> <span class="n">C</span>
<span class="nb">sum</span> <span class="mf">12.0</span> <span class="mf">15.0</span> <span class="mf">18.0</span>
<span class="nb">min</span> <span class="mf">1.0</span> <span class="mf">2.0</span> <span class="mf">3.0</span>
<span class="c1"># Different aggregations per column.
</span><span class="o">>>></span> <span class="n">df</span><span class="p">.</span><span class="n">agg</span><span class="p">({</span><span class="s">'A'</span> <span class="p">:</span> <span class="p">[</span><span class="s">'sum'</span><span class="p">,</span> <span class="s">'min'</span><span class="p">],</span> <span class="s">'B'</span> <span class="p">:</span> <span class="p">[</span><span class="s">'min'</span><span class="p">,</span> <span class="s">'max'</span><span class="p">]})</span>
<span class="n">A</span> <span class="n">B</span>
<span class="nb">sum</span> <span class="mf">12.0</span> <span class="n">NaN</span>
<span class="nb">min</span> <span class="mf">1.0</span> <span class="mf">2.0</span>
<span class="nb">max</span> <span class="n">NaN</span> <span class="mf">8.0</span>
<span class="c1"># Aggregate different functions over the columns and rename the index of the resulting DataFrame.
</span><span class="o">>>></span> <span class="n">df</span><span class="p">.</span><span class="n">agg</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="p">(</span><span class="s">'A'</span><span class="p">,</span> <span class="s">'max'</span><span class="p">),</span> <span class="n">y</span><span class="o">=</span><span class="p">(</span><span class="s">'B'</span><span class="p">,</span> <span class="s">'min'</span><span class="p">),</span> <span class="n">z</span><span class="o">=</span><span class="p">(</span><span class="s">'C'</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">))</span>
<span class="n">A</span> <span class="n">B</span> <span class="n">C</span>
<span class="n">x</span> <span class="mf">7.0</span> <span class="n">NaN</span> <span class="n">NaN</span>
<span class="n">y</span> <span class="n">NaN</span> <span class="mf">2.0</span> <span class="n">NaN</span>
<span class="n">z</span> <span class="n">NaN</span> <span class="n">NaN</span> <span class="mf">6.0</span>
</code></pre></div></div>
<h3 id="42-advanced">4.2 Advanced</h3>
<h4 id="421-combine-groupby-and-agg">4.2.1 Combine <code class="language-plaintext highlighter-rouge">groupby()</code> and <code class="language-plaintext highlighter-rouge">agg()</code></h4>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">'A'</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">],</span>
<span class="p">...</span> <span class="s">'B'</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span>
<span class="p">...</span> <span class="s">'C'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">4</span><span class="p">)})</span>
<span class="o">>>></span> <span class="n">df</span>
<span class="n">A</span> <span class="n">B</span> <span class="n">C</span>
<span class="mi">0</span> <span class="mi">1</span> <span class="mi">1</span> <span class="mf">0.362838</span>
<span class="mi">1</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mf">0.227877</span>
<span class="mi">2</span> <span class="mi">2</span> <span class="mi">3</span> <span class="mf">1.267767</span>
<span class="mi">3</span> <span class="mi">2</span> <span class="mi">4</span> <span class="o">-</span><span class="mf">0.562860</span>
<span class="c1"># Multiple aggregations
</span><span class="o">>>></span> <span class="n">df</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'A'</span><span class="p">).</span><span class="n">agg</span><span class="p">([</span><span class="s">'min'</span><span class="p">,</span> <span class="s">'max'</span><span class="p">])</span>
<span class="n">B</span> <span class="n">C</span>
<span class="nb">min</span> <span class="nb">max</span> <span class="nb">min</span> <span class="nb">max</span>
<span class="n">A</span>
<span class="mi">1</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mf">0.227877</span> <span class="mf">0.362838</span>
<span class="mi">2</span> <span class="mi">3</span> <span class="mi">4</span> <span class="o">-</span><span class="mf">0.562860</span> <span class="mf">1.267767</span>
<span class="c1"># Select a column for aggregation
</span><span class="o">>>></span> <span class="n">df</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'A'</span><span class="p">).</span><span class="n">B</span><span class="p">.</span><span class="n">agg</span><span class="p">([</span><span class="s">'min'</span><span class="p">,</span> <span class="s">'max'</span><span class="p">])</span>
<span class="nb">min</span> <span class="nb">max</span>
<span class="n">A</span>
<span class="mi">1</span> <span class="mi">1</span> <span class="mi">2</span>
<span class="mi">2</span> <span class="mi">3</span> <span class="mi">4</span>
<span class="c1"># Different aggregations per column
</span><span class="o">>>></span> <span class="n">df</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'A'</span><span class="p">).</span><span class="n">agg</span><span class="p">({</span><span class="s">'B'</span><span class="p">:</span> <span class="p">[</span><span class="s">'min'</span><span class="p">,</span> <span class="s">'max'</span><span class="p">],</span> <span class="s">'C'</span><span class="p">:</span> <span class="s">'sum'</span><span class="p">})</span>
<span class="n">B</span> <span class="n">C</span>
<span class="nb">min</span> <span class="nb">max</span> <span class="nb">sum</span>
<span class="n">A</span>
<span class="mi">1</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mf">0.590716</span>
<span class="mi">2</span> <span class="mi">3</span> <span class="mi">4</span> <span class="mf">0.704907</span>
</code></pre></div></div>
<p>The column have been put on two levels is called the <a href="https://pandas.pydata.org/docs/user_guide/advanced.html">MultiIndex</a>. You can think of <code class="language-plaintext highlighter-rouge">MultiIndex</code> as an array of tuples where each tuple is unique.</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">df['B']</code> return a DataFrame with the two columns under <code class="language-plaintext highlighter-rouge">B</code></li>
<li><code class="language-plaintext highlighter-rouge">df[('B', 'max')]</code> return the specific <code class="language-plaintext highlighter-rouge">B</code> <code class="language-plaintext highlighter-rouge">max</code> column</li>
<li>To fatten the index into one level, use <a href="https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.to_flat_index.html"><code class="language-plaintext highlighter-rouge">to_flat_index</code></a></li>
</ul>
<h4 id="422-custom">4.2.2 Custom</h4>
<p>In addition to using the default aggregations, we can also create our own aggregation functions and call them using <code class="language-plaintext highlighter-rouge">agg()</code>. The function should takes a series (or list) in and returns a single value. Let’s look the example of the <a href="https://en.wikipedia.org/wiki/Pythagorean_means">Pythagorean means</a>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Calculate the harmonic mean
</span><span class="k">def</span> <span class="nf">harmonic_mean</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="o">/</span> <span class="nb">sum</span><span class="p">([</span><span class="mi">1</span><span class="o">/</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">s</span><span class="p">])</span>
<span class="c1"># Calculate the Geometric mean
</span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="k">def</span> <span class="nf">geometric_mean</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">s</span><span class="p">).</span><span class="n">mean</span><span class="p">())</span>
<span class="c1"># Default aggregation methods should be put in quotes, custom functions pass
</span><span class="n">df</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'A'</span><span class="p">)[</span><span class="s">'B'</span><span class="p">].</span><span class="n">agg</span><span class="p">([</span><span class="s">'mean'</span><span class="p">,</span> <span class="n">harmonic_mean</span><span class="p">,</span> <span class="n">geometric_mean</span><span class="p">])</span>
</code></pre></div></div>
<h4 id="423-transform">4.2.3 Transform</h4>
<p>The <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transform.html"><code class="language-plaintext highlighter-rouge">transform()</code></a> method could save you a lot of time when you want aggregated values without aggregating your Series/DataFrame.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">'A'</span> <span class="p">:</span> <span class="p">[</span><span class="s">'foo'</span><span class="p">,</span> <span class="s">'bar'</span><span class="p">,</span> <span class="s">'foo'</span><span class="p">,</span> <span class="s">'bar'</span><span class="p">,</span>
<span class="p">...</span> <span class="s">'foo'</span><span class="p">,</span> <span class="s">'bar'</span><span class="p">],</span>
<span class="p">...</span> <span class="s">'B'</span> <span class="p">:</span> <span class="p">[</span><span class="s">'one'</span><span class="p">,</span> <span class="s">'one'</span><span class="p">,</span> <span class="s">'two'</span><span class="p">,</span> <span class="s">'three'</span><span class="p">,</span>
<span class="p">...</span> <span class="s">'two'</span><span class="p">,</span> <span class="s">'two'</span><span class="p">],</span>
<span class="p">...</span> <span class="s">'C'</span> <span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">],</span>
<span class="p">...</span> <span class="s">'D'</span> <span class="p">:</span> <span class="p">[</span><span class="mf">2.0</span><span class="p">,</span> <span class="mf">5.</span><span class="p">,</span> <span class="mf">8.</span><span class="p">,</span> <span class="mf">1.</span><span class="p">,</span> <span class="mf">2.</span><span class="p">,</span> <span class="mf">9.</span><span class="p">]})</span>
<span class="o">>>></span> <span class="n">grouped</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'A'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">grouped</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">x</span><span class="p">.</span><span class="n">mean</span><span class="p">())</span> <span class="o">/</span> <span class="n">x</span><span class="p">.</span><span class="n">std</span><span class="p">())</span>
<span class="n">C</span> <span class="n">D</span>
<span class="mi">0</span> <span class="o">-</span><span class="mf">1.154701</span> <span class="o">-</span><span class="mf">0.577350</span>
<span class="mi">1</span> <span class="mf">0.577350</span> <span class="mf">0.000000</span>
<span class="mi">2</span> <span class="mf">0.577350</span> <span class="mf">1.154701</span>
<span class="mi">3</span> <span class="o">-</span><span class="mf">1.154701</span> <span class="o">-</span><span class="mf">1.000000</span>
<span class="mi">4</span> <span class="mf">0.577350</span> <span class="o">-</span><span class="mf">0.577350</span>
<span class="mi">5</span> <span class="mf">0.577350</span> <span class="mf">1.000000</span>
</code></pre></div></div>
<p>In the example above, the <code class="language-plaintext highlighter-rouge">transform()</code> method takes an aggregated value and repeats it for every row in the original DataFrame.</p>
<h2 id="5-pivot-data">5. Pivot data</h2>
<p>If you have an Excel and SQL background, you might be very comfortable using pivot tables. But when it comes to pandas, how to recreate the functionality of pivot tables? Let’s look at several ways to do just that.</p>
<h3 id="51-the-manual-way">5.1 The manual way</h3>
<p>To have a deeper understanding of pivoting data, we are going to start with learning the manual way combining methods covered in previous section (<code class="language-plaintext highlighter-rouge">set_index()</code> and <code class="language-plaintext highlighter-rouge">groupby()</code>) with a new method called <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html"><code class="language-plaintext highlighter-rouge">unstack()</code></a> to pivot your data any way you want.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">stacked</span>
<span class="n">first</span> <span class="n">second</span>
<span class="n">bar</span> <span class="n">one</span> <span class="n">A</span> <span class="o">-</span><span class="mf">0.727965</span>
<span class="n">B</span> <span class="o">-</span><span class="mf">0.589346</span>
<span class="n">two</span> <span class="n">A</span> <span class="mf">0.339969</span>
<span class="n">B</span> <span class="o">-</span><span class="mf">0.693205</span>
<span class="n">baz</span> <span class="n">one</span> <span class="n">A</span> <span class="o">-</span><span class="mf">0.339355</span>
<span class="n">B</span> <span class="mf">0.593616</span>
<span class="n">two</span> <span class="n">A</span> <span class="mf">0.884345</span>
<span class="n">B</span> <span class="mf">1.591431</span>
<span class="n">dtype</span><span class="p">:</span> <span class="n">float64</span>
<span class="o">>>></span> <span class="n">stacked</span><span class="p">.</span><span class="n">unstack</span><span class="p">()</span>
<span class="n">A</span> <span class="n">B</span>
<span class="n">first</span> <span class="n">second</span>
<span class="n">bar</span> <span class="n">one</span> <span class="o">-</span><span class="mf">0.727965</span> <span class="o">-</span><span class="mf">0.589346</span>
<span class="n">two</span> <span class="mf">0.339969</span> <span class="o">-</span><span class="mf">0.693205</span>
<span class="n">baz</span> <span class="n">one</span> <span class="o">-</span><span class="mf">0.339355</span> <span class="mf">0.593616</span>
<span class="n">two</span> <span class="mf">0.884345</span> <span class="mf">1.591431</span>
<span class="o">>>></span> <span class="n">stacked</span><span class="p">.</span><span class="n">unstack</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">second</span> <span class="n">one</span> <span class="n">two</span>
<span class="n">first</span>
<span class="n">bar</span> <span class="n">A</span> <span class="o">-</span><span class="mf">0.727965</span> <span class="mf">0.339969</span>
<span class="n">B</span> <span class="o">-</span><span class="mf">0.589346</span> <span class="o">-</span><span class="mf">0.693205</span>
<span class="n">baz</span> <span class="n">A</span> <span class="o">-</span><span class="mf">0.339355</span> <span class="mf">0.884345</span>
<span class="n">B</span> <span class="mf">0.593616</span> <span class="mf">1.591431</span>
<span class="o">>>></span> <span class="n">stacked</span><span class="p">.</span><span class="n">unstack</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">first</span> <span class="n">bar</span> <span class="n">baz</span>
<span class="n">second</span>
<span class="n">one</span> <span class="n">A</span> <span class="o">-</span><span class="mf">0.727965</span> <span class="o">-</span><span class="mf">0.339355</span>
<span class="n">B</span> <span class="o">-</span><span class="mf">0.589346</span> <span class="mf">0.593616</span>
<span class="n">two</span> <span class="n">A</span> <span class="mf">0.339969</span> <span class="mf">0.884345</span>
<span class="n">B</span> <span class="o">-</span><span class="mf">0.693205</span> <span class="mf">1.591431</span>
</code></pre></div></div>
<p>What <code class="language-plaintext highlighter-rouge">unstack()</code> does is takes a specified index column and converts each value in that index into its own column. By default, it unstacks the <strong>last level</strong> (<em>level=-1</em>).</p>
<h4 id="511-with-aggregation">5.1.1 With aggregation</h4>
<p>The typical scenario where the data you want to pivot involves aggregation.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'A'</span><span class="p">,</span> <span class="s">'B'</span><span class="p">])[</span><span class="s">'C'</span><span class="p">].</span><span class="n">agg</span><span class="p">().</span><span class="n">unstack</span><span class="p">(</span><span class="s">'A'</span><span class="p">)</span>
</code></pre></div></div>
<p>From this, we first <code class="language-plaintext highlighter-rouge">groupby()</code> the columns as indexes, and then <code class="language-plaintext highlighter-rouge">agg()</code> the column as values, finally <code class="language-plaintext highlighter-rouge">unstack()</code> the index column (or columns) for which we want to have the categories as columns.</p>
<h4 id="512-without-aggregation">5.1.2 Without aggregation</h4>
<p>What if we already have aggregated data like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df_agg</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'A'</span><span class="p">,</span> <span class="s">'B'</span><span class="p">])[</span><span class="s">'C'</span><span class="p">].</span><span class="n">agg</span><span class="p">()</span>
</code></pre></div></div>
<p>Technically, we could do the same as above using <code class="language-plaintext highlighter-rouge">groupby()</code>. Because every row already represents a unique combination of <code class="language-plaintext highlighter-rouge">A</code> and <code class="language-plaintext highlighter-rouge">B</code>, aggregate again will not change the values. There’s a better way using <code class="language-plaintext highlighter-rouge">set_index()</code> though.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df_agg</span><span class="p">.</span><span class="n">set_index</span><span class="p">([</span><span class="s">'A'</span><span class="p">,</span> <span class="s">'B'</span><span class="p">])[</span><span class="s">'C'</span><span class="p">].</span><span class="n">unstack</span><span class="p">(</span><span class="s">'A'</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="52-use-pivot">5.2 Use <code class="language-plaintext highlighter-rouge">pivot()</code></h3>
<p>The <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html"><code class="language-plaintext highlighter-rouge">pivot()</code></a> method is a one-shot way of doing what we did in the previous section with <code class="language-plaintext highlighter-rouge">set_index()</code> — pivoting an already aggregated DataFrame.</p>
<p>Note: <code class="language-plaintext highlighter-rouge">Series</code> object has no attribute <code class="language-plaintext highlighter-rouge">pivot</code>, a <code class="language-plaintext highlighter-rouge">groupby</code> object after aggregation can be <code class="language-plaintext highlighter-rouge">series</code>. To convert it to DataFrame, directly use <a href="https://pandas.pydata.org/docs/reference/api/pandas.Series.to_frame.html"><code class="language-plaintext highlighter-rouge">to_frame</code></a>, or use <a href="https://pandas.pydata.org/docs/reference/api/pandas.Series.reset_index.html"><code class="language-plaintext highlighter-rouge">reset_index()</code></a> to treate the index as a column.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df_agg</span><span class="p">.</span><span class="n">reset_index</span><span class="p">().</span><span class="n">pivot</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="s">'B'</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="s">'A'</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="s">'C'</span><span class="p">)</span>
</code></pre></div></div>
<p>Here’s a more detailed example from the official documentation.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">df</span>
<span class="n">lev1</span> <span class="n">lev2</span> <span class="n">lev3</span> <span class="n">lev4</span> <span class="n">values</span>
<span class="mi">0</span> <span class="mi">1</span> <span class="mi">1</span> <span class="mi">1</span> <span class="mi">1</span> <span class="mi">0</span>
<span class="mi">1</span> <span class="mi">1</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mi">2</span> <span class="mi">1</span>
<span class="mi">2</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mi">1</span> <span class="mi">3</span> <span class="mi">2</span>
<span class="mi">3</span> <span class="mi">2</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mi">4</span> <span class="mi">3</span>
<span class="mi">4</span> <span class="mi">2</span> <span class="mi">1</span> <span class="mi">1</span> <span class="mi">5</span> <span class="mi">4</span>
<span class="mi">5</span> <span class="mi">2</span> <span class="mi">2</span> <span class="mi">2</span> <span class="mi">6</span> <span class="mi">5</span>
<span class="o">>>></span> <span class="n">df</span><span class="p">.</span><span class="n">pivot</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="s">"lev1"</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">"lev2"</span><span class="p">,</span> <span class="s">"lev3"</span><span class="p">],</span> <span class="n">values</span><span class="o">=</span><span class="s">"values"</span><span class="p">)</span>
<span class="n">lev2</span> <span class="mi">1</span> <span class="mi">2</span>
<span class="n">lev3</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mi">1</span> <span class="mi">2</span>
<span class="n">lev1</span>
<span class="mi">1</span> <span class="mf">0.0</span> <span class="mf">1.0</span> <span class="mf">2.0</span> <span class="n">NaN</span>
<span class="mi">2</span> <span class="mf">4.0</span> <span class="mf">3.0</span> <span class="n">NaN</span> <span class="mf">5.0</span>
<span class="o">>>></span> <span class="n">df</span><span class="p">.</span><span class="n">pivot</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"lev1"</span><span class="p">,</span> <span class="s">"lev2"</span><span class="p">],</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">"lev3"</span><span class="p">],</span> <span class="n">values</span><span class="o">=</span><span class="s">"values"</span><span class="p">)</span>
<span class="n">lev3</span> <span class="mi">1</span> <span class="mi">2</span>
<span class="n">lev1</span> <span class="n">lev2</span>
<span class="mi">1</span> <span class="mi">1</span> <span class="mf">0.0</span> <span class="mf">1.0</span>
<span class="mi">2</span> <span class="mf">2.0</span> <span class="n">NaN</span>
<span class="mi">2</span> <span class="mi">1</span> <span class="mf">4.0</span> <span class="mf">3.0</span>
<span class="mi">2</span> <span class="n">NaN</span> <span class="mf">5.0</span>
</code></pre></div></div>
<p>Notes:</p>
<ul>
<li>A <code class="language-plaintext highlighter-rouge">ValueError</code> is raised if there are any duplicates <em>(index, columns combinations with multiple values)</em>. So you’d better aggregate your data before using <code class="language-plaintext highlighter-rouge">pivot()</code>.</li>
<li>The <code class="language-plaintext highlighter-rouge">values</code> parameter is optional. If not specified, all remaining columns will be used and the result will have hierarchically indexed columns.</li>
</ul>
<h3 id="53-use-pandaspivot_table">5.3 Use <code class="language-plaintext highlighter-rouge">pandas.pivot_table()</code></h3>
<p>The <a href="https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html"><code class="language-plaintext highlighter-rouge">pandas.pivot_table()</code></a> creates a spreadsheet-style pivot table as a DataFrame. If <code class="language-plaintext highlighter-rouge">pivot()</code> is the equivalent of the “without aggregation” — <code class="language-plaintext highlighter-rouge">df_agg.set_index(['A', 'B'])['C'].unstack('A')</code>, then [<code class="language-plaintext highlighter-rouge">pivot_table()</code>] is the equivalent of the “with aggregation” — <code class="language-plaintext highlighter-rouge">df.groupby(['A', 'B'])['C'].agg().unstack('A')</code>. That is, it is better to pivot data using <code class="language-plaintext highlighter-rouge">pivot_table()</code>, which avoid raising <code class="language-plaintext highlighter-rouge">ValueError</code> when you need to aggregate before pivoting using <code class="language-plaintext highlighter-rouge">pivot()</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pd</span><span class="p">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="s">'C'</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">'B'</span><span class="p">],</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">'A'</span><span class="p">],</span> <span class="n">aggfunc</span><span class="o">=</span><span class="s">'agg'</span><span class="p">)</span>
</code></pre></div></div>
<p>The difference between <code class="language-plaintext highlighter-rouge">pivot()</code> and <code class="language-plaintext highlighter-rouge">pivot_table()</code>:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">pivot()</code> is a method of the DataFrame class, while <code class="language-plaintext highlighter-rouge">pivot_table()</code> is a standalone function in the pandas library. Which means we need pass a DataFrame to <code class="language-plaintext highlighter-rouge">pd.pivot_table()</code>.</li>
<li><code class="language-plaintext highlighter-rouge">pivot_table()</code> is a generalization of <code class="language-plaintext highlighter-rouge">pivot()</code> that can handle duplicate values for one <strong>pivoted</strong> index/column pair.</li>
<li><code class="language-plaintext highlighter-rouge">pivot_table()</code> contains the three parameters we used for the <code class="language-plaintext highlighter-rouge">pivot()</code> method (<code class="language-plaintext highlighter-rouge">index</code>, <code class="language-plaintext highlighter-rouge">columns</code>, and <code class="language-plaintext highlighter-rouge">values</code>), and additionally provide other parameters such as <code class="language-plaintext highlighter-rouge">aggfunc</code> (by default is <code class="language-plaintext highlighter-rouge">numpy.mean</code>).</li>
<li><code class="language-plaintext highlighter-rouge">pivot_table()</code> also supports using multiple columns for the index and column of the <strong>pivoted</strong> table to generate a hierarchical index.</li>
</ul>
<p>Function examples:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">df</span>
<span class="n">A</span> <span class="n">B</span> <span class="n">C</span> <span class="n">D</span> <span class="n">E</span>
<span class="mi">0</span> <span class="n">foo</span> <span class="n">one</span> <span class="n">small</span> <span class="mi">1</span> <span class="mi">2</span>
<span class="mi">1</span> <span class="n">foo</span> <span class="n">one</span> <span class="n">large</span> <span class="mi">2</span> <span class="mi">4</span>
<span class="mi">2</span> <span class="n">foo</span> <span class="n">one</span> <span class="n">large</span> <span class="mi">2</span> <span class="mi">5</span>
<span class="mi">3</span> <span class="n">foo</span> <span class="n">two</span> <span class="n">small</span> <span class="mi">3</span> <span class="mi">5</span>
<span class="mi">4</span> <span class="n">foo</span> <span class="n">two</span> <span class="n">small</span> <span class="mi">3</span> <span class="mi">6</span>
<span class="mi">5</span> <span class="n">bar</span> <span class="n">one</span> <span class="n">large</span> <span class="mi">4</span> <span class="mi">6</span>
<span class="mi">6</span> <span class="n">bar</span> <span class="n">one</span> <span class="n">small</span> <span class="mi">5</span> <span class="mi">8</span>
<span class="mi">7</span> <span class="n">bar</span> <span class="n">two</span> <span class="n">small</span> <span class="mi">6</span> <span class="mi">9</span>
<span class="mi">8</span> <span class="n">bar</span> <span class="n">two</span> <span class="n">large</span> <span class="mi">7</span> <span class="mi">9</span>
<span class="o">>>></span> <span class="n">table</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="s">'D'</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">'A'</span><span class="p">,</span> <span class="s">'B'</span><span class="p">],</span>
<span class="p">...</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">'C'</span><span class="p">],</span> <span class="n">aggfunc</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">,</span> <span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">table</span>
<span class="n">C</span> <span class="n">large</span> <span class="n">small</span>
<span class="n">A</span> <span class="n">B</span>
<span class="n">bar</span> <span class="n">one</span> <span class="mi">4</span> <span class="mi">5</span>
<span class="n">two</span> <span class="mi">7</span> <span class="mi">6</span>
<span class="n">foo</span> <span class="n">one</span> <span class="mi">4</span> <span class="mi">1</span>
<span class="n">two</span> <span class="mi">0</span> <span class="mi">6</span>
<span class="o">>>></span> <span class="n">table</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="s">'D'</span><span class="p">,</span> <span class="s">'E'</span><span class="p">],</span> <span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">'A'</span><span class="p">,</span> <span class="s">'C'</span><span class="p">],</span>
<span class="p">...</span> <span class="n">aggfunc</span><span class="o">=</span><span class="p">{</span><span class="s">'D'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">,</span>
<span class="p">...</span> <span class="s">'E'</span><span class="p">:</span> <span class="p">[</span><span class="nb">min</span><span class="p">,</span> <span class="nb">max</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">]})</span>
<span class="o">>>></span> <span class="n">table</span>
<span class="n">D</span> <span class="n">E</span>
<span class="n">mean</span> <span class="nb">max</span> <span class="n">mean</span> <span class="nb">min</span>
<span class="n">A</span> <span class="n">C</span>
<span class="n">bar</span> <span class="n">large</span> <span class="mf">5.500000</span> <span class="mf">9.0</span> <span class="mf">7.500000</span> <span class="mf">6.0</span>
<span class="n">small</span> <span class="mf">5.500000</span> <span class="mf">9.0</span> <span class="mf">8.500000</span> <span class="mf">8.0</span>
<span class="n">foo</span> <span class="n">large</span> <span class="mf">2.000000</span> <span class="mf">5.0</span> <span class="mf">4.500000</span> <span class="mf">4.0</span>
<span class="n">small</span> <span class="mf">2.333333</span> <span class="mf">6.0</span> <span class="mf">4.333333</span> <span class="mf">2.0</span>
</code></pre></div></div>
<h3 id="54-use-pandascrosstab">5.4 Use <code class="language-plaintext highlighter-rouge">pandas.crosstab()</code></h3>
<p>The <a href="https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html"><code class="language-plaintext highlighter-rouge">pandas.crosstab()</code></a> is another function of the pandas library similar to <code class="language-plaintext highlighter-rouge">pivot_table()</code> but more generalized in a way. However, for <code class="language-plaintext highlighter-rouge">crosstable()</code> we don’t have a DataFrame to pass. Instead, we pass actual columns of data to the <code class="language-plaintext highlighter-rouge">index</code>, <code class="language-plaintext highlighter-rouge">columns</code>, and <code class="language-plaintext highlighter-rouge">values</code> (optional, by default computes a frequency table of the factors). If you want to perform another aggregation, an array of values and an aggregation function is required.</p>
<p>Let’s reproduce pivoting data again:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pd</span><span class="p">.</span><span class="n">crosstab</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="s">'B'</span><span class="p">],</span> <span class="n">columns</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="s">'A'</span><span class="p">],</span> <span class="n">values</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="s">'C'</span><span class="p">],</span> <span class="n">aggfunc</span><span class="o">=</span><span class="s">'agg'</span><span class="p">)</span>
</code></pre></div></div>
<p>Function example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="s">"foo"</span><span class="p">,</span> <span class="s">"foo"</span><span class="p">,</span> <span class="s">"foo"</span><span class="p">,</span> <span class="s">"foo"</span><span class="p">,</span> <span class="s">"bar"</span><span class="p">,</span> <span class="s">"bar"</span><span class="p">,</span>
<span class="p">...</span> <span class="s">"bar"</span><span class="p">,</span> <span class="s">"bar"</span><span class="p">,</span> <span class="s">"foo"</span><span class="p">,</span> <span class="s">"foo"</span><span class="p">,</span> <span class="s">"foo"</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">object</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="s">"one"</span><span class="p">,</span> <span class="s">"one"</span><span class="p">,</span> <span class="s">"one"</span><span class="p">,</span> <span class="s">"two"</span><span class="p">,</span> <span class="s">"one"</span><span class="p">,</span> <span class="s">"one"</span><span class="p">,</span>
<span class="p">...</span> <span class="s">"one"</span><span class="p">,</span> <span class="s">"two"</span><span class="p">,</span> <span class="s">"two"</span><span class="p">,</span> <span class="s">"two"</span><span class="p">,</span> <span class="s">"one"</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">object</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">c</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="s">"dull"</span><span class="p">,</span> <span class="s">"dull"</span><span class="p">,</span> <span class="s">"shiny"</span><span class="p">,</span> <span class="s">"dull"</span><span class="p">,</span> <span class="s">"dull"</span><span class="p">,</span> <span class="s">"shiny"</span><span class="p">,</span>
<span class="p">...</span> <span class="s">"shiny"</span><span class="p">,</span> <span class="s">"dull"</span><span class="p">,</span> <span class="s">"shiny"</span><span class="p">,</span> <span class="s">"shiny"</span><span class="p">,</span> <span class="s">"shiny"</span><span class="p">],</span>
<span class="p">...</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">object</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">pd</span><span class="p">.</span><span class="n">crosstab</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="p">[</span><span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">],</span> <span class="n">rownames</span><span class="o">=</span><span class="p">[</span><span class="s">'a'</span><span class="p">],</span> <span class="n">colnames</span><span class="o">=</span><span class="p">[</span><span class="s">'b'</span><span class="p">,</span> <span class="s">'c'</span><span class="p">])</span>
<span class="n">b</span> <span class="n">one</span> <span class="n">two</span>
<span class="n">c</span> <span class="n">dull</span> <span class="n">shiny</span> <span class="n">dull</span> <span class="n">shiny</span>
<span class="n">a</span>
<span class="n">bar</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mi">1</span> <span class="mi">0</span>
<span class="n">foo</span> <span class="mi">2</span> <span class="mi">2</span> <span class="mi">1</span> <span class="mi">2</span>
</code></pre></div></div>
<p>Keep in mind, use <code class="language-plaintext highlighter-rouge">crosstab()</code> when you <em>don’t</em> have a DataFrame. It is a direct way if you’re only interested in the aggregated results, otherwise <code class="language-plaintext highlighter-rouge">pivot_table()</code> is nicer and cleaner.</p>
<h2 id="6-join-tables">6. Join tables</h2>
<p>In data science, joining two or more tables together based on some shared columns or index, which is one of the most fundamental concepts. It is <code class="language-plaintext highlighter-rouge">JOIN</code> in SQL, <code class="language-plaintext highlighter-rouge">VLOOKUP</code> or <code class="language-plaintext highlighter-rouge">INDEX-MATCH</code> in Excel. In pandas, there are various facilities with various kinds of set logic. We will look at how to combine pandas objects through different methods.</p>
<h3 id="61-use-merge">6.1 Use <code class="language-plaintext highlighter-rouge">merge()</code></h3>
<p>The <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html"><code class="language-plaintext highlighter-rouge">merge()</code></a> method merge DataFrame or named Series objects (treated as a DataFrame with a single named column) with a database-style join.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">df1</span>
<span class="n">lkey</span> <span class="n">value</span>
<span class="mi">0</span> <span class="n">foo</span> <span class="mi">1</span>
<span class="mi">1</span> <span class="n">bar</span> <span class="mi">2</span>
<span class="mi">2</span> <span class="n">baz</span> <span class="mi">3</span>
<span class="mi">3</span> <span class="n">foo</span> <span class="mi">5</span>
<span class="o">>>></span> <span class="n">df2</span>
<span class="n">rkey</span> <span class="n">value</span>
<span class="mi">0</span> <span class="n">foo</span> <span class="mi">5</span>
<span class="mi">1</span> <span class="n">bar</span> <span class="mi">6</span>
<span class="mi">2</span> <span class="n">baz</span> <span class="mi">7</span>
<span class="mi">3</span> <span class="n">foo</span> <span class="mi">8</span>
<span class="o">>>></span> <span class="n">df1</span><span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">df2</span><span class="p">,</span> <span class="n">left_on</span><span class="o">=</span><span class="s">'lkey'</span><span class="p">,</span> <span class="n">right_on</span><span class="o">=</span><span class="s">'rkey'</span><span class="p">,</span>
<span class="n">suffixes</span><span class="o">=</span><span class="p">(</span><span class="s">'_left'</span><span class="p">,</span> <span class="s">'_right'</span><span class="p">))</span>
<span class="n">lkey</span> <span class="n">value_left</span> <span class="n">rkey</span> <span class="n">value_right</span>
<span class="mi">0</span> <span class="n">foo</span> <span class="mi">1</span> <span class="n">foo</span> <span class="mi">5</span>
<span class="mi">1</span> <span class="n">foo</span> <span class="mi">1</span> <span class="n">foo</span> <span class="mi">8</span>
<span class="mi">2</span> <span class="n">foo</span> <span class="mi">5</span> <span class="n">foo</span> <span class="mi">5</span>
<span class="mi">3</span> <span class="n">foo</span> <span class="mi">5</span> <span class="n">foo</span> <span class="mi">8</span>
<span class="mi">4</span> <span class="n">bar</span> <span class="mi">2</span> <span class="n">bar</span> <span class="mi">6</span>
<span class="mi">5</span> <span class="n">baz</span> <span class="mi">3</span> <span class="n">baz</span> <span class="mi">7</span>
</code></pre></div></div>
<p>From the above example, we joined tables on <code class="language-plaintext highlighter-rouge">df1.lkey == df2.rkey</code>. However, there are two columns with the same information, <code class="language-plaintext highlighter-rouge">lkey</code> and <code class="language-plaintext highlighter-rouge">rkey</code>. Using <code class="language-plaintext highlighter-rouge">merge()</code>, we can not do much to avoid that when the columns we are joining on have different names. An alternative method is using <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html"><code class="language-plaintext highlighter-rouge">drop()</code></a> to remove the duplicate column.</p>
<h3 id="62-use-join">6.2 Use <code class="language-plaintext highlighter-rouge">join()</code></h3>
<p>The <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html"><code class="language-plaintext highlighter-rouge">join()</code></a> method joins columns with other DataFrame either on index or on a key column. We can pass a list of multiple DataFrame objects to join by index at once. Working much same as <code class="language-plaintext highlighter-rouge">merge()</code>, <code class="language-plaintext highlighter-rouge">join()</code> is a more specific and concise version — it is designed to join on an index which may be a column or multiple columns, especially for many-to-one joins (where one of the DataFrames is already indexed by the join key).</p>
<p><code class="language-plaintext highlighter-rouge">left.join(right, on=key_or_keys)</code> is equivalent to <code class="language-plaintext highlighter-rouge">left.merge(right, left_on=key_or_keys, right_index=True, how='left', sort=False)</code></p>
<p>The default for <code class="language-plaintext highlighter-rouge">join()</code> is to perform a left join (essentially a “VLOOKUP” operation for Excel usres). To perform other join types, for example inner join:</p>
<p><code class="language-plaintext highlighter-rouge">left.join(right, on=['key1', 'key2'], how='inner')</code></p>
<p>More detailed example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">df</span>
<span class="n">key</span> <span class="n">A</span>
<span class="mi">0</span> <span class="n">K0</span> <span class="n">A0</span>
<span class="mi">1</span> <span class="n">K1</span> <span class="n">A1</span>
<span class="mi">2</span> <span class="n">K2</span> <span class="n">A2</span>
<span class="mi">3</span> <span class="n">K3</span> <span class="n">A3</span>
<span class="mi">4</span> <span class="n">K4</span> <span class="n">A4</span>
<span class="mi">5</span> <span class="n">K5</span> <span class="n">A5</span>
<span class="o">>>></span> <span class="n">other</span>
<span class="n">key</span> <span class="n">B</span>
<span class="mi">0</span> <span class="n">K0</span> <span class="n">B0</span>
<span class="mi">1</span> <span class="n">K1</span> <span class="n">B1</span>
<span class="mi">2</span> <span class="n">K2</span> <span class="n">B2</span>
<span class="c1"># Join using indexes
</span><span class="o">>>></span> <span class="n">df</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">other</span><span class="p">,</span> <span class="n">lsuffix</span><span class="o">=</span><span class="s">'_caller'</span><span class="p">,</span> <span class="n">rsuffix</span><span class="o">=</span><span class="s">'_other'</span><span class="p">)</span>
<span class="n">key_caller</span> <span class="n">A</span> <span class="n">key_other</span> <span class="n">B</span>
<span class="mi">0</span> <span class="n">K0</span> <span class="n">A0</span> <span class="n">K0</span> <span class="n">B0</span>
<span class="mi">1</span> <span class="n">K1</span> <span class="n">A1</span> <span class="n">K1</span> <span class="n">B1</span>
<span class="mi">2</span> <span class="n">K2</span> <span class="n">A2</span> <span class="n">K2</span> <span class="n">B2</span>
<span class="mi">3</span> <span class="n">K3</span> <span class="n">A3</span> <span class="n">NaN</span> <span class="n">NaN</span>
<span class="mi">4</span> <span class="n">K4</span> <span class="n">A4</span> <span class="n">NaN</span> <span class="n">NaN</span>
<span class="mi">5</span> <span class="n">K5</span> <span class="n">A5</span> <span class="n">NaN</span> <span class="n">NaN</span>
<span class="c1"># Join using the key columns, by setting both key columns to be the index
</span><span class="o">>>></span> <span class="n">df</span><span class="p">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">'key'</span><span class="p">).</span><span class="n">join</span><span class="p">(</span><span class="n">other</span><span class="p">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">'key'</span><span class="p">))</span>
<span class="n">A</span> <span class="n">B</span>
<span class="n">key</span>
<span class="n">K0</span> <span class="n">A0</span> <span class="n">B0</span>
<span class="n">K1</span> <span class="n">A1</span> <span class="n">B1</span>
<span class="n">K2</span> <span class="n">A2</span> <span class="n">B2</span>
<span class="n">K3</span> <span class="n">A3</span> <span class="n">NaN</span>
<span class="n">K4</span> <span class="n">A4</span> <span class="n">NaN</span>
<span class="n">K5</span> <span class="n">A5</span> <span class="n">NaN</span>
<span class="c1"># Join using the key columns, by the `on` parameter. The original `df` index preserved.
</span><span class="o">>>></span> <span class="n">df</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">other</span><span class="p">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">'key'</span><span class="p">),</span> <span class="n">on</span><span class="o">=</span><span class="s">'key'</span><span class="p">)</span>
<span class="n">key</span> <span class="n">A</span> <span class="n">B</span>
<span class="mi">0</span> <span class="n">K0</span> <span class="n">A0</span> <span class="n">B0</span>
<span class="mi">1</span> <span class="n">K1</span> <span class="n">A1</span> <span class="n">B1</span>
<span class="mi">2</span> <span class="n">K2</span> <span class="n">A2</span> <span class="n">B2</span>
<span class="mi">3</span> <span class="n">K3</span> <span class="n">A3</span> <span class="n">NaN</span>
<span class="mi">4</span> <span class="n">K4</span> <span class="n">A4</span> <span class="n">NaN</span>
<span class="mi">5</span> <span class="n">K5</span> <span class="n">A5</span> <span class="n">NaN</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">join()</code> always uses <em>other</em>’s index but we can use any column in <em>df</em>. A little different from <code class="language-plaintext highlighter-rouge">merge()</code> using one <code class="language-plaintext highlighter-rouge">suffixes</code> parameter, when there are columns with the same names, the way <code class="language-plaintext highlighter-rouge">join()</code> handles suffixes by passing <code class="language-plaintext highlighter-rouge">lsuffix</code> and <code class="language-plaintext highlighter-rouge">rsuffix</code> parameters separately.</p>
<h3 id="63-use-pandasconcat">6.3 Use <code class="language-plaintext highlighter-rouge">pandas.concat()</code></h3>
<p>The <a href="https://pandas.pydata.org/docs/reference/api/pandas.concat.html"><code class="language-plaintext highlighter-rouge">pandas.concat()</code></a> concatenates pandas objects along a particular axis with optinal set logic (union or intersection) along the other axes. Primarily, we use <code class="language-plaintext highlighter-rouge">concat()</code> to stack two DataFrames together. Generally, the joining is vertical, we can combine DataFrames horizontally along the x axis by passing in <code class="language-plaintext highlighter-rouge">axis=1</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">s1</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">([</span><span class="s">'a'</span><span class="p">,</span> <span class="s">'b'</span><span class="p">])</span>
<span class="o">>>></span> <span class="n">s2</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">([</span><span class="s">'c'</span><span class="p">,</span> <span class="s">'d'</span><span class="p">])</span>
<span class="o">>>></span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">s1</span><span class="p">,</span> <span class="n">s2</span><span class="p">])</span>
<span class="mi">0</span> <span class="n">a</span>
<span class="mi">1</span> <span class="n">b</span>
<span class="mi">0</span> <span class="n">c</span>
<span class="mi">1</span> <span class="n">d</span>
<span class="n">dtype</span><span class="p">:</span> <span class="nb">object</span>
<span class="c1"># Add a hierarchical index with the `key` option and label the index keys
</span><span class="o">>>></span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">s1</span><span class="p">,</span> <span class="n">s2</span><span class="p">],</span> <span class="n">keys</span><span class="o">=</span><span class="p">[</span><span class="s">'s1'</span><span class="p">,</span> <span class="s">'s2'</span><span class="p">],</span>
<span class="p">...</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s">'Series name'</span><span class="p">,</span> <span class="s">'Row ID'</span><span class="p">])</span>
<span class="n">Series</span> <span class="n">name</span> <span class="n">Row</span> <span class="n">ID</span>
<span class="n">s1</span> <span class="mi">0</span> <span class="n">a</span>
<span class="mi">1</span> <span class="n">b</span>
<span class="n">s2</span> <span class="mi">0</span> <span class="n">c</span>
<span class="mi">1</span> <span class="n">d</span>
<span class="n">dtype</span><span class="p">:</span> <span class="nb">object</span>
</code></pre></div></div>
<p>Use the <code class="language-plaintext highlighter-rouge">keys</code> and <code class="language-plaintext highlighter-rouge">names</code> parameters if you want keep track of which DataFrame each row originally came from. Add <code class="language-plaintext highlighter-rouge">.reset_index()</code> to the end of <code class="language-plaintext highlighter-rouge">pd.concat()</code> function to covert into regular indexes.</p>
<h4 id="64-use-append">6.4 Use <code class="language-plaintext highlighter-rouge">append()</code></h4>
<p>The <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html"><code class="language-plaintext highlighter-rouge">append()</code></a> method appends rows of <em>other</em> to the end of caller. As <code class="language-plaintext highlighter-rouge">join()</code> is more specific to <code class="language-plaintext highlighter-rouge">merge()</code> for joining tables, <code class="language-plaintext highlighter-rouge">append()</code> is a shortcut to <code class="language-plaintext highlighter-rouge">concat()</code> for streamlined appending. We have an <code class="language-plaintext highlighter-rouge">ignore_index</code> parameter, but there is no <code class="language-plaintext highlighter-rouge">join</code> parameter, <code class="language-plaintext highlighter-rouge">append()</code> will always do an outer join.</p>
<p>When we need to repeatedly add one row at a time to generate DataFrames, there are two ways (while not recommended):</p>
<ul>
<li>Less efficient:</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">'A'</span><span class="p">])</span>
<span class="o">>>></span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">):</span>
<span class="p">...</span> <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">append</span><span class="p">({</span><span class="s">'A'</span><span class="p">:</span> <span class="n">i</span><span class="p">},</span> <span class="n">ignore_index</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">df</span>
<span class="n">A</span>
<span class="mi">0</span> <span class="mi">0</span>
<span class="mi">1</span> <span class="mi">1</span>
<span class="mi">2</span> <span class="mi">2</span>
<span class="mi">3</span> <span class="mi">3</span>
<span class="mi">4</span> <span class="mi">4</span>
</code></pre></div></div>
<ul>
<li>More efficient:</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">([</span><span class="n">i</span><span class="p">],</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">'A'</span><span class="p">])</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">)],</span>
<span class="p">...</span> <span class="n">ignore_index</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">A</span>
<span class="mi">0</span> <span class="mi">0</span>
<span class="mi">1</span> <span class="mi">1</span>
<span class="mi">2</span> <span class="mi">2</span>
<span class="mi">3</span> <span class="mi">3</span>
<span class="mi">4</span> <span class="mi">4</span>
</code></pre></div></div>
<p>Adding a column to a DataFrame is relatively fast. However, adding a row requires a copy, and may be expensive. Passing a pre-built list of records to the DataFrame constructor instead of building a DataFrame by iteratively appending records to it, which is recommended.</p>
<h3 id="65-differences-between-join-methods">6.5 Differences between <code class="language-plaintext highlighter-rouge">JOIN</code> methods</h3>
<ul>
<li><code class="language-plaintext highlighter-rouge">merge()</code> is used to combine DataFrames on the common columns or indexes. <code class="language-plaintext highlighter-rouge">join()</code> is used to join tables on the indexes. <code class="language-plaintext highlighter-rouge">concat()</code> is used to concatenate pandas objects aligned by indexes depending on the <code class="language-plaintext highlighter-rouge">axis</code> option.</li>
<li>By default, <code class="language-plaintext highlighter-rouge">merge()</code> perform an inner join, <code class="language-plaintext highlighter-rouge">join()</code> perform a left join, <code class="language-plaintext highlighter-rouge">concat()</code> perform an outer join.</li>
<li><code class="language-plaintext highlighter-rouge">merge()</code> exists both as a top-level pandas function and a DataFrame method, <code class="language-plaintext highlighter-rouge">join()</code> is a DataFrame method, <code class="language-plaintext highlighter-rouge">concat()</code> is a top-level pandas function.</li>
<li><code class="language-plaintext highlighter-rouge">merge()</code> and <code class="language-plaintext highlighter-rouge">join()</code> handle duplicates on the joining index (or columns) by performing a cartesian product, while using <code class="language-plaintext highlighter-rouge">concat()</code> to append horizontally <em>(<code class="language-plaintext highlighter-rouge">axis=1</code>)</em>, if all DataFrames have the same indexes and the same number of rows, <code class="language-plaintext highlighter-rouge">concat()</code> will perform row for row, even if there are duplicate values in the index; if either of the two conditions are not met, it will throw an error.</li>
</ul>
<h2 id="7-to-be-added-under-construction">7. To be Added (Under construction)</h2>
<ul>
<li><a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html"><code class="language-plaintext highlighter-rouge">DataFrame.to_numpy()</code></a> gives a NumPy representation of the underlying data. Note it does not include the index or column labels in the output. NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. So <code class="language-plaintext highlighter-rouge">to_numpy()</code> can be an expensive operation when your DataFrame has columns with different data types.</li>
<li><code class="language-plaintext highlighter-rouge">melt()</code></li>
<li><code class="language-plaintext highlighter-rouge">shift()</code></li>
</ul>
<p>Operations in general <em>exclude</em> missing data.</p>
<ul>
<li><a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html"><code class="language-plaintext highlighter-rouge">shift()</code></a> — Shift index by desired number of periods.</li>
<li><a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html"><code class="language-plaintext highlighter-rouge">apply()</code></a></li>
<li><a href="https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html"><code class="language-plaintext highlighter-rouge">pandas.Series.map</code></a></li>
<li><a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.applymap.html"><code class="language-plaintext highlighter-rouge">applymap()</code></a></li>
<li><code class="language-plaintext highlighter-rouge">where()</code> and <code class="language-plaintext highlighter-rouge">numpy.where()</code></li>
<li><code class="language-plaintext highlighter-rouge">pd.cut()</code></li>
</ul>
<p>Time Series: <code class="language-plaintext highlighter-rouge">to_datetime()</code></p>
<p>Text: <code class="language-plaintext highlighter-rouge">.str</code>, <code class="language-plaintext highlighter-rouge">re</code>, <code class="language-plaintext highlighter-rouge">astype()</code>, <code class="language-plaintext highlighter-rouge">rename()</code></p>
<p>Missing data: <code class="language-plaintext highlighter-rouge">df.isnull().values.any()</code>, <code class="language-plaintext highlighter-rouge">dropna(axis=1, how='any')</code>, <code class="language-plaintext highlighter-rouge">df.fillna(value)</code></p>
<p>Plot: <code class="language-plaintext highlighter-rouge">value_counts()</code></p>
<h2 id="rewrite-sql">Rewrite SQL</h2>
<p>We will use the <a href="https://ourairports.com/data/">OurAirports Datasets</a>. Examples are from <a href="https://medium.com/@itruong">Irina Truong</a>.</p>
<table>
<thead>
<tr>
<th style="text-align: center">SQL</th>
<th style="text-align: center">Pandas</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">SELECT DISTINCT type FROM airports</td>
<td style="text-align: center">airports.type.nunique()</td>
</tr>
<tr>
<td style="text-align: center">SELECT id FROM airports WHERE ident = ‘KLAX’ LIMIT 3</td>
<td style="text-align: center">airports[airports.ident == ‘KLAX’].id.head(3)</td>
</tr>
<tr>
<td style="text-align: center">SELECT ident, name, municipality FROM airports WHERE iso_region = ‘US-CA’ AND type = ‘large_airport’</td>
<td style="text-align: center">airports[(airports.iso_region == ‘US-CA’) & (airports.type == ‘large_airport’)][[‘ident’, ‘name’, ‘municipality’]]</td>
</tr>
<tr>
<td style="text-align: center">SELECT * FROM airport_freq WHERE airport_ident = ‘KLAX’ ORDER BY type DESC</td>
<td style="text-align: center">airport_freq[airport_freq.airport_ident == ‘KLAX’].sort_values(‘type’, ascending=False)</td>
</tr>
<tr>
<td style="text-align: center">SELECT * FROM airports WHERE type NOT IN (‘heliport’, ‘balloonport’)</td>
<td style="text-align: center">airport[~airports.type.isin([‘heliport’, ‘balloonport’])]</td>
</tr>
<tr>
<td style="text-align: center">SELECT iso_country, type, COUNT(*) FROM airports GROUP BY iso_country, type ORDER BY iso_country, COUNT(*) DESC</td>
<td style="text-align: center">airports.groupby([‘iso_country’, ‘type’]).size().to_frame(‘size’).reset_index().sort_values([‘iso_country’, ‘size’], ascending=[True, False])</td>
</tr>
<tr>
<td style="text-align: center">SELECT type, COUNT(*) FROM airports WHERE iso_country = ‘US’ GROUP BY type HAVING COUNT(*) > 1000 ORDER BY COUNT(*) DESC</td>
<td style="text-align: center">airports[airports.iso_country == ‘US’].groupby(‘type’).filter(lambda g:len(g) > 1000).groupby(‘type’).size().sort_values(ascending=False)</td>
</tr>
<tr>
<td style="text-align: center">SELECT iso_country FROM by_country ORDER BY size DESC LIMIT 10 OFFSET 10</td>
<td style="text-align: center">by_country.nlargest(20, columns=’airport_count’).tail(10)</td>
</tr>
<tr>
<td style="text-align: center">SELECT MAX(length_ft), MIN(length_ft), AVG(length_ft) FROM runways</td>
<td style="text-align: center">runways.agg({‘length_ft’: [‘min’, ‘max’, ‘mean’]}).T</td>
</tr>
<tr>
<td style="text-align: center">SELECT surface, MIN(length_ft) AS min_length_ft, MAX(width_ft) AS max_width_ft FROM runaways GROUP BY surface</td>
<td style="text-align: center">runaways.groupby(‘surface’).agg(min_length_ft = (‘length_ft’, ‘min’), max_width_ft = (‘width’, ‘max’))</td>
</tr>
<tr>
<td style="text-align: center">SELECT airport_ident, type, description, frequency_mhz FROM airport_freq JOIN airports ON airport_freq.airport_ref = airports.id WHERE airports.ident = ‘KLAX’</td>
<td style="text-align: center">airport_freq.merge(airports[airports.ident == ‘KLAX’][[‘id’]], left_on=’airport_ref’, right_on=’id’, how=’inner’)[[‘airport_ident’, ‘type’, ‘description’, ‘frequency_mhz’]]</td>
</tr>
<tr>
<td style="text-align: center">SELECT name, municipality FROM airport WHERE ident = ‘KLAX’ UNION SELECT name, municipality FROM airports WHERE ident = ‘KLGB’</td>
<td style="text-align: center">pd.concat([airports[airports.ident == ‘KLAX’][[‘name’, ‘municipality’]], airports[airports.ident == ‘KLGB’][[‘name’, ‘municipality’]]]).drop_duplicates()</td>
</tr>
<tr>
<td style="text-align: center">CREATE TABLE heroes (id INTEGER, name TEXT); <br /> INSERT INTO heros VALUES (1, ‘Harry Potter’), (2, ‘Ron Weasley’), (3, ‘Hermione Granger’);</td>
<td style="text-align: center">df1 = pd.DataFrame({‘id’: [1], ‘name’: [‘Harry Potter’]}) <br /> df2 = pd.DataFrame({‘id’: [2, 3], ‘name’: [‘Ron Weasley’, ‘Hermione Granger’]}) <br /> pd.concat([df1, df2].reset_index(drop=True)</td>
</tr>
<tr>
<td style="text-align: center">UPDATE airports SET home_link = ‘fixed_url’ WHERE ident = ‘KLAX’</td>
<td style="text-align: center">airports.loc[airports[‘ident’] == ‘KLAX’, ‘home_link’] = ‘fixed_url’</td>
</tr>
<tr>
<td style="text-align: center">DELETE FROM lax_freq WHERE type = ‘MISC’</td>
<td style="text-align: center">lax_freq.drop(lax_freq[lax_freq.type == ‘MISC’].index)</td>
</tr>
</tbody>
</table>
<h2 id="more">More</h2>
<ul>
<li><a href="https://pandastutor.com/">Pandas Tutor</a> — visualize Python pandas code</li>
</ul>Qiao HuangContentsMy Notes2021-09-28T00:00:00+00:002021-09-28T00:00:00+00:00/my-notes<p>Welcome to <strong><a href="/notes">my notes</a></strong> — a digital garden where share notes of my learning, and seeds of my thoughts, to be cultivated in public. 🌱</p>
<p>In fact, my notes are on another <a href="https://www.uniqiao.com/notes/">GitHub page</a>, which is hosted on another <a href="https://github.com/qiaohuang/notes">GitHub repo</a>. This post just serve as an index.</p>
<p>Notes are not polished or comprehensive, but serve more like memory jogs. I’m sharing my notes because a). making them public may help motivate me; b). they can give anyone who cares a good sense of what I do; c). they might be useful to others.</p>
<p class="box-warning">I’ll write notes for myself “by default”. So if a note seems confusing, under-explained, or don’t make much sense, I’m sorry, it’s probably because I didn’t write for you.</p>
<h2 id="notes-structrue">Notes structrue</h2>
<p>Currently incomplete, check my <a href="https://github.com/qiaohuang/notes">repo</a> for the whole folder.</p>
<ul>
<li><a href="https://www.uniqiao.com/notes/Arts/">Arts</a></li>
<li><a href="https://www.uniqiao.com/notes/Data/">Data</a></li>
<li><a href="https://www.uniqiao.com/notes/Digital%20Marketing/">Digital Marketing</a></li>
<li><a href="https://www.uniqiao.com/notes/SQL/">SQL</a></li>
<li><a href="https://www.uniqiao.com/notes/Stats/">Stats</a></li>
</ul>
<p><span style="color: gray">
Written with <a href="https://foambubble.github.io/foam/">
<img src="/projects/assets/images/foam-icon.png" alt="foam icon" title="Foam" height="24" />
</a> + <a href="https://code.visualstudio.com/">
<img src="/projects/assets/images/vscode-icon.png" alt="vscode icon" title="VS Code" height="24" />
</a>
</span></p>Qiao HuangWelcome to my notes — a digital garden where share notes of my learning, and seeds of my thoughts, to be cultivated in public. 🌱Master Google-fu2021-05-08T00:00:00+00:002021-05-08T00:00:00+00:00/blog/2021/05/08/google-fu<p><a href="https://en.wiktionary.org/wiki/Google-fu">Google-fu</a> means “skill in using search engines (especially <strong>Google</strong> ) to quickly find useful information on the Internet”. Sometimes we have a hard time yielding any effective results when using a search engine, however, there is a technique referred to as Google-fu that will help us to find more specific results.</p>
<p><img src="/blog/assets/images/google-fu.jpg" alt="google-fu" /></p>
<p><cite>Source: <a href="https://www.google.com/imghp">Google Images</a></cite></p>
<p>In this post, I will show you some crucial tips for refining your Google-fu.</p>
<h2 id="modifiers">Modifiers</h2>
<p>Google Search is always case-insensitive and usually ignores punctuation that isn’t part of a search.</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">"exact word or phrase"</code> — quotation
<ul>
<li>force an exact-match search with all characters and in the order specified</li>
<li><a href="https://www.google.com/search?q=%22awesome+python%22"><code class="language-plaintext highlighter-rouge">"awesome python"</code></a></li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">(search groups)</code> — parentheses
<ul>
<li>group terms and control the order without memorizing the precedence</li>
<li><a href="https://www.google.com/search?q=%28python+AND+java%29+salary"><code class="language-plaintext highlighter-rouge">(python AND r) salary</code></a></li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">OR</code> — logical word
<ul>
<li>“OR” must be in ALL-CAPS, you can use the pipe symbol (<code class="language-plaintext highlighter-rouge">|</code>) alternatively if your CapsLock is broken :)</li>
<li><a href="https://www.google.com/search?q=python+OR+r"><code class="language-plaintext highlighter-rouge">python OR r</code></a></li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">-</code> — exclusion
<ul>
<li>place immediately before the certain term you want to leave out</li>
<li><a href="https://www.google.com/search?q=python+-snake"><code class="language-plaintext highlighter-rouge">python -snake</code></a></li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">+</code> — inclusion
<ul>
<li>give priority to the precisely used term to force Google to return which might be discarded</li>
<li><a href="https://www.google.com/search?q=python+%2Bsnake"><code class="language-plaintext highlighter-rouge">python +snake</code></a> — Don’t click if you’re afraid of snakes 🐍</li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">..</code> — number range
<ul>
<li>put two dots between two numbers for two full stops to match within the number range</li>
<li><a href="https://www.google.com/search?q=data+science+2019..2021"><code class="language-plaintext highlighter-rouge">data science 2019..2021</code></a></li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">*</code> — asterisk
<ul>
<li>act as a wildcard to replace any single word or leave blank</li>
<li><a href="https://www.google.com/search?q=data+*+job"><code class="language-plaintext highlighter-rouge">data * job</code></a></li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">~</code> — tilde
<ul>
<li>bring back synonyms, now is the default</li>
<li><a href="https://www.google.com/search?q=~home+for+data+science"><code class="language-plaintext highlighter-rouge">~home for data science</code></a></li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">to</code> or <code class="language-plaintext highlighter-rouge">in</code>
<ul>
<li>convert measurements from one unit to another</li>
<li><a href="https://www.google.com/search?q=1+BTC+to+USD"><code class="language-plaintext highlighter-rouge">1 BTC to USD</code></a></li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">@</code> search social media, <code class="language-plaintext highlighter-rouge">$</code> search price, <code class="language-plaintext highlighter-rouge">#</code> search hashtag
<ul>
<li><a href="https://www.google.com/search?q=joe+%40twitter"><code class="language-plaintext highlighter-rouge">joe @twitter</code></a></li>
<li><a href="https://www.google.com/search?q=camera+%24400"><code class="language-plaintext highlighter-rouge">camera $400</code></a></li>
<li><a href="https://www.google.com/search?q=%23datascience"><code class="language-plaintext highlighter-rouge">#datascience</code></a></li>
</ul>
</li>
</ul>
<h2 id="operators">Operators</h2>
<p>Syntax: <code class="language-plaintext highlighter-rouge">operator:search_term</code>. Don’t put spaces between the operator and your search term.</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">site:</code>
<ul>
<li>restrict search results to a specific domain, it only accepts full domain, root domain, and Top-Level Domain (TLD)</li>
<li><a href="https://www.google.com/search?q=site%3A.edu+data+science"><code class="language-plaintext highlighter-rouge">site:.edu data science</code></a></li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">related:</code>
<ul>
<li>return sites that are related to a target domain, it only works for large domains</li>
<li><a href="https://www.google.com/search?q=related%3Apython.org"><code class="language-plaintext highlighter-rouge">related:python.org</code></a></li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">link:</code>
<ul>
<li>find pages with links to a target domain, only provides a sample of backlinks</li>
<li><a href="https://www.google.com/search?q=link%3Agoogle.com"><code class="language-plaintext highlighter-rouge">link:google.com</code></a></li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">cache:</code>
<ul>
<li>show Google’s cached version of a page</li>
<li><a href="https://www.google.com/search?q=cache%3Agithub.com"><code class="language-plaintext highlighter-rouge">cache:github.com</code></a></li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">filetype:</code>
<ul>
<li>specify a particular file extension, such as “.pdf”, “.doc”, or even “.html”</li>
<li><a href="https://www.google.com/search?q=machine+learning+filetype%3Apdf"><code class="language-plaintext highlighter-rouge">machine learning filetype:pdf</code></a></li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">intitle:</code> and <code class="language-plaintext highlighter-rouge">allintitle:</code>
<ul>
<li>return only search results that match in the page’s title, for a single word and multiple words respectively</li>
<li><a href="https://www.google.com/search?q=intitle%3Apython"><code class="language-plaintext highlighter-rouge">intitle:python</code></a></li>
<li><a href="https://www.google.com/search?q=allintitle%3Apython+r+sql"><code class="language-plaintext highlighter-rouge">allintitle:python r sql</code></a></li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">intext:</code> and <code class="language-plaintext highlighter-rouge">allintext</code>
<ul>
<li>return only results that match in the page’s body/document text</li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">inurl:</code> and <code class="language-plaintext highlighter-rouge">allinurl:</code>
<ul>
<li>return only results that match in the page’s URL text</li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">inanchor:</code> and <code class="language-plaintext highlighter-rouge">allinanchor:</code>
<ul>
<li>return only results that match in the page’s anchor text or links</li>
</ul>
</li>
</ul>
<h2 id="combos">Combos</h2>
<p>Having all the pieces is only the first step in building a puzzle. The real power of Google-fu comes from combos.</p>
<ul>
<li>Towards Data Science <strong>articles</strong> about Data Visualization or Dashboard in the real world, but not using R, written between 2019 and 2021
<ul>
<li><a href="https://www.google.com/search?q=site%3Atowardsdatascience.com+%22data+visualization%22+OR+dashboard+~%22real+world%22+-r+2019..2021"><code class="language-plaintext highlighter-rouge">site:towardsdatascience.com "data visualization" OR dashboard ~"real world" -r 2019..2021</code></a></li>
</ul>
</li>
<li>A <strong>report</strong> on the different airspeed velocities of common swallows
<ul>
<li><a href="https://www.google.com/search?q=filetype%3Apdf+airspeed+intitle%3Avelocity+of+*swallow"><code class="language-plaintext highlighter-rouge">filetype:pdf airspeed intitle:velocity of *swallow</code></a></li>
</ul>
</li>
</ul>
<h2 id="others">Others</h2>
<p>Google is more than a search engine.</p>
<h3 id="image-search">Image Search</h3>
<p>Go to <a href="https://google.com/advanced_image_search"><strong>Advanced Image Search</strong></a>, and narrow your results using filters like image size, file type, and even specific colors. It is also available for many of the aforementioned tips.</p>
<p>Also, you can perform a reverse image search on most browsers. Go to <a href="https://images.google.com/"><strong>Google Images</strong></a> and click <img src="/blog/assets/images/camera-icon.png" alt="camera icon" title="Search by image" height="24" />. You can find results include — search results for objects in the image, similar images, and websites with the image or a similar image.</p>
<h3 id="google-trends">Google Trends</h3>
<p><a href="https://trends.google.com/"><strong>Google Trends</strong></a> lets you explore what the world is searching for by entering terms or topics. It displays interest in a particular search from around the globe or down to city-level geography and uses graphs to compare the search volume of different queries over time. There are some <a href="https://support.google.com/trends/answer/4359582">search tips for trends</a>.</p>
<h3 id="google-alerts">Google Alerts</h3>
<p><a href="https://alerts.google.com/"><strong>Google Alerts</strong></a> is a way to create custom alerts that will notify you any time new results match your search term.</p>
<h3 id="google-news">Google News</h3>
<p><a href="https://news.google.com/"><strong>Google News</strong></a> is a personalized news aggregator that organizes and highlights what’s happening in the world.</p>
<ul>
<li><a href="https://news.google.com/topics/CAAqJAgKIh5DQkFTRUFvS0wyMHZNR3AwTTE5eE14SUNaVzRvQUFQAQ?hl=en-US&gl=US&ceid=US%3Aen">Google News Data Science</a> — the topic on Data Science</li>
<li><a href="https://news.google.com/newspapers">Google News Archive Search</a> — scanned archives of newspapers, also access to it by searching <code class="language-plaintext highlighter-rouge">site:news.google.com/newspapers</code></li>
</ul>
<h3 id="tools">Tools</h3>
<ul>
<li>Dictionary: Put <code class="language-plaintext highlighter-rouge">define:</code> in front of an unfamiliar word for a quick definition.
<ul>
<li><a href="https://www.google.com/search?q=define%3Agoggle"><code class="language-plaintext highlighter-rouge">define:goggle</code></a></li>
</ul>
</li>
<li>Calculator: Just type in the expression for quick math problems without bothering your local calculator.
<ul>
<li><a href="https://www.google.com/search?q=pi"><code class="language-plaintext highlighter-rouge">pi</code></a>, also provide an interactive calculator</li>
<li><a href="https://www.google.com/search?q=sin%28x%29%2Fy"><code class="language-plaintext highlighter-rouge">sin(x)/y</code></a> create interactive 3D virtual objects using “x” and “y” as free variables</li>
</ul>
</li>
<li>Timer: Just search <code class="language-plaintext highlighter-rouge">timer</code> to launch an embedded timer tool. Note: It will emit a loud beeping sound when it hits zero.
<ul>
<li><a href="https://www.google.com/search?q=time+for+10+seconds"><code class="language-plaintext highlighter-rouge">time for 10 seconds</code></a></li>
</ul>
</li>
</ul>
<h3 id="games">Games</h3>
<p>A host of fun built-in games you can access by Googling them. E.g., <a href="https://www.google.com/search?q=play%20snake"><code class="language-plaintext highlighter-rouge">play snake</code></a> and <a href="https://www.google.com/search?q=play%20PAC-MAN"><code class="language-plaintext highlighter-rouge">play PAC-MAN</code></a>.</p>
<h3 id="easter-eggs">Easter Eggs</h3>
<p>Here are a few cool Easter Eggs, hiding there until you stumble upon them.</p>
<ul>
<li><a href="https://www.google.com/search?q=New+Year%27s+Eve"><code class="language-plaintext highlighter-rouge">New Year's Eve</code></a> — Happy New Year’s Eve</li>
<li><a href="https://www.google.com/search?q=askew"><code class="language-plaintext highlighter-rouge">askew</code></a> — tilt your screen</li>
<li><a href="https://www.google.com/search?q=do+a+barrel+roll"><code class="language-plaintext highlighter-rouge">do a barrel roll</code></a> — execute a roll</li>
<li><a href="https://www.google.com/search?q=blink+HTML"><code class="language-plaintext highlighter-rouge">blink HTML</code></a> — the words “blink” and “HTML” blink on the SERPs (search engine results pages)</li>
<li><a href="https://www.google.com/search?q=Google+in+1998"><code class="language-plaintext highlighter-rouge">Google in 1998</code></a> — page appear as Google did in 1998</li>
</ul>
<h2 id="useful-links">Useful links</h2>
<ul>
<li><a href="http://www.googleguide.com/">Google Guide</a></li>
<li><a href="https://www.google.com/insidesearch/searcheducation/index.html">Google Search Education</a></li>
<li><a href="https://support.google.com/websearch/answer/134479">How to search on Google</a> — Google Search Help</li>
<li><a href="https://www.edx.org/xseries/google-power-searching-with-google">Google search techniques and tools from a Google expert</a> — edX</li>
</ul>
<p class="box-warning">Caution: Don’t Trust Google Search Results Blindly!</p>
<p>Any limitations? You may try <a href="https://www.faganfinder.com/">Fagan Finder</a> to get around.</p>
<p>🔍 Happy Searching! 🔎</p>Qiao HuangGoogle-fu means “skill in using search engines (especially Google ) to quickly find useful information on the Internet”. Sometimes we have a hard time yielding any effective results when using a search engine, however, there is a technique referred to as Google-fu that will help us to find more specific results.Introduction to Git2021-05-04T00:00:00+00:002021-05-04T00:00:00+00:00/blog/2021/05/04/introduction-to-git<p>This post is the note on the <a href="https://www.datacamp.com/">DataCamp</a> <a href="https://learn.datacamp.com/courses/introduction-to-git">course</a> led by Greg Wilson, the Co-founder of <a href="https://software-carpentry.org/">Software Carpentry</a>.</p>
<h2 id="basic-workflow">Basic workflow</h2>
<p>Git is a modern version control tool created by Linus Torvalds in 2005, now it is very popular with data scientists and software developers. It can keep track of changes to files, notice conflicts between changes made by different people, and synchronize files between different computers.</p>
<p>A repository is the combination of two parts: the files and directories, and their historical information that Git records which are called <code class="language-plaintext highlighter-rouge">.git</code> located in the root directory.</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">git status</code> — check the status of your repository</li>
<li><code class="language-plaintext highlighter-rouge">git diff filename</code> — show you the changes</li>
<li><code class="language-plaintext highlighter-rouge">git add filename</code> — add a file to the staging area</li>
<li><code class="language-plaintext highlighter-rouge">git diff -r HEAD path/to/file</code> — compare the state of your files with those in the staging area, the <code class="language-plaintext highlighter-rouge">-r</code> flag means “compare to a particular revision”, and <code class="language-plaintext highlighter-rouge">HEAD</code> means “the most recent commit”</li>
<li><code class="language-plaintext highlighter-rouge">nano filename</code> — use Nano to edit <code class="language-plaintext highlighter-rouge">filename</code></li>
<li><code class="language-plaintext highlighter-rouge">git commit -m "some message in quotes"</code> — commit the changes in the staging area with a log message, <code class="language-plaintext highlighter-rouge">git commit --amend -m "new message"</code> change a commit message</li>
<li><code class="language-plaintext highlighter-rouge">git log</code> — view a repository’s history, <code class="language-plaintext highlighter-rouge">git log -3 filename</code> show the last three commits involving a specific file</li>
</ul>
<h2 id="repositories">Repositories</h2>
<p>Git uses a three-level structure for information stored by each commit.</p>
<ol>
<li>A <strong>commit</strong> contains metadata such as the author, the commit message, and the time the commit happened.</li>
<li>A <strong>tree</strong> tracks the names and locations in the repository when that commit happened.</li>
<li>A <strong>blob</strong> (short for <em>binary large object</em>) contains a compressed snapshot of the contents of the file when the commit happened.</li>
</ol>
<p><img src="https://assets.datacamp.com/production/repositories/1545/datasets/1bb404075fe1164d8b3fd78f4065b0bf3d86bc16/gds_2_1_SVG.svg" alt="Git Structure" /></p>
<p>Looking at the diagram <code class="language-plaintext highlighter-rouge">SVG</code> (zoom for better clarity), first in the oldest (top) commit, there were two files tracked by the repository, then <code class="language-plaintext highlighter-rouge">report.md</code> and <code class="language-plaintext highlighter-rouge">draft.md</code> were changed in the middle commit, so the blobs are shown next to that commit. <code class="language-plaintext highlighter-rouge">data/northern.csv</code> didn’t change in that commit, so the tree links to the blob from the previous commit. Reusing blobs between commits help make common operations fast and minimize storage space.</p>
<ul>
<li>
<p>A hash is a unique identifier for every commit, which enables Git to share data efficiently between repositories.</p>
</li>
<li>
<p>The special label <code class="language-plaintext highlighter-rouge">HEAD</code> is another way to identify a specific commit. It always refers to the most recent commit. The label<code class="language-plaintext highlighter-rouge">HEAD~1</code> then refers to the commit before it, while <code class="language-plaintext highlighter-rouge">HEAD~2</code> refers to the commit before that, and so on.</p>
</li>
<li><code class="language-plaintext highlighter-rouge">git annotate file</code> — show who made the last change to each line of a file and when.</li>
<li><code class="language-plaintext highlighter-rouge">git diff ID1..ID2</code> — show the changes between two commits, <code class="language-plaintext highlighter-rouge">..</code> is a pair of dots.</li>
<li>A <code class="language-plaintext highlighter-rouge">.gitignore</code> file in the root directory tells Git to ignore certain files.</li>
<li><code class="language-plaintext highlighter-rouge">git clean</code> — only works on untracked files, <code class="language-plaintext highlighter-rouge">git clean -n</code> show a list of files whose history Git is not currently tracking, <code class="language-plaintext highlighter-rouge">git clean - f</code> delete those files for good.</li>
<li><code class="language-plaintext highlighter-rouge">git config --list</code> — see what the settings are with one of three additional options:
<ul>
<li><code class="language-plaintext highlighter-rouge">--system</code> — every user on this computer</li>
<li><code class="language-plaintext highlighter-rouge">--global</code> — every one of your projects</li>
<li><code class="language-plaintext highlighter-rouge">--local</code> — one specific project</li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">git config -- global setting value</code> — change a configuration value for all of your projects on a particular computer.</li>
</ul>
<h2 id="undo">Undo</h2>
<ul>
<li>
<p><code class="language-plaintext highlighter-rouge">git reset HEAD</code> — unstage the additions, <code class="language-plaintext highlighter-rouge">git reset</code> unstage everything.</p>
</li>
<li>
<p><code class="language-plaintext highlighter-rouge">git checkout -- filename</code> — discard the changes that have not yet been staged, <code class="language-plaintext highlighter-rouge">git checkout -- .</code> revert all files in the current directory.</p>
</li>
<li>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git reset HEAD path/to/file
git checkout <span class="nt">--</span> path/to/file
</code></pre></div> </div>
<p>By combining <code class="language-plaintext highlighter-rouge">git reset</code> with <code class="language-plaintext highlighter-rouge">git checkout</code>, you can undo changes to a file that you staged changes to.</p>
</li>
<li>You can think of committing as saving your work, and checking out as loading that saved version. <code class="language-plaintext highlighter-rouge">git checkout ID filename</code> would replace the current version of a file with the version that <code class="language-plaintext highlighter-rouge">ID</code> identified. Notice that this is the same syntax that you used to undo the unstaged changes, except <code class="language-plaintext highlighter-rouge">--</code> has been replaced by <code class="language-plaintext highlighter-rouge">ID</code>.</li>
</ul>
<h2 id="working-with-branches">Working with branches</h2>
<p>Branches allow you to have multiple versions of your work and let you track each version systematically. A commit will have two parents when branches are being merged, that’s why Git needs both trees and commits.</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">git branch</code> — list all of the branches in a repository</li>
<li><code class="language-plaintext highlighter-rouge">git diff branch-1..branch-2</code> — show the difference between two branches</li>
<li><code class="language-plaintext highlighter-rouge">git checkout branch-name</code> — switch to a branch</li>
<li><code class="language-plaintext highlighter-rouge">git checkout -b branch-name</code> — create a branch</li>
<li><code class="language-plaintext highlighter-rouge">git merge source destination</code> — merge one branch (source) to another (destination)</li>
</ul>
<h2 id="collaborating">Collaborating</h2>
<ul>
<li><code class="language-plaintext highlighter-rouge">git init project-name</code> — create a repository for a new project in the current working directory</li>
<li><code class="language-plaintext highlighter-rouge">git init /path/to/project</code> — convert existing projects into repositories</li>
<li><code class="language-plaintext highlighter-rouge">git clone URL</code> — clone a repository, <code class="language-plaintext highlighter-rouge">git clone /existing/project</code> use a path, <code class="language-plaintext highlighter-rouge">git clone /existing/project newprojectname</code> call the clone something else</li>
<li><code class="language-plaintext highlighter-rouge">git remote</code> — list the names of its remotes, <code class="language-plaintext highlighter-rouge">git remote -v</code> (“v” for “verbose”) show the remote’s URLs</li>
<li><code class="language-plaintext highlighter-rouge">git remote add remote-name URL</code> — add more remotes</li>
<li><code class="language-plaintext highlighter-rouge">git remote rm remote-name</code> — remove existing ones</li>
<li><code class="language-plaintext highlighter-rouge">git pull remote branch</code> — get everything in <code class="language-plaintext highlighter-rouge">branch</code> in the remote repository identified by <code class="language-plaintext highlighter-rouge">remote</code> and merges it into the current branch of your local repository, <code class="language-plaintext highlighter-rouge">git pull</code> is a combination of <code class="language-plaintext highlighter-rouge">git fetch</code> and <code class="language-plaintext highlighter-rouge">git merge</code></li>
<li><code class="language-plaintext highlighter-rouge">git push remote-name branch-name</code> — push the changes you have made locally into a remote repository</li>
</ul>
<h2 id="useful-links">Useful links</h2>
<ul>
<li><a href="https://www.dataschool.io/how-to-contribute-on-github/">Step-by-step guide to contributing on GitHub</a></li>
<li><a href="https://git-scm.com/book/en/v2">Pro Git book</a></li>
<li><a href="http://git.io/sheet">http://git.io/sheet</a> — a list of cool features of Git and GitHub</li>
<li><a href="https://learngitbranching.js.org/">Learn Git Branching</a> — the visual and interactive way</li>
</ul>Qiao HuangThis post is the note on the DataCamp course led by Greg Wilson, the Co-founder of Software Carpentry.Upgrade to Google Analytics 42021-04-19T00:00:00+00:002021-04-19T00:00:00+00:00/blog/2021/04/19/upgrade-to-ga4<p>I just upgraded to Google Analytics 4 — the future of analytics! 🔮</p>
<p>In this post, I’ll share a little of my experience.</p>
<h2 id="introduction">Introduction</h2>
<p>Google Analytics announced <a href="https://blog.google/products/marketingplatform/analytics/new-way-unify-app-and-website-measurement-google-analytics/">a new way to measure apps and websites together</a> in Jul 2019 and launched <a href="https://blog.google/products/marketingplatform/analytics/new_google_analytics/">the next generation Google Analytics</a> in Oct 2020. Unlike the old University Analytics (<a href="https://support.google.com/analytics/answer/10220206">UA</a>) properties, the new Google Analytics 4 (<a href="https://support.google.com/analytics/answer/10089681">GA4</a>) properties can be used for a website, an app, or both together. GA4 is built with machine learning at its core to help deliver new insights and comes with better privacy handling.</p>
<h2 id="prerequisites">Prerequisites</h2>
<ul>
<li>A Google Analytics Account — Log in to Google account and go to <a href="https://analytics.google.com/">https://analytics.google.com/</a></li>
<li>A Universal Analytics Property — <a href="https://support.google.com/analytics/answer/10269537">Set up Analytics for a website</a></li>
</ul>
<h2 id="upgrade">Upgrade</h2>
<h3 id="upgrade-to-ga4-using-setup-assistant">Upgrade to GA4 using <a href="https://support.google.com/analytics/answer/10312255">Setup Assistant</a></h3>
<ul>
<li>You can find <strong>GA4 Setup Assistant</strong> within the Admin console of your Universal Analytics property.</li>
<li>Get started with Google Analytics 4 by auto-creating a new property using the wizard. Click <strong>See your GA4 property</strong>.</li>
<li>Appears in the Admin section under the Property column, the <a href="https://support.google.com/analytics/answer/10110290">[GA4] Setup Assistant</a> allows you to further customize your GA4 property and share settings from your Universal Analytics property.</li>
</ul>
<h3 id="get-your-ga4-global-site-tag-gtagjs">Get your GA4 global site tag (<a href="https://support.google.com/analytics/answer/10220869">gtag.js</a>)</h3>
<ul>
<li><code class="language-plaintext highlighter-rouge">GA4 Admin</code> –> <code class="language-plaintext highlighter-rouge">Data Streams</code> –> <code class="language-plaintext highlighter-rouge">Click your stream</code> –> <code class="language-plaintext highlighter-rouge">Add new on-page tag</code></li>
</ul>
<p><img src="/blog/assets/images/gtag.png" alt="gtag.js" /></p>
<h3 id="add-your-tag-directly-to-your-web-pages">Add your tag directly to your web pages</h3>
<ul>
<li>Copy the entire Analytics page tag into the <code class="language-plaintext highlighter-rouge"><head></code> section of your HTML. It should look like <a href="https://github.com/qiaohuang/qiaohuang.github.io/blob/master/_includes/google-analytics.html" title="This link takes you to my google-analytics source code">this</a>.</li>
</ul>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"><!--</span> <span class="nx">Global</span> <span class="nx">site</span> <span class="nx">tag</span> <span class="p">(</span><span class="nx">gtag</span><span class="p">.</span><span class="nx">js</span><span class="p">)</span> <span class="o">-</span> <span class="nx">Google</span> <span class="nx">Analytics</span> <span class="o">--></span>
<span class="o"><</span><span class="nx">script</span> <span class="k">async</span> <span class="nx">src</span><span class="o">=</span><span class="dl">"</span><span class="s2">https://www.googletagmanager.com/gtag/js?id=G-XXXXXXX</span><span class="dl">"</span><span class="o">><</span><span class="sr">/script</span><span class="err">>
</span><span class="o"><</span><span class="nx">script</span><span class="o">></span>
<span class="nb">window</span><span class="p">.</span><span class="nx">dataLayer</span> <span class="o">=</span> <span class="nb">window</span><span class="p">.</span><span class="nx">dataLayer</span> <span class="o">||</span> <span class="p">[];</span>
<span class="kd">function</span> <span class="nx">gtag</span><span class="p">(){</span><span class="nx">dataLayer</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">arguments</span><span class="p">);}</span>
<span class="nx">gtag</span><span class="p">(</span><span class="dl">'</span><span class="s1">js</span><span class="dl">'</span><span class="p">,</span> <span class="k">new</span> <span class="nb">Date</span><span class="p">());</span>
<span class="nx">gtag</span><span class="p">(</span><span class="dl">'</span><span class="s1">config</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">G-XXXXXXX</span><span class="dl">'</span><span class="p">);</span>
<span class="o"><</span><span class="sr">/script</span><span class="err">>
</span></code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">gtag.js</code> tag that you added will collect data for your new GA4 property, which is linked to your Universal Analytics property. You now run the UA & GA4 in Parallel!</p>
<p class="box-warning">Warning: Don’t remove the old <code class="language-plaintext highlighter-rouge">analytics.js</code> tag. It will continue to collect data for your Universal Analytics property.</p>
<h2 id="postscript">Postscript</h2>
<p>This simplified guide showed the GA4 upgrade configuration in an existing UA within a few steps. You can also use Google Tag Manager (<a href="https://support.google.com/tagmanager/answer/6102821">GTM</a>) to implement a GA4 property, which would be a robust method. If you’re new to Google Analytics and want to <a href="https://support.google.com/analytics/answer/9306384">set up your first Analytics</a>, it will automatically be GA4.</p>
<p>Check out <a href="https://support.google.com/analytics/answer/9744165">Google’s walkthrough</a> to learn more about adding a GA4 property. If you want to explore more, try the Google Analytics <a href="https://support.google.com/analytics/answer/6367342">demo account</a> for free. For a more in-depth understanding of your new property type, take the <a href="https://skillshop.exceedlms.com/student/path/66729/">Google Analytics Skillshop course</a>.</p>
<p>If you’re facing any issues in your upgrade or find any mistakes in this post, feel free to comment here. Thanks for reading. Cheers! 🍻</p>Qiao HuangI just upgraded to Google Analytics 4 — the future of analytics! 🔮Review of Decision Trees2020-10-21T00:00:00+00:002020-10-21T00:00:00+00:00/blog/2020/10/21/review-of-decision-trees<p>A year or so ago, I received my introduction to machine learning thanks to my supervisor. My first model was a decision tree since it is the most popular machine learning algorithm due to simplicity and easy to realize. Now I’m back in machine learning, and this post is a brief review of Decision Trees.</p>
<h2 id="introduction">Introduction</h2>
<p>As the name implies, it is a tree that assists us in making decisions. The algorithm belongs to the family of <a href="https://en.wikipedia.org/wiki/Supervised_learning">supervised learning algorithms</a>. However, unlike other supervised learning algorithms, it is applicable for both <a href="https://en.wikipedia.org/wiki/Classification_(general_theory)">classification</a> and <a href="https://en.wikipedia.org/wiki/Regression_analysis">regression</a>.</p>
<p><img src="/blog/assets/images/Decision_Tree.jpg" alt="By Gilgoldm - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=90405437" /></p>
<p>As the famous <a href="https://en.wikipedia.org/wiki/Decision_tree_learning#/media/File:Decision_Tree.jpg">Titanic Decision tree</a> shows above (“sibsp” is the number of spouses or siblings aboard), the basic structure of a decision tree contains a <em>root node</em>, several <em>interior nodes</em>, and the final <em>leaf nodes</em>.</p>
<h3 id="terminology">Terminology</h3>
<ul>
<li><strong>Root node</strong>: The entire dataset that is further divided</li>
<li><strong>Splitting</strong>: The process of dividing a node into two or more sub-nodes</li>
<li><strong>Interior node</strong>: A sub-node that split into further sub-nodes</li>
<li><strong>Leaf node</strong>: Node that does not split</li>
<li><strong>Pruning</strong>: Removal of sub-nodes, opposite to splitting</li>
<li><strong>Branch</strong>: Sub-section of the entire tree</li>
<li><strong>Entropy</strong>: The impurity of a dataset</li>
<li><strong>Gini impurity</strong>: A variation of the usual entropy measure</li>
</ul>
<h3 id="process">Process</h3>
<ol>
<li><strong>Present</strong> a dataset containing several training instances characterized by several input features and target features.</li>
<li><strong>Train</strong> the decision tree model by continuously splitting target features along the values of input features using a measure of information gain during the training process.</li>
<li><strong>Grow</strong> the tree until we reach the stop condition. Create leaf nodes for new query instances.</li>
<li><strong>Show</strong> the query instance and run it through the tree until we reach the leaf node.</li>
</ol>
<h2 id="principle">Principle</h2>
<p>Consider an example dataset.</p>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: right">Temperature</th>
<th style="text-align: right">Outlook</th>
<th style="text-align: right">Humidity</th>
<th style="text-align: right">Windy</th>
<th style="text-align: right">Play Golf?</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td style="text-align: right">hot</td>
<td style="text-align: right">sunny</td>
<td style="text-align: right">high</td>
<td style="text-align: right">false</td>
<td style="text-align: right">no</td>
</tr>
<tr>
<td>1</td>
<td style="text-align: right">hot</td>
<td style="text-align: right">sunny</td>
<td style="text-align: right">high</td>
<td style="text-align: right">true</td>
<td style="text-align: right">no</td>
</tr>
<tr>
<td>2</td>
<td style="text-align: right">hot</td>
<td style="text-align: right">overcast</td>
<td style="text-align: right">high</td>
<td style="text-align: right">false</td>
<td style="text-align: right">yes</td>
</tr>
<tr>
<td>3</td>
<td style="text-align: right">cool</td>
<td style="text-align: right">rain</td>
<td style="text-align: right">normal</td>
<td style="text-align: right">false</td>
<td style="text-align: right">yes</td>
</tr>
<tr>
<td>4</td>
<td style="text-align: right">cool</td>
<td style="text-align: right">overcast</td>
<td style="text-align: right">normal</td>
<td style="text-align: right">true</td>
<td style="text-align: right">yes</td>
</tr>
<tr>
<td>5</td>
<td style="text-align: right">mild</td>
<td style="text-align: right">sunny</td>
<td style="text-align: right">high</td>
<td style="text-align: right">false</td>
<td style="text-align: right">no</td>
</tr>
<tr>
<td>6</td>
<td style="text-align: right">cool</td>
<td style="text-align: right">sunny</td>
<td style="text-align: right">normal</td>
<td style="text-align: right">false</td>
<td style="text-align: right">yes</td>
</tr>
<tr>
<td>7</td>
<td style="text-align: right">mild</td>
<td style="text-align: right">rain</td>
<td style="text-align: right">normal</td>
<td style="text-align: right">false</td>
<td style="text-align: right">yes</td>
</tr>
<tr>
<td>8</td>
<td style="text-align: right">mild</td>
<td style="text-align: right">sunny</td>
<td style="text-align: right">normal</td>
<td style="text-align: right">true</td>
<td style="text-align: right">yes</td>
</tr>
<tr>
<td>9</td>
<td style="text-align: right">mild</td>
<td style="text-align: right">overcast</td>
<td style="text-align: right">high</td>
<td style="text-align: right">true</td>
<td style="text-align: right">yes</td>
</tr>
<tr>
<td>10</td>
<td style="text-align: right">hot</td>
<td style="text-align: right">overcast</td>
<td style="text-align: right">normal</td>
<td style="text-align: right">false</td>
<td style="text-align: right">yes</td>
</tr>
<tr>
<td>11</td>
<td style="text-align: right">mild</td>
<td style="text-align: right">rain</td>
<td style="text-align: right">high</td>
<td style="text-align: right">true</td>
<td style="text-align: right">no</td>
</tr>
<tr>
<td>12</td>
<td style="text-align: right">cool</td>
<td style="text-align: right">rain</td>
<td style="text-align: right">normal</td>
<td style="text-align: right">true</td>
<td style="text-align: right">no</td>
</tr>
<tr>
<td>13</td>
<td style="text-align: right">mild</td>
<td style="text-align: right">rain</td>
<td style="text-align: right">high</td>
<td style="text-align: right">false</td>
<td style="text-align: right">yes</td>
</tr>
</tbody>
</table>
<p>Shannon’s entropy model uses the logarithm function $log_{2}(P(x))$ to measure the entropy. The logarithm is to make it growing linearly with system size and “behaving like information”.</p>
<p>To sum up the entropies of each possible target value and weight it by the probability, we have the baseline for the calculation:</p>
\[H(x) = -\sum_{for \ k \ \in target}(P(x=k)*log_2(P(x=k)))\]
<p>where $P(x=k)$ is the probability, that the target feature takes a specific value k.</p>
<p>Hence we applying this formula to calculate the information in the data which contained nine yes’s and five no’s.</p>
\[H(x) = -\frac{9}{14}*log_2(\frac{9}{14})-\frac{5}{14}*log_2(\frac{5}{14}) = 0.94\]
<p>We use the input feature which occupies the most information about the target feature to split the dataset. From now on, we use the <strong>information gain</strong> as a measure of the feature “informativeness”. To construct a decision tree on this data, we will take the split with the most information gain as the first. The process will continue until all leaf nodes are pure, or until the information gain is 0. The information gain of a feature is calculated with:</p>
\[InfoGain(feature_{d}) = Entropy(D)-Entropy(feature_{d})\]
<p>The formula for the information gain calculation per feature is:</p>
\[InfoGain(feature_{d},D) = Entropy(D)-\sum_{t \ \in \ feature}(\frac{|feature_{d} = t|}{|D|}*H(feature_{d} = t))\]
\[=\]
\[Entropy(D)-\sum_{t \ \in \ feature}(\frac{|feature_{d} = t|}{|D|}*(-\sum_{k \ \in \ target}(P(target=k,feature_{d} = t)*log_{2}P(target=k,feature_{d} = t))))\]
<p>Now we will calculate the information gain for the feature <em>temperature</em>.</p>
<p>In this dataset, there are 4 data points with a <em>hot</em> value, 2 of which have a target variable value of yes and 2 with a value of no. The information of the <em>temperature=hot</em> is calculated using the entropy equation above:</p>
\[H(temperature=hot) = -\frac{2}{4}*log_2(\frac{2}{4})-\frac{2}{4}*log_2(\frac{2}{4}) = 1\]
<p>The data points with a <em>temperature</em> value of <em>cool</em> contain 3 yes’s and 1 no’s, we have:</p>
\[H(temperature=cool) = -\frac{3}{4}*log_2(\frac{3}{4})-\frac{1}{4}*log_2(\frac{1}{4}) = 0.81\]
<p>For the node where <em>temperature=mild</em> there were 6 data points, 4 yes’s and 2 no’s. Thus we have:</p>
\[H(temperature=mild) = -\frac{4}{6}*log_2(\frac{4}{6})-\frac{2}{6}*log_2(\frac{2}{6}) = 0.92\]
<p>To find the information of the split, we take the weighted average of these three numbers based on how many observations fell into which node.</p>
\[H(temperature) = \frac{4}{14}*1+\frac{4}{14}*0.81+\frac{6}{14}*0.92 = 0.91\]
<p>Now we can calculate the information gain achieved by splitting on the <em>temperature</em> feature.</p>
\[InfoGain(temperature) = 0.94-0.91 = 0.03\]
<p>To build the tree, we need to calculate the information gain of each possible first split and choose the best that provides the most information gain. The process is repeated for each impure node until the tree is complete.</p>
<h2 id="algorithm">Algorithm</h2>
<p>Invented by Ross Quinlan in 1986, the <a href="https://en.wikipedia.org/wiki/ID3">ID3</a> (Iterative Dichotomiser 3) is an algorithm used to generate a decision tree from a dataset. Besides the ID3 algorithm, there are other popular algorithms like the <a href="https://en.wikipedia.org/wiki/C4.5_algorithm">C4.5</a>, the <a href="https://en.wikipedia.org/wiki/C5.0_algorithm">C5.0</a>, and the <a href="https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees_.28CART.29">CART</a> algorithm. We won’t go into details here.</p>
<p>We now introduce the ID3 algorithm through pseudocode:</p>
<pre><code class="language-pseudocode">ID3 (Examples, Target_Attribute, Attributes)
Create a root node for the tree
If all examples are positive, Return the single-node tree Root, with label = +.
If all examples are negative, Return the single-node tree Root, with label = -.
If number of predicting attributes is empty, then Return the single node tree Root,
with label = most common value of the target attribute in the examples.
Otherwise Begin
A ← The Attribute that best classifies examples.
Decision Tree attribute for Root = A.
For each possible value, vi, of A,
Add a new tree branch below Root, corresponding to the test A = vi.
Let Examples(vi) be the subset of examples that have the value vi for A
If Examples(vi) is empty
Then below this new branch add a leaf node with label = most common target value in the examples
Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A})
End
Return Root
</code></pre>
<p>For R users, there are multiple packages available to implement a decision tree such as ctree and rpart.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">rpart</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">x_train</span><span class="p">,</span><span class="w"> </span><span class="n">y_train</span><span class="p">)</span><span class="w">
</span><span class="c1"># Grow tree</span><span class="w">
</span><span class="n">fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rpart</span><span class="p">(</span><span class="n">y_train</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="o">=</span><span class="s2">"class"</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">fit</span><span class="p">)</span><span class="w">
</span><span class="c1"># Predict output</span><span class="w">
</span><span class="n">predicted</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">x_test</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>For Python users, below is the code:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Import necessary libraries like pandas, numpy...
</span><span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">tree</span>
<span class="c1"># Create tree object
</span><span class="n">model</span> <span class="o">=</span> <span class="n">tree</span><span class="p">.</span><span class="n">DecisionTreeClassifier</span><span class="p">(</span><span class="n">criterion</span><span class="o">=</span><span class="s">'gini'</span><span class="p">)</span> <span class="c1"># for classification
# Here you can change the algorithm as gini or entropy (information gain), by default it is gini
# model = tree.DecisionTreeRegressor() for regression
# Train the model using the training sets and check score
</span><span class="n">model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">model</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="c1"># Predict Output
</span><span class="n">predicted</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">x_test</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="extension">Extension</h2>
<p>In this post, we have discovered decision trees for machine learning. Owing to its length, we’ve only briefly reviewed the basic principles. To improve the model performance, we should adjust the hyperparameters for optimization, here are <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html">more details</a>.</p>
<p>Decision trees have many advantages, such as simple to understand and interpret, able to handle both numerical and categorical data. While the major disadvantage is overfitting, especially when a tree is particularly deep. Fortunately, this issue can be addressed using <a href="https://en.wikipedia.org/wiki/Decision_tree_pruning">pruning</a>. Another approach to increase accuracy is to use an ensemble approach such as <a href="https://en.wikipedia.org/wiki/Bootstrap_aggregating">bagging</a> and <a href="https://en.wikipedia.org/wiki/Boosting_(machine_learning)">boosting</a>.</p>
<p>Despite decision tree learning is an old method, the more recent tree-based models including <a href="https://en.wikipedia.org/wiki/Random_forest">Random forest</a> (bagging), <a href="https://en.wikipedia.org/wiki/Gradient_boosting">Gradient boosting</a> (boosting), and <a href="https://en.wikipedia.org/wiki/XGBoost">XGBoost</a> (boosting), are built on the top of decision tree algorithms. Such ensemble models have proven themselves to be more powerful. Therefore, a thorough understanding of decision trees is very helpful in building a good foundation for learning machine learning and data science.</p>
<h2 id="references">References</h2>
<ul>
<li>A. Renyi (1961), <a href="http://projecteuclid.org/euclid.bsmsp/1200512181">On Measures of Entropy and Information</a>, <em>Proc. of the Fourth Berkeley Symposium on Mathematical Statistics and Probability</em>, vol. 1, 547-561.</li>
<li><a href="https://en.wikipedia.org/wiki/Decision_tree_learning">https://en.wikipedia.org/wiki/Decision_tree_learning</a></li>
<li><a href="https://www.python-course.eu/Decision_Trees.php">https://www.python-course.eu/Decision_Trees.php</a></li>
<li><a href="https://en.wikipedia.org/wiki/ID3_algorithm">https://en.wikipedia.org/wiki/ID3_algorithm</a></li>
</ul>Qiao HuangA year or so ago, I received my introduction to machine learning thanks to my supervisor. My first model was a decision tree since it is the most popular machine learning algorithm due to simplicity and easy to realize. Now I’m back in machine learning, and this post is a brief review of Decision Trees.My Certificates2020-10-20T00:00:00+00:002020-10-20T00:00:00+00:00/my-certificates<p>Certificates are mainly from MOOCs, just serve as proofs of my learning accomplishment. 🔰</p>
<h2 id="completed"><strong>Completed</strong></h2>
<h3 id="data-analytics">Data Analytics</h3>
<ul>
<li><strong><a href="https://www.coursera.org/account/accomplishments/specialization/certificate/JPSCXZDVW59S">Excel Skills for Data Analytics and Visualization</a> <em>Specialization</em></strong> — Macquarie University, Coursera
<ol>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/CF5M9KJSF6FJ">Excel Fundamentals for Data Analysis</a></li>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/28PBSELX82RL">Data visualization in Excel</a></li>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/J757E7VERRH4">Excel Power Tools for Data Analysis</a></li>
</ol>
</li>
<li><a href="https://jovian.ai/certificate/MFQTENRUGY">Data Analysis with Python: Zero to Pandas</a> — freeCodeCamp, Jovian</li>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/C2VY26Y6UY3J">Analyzing and Visualizing Data in Looker</a> — Google Cloud, Coursera</li>
</ul>
<h3 id="data-science">Data Science</h3>
<ul>
<li><strong><a href="https://www.coursera.org/account/accomplishments/specialization/certificate/A4USAAU3W4M6">IBM Data Science</a> <em>Professional Certificate</em></strong> — IBM, Coursera
<ol>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/3G3M8ED75LKK">What is Data Science?</a></li>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/6SPBFNDNX5Q4">Tools for Data Science</a></li>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/HXR3TWUWULNJ">Data Science Methodology</a></li>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/ASKQQ9K48MMQ">Python for Data Science, AI & Development</a></li>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/QYMFS8VVT59J">Python Projects for Data Science</a></li>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/EN59SS23PHF5">Databases and SQL for Data Science with Python</a></li>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/CNNNBGZUR8DN">Data Analysis with Python</a></li>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/D52PZF67SKXN">Data visualization with Python</a></li>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/2MYKDQ4Y24MW">Machine Learning with Python</a></li>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/P57ULF9LQYV4">Applied Data Science Capstone</a></li>
</ol>
</li>
<li><strong><a href="https://www.datacamp.com/statement-of-accomplishment/track/4849489253970474936ad81a1fe5bbaa626926ef">Data Scientist with Python</a> <em>Track</em></strong> — DataCamp</li>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/MDKPBC3CG8GS">Data Science Math Skills</a> — Duke University, Coursera</li>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/MV62WRYU8634">Getting Started with AWS Machine Learning</a> — Amazon Web Services, Coursera</li>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/PUU9APXTERHG">Managing Machine Learning Projects with Google Cloud</a> — Google Cloud, Coursera</li>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/PD5V4ZREXBT9">Predict Visitor Purchases with a Classification Model in BQML</a> — Google Cloud, Coursera</li>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/MCRJ8MS5F3ZJ">Reinforcement Learning: Qwik Start</a> — Google Cloud, Coursera</li>
</ul>
<h3 id="computer-science">Computer Science</h3>
<ul>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/V37FZ6QMMH2R">Deploy a Hugo Website with Cloud Build and Firebase Pipeline</a> — Google Cloud, Coursera</li>
</ul>
<h3 id="information-technology">Information Technology</h3>
<ul>
<li><a href="https://www.futurelearn.com/awards/bvzokv5">An Introduction to Coding and Design</a> — University of Leeds, FutureLearn</li>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/2HDS7YPSGLNQ">Cloud Computing Basics (Cloud 101)</a> — LearnQuest, Coursera</li>
</ul>
<h3 id="digital-marketing">Digital Marketing</h3>
<ul>
<li><a href="https://analytics.google.com/analytics/academy/certificate/K287ueqBS5elY08OFJHiZg">Google Analytics for Beginners</a> — Google Analytics Academy, Google</li>
<li><a href="https://analytics.google.com/analytics/academy/certificate/zfOtNeuTQma-STKtlFAibA">Advanced Google Analytics</a> — Google Analytics Academy, Google</li>
<li><a href="https://skillshop.exceedlms.com/student/award/mG7sSmMNLz2HQQWcER8CdJAb">Google Analytics Individual Qualification</a> — Skillshop, Google</li>
<li><a href="https://www.coursera.org/account/accomplishments/certificate/A6S9ZNUKQRYL">Create an A/B web page marketing test with Google Optimize</a> — Angelo Paolillo, Coursera Project Network</li>
</ul>
<h2 id="badges">Badges</h2>
<ul>
<li><a href="https://www.credly.com/badges/1529862c-0471-48f9-8d87-2753a544e2db">Data & AI Essentials</a> — IBM</li>
<li><a href="https://www.credly.com/badges/c64531c8-f7b9-425c-8721-1ef66151e51a">Docker Essentials: A Developer Introduction</a> — IBM</li>
</ul>
<h2 id="in-progress"><strong>In Progress</strong></h2>
<ul class="task-list">
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" /><a href="https://cognitiveclass.ai/learn/big-data">Big Data Fundamentals</a> — IBM, Cognitive Class</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" /><a href="https://app.datacamp.com/certification">Data Scientist Professional</a> — DataCamp</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" checked="checked" /><a href="https://www.udacity.com/course/ab-testing--ud257">A/B Testing</a> — Gooogle, Udacity
<ul>
<li>No cert provided</li>
<li><a href="https://github.com/qiaohuang/A-B-Testing/tree/main/Udacity%20AB%20Testing%20by%20Google">GitHub repo</a></li>
</ul>
</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" /><a href="https://www.coursera.org/learn/analytics-tableau">Data Visualization and Communication with Tableau</a> — Duke University, Coursera</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" /><strong><a href="https://www.coursera.org/specializations/probabilistic-graphical-models">Probabilistic Graphical Models Specialization</a></strong> — Stanford University, Coursera</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" /><a href="https://www.coursera.org/learn/crash-course-in-causality">A Crash Course in Causality: Inferring Causal Effects from Observational Data</a> — University of Pennsylvania, Coursera</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" /><a href="https://www.coursera.org/learn/algorithms-part1">Algorithms, Part I</a> — Princeton University, Coursera</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" /><a href="https://www.coursera.org/specializations/aml">Advanced Machine Learning Specialization</a> — HSE University, Coursera
<ul>
<li><a href="https://github.com/qiaohuang/Advanced-Machine-Learning">GitHub repo</a></li>
</ul>
</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" /><a href="https://www.classcentral.com/study-group/webdev-bootcamp-fall-2021">Free Web Development Bootcamp</a> — freeCodeCamp, Class Central</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" /><a href="https://learndigital.withgoogle.com/digitalgarage/course/digital-marketing">The Fundamentals of Digital Marketing</a> — Google Digital Garage</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" /><a href="https://www.coursera.org/learn/writing-english-university">Writing in English at University</a> — Lund University, Coursera</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" /><a href="https://www.coursera.org/learn/feedback">Giving Helpful Feedback</a> — University of Colorado Boulder, Coursera</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" /><a href="https://www.coursera.org/learn/positive-psychology">Positive Psychology</a> — The University of North Carolina at Chapel Hill, Coursera</li>
<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" /><a href="https://www.coursera.org/learn/introclassicalmusic">Introduction to Classical Music</a> — Yale University, Coursera
<ul>
<li><a href="https://www.uniqiao.com/notes/Arts/classical-music">Course notes</a></li>
</ul>
</li>
</ul>Qiao HuangCertificates are mainly from MOOCs, just serve as proofs of my learning accomplishment. 🔰My Learning Resources2020-10-19T00:00:00+00:002020-10-19T00:00:00+00:00/my-learning-resources<p>My favorite learning resources, mostly free and carefully chosen! 🎁</p>
<blockquote>
<p>I’d recommend you give a try to the resources and find the best one to stick with — it’s up to you.</p>
<blockquote>
<p>To learn the basics — the foundational concepts, then compare different resources and check if the content is any good.</p>
</blockquote>
</blockquote>
<h2 id="math-and-statistics">Math and Statistics</h2>
<ul>
<li><a href="https://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/">Linear Algebra</a> — MIT 18.06</li>
<li><a href="https://ocw.mit.edu/resources/res-6-012-introduction-to-probability-spring-2018/">Introduction to Probability</a> — John Tsitsiklis</li>
<li><a href="https://greenteapress.com/wp/think-stats-2e/">Think Stats</a> — Allen B. Downey</li>
<li><a href="http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/">Bayesian Methods for Hackers</a> — Cameron Davidson-Pilon</li>
</ul>
<h2 id="data-science">Data Science</h2>
<ul>
<li><a href="https://www.statlearning.com/">An Introduction to Statistical Learning</a> <!-- [1st edition](https://www.ime.unicamp.br/~dias/Intoduction%20to%20Statistical%20Learning.pdf) -->
<em>(<a href="https://blog.princehonest.com/stat-learning/">Solutions</a>)</em> — Essential</li>
<li><a href="https://web.stanford.edu/~hastie/ElemStatLearn/">The Elements of Statistical Learning</a>
<em>(<a href="https://waxworksmath.com/Authors/G_M/Hastie/WriteUp/Weatherwax_Epstein_Hastie_Solution_Manual.pdf">Solutions</a>)</em> — Extremely useful</li>
<li><a href="https://www.coursera.org/specializations/aml">Advanced Machine Learning Specialization on Coursera</a> — A deeper understanding</li>
</ul>
<h2 id="practical-data-science">Practical Data Science</h2>
<ul>
<li><a href="https://leanpub.com/eds">Executive Data Science</a> — Leanpub</li>
</ul>
<h2 id="computer-science">Computer Science</h2>
<ul>
<li><a href="https://missing.csail.mit.edu/">The Missing Semester of Your CS Education</a> — MIT</li>
<li><a href="https://algs4.cs.princeton.edu/">Algorithms, 4th Edition</a> — Princeton</li>
<li><a href="https://visualgo.net/en">Visualising algorithms through animation</a></li>
</ul>
<h2 id="sql">SQL</h2>
<ul>
<li><a href="https://www.w3schools.com/sql/">SQL Tutorial</a> — W3Schools interactive training, recommend</li>
<li><a href="https://www.sqlzoo.net/">SQLZOO</a> — Practice tasks and quizzes to boost learning</li>
</ul>
<h2 id="pythonr">Python/R</h2>
<ul>
<li><a href="https://learnxinyminutes.com/docs/python/">Learn x in y minutes (X=Python)</a></li>
<li><a href="http://automatetheboringstuff.com/">Automate the Boring Stuff with Python</a></li>
<li><a href="https://github.com/wesm/pydata-book">Python for Data Analysis</a></li>
<li><a href="https://jakevdp.github.io/PythonDataScienceHandbook/">Python Data Science Handbook</a></li>
<li><a href="https://r4ds.had.co.nz/">R for Data Science</a></li>
</ul>
<h2 id="deep-learning">Deep Learning</h2>
<ul>
<li><a href="https://www.deeplearningbook.org/">Deep Learning book</a></li>
<li><a href="https://github.com/fastai/fastbook">The fastai book</a></li>
<li><a href="http://course.fast.ai/">Practical Deep Learning for Coders</a></li>
<li><a href="http://d2l.ai/">Dive into Deep Learning</a></li>
<li><a href="https://www.coursera.org/specializations/deep-learning">Deep Learning Specialization on Coursera</a></li>
</ul>
<h2 id="sitesnews">Sites/News</h2>
<ul>
<li><a href="https://www.kaggle.com/">Kaggle</a> — Home for Data Science</li>
<li><a href="https://towardsdatascience.com/">Towards Data Science</a> — A Medium publication</li>
<li><a href="https://www.classcentral.com/">Class Central</a> — Search engine and reviews site for free online courses</li>
<li><a href="https://www.kdnuggets.com/">KDnuggets</a></li>
<li><a href="https://www.analyticsvidhya.com/blog/">Analytics Vidhya</a></li>
<li><a href="http://www.datatau.com/">DataTau</a> — The “Hacker News” of Data Science</li>
<li><a href="https://hackernoon.com/tagged/data-science">HACKERNOON#data-science</a></li>
<li><a href="https://news.google.com/topics/CAAqJAgKIh5DQkFTRUFvS0wyMHZNR3AwTTE5eE14SUNaVzRvQUFQAQ?hl=en-US&gl=US&ceid=US%3Aen">Google News Data Science</a></li>
</ul>
<h2 id="cheat-sheets">Cheat Sheets</h2>
<ul>
<li><a href="https://web.mit.edu/~csvoss/Public/usabo/stats_handout.pdf">Statistics</a> — MIT (PDF)</li>
<li><a href="http://www.wzchen.com/s/probability_cheatsheet.pdf">Probability</a> — William Chen (PDF)</li>
<li><a href="https://www.datacamp.com/community/data-science-cheatsheets">Data Science</a> — Datacamp (PDF)</li>
<li><a href="https://s3-us-west-2.amazonaws.com/dbshostedfiles/dbs/sql_cheat_sheet_mysql.pdf">MySQL</a>,
<a href="https://s3-us-west-2.amazonaws.com/dbshostedfiles/dbs/sql_cheat_sheet_pgsql.pdf">PostgreSQL</a> — Database Star (PDF)</li>
<li><a href="https://education.github.com/git-cheat-sheet-education.pdf">Git</a> — GitHub (PDF)</li>
<li><a href="https://devhints.io/bash">Bash</a> — Devhints (HTML)</li>
<li><a href="http://www.rexegg.com/regex-quickstart.html#ref">Regex</a> — RexEgg (HTML)</li>
<li><a href="https://markdown-it.github.io/">Markdown</a> — Live demo (HTML)</li>
<li><a href="https://digital.com/tools/html-cheatsheet/">HTML</a> — Digital (HTML)</li>
<li><a href="http://www.datascienceglossary.org/">Glossary</a> — Terms in Data Science (HTML)</li>
</ul>
<p>If you find the list helpful, feel free to bookmark it, which will be continuously updated!</p>Qiao HuangMy favorite learning resources, mostly free and carefully chosen! 🎁Post for Test2020-10-18T00:00:00+00:002020-10-18T00:00:00+00:00/post-for-test<p>Post to try the features out.</p>
<h2 id="playground">Playground</h2>
<p>This is a $\LaTeX{}$ formula.</p>
<p>This is an <abbr title="Hyper Text Markup Language">HTML</abbr> example.</p>
<p>This is a text with a footnote<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>.</p>
<p class="box-note">This is a CSS box-shadow.</p>
<table>
<tbody>
<tr>
<td><strong>Hypothesis Testing</strong></td>
<td>Test Rejects Null</td>
<td>Test Fails to Reject Null</td>
</tr>
<tr>
<td>Null is True</td>
<td><span style="color: red">Type I Error</span></td>
<td><span style="color: green">Correct decision</span></td>
</tr>
<tr>
<td>Null is False</td>
<td><span style="color: green">Correct decision</span></td>
<td><span style="color: red">Type II Error</span></td>
</tr>
</tbody>
</table>
<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="k">def</span> <span class="nf">print_hi</span><span class="p">(</span><span class="nb">name</span><span class="p">)</span>
<span class="nb">puts</span> <span class="s2">"Hi, </span><span class="si">#{</span><span class="nb">name</span><span class="si">}</span><span class="s2">"</span>
<span class="k">end</span>
<span class="n">print_hi</span><span class="p">(</span><span class="s1">'Qiao'</span><span class="p">)</span>
<span class="c1">#=> prints 'Hi, Qiao' to STDOUT.</span></code></pre></figure>
<h2 id="embeddings">Embeddings</h2>
<p>Here’s a <a href="https://gist.github.com/qiaohuang/c852d8ee2003bed8476f699b054798f3/raw/8812049291c355c738665ab0d084ac59fb0461cd/lists_to_dict.py">GitHub gist</a>. (Can’t display the embedded code? There may be DNS spoofing/GFW in your region.)</p>
<script src="https://gist.github.com/qiaohuang/c852d8ee2003bed8476f699b054798f3.js"></script>
<h2 id="comments">Comments</h2>
<p>If you want to try a comment to see how it works, feel free to test it here. You’ll be able to see the associated pull request in the <a href="https://github.com/qiaohuang/qiaohuang.github.io/pulls">GitHub repo</a> (also provided in the confirmation message). Once I accept, your comment will appear as well. <strong>All your information is encrypted</strong>.</p>
<hr />
<p><br /></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>And here is the definition.</p>
<blockquote>
<p>With a quote!</p>
</blockquote>
<p><a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Qiao HuangPost to try the features out.