{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## 11. Librería Pandas"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"**pandas** es una librería *open source* que nos proporciona estructuras de datos y herramientas de análisis de datos potentes y fáciles de usar en Python."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Se puede instalar en nuestro entorno virtual con el siguiente comando:\n",
"\n",
"```\n",
"pipenv install pandas\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Se utiliza el alias `pd` como estándar de facto par el uso de **pandas**."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Series\n",
"\n",
"Una serie representa una secuencia de datos unidimensional, y se crea pasándole a pandas una lista de datos."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 1.0\n",
"1 3.0\n",
"2 5.0\n",
"3 NaN\n",
"4 6.0\n",
"5 8.0\n",
"dtype: float64"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s = pd.Series([1,3,5,np.nan,6,8])\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### DataFrame\n",
"\n",
"Un objeto `DataFrame` representa una estructura tabular bi-dimensional que contiene datos potencialmente heterogéneos, con filas etiquetadas.\n",
"\n",
"Se pueden crear a partir de un diccionario, o de un `array` de NumPy."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df = pd.DataFrame({\n",
" 'A' : [1., 2., np.nan, None],\n",
" 'B' : pd.Timestamp('20130102'),\n",
" 'C' : pd.Series(1,index=list(range(4)),dtype='float32'),\n",
" 'D' : np.array([3] * 4,dtype='int32'),\n",
" 'E' : pd.Categorical([\"test\",\"train\",\"test\",\"train\"]),\n",
" 'F' : 'foo'\n",
"})"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
" E | \n",
" F | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1.0 | \n",
" 2013-01-02 | \n",
" 1.0 | \n",
" 3 | \n",
" test | \n",
" foo | \n",
"
\n",
" \n",
" | 1 | \n",
" 2.0 | \n",
" 2013-01-02 | \n",
" 1.0 | \n",
" 3 | \n",
" train | \n",
" foo | \n",
"
\n",
" \n",
" | 2 | \n",
" NaN | \n",
" 2013-01-02 | \n",
" 1.0 | \n",
" 3 | \n",
" test | \n",
" foo | \n",
"
\n",
" \n",
" | 3 | \n",
" NaN | \n",
" 2013-01-02 | \n",
" 1.0 | \n",
" 3 | \n",
" train | \n",
" foo | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D E F\n",
"0 1.0 2013-01-02 1.0 3 test foo\n",
"1 2.0 2013-01-02 1.0 3 train foo\n",
"2 NaN 2013-01-02 1.0 3 test foo\n",
"3 NaN 2013-01-02 1.0 3 train foo"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"A float64\n",
"B datetime64[ns]\n",
"C float32\n",
"D int32\n",
"E category\n",
"F object\n",
"dtype: object"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Podemos utilizar una serie para especificar la columna de índice."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',\n",
" '2013-01-05', '2013-01-06'],\n",
" dtype='datetime64[ns]', freq='D')"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dates = pd.date_range('20130101', periods=6)\n",
"dates"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2013-01-01 | \n",
" -0.679399 | \n",
" -0.564244 | \n",
" -0.395166 | \n",
" -0.004622 | \n",
"
\n",
" \n",
" | 2013-01-02 | \n",
" 2.147829 | \n",
" -0.991826 | \n",
" -1.004833 | \n",
" 0.168517 | \n",
"
\n",
" \n",
" | 2013-01-03 | \n",
" 0.398068 | \n",
" -0.536610 | \n",
" -0.773990 | \n",
" -1.075894 | \n",
"
\n",
" \n",
" | 2013-01-04 | \n",
" -1.185011 | \n",
" 1.988697 | \n",
" -0.770427 | \n",
" -0.472499 | \n",
"
\n",
" \n",
" | 2013-01-05 | \n",
" -0.359634 | \n",
" 0.338176 | \n",
" 0.105786 | \n",
" 0.359107 | \n",
"
\n",
" \n",
" | 2013-01-06 | \n",
" -0.555880 | \n",
" 1.115044 | \n",
" -2.108126 | \n",
" 0.139896 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"2013-01-01 -0.679399 -0.564244 -0.395166 -0.004622\n",
"2013-01-02 2.147829 -0.991826 -1.004833 0.168517\n",
"2013-01-03 0.398068 -0.536610 -0.773990 -1.075894\n",
"2013-01-04 -1.185011 1.988697 -0.770427 -0.472499\n",
"2013-01-05 -0.359634 0.338176 0.105786 0.359107\n",
"2013-01-06 -0.555880 1.115044 -2.108126 0.139896"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))\n",
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Ejes\n",
"\n",
"En un DataFrame de `pandas` se pueden realizar operaciones a lo largo de los dos ejes, o `axis`.\n",
"\n",
"- Si en una operación especificamos `axis=0` nos referimos a loas índices, es decir, estaremos diciendo que la operación se realiza para todas las filas.\n",
"- Si en una operación especificamos `axis=1` estaremos diciendo que la operación se realiza para todas las columnas."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Visualización de datos"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2013-01-01 | \n",
" -0.679399 | \n",
" -0.564244 | \n",
" -0.395166 | \n",
" -0.004622 | \n",
"
\n",
" \n",
" | 2013-01-02 | \n",
" 2.147829 | \n",
" -0.991826 | \n",
" -1.004833 | \n",
" 0.168517 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"2013-01-01 -0.679399 -0.564244 -0.395166 -0.004622\n",
"2013-01-02 2.147829 -0.991826 -1.004833 0.168517"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2013-01-05 | \n",
" -0.359634 | \n",
" 0.338176 | \n",
" 0.105786 | \n",
" 0.359107 | \n",
"
\n",
" \n",
" | 2013-01-06 | \n",
" -0.555880 | \n",
" 1.115044 | \n",
" -2.108126 | \n",
" 0.139896 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"2013-01-05 -0.359634 0.338176 0.105786 0.359107\n",
"2013-01-06 -0.555880 1.115044 -2.108126 0.139896"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.tail(2)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',\n",
" '2013-01-05', '2013-01-06'],\n",
" dtype='datetime64[ns]', freq='D')"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.index"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['A', 'B', 'C', 'D'], dtype='object')"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.columns"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### DataFrame.to_numpy()\n",
"\n",
"El método `.to_numpy()` de un DataFrame nos da una representación en una estructura de datos de `numpy` de los datos del DataFrame,"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[-0.67939947, -0.5642441 , -0.39516608, -0.00462202],\n",
" [ 2.14782856, -0.99182561, -1.00483345, 0.16851747],\n",
" [ 0.39806756, -0.53661026, -0.77399033, -1.07589368],\n",
" [-1.18501088, 1.98869725, -0.77042661, -0.47249893],\n",
" [-0.35963418, 0.3381756 , 0.10578614, 0.35910665],\n",
" [-0.55588001, 1.11504445, -2.10812582, 0.13989579]])"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.to_numpy()\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Describe\n",
"\n",
"El método `.describe()` nos muestra un resumen estadístico de los datos."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" | count | \n",
" 6.000000 | \n",
" 6.000000 | \n",
" 6.000000 | \n",
" 6.000000 | \n",
"
\n",
" \n",
" | mean | \n",
" -0.039005 | \n",
" 0.224873 | \n",
" -0.824459 | \n",
" -0.147582 | \n",
"
\n",
" \n",
" | std | \n",
" 1.188837 | \n",
" 1.148846 | \n",
" 0.739655 | \n",
" 0.534241 | \n",
"
\n",
" \n",
" | min | \n",
" -1.185011 | \n",
" -0.991826 | \n",
" -2.108126 | \n",
" -1.075894 | \n",
"
\n",
" \n",
" | 25% | \n",
" -0.648520 | \n",
" -0.557336 | \n",
" -0.947123 | \n",
" -0.355530 | \n",
"
\n",
" \n",
" | 50% | \n",
" -0.457757 | \n",
" -0.099217 | \n",
" -0.772208 | \n",
" 0.067637 | \n",
"
\n",
" \n",
" | 75% | \n",
" 0.208642 | \n",
" 0.920827 | \n",
" -0.488981 | \n",
" 0.161362 | \n",
"
\n",
" \n",
" | max | \n",
" 2.147829 | \n",
" 1.988697 | \n",
" 0.105786 | \n",
" 0.359107 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"count 6.000000 6.000000 6.000000 6.000000\n",
"mean -0.039005 0.224873 -0.824459 -0.147582\n",
"std 1.188837 1.148846 0.739655 0.534241\n",
"min -1.185011 -0.991826 -2.108126 -1.075894\n",
"25% -0.648520 -0.557336 -0.947123 -0.355530\n",
"50% -0.457757 -0.099217 -0.772208 0.067637\n",
"75% 0.208642 0.920827 -0.488981 0.161362\n",
"max 2.147829 1.988697 0.105786 0.359107"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Transposición\n",
"\n",
"Podemos obtener el DataFrame transpuesto de uno dado a través del atributo `T`."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 2013-01-01 | \n",
" 2013-01-02 | \n",
" 2013-01-03 | \n",
" 2013-01-04 | \n",
" 2013-01-05 | \n",
" 2013-01-06 | \n",
"
\n",
" \n",
" \n",
" \n",
" | A | \n",
" -0.679399 | \n",
" 2.147829 | \n",
" 0.398068 | \n",
" -1.185011 | \n",
" -0.359634 | \n",
" -0.555880 | \n",
"
\n",
" \n",
" | B | \n",
" -0.564244 | \n",
" -0.991826 | \n",
" -0.536610 | \n",
" 1.988697 | \n",
" 0.338176 | \n",
" 1.115044 | \n",
"
\n",
" \n",
" | C | \n",
" -0.395166 | \n",
" -1.004833 | \n",
" -0.773990 | \n",
" -0.770427 | \n",
" 0.105786 | \n",
" -2.108126 | \n",
"
\n",
" \n",
" | D | \n",
" -0.004622 | \n",
" 0.168517 | \n",
" -1.075894 | \n",
" -0.472499 | \n",
" 0.359107 | \n",
" 0.139896 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06\n",
"A -0.679399 2.147829 0.398068 -1.185011 -0.359634 -0.555880\n",
"B -0.564244 -0.991826 -0.536610 1.988697 0.338176 1.115044\n",
"C -0.395166 -1.004833 -0.773990 -0.770427 0.105786 -2.108126\n",
"D -0.004622 0.168517 -1.075894 -0.472499 0.359107 0.139896"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.T"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Ordenación\n",
"\n",
"Podemos ordenar los datos por alguno de los ejes o por valores."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" D | \n",
" C | \n",
" B | \n",
" A | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2013-01-01 | \n",
" -0.004622 | \n",
" -0.395166 | \n",
" -0.564244 | \n",
" -0.679399 | \n",
"
\n",
" \n",
" | 2013-01-02 | \n",
" 0.168517 | \n",
" -1.004833 | \n",
" -0.991826 | \n",
" 2.147829 | \n",
"
\n",
" \n",
" | 2013-01-03 | \n",
" -1.075894 | \n",
" -0.773990 | \n",
" -0.536610 | \n",
" 0.398068 | \n",
"
\n",
" \n",
" | 2013-01-04 | \n",
" -0.472499 | \n",
" -0.770427 | \n",
" 1.988697 | \n",
" -1.185011 | \n",
"
\n",
" \n",
" | 2013-01-05 | \n",
" 0.359107 | \n",
" 0.105786 | \n",
" 0.338176 | \n",
" -0.359634 | \n",
"
\n",
" \n",
" | 2013-01-06 | \n",
" 0.139896 | \n",
" -2.108126 | \n",
" 1.115044 | \n",
" -0.555880 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" D C B A\n",
"2013-01-01 -0.004622 -0.395166 -0.564244 -0.679399\n",
"2013-01-02 0.168517 -1.004833 -0.991826 2.147829\n",
"2013-01-03 -1.075894 -0.773990 -0.536610 0.398068\n",
"2013-01-04 -0.472499 -0.770427 1.988697 -1.185011\n",
"2013-01-05 0.359107 0.105786 0.338176 -0.359634\n",
"2013-01-06 0.139896 -2.108126 1.115044 -0.555880"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.sort_index(axis=1, ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2013-01-02 | \n",
" 2.147829 | \n",
" -0.991826 | \n",
" -1.004833 | \n",
" 0.168517 | \n",
"
\n",
" \n",
" | 2013-01-01 | \n",
" -0.679399 | \n",
" -0.564244 | \n",
" -0.395166 | \n",
" -0.004622 | \n",
"
\n",
" \n",
" | 2013-01-03 | \n",
" 0.398068 | \n",
" -0.536610 | \n",
" -0.773990 | \n",
" -1.075894 | \n",
"
\n",
" \n",
" | 2013-01-05 | \n",
" -0.359634 | \n",
" 0.338176 | \n",
" 0.105786 | \n",
" 0.359107 | \n",
"
\n",
" \n",
" | 2013-01-06 | \n",
" -0.555880 | \n",
" 1.115044 | \n",
" -2.108126 | \n",
" 0.139896 | \n",
"
\n",
" \n",
" | 2013-01-04 | \n",
" -1.185011 | \n",
" 1.988697 | \n",
" -0.770427 | \n",
" -0.472499 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"2013-01-02 2.147829 -0.991826 -1.004833 0.168517\n",
"2013-01-01 -0.679399 -0.564244 -0.395166 -0.004622\n",
"2013-01-03 0.398068 -0.536610 -0.773990 -1.075894\n",
"2013-01-05 -0.359634 0.338176 0.105786 0.359107\n",
"2013-01-06 -0.555880 1.115044 -2.108126 0.139896\n",
"2013-01-04 -1.185011 1.988697 -0.770427 -0.472499"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.sort_values(by='B')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Selección \n",
"\n",
"Podemos obtener una selección de los datos usando los métodos estándar de Python o `numpy` para la obtener *slices* en listas o matrices."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Además, `pandas` proporcia métodos especializados (y optimizados) para el acceso a los datos:"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"`.loc`\n",
"\n",
"Se utiliza principalmente para acceder por etiqueta. Soporta los siguietnes tipos de entradas:\n",
"\n",
"- Una etiqueta única: df.loc['a']\n",
"- Una lista o array de etiqueta: df.loc[['a', 'b', 'c']]\n",
"- Un *slice* con etiquetas: df.loc[a':'f']"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"`.iloc`\n",
"\n",
"Se utiliza principalmente para acceder posición. Soporta los siguietnes tipos de entradas:\n",
"\n",
"- Una entero: df.iloc[0]\n",
"- Una lista o array de enteros: df.iloc[[0, 1, 2]]\n",
"- Un *slice* : df.loc[1:3]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"Tipo de objeto | Selección | Valor retornado\n",
"---------------|----------------|-------------------------------------\n",
"Series | series[label] | valor escalar\n",
"DataFrame | frame[colname] | La serie correspondiente a `colname`"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"2013-01-01 -0.679399\n",
"2013-01-02 2.147829\n",
"2013-01-03 0.398068\n",
"2013-01-04 -1.185011\n",
"2013-01-05 -0.359634\n",
"2013-01-06 -0.555880\n",
"Freq: D, Name: A, dtype: float64"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2['A']"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"2013-01-01 -0.679399\n",
"2013-01-02 2.147829\n",
"2013-01-03 0.398068\n",
"2013-01-04 -1.185011\n",
"2013-01-05 -0.359634\n",
"2013-01-06 -0.555880\n",
"Freq: D, Name: A, dtype: float64"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.A"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2013-01-01 | \n",
" -0.679399 | \n",
" -0.564244 | \n",
" -0.395166 | \n",
" -0.004622 | \n",
"
\n",
" \n",
" | 2013-01-02 | \n",
" 2.147829 | \n",
" -0.991826 | \n",
" -1.004833 | \n",
" 0.168517 | \n",
"
\n",
" \n",
" | 2013-01-03 | \n",
" 0.398068 | \n",
" -0.536610 | \n",
" -0.773990 | \n",
" -1.075894 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"2013-01-01 -0.679399 -0.564244 -0.395166 -0.004622\n",
"2013-01-02 2.147829 -0.991826 -1.004833 0.168517\n",
"2013-01-03 0.398068 -0.536610 -0.773990 -1.075894"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2[0:3]"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2013-01-02 | \n",
" 2.147829 | \n",
" -0.991826 | \n",
" -1.004833 | \n",
" 0.168517 | \n",
"
\n",
" \n",
" | 2013-01-03 | \n",
" 0.398068 | \n",
" -0.536610 | \n",
" -0.773990 | \n",
" -1.075894 | \n",
"
\n",
" \n",
" | 2013-01-04 | \n",
" -1.185011 | \n",
" 1.988697 | \n",
" -0.770427 | \n",
" -0.472499 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"2013-01-02 2.147829 -0.991826 -1.004833 0.168517\n",
"2013-01-03 0.398068 -0.536610 -0.773990 -1.075894\n",
"2013-01-04 -1.185011 1.988697 -0.770427 -0.472499"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2['20130102':'20130104']"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"A -0.679399\n",
"B -0.564244\n",
"C -0.395166\n",
"D -0.004622\n",
"Name: 2013-01-01 00:00:00, dtype: float64"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.loc[dates[0]]"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2013-01-01 | \n",
" -0.679399 | \n",
" -0.564244 | \n",
"
\n",
" \n",
" | 2013-01-02 | \n",
" 2.147829 | \n",
" -0.991826 | \n",
"
\n",
" \n",
" | 2013-01-03 | \n",
" 0.398068 | \n",
" -0.536610 | \n",
"
\n",
" \n",
" | 2013-01-04 | \n",
" -1.185011 | \n",
" 1.988697 | \n",
"
\n",
" \n",
" | 2013-01-05 | \n",
" -0.359634 | \n",
" 0.338176 | \n",
"
\n",
" \n",
" | 2013-01-06 | \n",
" -0.555880 | \n",
" 1.115044 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B\n",
"2013-01-01 -0.679399 -0.564244\n",
"2013-01-02 2.147829 -0.991826\n",
"2013-01-03 0.398068 -0.536610\n",
"2013-01-04 -1.185011 1.988697\n",
"2013-01-05 -0.359634 0.338176\n",
"2013-01-06 -0.555880 1.115044"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.loc[:, ['A', 'B']]"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2013-01-02 | \n",
" 2.147829 | \n",
" -0.991826 | \n",
"
\n",
" \n",
" | 2013-01-03 | \n",
" 0.398068 | \n",
" -0.536610 | \n",
"
\n",
" \n",
" | 2013-01-04 | \n",
" -1.185011 | \n",
" 1.988697 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B\n",
"2013-01-02 2.147829 -0.991826\n",
"2013-01-03 0.398068 -0.536610\n",
"2013-01-04 -1.185011 1.988697"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.loc['20130102':'20130104', ['A', 'B']]"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"A 2.147829\n",
"B -0.991826\n",
"Name: 2013-01-02 00:00:00, dtype: float64"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.loc['20130102', ['A', 'B']]"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"A -1.185011\n",
"B 1.988697\n",
"C -0.770427\n",
"D -0.472499\n",
"Name: 2013-01-04 00:00:00, dtype: float64"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.iloc[3]"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2013-01-04 | \n",
" -1.185011 | \n",
" 1.988697 | \n",
"
\n",
" \n",
" | 2013-01-05 | \n",
" -0.359634 | \n",
" 0.338176 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B\n",
"2013-01-04 -1.185011 1.988697\n",
"2013-01-05 -0.359634 0.338176"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.iloc[3:5, 0:2]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Indexación condicional\n",
"\n",
"Se puede acceder a las columnas que cumplan una condición concreta, indicando la condición en el selector."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2013-01-02 | \n",
" 2.147829 | \n",
" -0.991826 | \n",
" -1.004833 | \n",
" 0.168517 | \n",
"
\n",
" \n",
" | 2013-01-03 | \n",
" 0.398068 | \n",
" -0.536610 | \n",
" -0.773990 | \n",
" -1.075894 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"2013-01-02 2.147829 -0.991826 -1.004833 0.168517\n",
"2013-01-03 0.398068 -0.536610 -0.773990 -1.075894"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2[df2.A > 0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Operaciones"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Se pueden realizar operaciones estadísticas básicas llamando a los métodos correspondientes."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"A -0.039005\n",
"B 0.224873\n",
"C -0.824459\n",
"D -0.147582\n",
"dtype: float64"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.mean()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"-0.03900473868952752"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2['A'].mean()"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"2013-01-01 -0.410858\n",
"2013-01-02 0.079922\n",
"2013-01-03 -0.497107\n",
"2013-01-04 -0.109810\n",
"2013-01-05 0.110859\n",
"2013-01-06 -0.352266\n",
"Freq: D, dtype: float64"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.mean(axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Se pueden aplicar funciones a los datos."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2013-01-01 | \n",
" -0.679399 | \n",
" -0.564244 | \n",
" -0.395166 | \n",
" -0.004622 | \n",
"
\n",
" \n",
" | 2013-01-02 | \n",
" 1.468429 | \n",
" -1.556070 | \n",
" -1.400000 | \n",
" 0.163895 | \n",
"
\n",
" \n",
" | 2013-01-03 | \n",
" 1.866497 | \n",
" -2.092680 | \n",
" -2.173990 | \n",
" -0.911998 | \n",
"
\n",
" \n",
" | 2013-01-04 | \n",
" 0.681486 | \n",
" -0.103983 | \n",
" -2.944416 | \n",
" -1.384497 | \n",
"
\n",
" \n",
" | 2013-01-05 | \n",
" 0.321852 | \n",
" 0.234193 | \n",
" -2.838630 | \n",
" -1.025391 | \n",
"
\n",
" \n",
" | 2013-01-06 | \n",
" -0.234028 | \n",
" 1.349237 | \n",
" -4.946756 | \n",
" -0.885495 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"2013-01-01 -0.679399 -0.564244 -0.395166 -0.004622\n",
"2013-01-02 1.468429 -1.556070 -1.400000 0.163895\n",
"2013-01-03 1.866497 -2.092680 -2.173990 -0.911998\n",
"2013-01-04 0.681486 -0.103983 -2.944416 -1.384497\n",
"2013-01-05 0.321852 0.234193 -2.838630 -1.025391\n",
"2013-01-06 -0.234028 1.349237 -4.946756 -0.885495"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.apply(np.cumsum)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
" E | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2013-01-01 | \n",
" -0.679399 | \n",
" -0.564244 | \n",
" -0.395166 | \n",
" -0.004622 | \n",
" 0.674777 | \n",
"
\n",
" \n",
" | 2013-01-02 | \n",
" 2.147829 | \n",
" -0.991826 | \n",
" -1.004833 | \n",
" 0.168517 | \n",
" 3.152662 | \n",
"
\n",
" \n",
" | 2013-01-03 | \n",
" 0.398068 | \n",
" -0.536610 | \n",
" -0.773990 | \n",
" -1.075894 | \n",
" 1.473961 | \n",
"
\n",
" \n",
" | 2013-01-04 | \n",
" -1.185011 | \n",
" 1.988697 | \n",
" -0.770427 | \n",
" -0.472499 | \n",
" 3.173708 | \n",
"
\n",
" \n",
" | 2013-01-05 | \n",
" -0.359634 | \n",
" 0.338176 | \n",
" 0.105786 | \n",
" 0.359107 | \n",
" 0.718741 | \n",
"
\n",
" \n",
" | 2013-01-06 | \n",
" -0.555880 | \n",
" 1.115044 | \n",
" -2.108126 | \n",
" 0.139896 | \n",
" 3.223170 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D E\n",
"2013-01-01 -0.679399 -0.564244 -0.395166 -0.004622 0.674777\n",
"2013-01-02 2.147829 -0.991826 -1.004833 0.168517 3.152662\n",
"2013-01-03 0.398068 -0.536610 -0.773990 -1.075894 1.473961\n",
"2013-01-04 -1.185011 1.988697 -0.770427 -0.472499 3.173708\n",
"2013-01-05 -0.359634 0.338176 0.105786 0.359107 0.718741\n",
"2013-01-06 -0.555880 1.115044 -2.108126 0.139896 3.223170"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"c = df2.apply(lambda x: x.max() - x.min(), axis=1)\n",
"df2['E'] = c\n",
"df2"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"2013-01-01 1.354177\n",
"2013-01-02 4.157495\n",
"2013-01-03 2.549855\n",
"2013-01-04 4.358719\n",
"2013-01-05 1.078375\n",
"2013-01-06 5.331296\n",
"Freq: D, dtype: float64"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.apply(lambda x: x.max() - x.min(), axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Uniones\n",
"\n",
"La librería `pandas` proporciona diferentes métodos para la unión de Series o DataFrame."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Concat"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" -0.634450 | \n",
" 0.763724 | \n",
" 0.710228 | \n",
" -0.694768 | \n",
"
\n",
" \n",
" | 1 | \n",
" -0.142616 | \n",
" 1.630704 | \n",
" 1.029687 | \n",
" -1.008484 | \n",
"
\n",
" \n",
" | 2 | \n",
" -0.344466 | \n",
" -0.222917 | \n",
" 0.294177 | \n",
" -0.859483 | \n",
"
\n",
" \n",
" | 3 | \n",
" 1.012883 | \n",
" -0.369916 | \n",
" -0.552784 | \n",
" 1.356238 | \n",
"
\n",
" \n",
" | 4 | \n",
" -0.167002 | \n",
" 1.677076 | \n",
" -0.454767 | \n",
" 1.183958 | \n",
"
\n",
" \n",
" | 5 | \n",
" -0.528190 | \n",
" -0.912389 | \n",
" 0.786753 | \n",
" 1.043857 | \n",
"
\n",
" \n",
" | 6 | \n",
" 0.527898 | \n",
" -0.379471 | \n",
" 1.537252 | \n",
" -1.050597 | \n",
"
\n",
" \n",
" | 7 | \n",
" -0.352473 | \n",
" -1.825571 | \n",
" 0.186576 | \n",
" 0.977988 | \n",
"
\n",
" \n",
" | 8 | \n",
" 0.991172 | \n",
" -0.030169 | \n",
" -1.816031 | \n",
" 0.601092 | \n",
"
\n",
" \n",
" | 9 | \n",
" 1.522968 | \n",
" 0.440188 | \n",
" -1.763289 | \n",
" 1.840091 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3\n",
"0 -0.634450 0.763724 0.710228 -0.694768\n",
"1 -0.142616 1.630704 1.029687 -1.008484\n",
"2 -0.344466 -0.222917 0.294177 -0.859483\n",
"3 1.012883 -0.369916 -0.552784 1.356238\n",
"4 -0.167002 1.677076 -0.454767 1.183958\n",
"5 -0.528190 -0.912389 0.786753 1.043857\n",
"6 0.527898 -0.379471 1.537252 -1.050597\n",
"7 -0.352473 -1.825571 0.186576 0.977988\n",
"8 0.991172 -0.030169 -1.816031 0.601092\n",
"9 1.522968 0.440188 -1.763289 1.840091"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(np.random.randn(10, 4))\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[ 0 1 2 3\n",
" 0 -0.634450 0.763724 0.710228 -0.694768\n",
" 1 -0.142616 1.630704 1.029687 -1.008484\n",
" 2 -0.344466 -0.222917 0.294177 -0.859483,\n",
" 0 1 2 3\n",
" 3 1.012883 -0.369916 -0.552784 1.356238\n",
" 4 -0.167002 1.677076 -0.454767 1.183958\n",
" 5 -0.528190 -0.912389 0.786753 1.043857\n",
" 6 0.527898 -0.379471 1.537252 -1.050597,\n",
" 0 1 2 3\n",
" 7 -0.352473 -1.825571 0.186576 0.977988\n",
" 8 0.991172 -0.030169 -1.816031 0.601092\n",
" 9 1.522968 0.440188 -1.763289 1.840091]"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pieces = [df[:3], df[3:7], df[7:]]\n",
"pieces"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" -0.634450 | \n",
" 0.763724 | \n",
" 0.710228 | \n",
" -0.694768 | \n",
"
\n",
" \n",
" | 1 | \n",
" -0.142616 | \n",
" 1.630704 | \n",
" 1.029687 | \n",
" -1.008484 | \n",
"
\n",
" \n",
" | 2 | \n",
" -0.344466 | \n",
" -0.222917 | \n",
" 0.294177 | \n",
" -0.859483 | \n",
"
\n",
" \n",
" | 3 | \n",
" 1.012883 | \n",
" -0.369916 | \n",
" -0.552784 | \n",
" 1.356238 | \n",
"
\n",
" \n",
" | 4 | \n",
" -0.167002 | \n",
" 1.677076 | \n",
" -0.454767 | \n",
" 1.183958 | \n",
"
\n",
" \n",
" | 5 | \n",
" -0.528190 | \n",
" -0.912389 | \n",
" 0.786753 | \n",
" 1.043857 | \n",
"
\n",
" \n",
" | 6 | \n",
" 0.527898 | \n",
" -0.379471 | \n",
" 1.537252 | \n",
" -1.050597 | \n",
"
\n",
" \n",
" | 7 | \n",
" -0.352473 | \n",
" -1.825571 | \n",
" 0.186576 | \n",
" 0.977988 | \n",
"
\n",
" \n",
" | 8 | \n",
" 0.991172 | \n",
" -0.030169 | \n",
" -1.816031 | \n",
" 0.601092 | \n",
"
\n",
" \n",
" | 9 | \n",
" 1.522968 | \n",
" 0.440188 | \n",
" -1.763289 | \n",
" 1.840091 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3\n",
"0 -0.634450 0.763724 0.710228 -0.694768\n",
"1 -0.142616 1.630704 1.029687 -1.008484\n",
"2 -0.344466 -0.222917 0.294177 -0.859483\n",
"3 1.012883 -0.369916 -0.552784 1.356238\n",
"4 -0.167002 1.677076 -0.454767 1.183958\n",
"5 -0.528190 -0.912389 0.786753 1.043857\n",
"6 0.527898 -0.379471 1.537252 -1.050597\n",
"7 -0.352473 -1.825571 0.186576 0.977988\n",
"8 0.991172 -0.030169 -1.816031 0.601092\n",
"9 1.522968 0.440188 -1.763289 1.840091"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat(pieces)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Join"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key | \n",
" lval | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" foo | \n",
" 1 | \n",
"
\n",
" \n",
" | 1 | \n",
" foo | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key lval\n",
"0 foo 1\n",
"1 foo 2"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})\n",
"left"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key | \n",
" rval | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" foo | \n",
" 4 | \n",
"
\n",
" \n",
" | 1 | \n",
" foo | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key rval\n",
"0 foo 4\n",
"1 foo 5"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})\n",
"right"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key | \n",
" lval | \n",
" rval | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" foo | \n",
" 1 | \n",
" 4 | \n",
"
\n",
" \n",
" | 1 | \n",
" foo | \n",
" 1 | \n",
" 5 | \n",
"
\n",
" \n",
" | 2 | \n",
" foo | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
" | 3 | \n",
" foo | \n",
" 2 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key lval rval\n",
"0 foo 1 4\n",
"1 foo 1 5\n",
"2 foo 2 4\n",
"3 foo 2 5"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(left, right, on='key')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Append"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" -0.362774 | \n",
" -0.573908 | \n",
" 0.098044 | \n",
" 1.992482 | \n",
"
\n",
" \n",
" | 1 | \n",
" 1.437667 | \n",
" 0.940580 | \n",
" -0.355047 | \n",
" -0.142454 | \n",
"
\n",
" \n",
" | 2 | \n",
" -1.097556 | \n",
" -0.593504 | \n",
" -1.313146 | \n",
" -0.490131 | \n",
"
\n",
" \n",
" | 3 | \n",
" 1.028989 | \n",
" 0.098031 | \n",
" 0.881277 | \n",
" 0.426499 | \n",
"
\n",
" \n",
" | 4 | \n",
" -0.589829 | \n",
" -0.331404 | \n",
" 0.692164 | \n",
" 0.456827 | \n",
"
\n",
" \n",
" | 5 | \n",
" -0.158751 | \n",
" -0.199149 | \n",
" -0.395195 | \n",
" 0.882798 | \n",
"
\n",
" \n",
" | 6 | \n",
" -0.021648 | \n",
" 0.764384 | \n",
" 0.408657 | \n",
" -1.262260 | \n",
"
\n",
" \n",
" | 7 | \n",
" -1.113406 | \n",
" 0.107256 | \n",
" 0.420511 | \n",
" -0.968303 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"0 -0.362774 -0.573908 0.098044 1.992482\n",
"1 1.437667 0.940580 -0.355047 -0.142454\n",
"2 -1.097556 -0.593504 -1.313146 -0.490131\n",
"3 1.028989 0.098031 0.881277 0.426499\n",
"4 -0.589829 -0.331404 0.692164 0.456827\n",
"5 -0.158751 -0.199149 -0.395195 0.882798\n",
"6 -0.021648 0.764384 0.408657 -1.262260\n",
"7 -1.113406 0.107256 0.420511 -0.968303"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"A 1.028989\n",
"B 0.098031\n",
"C 0.881277\n",
"D 0.426499\n",
"Name: 3, dtype: float64"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s = df.iloc[3]\n",
"s"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" -0.362774 | \n",
" -0.573908 | \n",
" 0.098044 | \n",
" 1.992482 | \n",
"
\n",
" \n",
" | 1 | \n",
" 1.437667 | \n",
" 0.940580 | \n",
" -0.355047 | \n",
" -0.142454 | \n",
"
\n",
" \n",
" | 2 | \n",
" -1.097556 | \n",
" -0.593504 | \n",
" -1.313146 | \n",
" -0.490131 | \n",
"
\n",
" \n",
" | 3 | \n",
" 1.028989 | \n",
" 0.098031 | \n",
" 0.881277 | \n",
" 0.426499 | \n",
"
\n",
" \n",
" | 4 | \n",
" -0.589829 | \n",
" -0.331404 | \n",
" 0.692164 | \n",
" 0.456827 | \n",
"
\n",
" \n",
" | 5 | \n",
" -0.158751 | \n",
" -0.199149 | \n",
" -0.395195 | \n",
" 0.882798 | \n",
"
\n",
" \n",
" | 6 | \n",
" -0.021648 | \n",
" 0.764384 | \n",
" 0.408657 | \n",
" -1.262260 | \n",
"
\n",
" \n",
" | 7 | \n",
" -1.113406 | \n",
" 0.107256 | \n",
" 0.420511 | \n",
" -0.968303 | \n",
"
\n",
" \n",
" | 8 | \n",
" 1.028989 | \n",
" 0.098031 | \n",
" 0.881277 | \n",
" 0.426499 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"0 -0.362774 -0.573908 0.098044 1.992482\n",
"1 1.437667 0.940580 -0.355047 -0.142454\n",
"2 -1.097556 -0.593504 -1.313146 -0.490131\n",
"3 1.028989 0.098031 0.881277 0.426499\n",
"4 -0.589829 -0.331404 0.692164 0.456827\n",
"5 -0.158751 -0.199149 -0.395195 0.882798\n",
"6 -0.021648 0.764384 0.408657 -1.262260\n",
"7 -1.113406 0.107256 0.420511 -0.968303\n",
"8 1.028989 0.098031 0.881277 0.426499"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.append(s, ignore_index=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Agrupamientos\n",
"\n",
"Cuando hablamos de agrupar datos en `pandas` nos referimos a un proceso que inplica uno o más de los siguientes pasos:\n",
"\n",
"- Separar los datos en grupos basados en algún criterio\n",
"- Aplicar una función para cada grupo de forma independiente\n",
"- Combinar los resultados en una estructura de datos"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" foo | \n",
" one | \n",
" -1.976302 | \n",
" -0.708903 | \n",
"
\n",
" \n",
" | 1 | \n",
" bar | \n",
" one | \n",
" -1.709147 | \n",
" -0.680945 | \n",
"
\n",
" \n",
" | 2 | \n",
" foo | \n",
" two | \n",
" 0.229683 | \n",
" -0.613908 | \n",
"
\n",
" \n",
" | 3 | \n",
" bar | \n",
" three | \n",
" 0.917311 | \n",
" -0.819363 | \n",
"
\n",
" \n",
" | 4 | \n",
" foo | \n",
" two | \n",
" -1.245424 | \n",
" -1.041576 | \n",
"
\n",
" \n",
" | 5 | \n",
" bar | \n",
" two | \n",
" 0.904258 | \n",
" -1.698605 | \n",
"
\n",
" \n",
" | 6 | \n",
" foo | \n",
" one | \n",
" -1.215414 | \n",
" 1.879422 | \n",
"
\n",
" \n",
" | 7 | \n",
" foo | \n",
" three | \n",
" 1.406019 | \n",
" -0.603691 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"0 foo one -1.976302 -0.708903\n",
"1 bar one -1.709147 -0.680945\n",
"2 foo two 0.229683 -0.613908\n",
"3 bar three 0.917311 -0.819363\n",
"4 foo two -1.245424 -1.041576\n",
"5 bar two 0.904258 -1.698605\n",
"6 foo one -1.215414 1.879422\n",
"7 foo three 1.406019 -0.603691"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',\n",
" 'foo', 'bar', 'foo', 'foo'],\n",
" 'B': ['one', 'one', 'two', 'three',\n",
" 'two', 'two', 'one', 'three'],\n",
" 'C': np.random.randn(8),\n",
" 'D': np.random.randn(8)})\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" | A | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | bar | \n",
" 0.112423 | \n",
" -3.198913 | \n",
"
\n",
" \n",
" | foo | \n",
" -2.801438 | \n",
" -1.088656 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" C D\n",
"A \n",
"bar 0.112423 -3.198913\n",
"foo -2.801438 -1.088656"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('A').sum()"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" | A | \n",
" B | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | bar | \n",
" one | \n",
" -1.709147 | \n",
" -0.680945 | \n",
"
\n",
" \n",
" | three | \n",
" 0.917311 | \n",
" -0.819363 | \n",
"
\n",
" \n",
" | two | \n",
" 0.904258 | \n",
" -1.698605 | \n",
"
\n",
" \n",
" | foo | \n",
" one | \n",
" -3.191715 | \n",
" 1.170519 | \n",
"
\n",
" \n",
" | three | \n",
" 1.406019 | \n",
" -0.603691 | \n",
"
\n",
" \n",
" | two | \n",
" -1.015741 | \n",
" -1.655484 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" C D\n",
"A B \n",
"bar one -1.709147 -0.680945\n",
" three 0.917311 -0.819363\n",
" two 0.904258 -1.698605\n",
"foo one -3.191715 1.170519\n",
" three 1.406019 -0.603691\n",
" two -1.015741 -1.655484"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby(['A', 'B']).sum()"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}