diff --git a/models/model.joblib b/models/model.joblib
index 9b39aee..533a414 100644
Binary files a/models/model.joblib and b/models/model.joblib differ
diff --git a/notebooks/02-comparative_analysis.ipynb b/notebooks/02-comparative_analysis.ipynb
index 68c85a0..2dd0a67 100644
--- a/notebooks/02-comparative_analysis.ipynb
+++ b/notebooks/02-comparative_analysis.ipynb
@@ -8,6 +8,34 @@
"# Análise comparativa"
]
},
+ {
+ "cell_type": "markdown",
+ "id": "713e7038-de8e-45b1-ae90-f748f7c823ee",
+ "metadata": {},
+ "source": [
+ "## Objetivo:\n",
+ "O objetivo deste notebook é realizar uma análise comparativa de diferentes modelos de aprendizado de máquina aplicados à previsão da demanda por compartilhamento/locação de bicicletas. A análise busca identificar quais modelos apresentam o melhor desempenho em termos de previsão, levando em consideração a importância da preparação adequada dos dados e a aplicação de técnicas de validação robustas."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a344d8d1-8d0b-458e-bac1-2438cc8a0186",
+ "metadata": {},
+ "source": [
+ "## Contexto:\n",
+ "A locação de bicicletas é uma solução de mobilidade urbana sustentável que tem ganhado popularidade em muitas cidades. No entanto, a previsão precisa da demanda por bicicletas é crucial para otimizar a disponibilidade e reduzir custos operacionais. Este notebook aborda o desafio de prever a demanda utilizando uma base de dados real de locação de bicicletas. A preparação dos dados e a escolha de modelos adequados são etapas fundamentais para garantir a precisão das previsões. A comparação entre diferentes modelos permitirá identificar a abordagem mais eficaz para este problema. O conjunto de dados \"[Bike Sharing](https://www.kaggle.com/datasets/lakshmi25npathi/bike-sharing-dataset/data)\" do Kaggle registra o uso de bicicletas compartilhadas em Washington, D.C. entre o início de 2011 e 2013. Ele inclui variáveis como data, estação do ano, se o dia é feriado ou útil, situação climática, temperatura, sensação térmica, umidade, velocidade do vento, além do número de usuários casuais, registrados e o total de aluguéis. Essas informações permitem analisar padrões de uso e construir modelos para prever a demanda de bicicletas com base em condições climáticas e temporais.\n",
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f7f6bdb1-195d-42a4-a109-6156e33b23c2",
+ "metadata": {},
+ "source": [
+ "## Configuração Inicial:"
+ ]
+ },
{
"cell_type": "code",
"execution_count": 1,
@@ -21,13 +49,15 @@
"import joblib\n",
"import numpy as np\n",
"import pandas as pd\n",
+ "import seaborn as sns\n",
+ "import matplotlib.pyplot as plt\n",
"\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.impute import SimpleImputer\n",
"from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder\n",
"from sklearn.compose import ColumnTransformer\n",
"\n",
- "from sklearn.model_selection import ShuffleSplit, GridSearchCV, KFold, cross_validate\n",
+ "from sklearn.model_selection import ShuffleSplit, GridSearchCV, KFold, cross_validate, train_test_split, learning_curve\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.neighbors import KNeighborsRegressor\n",
"from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor\n",
@@ -39,7 +69,23 @@
"id": "d6821558-3c26-4b1b-b2ac-8ef422219bb8",
"metadata": {},
"source": [
- "## 1. Obtenção dos dados:"
+ "## 1. Preparação dos Dados:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8cd15780-f875-4cb2-ab70-07461299672f",
+ "metadata": {},
+ "source": [
+ "A preparação dos dados é uma etapa essencial no desenvolvimento de modelos de aprendizado de máquina, pois garante que os dados estejam em um formato adequado para a análise e modelagem. Nesta etapa, serão realizados os seguintes processos: Tratamento de dados faltantes e discrepantes, Codificação de variáveis e Normalização dos dados."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "af991dff-12ed-4ccb-9ddb-a3c85bd8bd8b",
+ "metadata": {},
+ "source": [
+ "### 1.1 Obtenção de dados"
]
},
{
@@ -246,10 +292,22 @@
},
{
"cell_type": "markdown",
- "id": "3e67c7e6-c133-4ff8-871a-d4e1a8e132b7",
+ "id": "30be6d70-b36e-4b55-aba6-5dcfc60bd2d2",
"metadata": {},
"source": [
- "---"
+ "- A coluna `instant` trata somente do índice, então podemos removê-la."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "78247379-94c0-4319-abb2-c2629f28d803",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "df = df.drop(columns=['instant'])"
]
},
{
@@ -257,12 +315,12 @@
"id": "2e0c1fc4-8764-4347-8389-8ad0b7d3291d",
"metadata": {},
"source": [
- "## 2. Preparação de dados:"
+ "### 1.2 Tratamento de dados Faltantes"
]
},
{
"cell_type": "code",
- "execution_count": 3,
+ "execution_count": 4,
"id": "3ee02410-060d-482d-ae22-1efce480dc8e",
"metadata": {
"tags": []
@@ -273,7 +331,6 @@
"output_type": "stream",
"text": [
"Dados faltantes por coluna:\n",
- "instant 0\n",
"dteday 0\n",
"season 0\n",
"yr 0\n",
@@ -298,9 +355,299 @@
"print(df.isnull().sum())"
]
},
+ {
+ "cell_type": "markdown",
+ "id": "2b481836-aa87-4c63-9d0f-5d5e925e4ebc",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "Não há dados faltantes para nenhuma das colunas na base de dados."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a847829b-43fd-412f-bcc0-402cf64738e6",
+ "metadata": {},
+ "source": [
+ "### 1.3 Identificação de Outliers"
+ ]
+ },
{
"cell_type": "code",
- "execution_count": 4,
+ "execution_count": 5,
+ "id": "941da82c-d120-458e-8042-47297e1e542a",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "'temp': 0 outliers\n",
+ "'atemp': 0 outliers\n",
+ "'hum': 2 outliers\n",
+ "'windspeed': 13 outliers\n",
+ "'casual': 44 outliers\n",
+ "'registered': 0 outliers\n",
+ "'cnt': 0 outliers\n"
+ ]
+ }
+ ],
+ "source": [
+ "def detect_outliers_iqr(df):\n",
+ " cols_to_check = ['temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt']\n",
+ " outliers_dict = {}\n",
+ "\n",
+ " for col in cols_to_check:\n",
+ " Q1 = df[col].quantile(0.25)\n",
+ " Q3 = df[col].quantile(0.75)\n",
+ " IQR = Q3 - Q1\n",
+ " lower_bound = Q1 - 1.5 * IQR\n",
+ " upper_bound = Q3 + 1.5 * IQR\n",
+ "\n",
+ " num_outliers = ((df[col] < lower_bound) | (df[col] > upper_bound)).sum()\n",
+ " outliers_dict[col] = num_outliers\n",
+ "\n",
+ " for col, num_outliers in outliers_dict.items():\n",
+ " print(f\"'{col}': {num_outliers} outliers\")\n",
+ "\n",
+ "detect_outliers_iqr(df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3f1fbc3f-c841-45f6-8eb6-653ecbe09298",
+ "metadata": {},
+ "source": [
+ "### 1.4 Codificação de Variáveis"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "49eb2f04-9750-4759-a079-daa9fec511ff",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df['dteday'] = pd.to_datetime(df['dteday'])\n",
+ "\n",
+ "df['day'] = df['dteday'].dt.day\n",
+ "\n",
+ "df = df.drop(columns=['dteday'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d3161acf-8a8f-473e-b5c0-04467c5e0c63",
+ "metadata": {},
+ "source": [
+ "Como a base de dados já possui variáveis referentes a mês (`mnth`) e ano(`yr`), separamos apenas o dia. Com isso, precisamos alterar nosso dicionário de dados."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "4ad83997-721d-4d24-93ad-bb30f035c126",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " variavel | \n",
+ " descricao | \n",
+ " tipo | \n",
+ " subtipo | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 2 | \n",
+ " season | \n",
+ " Estação do ano | \n",
+ " qualitativa | \n",
+ " ordinal | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " yr | \n",
+ " Ano | \n",
+ " qualitativa | \n",
+ " ordinal | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " mnth | \n",
+ " Mês | \n",
+ " qualitativa | \n",
+ " ordinal | \n",
+ "
\n",
+ " \n",
+ " 5 | \n",
+ " holiday | \n",
+ " Se o dia é feriado ou não | \n",
+ " qualitativa | \n",
+ " nominal | \n",
+ "
\n",
+ " \n",
+ " 6 | \n",
+ " weekday | \n",
+ " Dia da semana | \n",
+ " qualitativa | \n",
+ " nominal | \n",
+ "
\n",
+ " \n",
+ " 7 | \n",
+ " workingday | \n",
+ " Se o dia não é fim de semana e nem feriado | \n",
+ " qualitativa | \n",
+ " nominal | \n",
+ "
\n",
+ " \n",
+ " 8 | \n",
+ " weathersit | \n",
+ " Clima | \n",
+ " qualitativa | \n",
+ " nominal | \n",
+ "
\n",
+ " \n",
+ " 9 | \n",
+ " temp | \n",
+ " Temperatura normalizada em Celsius | \n",
+ " quantitativa | \n",
+ " contínua | \n",
+ "
\n",
+ " \n",
+ " 10 | \n",
+ " atemp | \n",
+ " Temperatura de sensação normalizada em Celsius | \n",
+ " quantitativa | \n",
+ " contínua | \n",
+ "
\n",
+ " \n",
+ " 11 | \n",
+ " hum | \n",
+ " Umidade do ar | \n",
+ " quantitiva | \n",
+ " contínua | \n",
+ "
\n",
+ " \n",
+ " 12 | \n",
+ " windspeed | \n",
+ " Velocidade do vento | \n",
+ " quantitiva | \n",
+ " contínua | \n",
+ "
\n",
+ " \n",
+ " 13 | \n",
+ " casual | \n",
+ " Contagem de usuários casuais | \n",
+ " quantitiva | \n",
+ " discreta | \n",
+ "
\n",
+ " \n",
+ " 14 | \n",
+ " registered | \n",
+ " Contagem de usuários registrados | \n",
+ " quantitiva | \n",
+ " discreta | \n",
+ "
\n",
+ " \n",
+ " 15 | \n",
+ " cnt | \n",
+ " Contagem total de bicicletas alugadas, incluin... | \n",
+ " quantitiva | \n",
+ " discreta | \n",
+ "
\n",
+ " \n",
+ " 16 | \n",
+ " day | \n",
+ " Dia no ano | \n",
+ " qualitativa | \n",
+ " ordinal | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " variavel descricao \\\n",
+ "2 season Estação do ano \n",
+ "3 yr Ano \n",
+ "4 mnth Mês \n",
+ "5 holiday Se o dia é feriado ou não \n",
+ "6 weekday Dia da semana \n",
+ "7 workingday Se o dia não é fim de semana e nem feriado \n",
+ "8 weathersit Clima \n",
+ "9 temp Temperatura normalizada em Celsius \n",
+ "10 atemp Temperatura de sensação normalizada em Celsius \n",
+ "11 hum Umidade do ar \n",
+ "12 windspeed Velocidade do vento \n",
+ "13 casual Contagem de usuários casuais \n",
+ "14 registered Contagem de usuários registrados \n",
+ "15 cnt Contagem total de bicicletas alugadas, incluin... \n",
+ "16 day Dia no ano \n",
+ "\n",
+ " tipo subtipo \n",
+ "2 qualitativa ordinal \n",
+ "3 qualitativa ordinal \n",
+ "4 qualitativa ordinal \n",
+ "5 qualitativa nominal \n",
+ "6 qualitativa nominal \n",
+ "7 qualitativa nominal \n",
+ "8 qualitativa nominal \n",
+ "9 quantitativa contínua \n",
+ "10 quantitativa contínua \n",
+ "11 quantitiva contínua \n",
+ "12 quantitiva contínua \n",
+ "13 quantitiva discreta \n",
+ "14 quantitiva discreta \n",
+ "15 quantitiva discreta \n",
+ "16 qualitativa ordinal "
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "new_columns = pd.DataFrame({\n",
+ " 'variavel': ['day'],\n",
+ " 'descricao': [\n",
+ " 'Dia no ano'\n",
+ " ],\n",
+ " 'tipo': ['qualitativa'],\n",
+ " 'subtipo': ['ordinal']\n",
+ "})\n",
+ "\n",
+ "df_dict = pd.concat([df_dict, new_columns], ignore_index=True)\n",
+ "df_dict = df_dict[~df_dict['variavel'].isin(['instant', 'dteday'])]\n",
+ "df_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
"id": "978ca630-47b5-4011-b108-ab5b6492ec6e",
"metadata": {
"tags": []
@@ -325,7 +672,7 @@
"\n",
"ordinal_columns = (\n",
" df_dict\n",
- " .query(\"subtipo == 'ordinal' and variavel != 'dteday'\")\n",
+ " .query(\"subtipo == 'ordinal'\")\n",
" .variavel\n",
" .to_list()\n",
")\n",
@@ -343,23 +690,17 @@
},
{
"cell_type": "code",
- "execution_count": 5,
+ "execution_count": 9,
"id": "3f0346d7-3228-44af-bfb5-12f0712979ee",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
- "if 'dteday' in X.columns:\n",
- " X['year'] = X['dteday'].dt.year\n",
- " X['month'] = X['dteday'].dt.month\n",
- " X['day'] = X['dteday'].dt.day\n",
- " X = X.drop(columns=['dteday'])\n",
- "\n",
"nominal_preprocessor = Pipeline([\n",
" ('missing', SimpleImputer(strategy='most_frequent')), \n",
- " ('encoding', OneHotEncoder(sparse_output=False, drop='first')), \n",
- " ('normalization', StandardScaler()) \n",
+ " ('encoding', OneHotEncoder(sparse_output=False, drop='first')),\n",
+ " ('normalization', StandardScaler())\n",
"])\n",
"\n",
"continuous_preprocessor = Pipeline([\n",
@@ -396,31 +737,48 @@
},
{
"cell_type": "code",
- "execution_count": 6,
+ "execution_count": 10,
"id": "e144b0d5-2285-49c9-95b2-15a796e179da",
"metadata": {
"tags": []
},
"outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Colunas após a preparação:\n",
+ "['nominal__holiday_1' 'nominal__weekday_1' 'nominal__weekday_2'\n",
+ " 'nominal__weekday_3' 'nominal__weekday_4' 'nominal__weekday_5'\n",
+ " 'nominal__weekday_6' 'nominal__workingday_1' 'nominal__weathersit_2'\n",
+ " 'nominal__weathersit_3' 'ordinal__season' 'ordinal__yr' 'ordinal__mnth'\n",
+ " 'ordinal__day' 'discrete__casual' 'discrete__registered'\n",
+ " 'continuous__temp' 'continuous__atemp' 'continuous__hum'\n",
+ " 'continuous__windspeed']\n"
+ ]
+ },
{
"data": {
"text/plain": [
"(731, 20)"
]
},
- "execution_count": 6,
+ "execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_prepared = preprocessor.fit_transform(X)\n",
+ "feature_names = preprocessor.get_feature_names_out()\n",
+ "print(\"Colunas após a preparação:\")\n",
+ "print(feature_names)\n",
"X_prepared.shape"
]
},
{
"cell_type": "code",
- "execution_count": 7,
+ "execution_count": 11,
"id": "e4a589c6-aebc-42b7-8b96-b564784db2b8",
"metadata": {
"tags": []
@@ -466,7 +824,7 @@
"id": "33e33101-a682-4cc2-9adf-362464aa7791",
"metadata": {},
"source": [
- "## 3. Seleção de modelos\n"
+ "## 2. Seleção de modelos"
]
},
{
@@ -492,7 +850,7 @@
},
{
"cell_type": "code",
- "execution_count": 8,
+ "execution_count": 12,
"id": "d18ec4ad-8cd7-408e-b700-7bb953934592",
"metadata": {
"tags": []
@@ -509,8 +867,9 @@
"max_iter = 1000 \n",
"models = [\n",
" ('K-Nearest Neighbors', KNeighborsRegressor(), {\n",
- " \"n_neighbors\": range(3, 20, 2), \n",
- " 'weights': ['uniform', 'distance']\n",
+ " \"n_neighbors\": range(3, 5, 7), \n",
+ " 'weights': ['uniform', 'distance'],\n",
+ " 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']\n",
" }),\n",
" ('Gradient Boosting', GradientBoostingRegressor(random_state=random_state), {\n",
" 'n_estimators': [50, 100, 150],\n",
@@ -531,7 +890,7 @@
},
{
"cell_type": "code",
- "execution_count": 9,
+ "execution_count": 13,
"id": "38a1312c-22be-4db7-89a4-12a69fe2c940",
"metadata": {
"tags": []
@@ -570,27 +929,24 @@
" test_neg_mean_squared_error | \n",
" test_neg_mean_absolute_error | \n",
" test_r2 | \n",
- " model_name | \n",
" \n",
" \n",
" \n",
" \n",
" mean | \n",
- " 0.778052 | \n",
- " 0.029866 | \n",
- " -495085.285587 | \n",
- " -521.421224 | \n",
- " 0.867283 | \n",
- " K-Nearest Neighbors | \n",
+ " 0.958916 | \n",
+ " 0.086565 | \n",
+ " -520945.104332 | \n",
+ " -538.463882 | \n",
+ " 0.859563 | \n",
"
\n",
" \n",
" std | \n",
- " 0.125487 | \n",
- " 0.014388 | \n",
- " 98807.812136 | \n",
- " 56.917455 | \n",
- " 0.024541 | \n",
- " K-Nearest Neighbors | \n",
+ " 0.314371 | \n",
+ " 0.049576 | \n",
+ " 80345.848916 | \n",
+ " 36.727944 | \n",
+ " 0.024787 | \n",
"
\n",
" \n",
"\n",
@@ -598,12 +954,12 @@
],
"text/plain": [
" fit_time score_time test_neg_mean_squared_error \\\n",
- "mean 0.778052 0.029866 -495085.285587 \n",
- "std 0.125487 0.014388 98807.812136 \n",
+ "mean 0.958916 0.086565 -520945.104332 \n",
+ "std 0.314371 0.049576 80345.848916 \n",
"\n",
- " test_neg_mean_absolute_error test_r2 model_name \n",
- "mean -521.421224 0.867283 K-Nearest Neighbors \n",
- "std 56.917455 0.024541 K-Nearest Neighbors "
+ " test_neg_mean_absolute_error test_r2 \n",
+ "mean -538.463882 0.859563 \n",
+ "std 36.727944 0.024787 "
]
},
"metadata": {},
@@ -642,27 +998,24 @@
" test_neg_mean_squared_error | \n",
" test_neg_mean_absolute_error | \n",
" test_r2 | \n",
- " model_name | \n",
" \n",
" \n",
" \n",
" \n",
" mean | \n",
- " 42.909599 | \n",
- " 0.016341 | \n",
- " -11775.561442 | \n",
- " -68.763639 | \n",
- " 0.996826 | \n",
- " Gradient Boosting | \n",
+ " 87.815198 | \n",
+ " 0.043461 | \n",
+ " -11767.095009 | \n",
+ " -69.842862 | \n",
+ " 0.996832 | \n",
"
\n",
" \n",
" std | \n",
- " 8.347175 | \n",
- " 0.001307 | \n",
- " 3159.517240 | \n",
- " 7.482689 | \n",
- " 0.000856 | \n",
- " Gradient Boosting | \n",
+ " 15.660049 | \n",
+ " 0.012917 | \n",
+ " 2630.227186 | \n",
+ " 6.533291 | \n",
+ " 0.000709 | \n",
"
\n",
" \n",
"\n",
@@ -670,12 +1023,12 @@
],
"text/plain": [
" fit_time score_time test_neg_mean_squared_error \\\n",
- "mean 42.909599 0.016341 -11775.561442 \n",
- "std 8.347175 0.001307 3159.517240 \n",
+ "mean 87.815198 0.043461 -11767.095009 \n",
+ "std 15.660049 0.012917 2630.227186 \n",
"\n",
- " test_neg_mean_absolute_error test_r2 model_name \n",
- "mean -68.763639 0.996826 Gradient Boosting \n",
- "std 7.482689 0.000856 Gradient Boosting "
+ " test_neg_mean_absolute_error test_r2 \n",
+ "mean -69.842862 0.996832 \n",
+ "std 6.533291 0.000709 "
]
},
"metadata": {},
@@ -714,27 +1067,24 @@
" test_neg_mean_squared_error | \n",
" test_neg_mean_absolute_error | \n",
" test_r2 | \n",
- " model_name | \n",
" \n",
" \n",
" \n",
" \n",
" mean | \n",
- " 0.684064 | \n",
- " 0.018473 | \n",
- " -59390.980641 | \n",
- " -164.245985 | \n",
- " 0.983967 | \n",
- " Decision Tree | \n",
+ " 1.580274 | \n",
+ " 0.043137 | \n",
+ " -57611.119230 | \n",
+ " -163.296206 | \n",
+ " 0.984511 | \n",
"
\n",
" \n",
" std | \n",
- " 0.106449 | \n",
- " 0.006129 | \n",
- " 12130.331634 | \n",
- " 17.579599 | \n",
- " 0.003618 | \n",
- " Decision Tree | \n",
+ " 0.257548 | \n",
+ " 0.014953 | \n",
+ " 10015.212231 | \n",
+ " 14.241930 | \n",
+ " 0.002719 | \n",
"
\n",
" \n",
"\n",
@@ -742,12 +1092,12 @@
],
"text/plain": [
" fit_time score_time test_neg_mean_squared_error \\\n",
- "mean 0.684064 0.018473 -59390.980641 \n",
- "std 0.106449 0.006129 12130.331634 \n",
+ "mean 1.580274 0.043137 -57611.119230 \n",
+ "std 0.257548 0.014953 10015.212231 \n",
"\n",
- " test_neg_mean_absolute_error test_r2 model_name \n",
- "mean -164.245985 0.983967 Decision Tree \n",
- "std 17.579599 0.003618 Decision Tree "
+ " test_neg_mean_absolute_error test_r2 \n",
+ "mean -163.296206 0.984511 \n",
+ "std 14.241930 0.002719 "
]
},
"metadata": {},
@@ -786,27 +1136,24 @@
" test_neg_mean_squared_error | \n",
" test_neg_mean_absolute_error | \n",
" test_r2 | \n",
- " model_name | \n",
" \n",
" \n",
" \n",
" \n",
" mean | \n",
- " 10.112643 | \n",
- " 0.025550 | \n",
- " -18307.274898 | \n",
- " -83.639403 | \n",
- " 0.995059 | \n",
- " Random Forest | \n",
+ " 25.657500 | \n",
+ " 0.046849 | \n",
+ " -18545.332476 | \n",
+ " -85.850864 | \n",
+ " 0.995005 | \n",
"
\n",
" \n",
" std | \n",
- " 1.794122 | \n",
- " 0.014259 | \n",
- " 5609.330859 | \n",
- " 11.129464 | \n",
- " 0.001523 | \n",
- " Random Forest | \n",
+ " 5.260464 | \n",
+ " 0.015731 | \n",
+ " 5785.896650 | \n",
+ " 10.240212 | \n",
+ " 0.001532 | \n",
"
\n",
" \n",
"\n",
@@ -814,12 +1161,12 @@
],
"text/plain": [
" fit_time score_time test_neg_mean_squared_error \\\n",
- "mean 10.112643 0.025550 -18307.274898 \n",
- "std 1.794122 0.014259 5609.330859 \n",
+ "mean 25.657500 0.046849 -18545.332476 \n",
+ "std 5.260464 0.015731 5785.896650 \n",
"\n",
- " test_neg_mean_absolute_error test_r2 model_name \n",
- "mean -83.639403 0.995059 Random Forest \n",
- "std 11.129464 0.001523 Random Forest "
+ " test_neg_mean_absolute_error test_r2 \n",
+ "mean -85.850864 0.995005 \n",
+ "std 10.240212 0.001532 "
]
},
"metadata": {},
@@ -827,6 +1174,7 @@
}
],
"source": [
+ "model_results = {}\n",
"results = pd.DataFrame({})\n",
"cross_validate_grid_search = KFold(n_splits=n_folds_grid_search)\n",
"cross_validate_comparative_analysis = ShuffleSplit(n_splits=n_splits_comparative_analysis, test_size=test_size, random_state=random_state)\n",
@@ -846,6 +1194,20 @@
" ('model', model_grid_search)\n",
" ])\n",
" \n",
+ " # Armazenando os resultados de y_true e y_pred\n",
+ " for train_index, test_index in cross_validate_comparative_analysis.split(X, y):\n",
+ " X_train, X_test = X.iloc[train_index], X.iloc[test_index]\n",
+ " y_train, y_test = y.iloc[train_index], y.iloc[test_index]\n",
+ " \n",
+ " approach.fit(X_train, y_train)\n",
+ " y_pred = approach.predict(X_test)\n",
+ " \n",
+ " model_results[model_name] = {\n",
+ " 'y_true': y_test,\n",
+ " 'y_pred': y_pred\n",
+ " }\n",
+ "\n",
+ " # Avaliando o modelo com cross_validate\n",
" scores = cross_validate(\n",
" estimator=approach,\n",
" X=X,\n",
@@ -856,15 +1218,16 @@
" )\n",
" \n",
" scores_df = pd.DataFrame(scores)\n",
- " aggregated_scores = scores_df.agg(['mean', 'std'])\n",
- " aggregated_scores['model_name'] = model_name\n",
- " display(aggregated_scores)\n",
- " results = pd.concat([results, aggregated_scores], ignore_index=True)"
+ " scores_df['model_name'] = model_name\n",
+ " results = pd.concat([results, scores_df], ignore_index=True)\n",
+ " numeric_scores_df = scores_df.select_dtypes(include=['float64', 'int64'])\n",
+ " scores_aggregated = numeric_scores_df.agg(['mean', 'std'])\n",
+ " display(scores_aggregated)"
]
},
{
"cell_type": "code",
- "execution_count": 10,
+ "execution_count": 14,
"id": "5127adfc-d0c6-4f15-b833-bf1fdc91ef1b",
"metadata": {
"tags": []
@@ -873,7 +1236,7 @@
{
"data": {
"text/markdown": [
- "### 3.1 Resultados gerais"
+ "### 2.1 Resultados gerais"
],
"text/plain": [
""
@@ -886,110 +1249,110 @@
"data": {
"text/html": [
"\n",
- "\n",
+ "\n",
" \n",
" \n",
" | \n",
" model_name | \n",
- " Decision Tree | \n",
- " Gradient Boosting | \n",
- " K-Nearest Neighbors | \n",
- " Random Forest | \n",
+ " Decision Tree | \n",
+ " Gradient Boosting | \n",
+ " K-Nearest Neighbors | \n",
+ " Random Forest | \n",
"
\n",
" \n",
" \n",
" \n",
- " fit_time | \n",
- " mean | \n",
- " 0.395256 | \n",
- " 25.628387 | \n",
- " 0.451770 | \n",
- " 5.953382 | \n",
+ " fit_time | \n",
+ " mean | \n",
+ " 1.580274 | \n",
+ " 87.815198 | \n",
+ " 0.958916 | \n",
+ " 25.657500 | \n",
"
\n",
" \n",
- " std | \n",
- " 0.408435 | \n",
- " 24.439324 | \n",
- " 0.461433 | \n",
- " 5.882082 | \n",
+ " std | \n",
+ " 0.257548 | \n",
+ " 15.660049 | \n",
+ " 0.314371 | \n",
+ " 5.260464 | \n",
"
\n",
" \n",
- " score_time | \n",
- " mean | \n",
- " 0.012301 | \n",
- " 0.008824 | \n",
- " 0.022127 | \n",
- " 0.019905 | \n",
+ " score_time | \n",
+ " mean | \n",
+ " 0.043137 | \n",
+ " 0.043461 | \n",
+ " 0.086565 | \n",
+ " 0.046849 | \n",
"
\n",
" \n",
- " std | \n",
- " 0.008728 | \n",
- " 0.010631 | \n",
- " 0.010944 | \n",
- " 0.007984 | \n",
+ " std | \n",
+ " 0.014953 | \n",
+ " 0.012917 | \n",
+ " 0.049576 | \n",
+ " 0.015731 | \n",
"
\n",
" \n",
- " test_neg_mean_squared_error | \n",
- " mean | \n",
- " -23630.324504 | \n",
- " -4308.022101 | \n",
- " -198138.736726 | \n",
- " -6348.972019 | \n",
+ " test_neg_mean_squared_error | \n",
+ " mean | \n",
+ " -57611.119230 | \n",
+ " -11767.095009 | \n",
+ " -520945.104332 | \n",
+ " -18545.332476 | \n",
"
\n",
" \n",
- " std | \n",
- " 50573.204909 | \n",
- " 10560.695413 | \n",
- " 419945.836699 | \n",
- " 16911.594114 | \n",
+ " std | \n",
+ " 10015.212231 | \n",
+ " 2630.227186 | \n",
+ " 80345.848916 | \n",
+ " 5785.896650 | \n",
"
\n",
" \n",
- " test_neg_mean_absolute_error | \n",
- " mean | \n",
- " -73.333193 | \n",
- " -30.640475 | \n",
- " -232.251885 | \n",
- " -36.254970 | \n",
+ " test_neg_mean_absolute_error | \n",
+ " mean | \n",
+ " -163.296206 | \n",
+ " -69.842862 | \n",
+ " -538.463882 | \n",
+ " -85.850864 | \n",
"
\n",
" \n",
- " std | \n",
- " 128.570104 | \n",
- " 53.914296 | \n",
- " 408.947202 | \n",
- " 67.011709 | \n",
+ " std | \n",
+ " 14.241930 | \n",
+ " 6.533291 | \n",
+ " 36.727944 | \n",
+ " 10.240212 | \n",
"
\n",
" \n",
- " test_r2 | \n",
- " mean | \n",
- " 0.493793 | \n",
- " 0.498841 | \n",
- " 0.445912 | \n",
- " 0.498291 | \n",
+ " test_r2 | \n",
+ " mean | \n",
+ " 0.984511 | \n",
+ " 0.996832 | \n",
+ " 0.859563 | \n",
+ " 0.995005 | \n",
"
\n",
" \n",
- " std | \n",
- " 0.693211 | \n",
- " 0.704257 | \n",
- " 0.595908 | \n",
- " 0.702536 | \n",
+ " std | \n",
+ " 0.002719 | \n",
+ " 0.000709 | \n",
+ " 0.024787 | \n",
+ " 0.001532 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
- ""
+ ""
]
},
- "execution_count": 10,
+ "execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
@@ -1001,7 +1364,7 @@
" return np.where(s == np.nanmin(s.values), props, '')\n",
" return np.where(s == np.nanmax(s.values), props, '')\n",
"\n",
- "display(Markdown(\"### 3.1 Resultados gerais\"))\n",
+ "display(Markdown(\"### 2.1 Resultados gerais\"))\n",
"(\n",
" results\n",
" .groupby('model_name')\n",
@@ -1025,17 +1388,52 @@
"Esses resultados sugerem que, se o objetivo é maximizar a precisão e há tempo computacional disponível, o Gradient Boosting é a melhor opção. Se o tempo de treinamento for uma restrição, o Random Forest pode ser uma boa alternativa."
]
},
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "id": "38a47820-08a4-4bd3-b8a8-88a8a8ce31d8",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "