E2E Example¶
This notebook demonstrates the data ingestion, feature engineering, and scoring pipeline for a fixed-income fund recommendation system.
This illustrates the steps in our pipeline and explains the recommendation method.
Highlights:
- Interpretable and flexible knowledge-based recommendation system.
- Config-driven pipeline from YAML
Method¶
Our proposed recommendation system method is quite simple, it is a knowledge-based recommendation system that relies on CVM data to recommend fixed income funds on a periodic basis. It is designed to run in batch in a monthly basis but also allow exploration and backfill.
The high-level recommendation method is as follows:
-
Fetches recent data from CVM. It currently support CDA, but can easily expanded to other data sources, such as AMBIMA, and other CVM data, like daily information about quote values.
-
Feature Engineering: It has a built-in feature engine (quite cool by the way), which allow us to compute flexible feature definitions on parametrized entities and time columns. So, basically, we are currently applying the feature definitions on
(CNPJ_FUNDO_CLASSE, DENOM_SOCIAL)as entity but we can replace it by any other entity we like.
The feature definitions are config-driven and are not tied to any specific backend. For example, we are currently using pandas to perform the transformations, but we could easily switch to pyspark, SQL or any other backend we like.
-
Compute Scores: On top of our features, we compute scores. Each score represent a criteria of interest, for example risk, diversification, etc. Right now, we are using single feature models based on z-score, but we could have more interesting heuristic logics or even ML models at this step. Something that I would like to test but I didn't have the time, is creating a estimated sharpe score, so take the quotas information, train an xgboost to predict the sharpe ration of a fund in the future.
-
Ranking based on Customer Profile Weighting over the Scores: This is the hearth of our recommendation system. Basically, I perform a weighted sum on top of the scores to compute a final score, which I use to rank the funds. In a sense, it is like a utility function that combines all criteria of interests, aka the scores, into a utility score that best suits a customer profile.
In a sense, we apply a weighted sum over all criteria, where each weight reflects the customer’s profile as the formula below:
U_i = \sum_{k=1}^{K} w_k \, s_{i,k}
Where:
- U_i — final utility score for fund i
- s_{i,k} — score k for fund i
- w_k — weight of score k derived from customer profile
- K — number of criteria
U_i =
\begin{bmatrix}
w_1 & w_2 & \cdots & w_K
\end{bmatrix}
\begin{bmatrix}
s_{i,1} \\
s_{i,2} \\
\vdots \\
s_{i,K}
\end{bmatrix}
And, we apply normalization:
U_i = \sum_{k=1}^{K}
\left( \frac{w_k}{\sum_{j=1}^{K} w_j} \right) s_{i,k}
\text{RankedFunds} = \operatorname{argsort}\left( -U_i \right)
Where:
- U_i is the final utility score for fund i
argsortreturns the indices of funds sorted by the given value- The negative sign ensures descending order (highest score first)
Customer Profiles¶
The customer profiles weights, as it is, are hardcoded values. But we propose different ways to define these weights.
To address cold-start, we could have a questionnaire with Likert-scale questions, 1 to 5, to understand how important each criteria is to the customer, or even having a fuzzy logic heuristic to combine the questionnaire answers into the weights for each score.
After we have customer feedback on the recommendations, we could introduce this feedback into a re-ranker approach, where we adjust the rankings based on previous choices of our customer portfolio.
Troubleshooting¶
Prerequisites: Python packages: pandas, requests, pyyaml. Optional: pyarrow or fastparquet for Parquet I/O.
Note: For quick demos this notebook may use local CSV fallbacks; substitute
fetch_manifest(...)to run the full end-to-end pipeline against remote sources.
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use:
%reload_ext autoreload
Quick check¶
Run a quick sanity check to ensure the package is importable and that basic helpers (like hello()) work as expected. This is useful to confirm the development environment is set up correctly before running heavier pipeline steps.
from fif_recsys import hello
# This will print to the notebook output
hello()
Hello from fif_recsys!
Configuration manifest (YAML)¶
The configuration dictionary (config_d) defines how data is fetched and how features and scores are computed.
fetch: datasets to download. Each dataset includesbase_url,periods, andfilename_template.feature: registry of features to compute, including aggregation method and optional adjustments.score: scoring definitions (type, feature source, and adjustments likeinvert).profile: named profile weightings used to aggregate scores into a single ranking for each investor profile.
Edit these values to match your data sources and scoring preferences.
import yaml
config_d = yaml.safe_load("""
fetch:
cda:
base_url: "https://dados.cvm.gov.br/dados/FI/DOC/CDA/DADOS/"
periods:
- "202501"
- "202502"
- "202503"
- "202504"
- "202505"
- "202506"
- "202507"
- "202508"
- "202509"
- "202510"
- "202511"
- "202512"
filename_template: "cda_fi_{period}.zip"
cotas:
base_url: "https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/"
periods:
- "202301"
- "202302"
- "202303"
- "202304"
- "202305"
- "202306"
- "202307"
- "202308"
- "202309"
- "202310"
- "202311"
- "202312"
- "202401"
- "202402"
- "202403"
- "202404"
- "202405"
- "202406"
- "202407"
- "202408"
- "202409"
- "202410"
- "202411"
- "202412"
- "202501"
- "202502"
- "202503"
- "202504"
- "202505"
- "202506"
- "202507"
- "202508"
- "202509"
- "202510"
- "202511"
- "202512"
filename_template: "inf_diario_fi_{period}.zip"
feature:
group_keys:
- CNPJ_FUNDO_CLASSE
- DENOM_SOCIAL
- reference_date
feature_registry:
cda:
patrimonio_liq:
description: "Maximum reported net asset value per fund-month."
method: max
args:
- VL_PATRIM_LIQ
log_aum:
description: "Maximum reported net asset value per fund-month."
method: max
args:
- VL_PATRIM_LIQ
adjustment:
- log
total_posicao:
description: "Sum of final market value of all positions in the period."
method: sum
args:
- VL_MERC_POS_FINAL
n_ativos:
description: "Number of unique assets in the fund portfolio."
method: nunique
args:
- CD_ATIVO
n_emissores:
description: "Number of unique issuers in the fund portfolio."
method: nunique
args:
- CPF_CNPJ_EMISSOR
credito_share:
description: "Weighted share of credit-linked assets in the portfolio."
method: credito_share_feature_fn
args:
- ["Debêntures", "Cédula de Crédito", "CRI", "CRA", "Notas Promissórias"]
adjustment:
- clip
related_party_share:
description: "Weighted share of related-party issuers."
method: related_party_share_feature_fn
adjustment:
- clip
issuer_hhi:
description: "Herfindahl-Hirschman index based on issuer weights."
method: hhi_feature_fn
adjustment:
- clip
- coalesce
cotas:
score:
size_score:
type: zscore
description: >
Measures the relative size of the fund based on its assets under
management. Larger funds typically exhibit greater operational
stability, better liquidity access, and lower idiosyncratic risk.
Computed using the z-score of the log-transformed AUM (log_aum).
args:
feature: log_aum
diversification_score:
type: zscore
description: >
Evaluates how diversified the fund's portfolio is in terms of
the number of unique assets held. Higher values indicate broader
asset diversification, reducing exposure to security-specific risks.
args:
feature: n_ativos
issuer_diversification_score:
type: zscore
description: >
Measures diversification across issuers by counting how many distinct
counterparties the fund is exposed to. Funds with exposures distributed
across more issuers typically have lower concentration and reduced
issuer-specific credit risk.
args:
feature: n_emissores
credit_risk_score:
type: zscore
description: >
Quantifies the fund's exposure to credit-linked instruments such as
debentures, CRIs/CRAs, and promissory notes. A higher credit share
typically increases sensitivity to credit events. The score is inverted
so that higher credit exposure corresponds to a lower (worse) score.
args:
feature: credito_share
adjustment:
- invert
governance_risk_score:
type: zscore
description: >
Captures exposure to related-party transactions, which may increase
governance risk due to potential conflicts of interest and reduced
market discipline. The score is inverted, so funds with higher
related-party share receive a lower (worse) score.
args:
feature: related_party_share
adjustment:
- invert
concentration_risk_score:
type: zscore
description: >
Measures portfolio concentration using the Herfindahl-Hirschman Index
(HHI) computed over issuer exposure weights. Higher HHI values indicate
more concentrated portfolios and elevated idiosyncratic and liquidity
risks. Score is inverted so higher concentration yields a lower score.
args:
feature: issuer_hhi
adjustment:
- invert
profile:
conservative:
description: >
Designed for risk-averse investors prioritizing capital preservation and stability.
Emphasizes fund size, diversification, and issuer spread to minimize volatility,
while keeping exposure to credit and governance risks tightly controlled.
size_score: 0.25
diversification_score: 0.20
issuer_diversification_score: 0.20
credit_risk_score: 0.15
governance_risk_score: 0.10
concentration_risk_score: 0.10
balanced:
description: >
Suitable for investors seeking a middle ground between safety and return.
Balances diversification and issuer exposure with moderate tolerance for credit
and concentration risks, aiming for a stable but growth-oriented allocation.
size_score: 0.20
diversification_score: 0.15
issuer_diversification_score: 0.15
credit_risk_score: 0.20
governance_risk_score: 0.15
concentration_risk_score: 0.15
institutional:
description: >
Targeted at large professional allocators who value scale and diversification
but can tolerate more concentrated or complex positions. Prioritizes fund size
and issuer spread while placing relatively lower weight on credit and governance constraints.
size_score: 0.30
diversification_score: 0.20
issuer_diversification_score: 0.20
credit_risk_score: 0.10
governance_risk_score: 0.10
concentration_risk_score: 0.10
""")
Fetch datasets¶
Use fetch_manifest to download and assemble datasets defined in the manifest. The function returns a dict mapping dataset names to pandas.DataFrame objects and writes partitioned files to output_dir/<dataset>/period=<period>/data.parquet when a Parquet engine is available (a CSV fallback is used otherwise).
Example usage (below) demonstrates both the programmatic fetch and a temporary offline fallback for quick demos.
from pathlib import Path
from fif_recsys.commands.data import fetch_manifest
data_sources_d = fetch_manifest(config_d['fetch'], output_dir=Path("/tmp"))
Downloading https://dados.cvm.gov.br/dados/FI/DOC/CDA/DADOS/cda_fi_202501.zip
Parsing cda_fie_202501.csv
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 1011: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2071: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2279: field larger than field limit (131072)
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 3391: field larger than field limit (131072)
df = pd.read_csv(
Downloading https://dados.cvm.gov.br/dados/FI/DOC/CDA/DADOS/cda_fi_202502.zip
Parsing cda_fie_202502.csv
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 1073: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2175: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2373: field larger than field limit (131072)
df = pd.read_csv(
Downloading https://dados.cvm.gov.br/dados/FI/DOC/CDA/DADOS/cda_fi_202503.zip
Parsing cda_fie_202503.csv
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 1090: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2316: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2480: field larger than field limit (131072)
df = pd.read_csv(
Downloading https://dados.cvm.gov.br/dados/FI/DOC/CDA/DADOS/cda_fi_202504.zip
Parsing cda_fie_202504.csv
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 1190: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2459: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2625: field larger than field limit (131072)
df = pd.read_csv(
Downloading https://dados.cvm.gov.br/dados/FI/DOC/CDA/DADOS/cda_fi_202505.zip
Parsing cda_fie_202505.csv
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 1215: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2627: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2816: field larger than field limit (131072)
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 3394: field larger than field limit (131072)
df = pd.read_csv(
Downloading https://dados.cvm.gov.br/dados/FI/DOC/CDA/DADOS/cda_fi_202506.zip
Parsing cda_fie_202506.csv
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 648: field larger than field limit (131072)
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 1430: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2930: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 3148: field larger than field limit (131072)
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 3889: ';' expected after '"'
df = pd.read_csv(
Downloading https://dados.cvm.gov.br/dados/FI/DOC/CDA/DADOS/cda_fi_202507.zip
Parsing cda_fie_202507.csv
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 347: field larger than field limit (131072)
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 1139: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2693: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2917: field larger than field limit (131072)
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 3737: ';' expected after '"'
df = pd.read_csv(
Downloading https://dados.cvm.gov.br/dados/FI/DOC/CDA/DADOS/cda_fi_202508.zip
Parsing cda_fie_202508.csv
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 241: field larger than field limit (131072)
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 1044: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2640: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2733: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 3606: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 4049: field larger than field limit (131072)
df = pd.read_csv(
Downloading https://dados.cvm.gov.br/dados/FI/DOC/CDA/DADOS/cda_fi_202509.zip
Parsing cda_fie_202509.csv
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 3164: field larger than field limit (131072)
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 3922: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 5598: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 5688: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 6579: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 7054: field larger than field limit (131072)
df = pd.read_csv(
Downloading https://dados.cvm.gov.br/dados/FI/DOC/CDA/DADOS/cda_fi_202510.zip
Parsing cda_fie_202510.csv
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 174: field larger than field limit (131072)
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 614: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 1935: field larger than field limit (131072)
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2260: field larger than field limit (131072)
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2459: field larger than field limit (131072)
df = pd.read_csv(
Downloading https://dados.cvm.gov.br/dados/FI/DOC/CDA/DADOS/cda_fi_202511.zip
Parsing cda_fie_202511.csv
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 145: field larger than field limit (131072)
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 355: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 1248: field larger than field limit (131072)
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 1411: ';' expected after '"'
df = pd.read_csv(
Downloading https://dados.cvm.gov.br/dados/FI/DOC/CDA/DADOS/cda_fi_202512.zip
Parsing cda_fie_202512.csv
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 1060: field larger than field limit (131072)
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 1303: ';' expected after '"'
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2247: field larger than field limit (131072)
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2421: field larger than field limit (131072)
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 2465: field larger than field limit (131072)
df = pd.read_csv(
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/data.py:41: ParserWarning: Skipping line 8583: ';' expected after '"'
df = pd.read_csv(
Pyarrow not available; wrote CSV instead: /tmp/cda/period=202501/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cda/period=202502/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cda/period=202503/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cda/period=202504/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cda/period=202505/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cda/period=202506/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cda/period=202507/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cda/period=202508/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cda/period=202509/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cda/period=202510/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cda/period=202511/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cda/period=202512/data.csv
Saved cda → /tmp/cda
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202301.zip
Parsing inf_diario_fi_202301.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202302.zip
Parsing inf_diario_fi_202302.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202303.zip
Parsing inf_diario_fi_202303.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202304.zip
Parsing inf_diario_fi_202304.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202305.zip
Parsing inf_diario_fi_202305.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202306.zip
Parsing inf_diario_fi_202306.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202307.zip
Parsing inf_diario_fi_202307.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202308.zip
Parsing inf_diario_fi_202308.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202309.zip
Parsing inf_diario_fi_202309.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202310.zip
Parsing inf_diario_fi_202310.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202311.zip
Parsing inf_diario_fi_202311.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202312.zip
Parsing inf_diario_fi_202312.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202401.zip
Parsing inf_diario_fi_202401.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202402.zip
Parsing inf_diario_fi_202402.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202403.zip
Parsing inf_diario_fi_202403.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202404.zip
Parsing inf_diario_fi_202404.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202405.zip
Parsing inf_diario_fi_202405.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202406.zip
Parsing inf_diario_fi_202406.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202407.zip
Parsing inf_diario_fi_202407.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202408.zip
Parsing inf_diario_fi_202408.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202409.zip
Parsing inf_diario_fi_202409.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202410.zip
Parsing inf_diario_fi_202410.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202411.zip
Parsing inf_diario_fi_202411.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202412.zip
Parsing inf_diario_fi_202412.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202501.zip
Parsing inf_diario_fi_202501.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202502.zip
Parsing inf_diario_fi_202502.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202503.zip
Parsing inf_diario_fi_202503.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202504.zip
Parsing inf_diario_fi_202504.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202505.zip
Parsing inf_diario_fi_202505.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202506.zip
Parsing inf_diario_fi_202506.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202507.zip
Parsing inf_diario_fi_202507.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202508.zip
Parsing inf_diario_fi_202508.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202509.zip
Parsing inf_diario_fi_202509.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202510.zip
Parsing inf_diario_fi_202510.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202511.zip
Parsing inf_diario_fi_202511.csv
Downloading https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202512.zip
Parsing inf_diario_fi_202512.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202301/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202302/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202303/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202304/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202305/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202306/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202307/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202308/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202309/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202310/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202311/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202312/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202401/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202402/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202403/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202404/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202405/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202406/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202407/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202408/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202409/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202410/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202411/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202412/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202501/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202502/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202503/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202504/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202505/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202506/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202507/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202508/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202509/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202510/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202511/data.csv
Pyarrow not available; wrote CSV instead: /tmp/cotas/period=202512/data.csv
Saved cotas → /tmp/cotas
Compute features¶
Call compute_all_features (or compute_all_features(...) via the FEATURE_ENGINE) to aggregate fund-month features according to your feature_registry. The result is a DataFrame with one row per fund-month and computed features ready for scoring.
from fif_recsys.commands.feature import compute_all_features, FEATURE_ENGINE
feature_df = compute_all_features(data_sources_d, config_d, FEATURE_ENGINE)
feature_df.head()
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/feature.py:26: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
def build_feature_engine(feature_engine: Dict, group_keys: List[str], registry: Any):
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/feature.py:26: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
def build_feature_engine(feature_engine: Dict, group_keys: List[str], registry: Any):
/opt/homebrew/Caskroom/miniconda/base/envs/py313-fif/lib/python3.13/site-packages/pandas/core/arraylike.py:399: RuntimeWarning: invalid value encountered in log1p
result = getattr(ufunc, method)(*inputs, **kwargs)
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/feature.py:26: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
def build_feature_engine(feature_engine: Dict, group_keys: List[str], registry: Any):
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/feature.py:26: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
def build_feature_engine(feature_engine: Dict, group_keys: List[str], registry: Any):
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/feature.py:26: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
def build_feature_engine(feature_engine: Dict, group_keys: List[str], registry: Any):
[33mSkipping dataset [0m[33m'cotas'[0m[33m: no features defined in registry.[0m
/Users/gustavopolleti/dev/fixed-income-fund-recsys/fif_recsys/commands/feature.py:158: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
| CNPJ_FUNDO_CLASSE | DENOM_SOCIAL | reference_date | patrimonio_liq | log_aum | total_posicao | n_ativos | n_emissores | credito_share | related_party_share | issuer_hhi | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 06.323.688/0001-27 | IT NOW PIBB IBRX-50 FUNDO DE ÍNDICE RESPONSABILIDADE LIMITADA | 2026-01-23 | 9.965211e+08 | 20.719781 | 6.826644e+09 | 58 | 1 | 0.0 | 0.125166 | 1.000000 |
| 1 | 09.260.031/0001-56 | FUNDO DE INVESTIMENTO EM QUOTAS DE FUNDO DE INVESTIMENTO EM DIREITOS CREDITÓRIOS NÃO PADRONIZADO SRM | 2026-01-23 | 8.236450e+07 | 18.226665 | 5.039806e+08 | 0 | 8 | 0.0 | 0.479135 | 0.298536 |
| 2 | 10.292.322/0001-05 | KONDOR KOBOLD FUNDO DE INVESTIMENTO EM COTAS DE FIDC - RESP LIMITADA | 2026-01-23 | 5.389893e+08 | 20.105206 | 4.547051e+09 | 0 | 4 | 0.0 | 0.999696 | 0.610606 |
| 3 | 10.406.511/0001-61 | ISHARES IBOVESPA CLASSE DE ÍNDICE - RESPONSABILIDADE LIMITADA | 2026-01-23 | 1.499092e+10 | 23.430710 | 1.028544e+11 | 103 | 9 | 0.0 | 0.013466 | 0.364377 |
| 4 | 10.406.600/0001-08 | ISHARES BM&FBOVESPA SMALL CAP CLASSE DE ÍNDICE - RESPONSABILIDADE LIMITADA | 2026-01-23 | 2.112755e+09 | 21.471258 | 1.813606e+10 | 131 | 11 | 0.0 | 0.035685 | 0.856891 |
Compute scores¶
Convert features into normalized scores using compute_scores_from_yaml. The score section in the configuration defines score types (e.g., zscore) and optional adjustments (e.g., invert). The resulting DataFrame will contain the base features and the derived score columns.
from fif_recsys.commands.model import compute_scores_from_yaml
score_df = compute_scores_from_yaml(feature_df, config_d)
score_df[['CNPJ_FUNDO_CLASSE', 'DENOM_SOCIAL', *[c for c in score_df.columns if 'score' in c]]].head()
| CNPJ_FUNDO_CLASSE | DENOM_SOCIAL | size_score | diversification_score | issuer_diversification_score | credit_risk_score | governance_risk_score | concentration_risk_score | |
|---|---|---|---|---|---|---|---|---|
| 0 | 06.323.688/0001-27 | IT NOW PIBB IBRX-50 FUNDO DE ÍNDICE RESPONSABILIDADE LIMITADA | 1.280804 | 2.536111 | -0.541774 | 0.070015 | 0.512272 | -1.087377 |
| 1 | 09.260.031/0001-56 | FUNDO DE INVESTIMENTO EM QUOTAS DE FUNDO DE INVESTIMENTO EM DIREITOS CREDITÓRIOS NÃO PADRONIZADO SRM | 0.213781 | -0.223430 | -0.031130 | 0.070015 | -0.363396 | 0.826859 |
| 2 | 10.292.322/0001-05 | KONDOR KOBOLD FUNDO DE INVESTIMENTO EM COTAS DE FIDC - RESP LIMITADA | 1.017773 | -0.223430 | -0.322927 | 0.070015 | -1.651189 | -0.024753 |
| 3 | 10.406.511/0001-61 | ISHARES IBOVESPA CLASSE DE ÍNDICE - RESPONSABILIDADE LIMITADA | 2.441048 | 4.677134 | 0.041819 | 0.070015 | 0.788599 | 0.647186 |
| 4 | 10.406.600/0001-08 | ISHARES BM&FBOVESPA SMALL CAP CLASSE DE ÍNDICE - RESPONSABILIDADE LIMITADA | 1.602427 | 6.009326 | 0.187718 | 0.070015 | 0.733635 | -0.696846 |
Compute profile rankings¶
Use compute_profile_scores_from_yaml (from fif_recsys.commands.policy) to aggregate weighted scores into a single profile score and ranking for each fund. Profiles are defined in the profile section of the configuration (e.g., conservative, balanced, institutional).
import pandas as pd
pd.set_option('display.max_colwidth', None) # or set to a large integer value (e.g., 500)
from fif_recsys.commands.policy import compute_profile_scores_from_yaml
ranking_df = compute_profile_scores_from_yaml(score_df.fillna(0), config_d)
ranking_df[['CNPJ_FUNDO_CLASSE', 'DENOM_SOCIAL', 'reference_date', *[c for c in ranking_df.columns if 'rank' in c]]].head()
| CNPJ_FUNDO_CLASSE | DENOM_SOCIAL | reference_date | rank_conservative | rank_balanced | rank_institutional | |
|---|---|---|---|---|---|---|
| 0 | 06.323.688/0001-27 | IT NOW PIBB IBRX-50 FUNDO DE ÍNDICE RESPONSABILIDADE LIMITADA | 2026-01-23 | 82 | 122 | 74 |
| 1 | 09.260.031/0001-56 | FUNDO DE INVESTIMENTO EM QUOTAS DE FUNDO DE INVESTIMENTO EM DIREITOS CREDITÓRIOS NÃO PADRONIZADO SRM | 2026-01-23 | 381 | 383 | 381 |
| 2 | 10.292.322/0001-05 | KONDOR KOBOLD FUNDO DE INVESTIMENTO EM COTAS DE FIDC - RESP LIMITADA | 2026-01-23 | 456 | 608 | 420 |
| 3 | 10.406.511/0001-61 | ISHARES IBOVESPA CLASSE DE ÍNDICE - RESPONSABILIDADE LIMITADA | 2026-01-23 | 7 | 7 | 7 |
| 4 | 10.406.600/0001-08 | ISHARES BM&FBOVESPA SMALL CAP CLASSE DE ÍNDICE - RESPONSABILIDADE LIMITADA | 2026-01-23 | 8 | 8 | 8 |
Next steps & CLI¶
- Run the full pipeline from the command line using the Typer-based CLI:
fif-recsys data fetchto download and prepare datasetsfif-recsys feature buildto compute and write feature tables-
fif-recsys model scoreto compute scores -
Tips:
- Install
pyarrowfor faster Parquet I/O when running on large datasets. - For reproducible fetches, consider passing a deterministic
reference_datetofetch_manifest.
Feel free to update this notebook with real data paths and run the pipeline end-to-end.
Inspecting pipeline outputs¶
If you ran the Docker pipeline and mounted an output directory (e.g., /tmp/fif_data on the host → /data in the container), the pipeline writes the final profile-scored table to features_profile_scored.parquet or features_profile_scored.csv in that directory. Use the cell below to load and preview the output; update the output_path if you used a different directory.
ranking_df[['CNPJ_FUNDO_CLASSE', 'DENOM_SOCIAL', 'reference_date', *[c for c in ranking_df.columns if 'rank' in c]]].sort_values(by='rank_conservative', ascending=True)[:5]
| CNPJ_FUNDO_CLASSE | DENOM_SOCIAL | reference_date | rank_conservative | rank_balanced | rank_institutional | |
|---|---|---|---|---|---|---|
| 219 | 40.155.573/0001-09 | TREND ETF IBOVESPA CLASSE DE ÍNDICE - RESPONSABILIDADE LIMITADA | 2026-01-23 | 1 | 1 | 1 |
| 133 | 32.203.211/0001-18 | FUNDO DE INVESTIMENTO DE ÍNDICE - CLASSE DE INVESTIMENTO ETF BRADESCO IBOVESPA - RESP LIMITADA | 2026-01-23 | 2 | 2 | 2 |
| 143 | 34.606.480/0001-50 | BB ETF IBOVESPA FUNDO DE ÍNDICE RESPONSABILIDADE LIMITADA | 2026-01-23 | 3 | 3 | 3 |
| 424 | 48.643.130/0001-79 | FUNDO DE INVESTIMENTO DE ÍNDICE - CI B-INDEX MORNINGSTAR BRASIL PESOS IGUAIS - RESP LIMITADA | 2026-01-23 | 4 | 4 | 4 |
| 730 | 57.848.980/0001-02 | BB ETF ÍNDICE BOVESPA B3 BR+ FUNDO DE ÍNDICE RESPONSABILIDADE LIMITADA | 2026-01-23 | 5 | 5 | 5 |
# # Load and preview the profile-scored table
# from pathlib import Path
# import pandas as pd
# pd.set_option('display.max_colwidth', None) # or set to a large integer value (e.g., 500)
# # Update this path to the directory you mounted into the container (host path: /tmp/fif_data)
# output_dir = Path("/tmp/fif_data")
# pj = output_dir / "features_profile_scored.parquet"
# pcsv = output_dir / "features_profile_scored.csv"
# if pj.exists():
# df = pd.read_parquet(pj)
# elif pcsv.exists():
# df = pd.read_csv(pcsv)
# else:
# raise FileNotFoundError(f"No profile-scored output found at {pj} or {pcsv}. Make sure you mounted the output dir and ran the pipeline.")
# # Quick preview
# print("Path:", pj if pj.exists() else pcsv)
# print("Rows:", len(df))
# print("Columns:", list(df.columns))
# df