Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: QlibDataLoader drops the cols added by inst_processor #1430

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

xu-li
Copy link

@xu-li xu-li commented Feb 5, 2023

Fix: QlibDataLoader drops the cols added by inst_processor

Description

When inst_processor creates new columns, the QlibDataLoader drops them.

Motivation and Context

How Has This Been Tested?

  • Pass the test by running: pytest qlib/tests/test_all_pipeline.py under upper directory of qlib.
  • If you are adding a new feature, test on your own test scripts.

Screenshots of Test Results (if appropriate):

  1. Pipeline test:
  2. Your own tests:

Types of changes

  • [ X] Fix bugs
  • Add new feature
  • Update documentation

@github-actions github-actions bot added the waiting for triage Cannot auto-triage, wait for triage. label Feb 5, 2023
@xu-li
Copy link
Author

xu-li commented Feb 5, 2023 via email

@you-n-g
Copy link
Collaborator

you-n-g commented Feb 6, 2023

@xu-li
Could you please add any tests for it?
(It fails in previous version and success in current version)

Thanks.

@xu-li
Copy link
Author

xu-li commented Feb 6, 2023

The test pipeline seems to be broken already on Mac. See the output below. I am on 6295939.

Am I doing something wrong?

% git rev-parse HEAD
6295939346bfe619fe6eef5ffbe7c9252c3c9b09
% pytest tests/test_all_pipeline.py
================================================================================================= test session starts ==================================================================================================
platform darwin -- Python 3.8.16, pytest-7.2.1, pluggy-1.0.0
rootdir: /XXXXXX/tests, configfile: pytest.ini
collected 3 items                                                                                                                                                                                                      

tests/test_all_pipeline.py .F.                                                                                                                                                                                   [100%]

======================================================================================================= FAILURES =======================================================================================================
_____________________________________________________________________________________________ TestAllFlow.test_1_backtest ______________________________________________________________________________________________

self = <test_all_pipeline.TestAllFlow testMethod=test_1_backtest>

    @pytest.mark.slow
    def test_1_backtest(self):
        analyze_df = backtest_analysis(TestAllFlow.PRED_SCORE, TestAllFlow.RID, self.URI_PATH)
>       self.assertGreaterEqual(
            analyze_df.loc(axis=0)["excess_return_with_cost", "annualized_return"].values[0],
            0.05,
            "backtest failed",
        )
E       AssertionError: -0.09994674991447777 not greater than or equal to 0.05 : backtest failed

tests/test_all_pipeline.py:166: AssertionError
------------------------------------------------------------------------------------------------- Captured stdout call -------------------------------------------------------------------------------------------------
'The following are analysis results of benchmark return(1day).'
                       risk
mean               0.000477
std                0.012295
annualized_return  0.113561
information_ratio  0.598699
max_drawdown      -0.370479
'The following are analysis results of the excess return without cost(1day).'
                       risk
mean              -0.000417
std                0.012286
annualized_return -0.099245
information_ratio -0.523614
max_drawdown      -0.503287
'The following are analysis results of the excess return with cost(1day).'
                       risk
mean              -0.000420
std                0.012286
annualized_return -0.099947
information_ratio -0.527307
max_drawdown      -0.503838
'The following are analysis results of indicators(1day).'
     value
ffr    1.0
pa     0.0
pos    0.0
                                                  risk
excess_return_without_cost mean              -0.000417
                           std                0.012286
                           annualized_return -0.099245
                           information_ratio -0.523614
                           max_drawdown      -0.503287
excess_return_with_cost    mean              -0.000420
                           std                0.012286
                           annualized_return -0.099947
                           information_ratio -0.527307
                           max_drawdown      -0.503838
------------------------------------------------------------------------------------------------- Captured stderr call -------------------------------------------------------------------------------------------------
[73260:MainThread](2023-02-06 20:06:05,082) INFO - qlib.timer - [log.py:128] - Time cost: 71.856s | Loading data Done
[73260:MainThread](2023-02-06 20:06:06,395) INFO - qlib.timer - [log.py:128] - Time cost: 0.359s | DropnaLabel Done
[73260:MainThread](2023-02-06 20:06:10,730) INFO - qlib.timer - [log.py:128] - Time cost: 4.335s | CSZScoreNorm Done
[73260:MainThread](2023-02-06 20:06:10,759) INFO - qlib.timer - [log.py:128] - Time cost: 5.677s | fit & process data Done
[73260:MainThread](2023-02-06 20:06:10,760) INFO - qlib.timer - [log.py:128] - Time cost: 77.534s | Init data Done
[73260:MainThread](2023-02-06 20:06:10,789) INFO - qlib.backtest caller - [__init__.py:93] - Create new exchange
[73260:MainThread](2023-02-06 20:06:39,242) WARNING - qlib.online operator - [exchange.py:219] - $close field data contains nan.
[73260:MainThread](2023-02-06 20:06:39,244) WARNING - qlib.online operator - [exchange.py:219] - $close field data contains nan.
[73260:MainThread](2023-02-06 20:06:39,247) WARNING - qlib.online operator - [exchange.py:226] - factor.day.bin file not exists or factor contains `nan`. Order using adjusted_price.
[73260:MainThread](2023-02-06 20:06:39,247) WARNING - qlib.online operator - [exchange.py:228] - trade unit 100 is not supported in adjusted_price mode.
[73260:MainThread](2023-02-06 20:06:44,264) WARNING - qlib.BaseExecutor - [executor.py:121] - `common_infra` is not set for <qlib.backtest.executor.SimulatorExecutor object at 0x141f05160>
backtest loop: 100%|██████████| 871/871 [00:10<00:00, 82.92it/s]
[73260:MainThread](2023-02-06 20:06:55,254) INFO - qlib.workflow - [record_temp.py:505] - Portfolio analysis record 'port_analysis_1day.pkl' has been saved as the artifact of the Experiment 1
[73260:MainThread](2023-02-06 20:06:55,272) INFO - qlib.workflow - [record_temp.py:530] - Indicator analysis record 'indicator_analysis_1day.pkl' has been saved as the artifact of the Experiment 1

@xu-li
Copy link
Author

xu-li commented Feb 6, 2023

This is the bug that I am trying to fix:

import pandas as pd

import qlib
from qlib.data import D
from qlib.data.inst_processor import InstProcessor

qlib.init()


class MyInstProcessor(InstProcessor):

    def __init__(self) -> None:
        super().__init__()

    def __call__(self, df: pd.DataFrame, instrument, *args, **kwargs):
        if df.empty:
            return df

        df['MyNewFeature'] = 1
        return df


df = D.features(['SH601216'], ['$close'], start_time='2020-05-01', end_time='2020-05-10',
                inst_processors=[MyInstProcessor()])

# Before fix:
#                          $close
# instrument datetime
# SH601216   2020-05-06  1.318379
#            2020-05-07  1.304631
#            2020-05-08  1.309334
# After fix:
#                          $close  MyNewFeature
# instrument datetime
# SH601216   2020-05-06  1.318379             1
#            2020-05-07  1.304631             1
#            2020-05-08  1.309334             1
print(df)

from qlib.data.dataset.loader import QlibDataLoader

qdl_config = {
    'feature': (['$close'], ['Close']),
    'label': (['$close / Ref($close, 10)'], ['RET10'])
}
qdl = QlibDataLoader(config=qdl_config, inst_processor={'feature': [MyInstProcessor()]})
df = qdl.load(instruments=['sh600519'], start_time='20190101', end_time='20190105')

# Before fix:
#                          feature     label
#                            Close     RET10
# datetime   instrument
# 2019-01-02 sh600519    72.881897  1.014326
# 2019-01-03 sh600519    71.789268  0.998409
# 2019-01-04 sh600519    73.249367  1.041884
# After fix:
#                          feature                  label
#                            Close MyNewFeature     RET10
# datetime   instrument
# 2019-01-02 sh600519    72.881897            1  1.014326
# 2019-01-03 sh600519    71.789268            1  0.998409
# 2019-01-04 sh600519    73.249367            1  1.041884
print(df)

@Fivele-Li
Copy link
Contributor

Could you please handle the CI error message?


# NOTE: InstProcessors may add new columns and using cache_to_origin_data will remove those added columns.
if not len(inst_processors):
data = DiskDatasetCache.cache_to_origin_data(data, column_names)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps modifying the cache_to_origin_data without checking len(inst_processors) is better?

    def cache_to_origin_data(data, fields):
        """cache data to origin data

        :param data: pd.DataFrame, cache data.
        :param fields: feature fields.
        :return: pd.DataFrame.
        """
        not_space_fields = remove_fields_space(fields)
        data_selected = data.loc[:, not_space_fields]
        # set features fields
        data_selected.columns = [str(i) for i in fields]

        _fields = [col for col in data.columns if col not in not_space_fields]
        _data_selected = data.loc[:, _fields]
        data = pd.concat([data_selected, _data_selected], axis=1)
        return data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for triage Cannot auto-triage, wait for triage.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants