Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health checker for Qlib Data #854

Open
you-n-g opened this issue Jan 17, 2022 · 7 comments
Open

Health checker for Qlib Data #854

you-n-g opened this issue Jan 17, 2022 · 7 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@you-n-g
Copy link
Collaborator

you-n-g commented Jan 17, 2022

🌟 Feature Description

A lot of users encountered data errors when integrating their own data.
If Qlib can provide a health checker to help users automatically check the heath of Qlib data.

For example,

Motivation

  1. Application scenario
  2. Related works (Papers, Github repos etc.):
  3. Any other relevant and important information:

Alternatives

Additional Notes

@you-n-g you-n-g added enhancement New feature or request help wanted Extra attention is needed labels Jan 17, 2022
@benheckmann
Copy link

@you-n-g Looking into this at the moment. Is there any related work of such a checker (papers, repos) to point me in the right direction? Or could you specify some of the key features this checker would have?

@you-n-g
Copy link
Collaborator Author

you-n-g commented Jun 13, 2023

Hi, @benheckmann

Thanks for your interest in Qlib.

Here is the motivation for a health checker.

Many users try to leverage their own data to run models with Qlib.

However, there is no guarantee that users have correctly converted their data or provided enough fields for related modules.

For example, the following checker may be helpful for users (although some checkers may only be able to give warnings instead of errors due to their inability to distinguish between an error and a character specific to a certain market).

  • To ensure data correctness,
    • The data are well-adjusted and do not present any significant step changes.
    • Users should be alerted when data is abnormally missing.
    • ...
  • To ensure completeness of the information,
    • The backtest cannot be run successfully if the close price is not provided.
    • If OHLCV data is not provided, the default datasets cannot be created correctly.
    • If a factor is not provided, the trading unit will be disabled.
    • ...

As Qlib continues to develop, there will be an increasing number of checkers. Therefore, the health checker is expected to be an extensible framework.

@benheckmann
Copy link

Hi @you-n-g,

just drafted a first idea, and would really appreciate some feedback. Also, some questions I still have:

  • What do you mean by "If a factor is not provided, the trading unit will be disabled". Would the factor column be named "factor" or can it have different names? And where is the trading unit specified?
  • Would it be beneficial if the checker was specific to Alpha360 and Alpha158?

Thank you.

@you-n-g
Copy link
Collaborator Author

you-n-g commented Jul 11, 2023

  1. Here are some related code. https://github.com/microsoft/qlib/blob/main/qlib/backtest/exchange.py#L776. All the prices in Qlib are expected to be adjusted prices, which are different from the real trading prices. However, the trading amount should be integer multiples of the trading unit, which is unadjusted. The calculation requires factors to make the trading amount align with the real world.
  2. I think checking for raw data looks more general. If some checking for specific datasets like Alpha360 and Alpha158 can't be done in raw data level. I think we can write specific code like a subclass.

Thanks.

benheckmann added a commit to benheckmann/qlib that referenced this issue Jul 17, 2023
benheckmann added a commit to benheckmann/qlib that referenced this issue Jul 17, 2023
@benheckmann
Copy link

@you-n-g Thank you. Regarding (2), you mean it makes sense to keep checking only raw data for now, right? I have added a simple function checking for a missing "factor" column or values. Also, I have added support for providing data in qlib format.

Have you had a chance to look at the draft yet?

@you-n-g
Copy link
Collaborator Author

you-n-g commented Jul 31, 2023

Hi, @benheckmann

@Fivele-Li Will help to review this issue.

@dexter31
Copy link

+1 or this feature. Also, if we look at this holistically data completeness is easier to check for. But I've come across scenarios where consistency can cause issues. An example is a different in normalizing data collected after market close but changes the precision.

While it is not the end of the world, this would result in a few percentage point differences in data collected at different times. See highlighted precision difference below in data collected 1 minute apart.

2006-07-27 05:30:00+05:30,SOLARINDS.NS,26.0,26.979999542236328,26.0,26.049999237060547,7790.0,22.**37462615966797**,0.0,0.0
2006-07-27 05:30:00+05:30,SOLARINDS.NS,26.0,26.979999542236328,26.0,26.049999237060547,7790.0,22.**37462043762207**,0.0,0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants