Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: wukan1986/expr_codegen
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v0.10.3
Choose a base ref
...
head repository: wukan1986/expr_codegen
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: main
Choose a head ref
  • 17 commits
  • 15 files changed
  • 1 contributor

Commits on Dec 5, 2024

  1. Copy the full SHA
    31eaa3d View commit details
  2. Update tool.py

    wukan1986 committed Dec 5, 2024
    Copy the full SHA
    6662261 View commit details
  3. 优化文件生成提示

    wukan1986 committed Dec 5, 2024
    Copy the full SHA
    2f0e3e0 View commit details

Commits on Dec 19, 2024

  1. null处理新方案

    wukan1986 committed Dec 19, 2024
    Copy the full SHA
    d231fad View commit details
  2. null处理提供选项

    wukan1986 committed Dec 19, 2024
    Copy the full SHA
    64a2d2a View commit details

Commits on Dec 22, 2024

  1. 更新文档

    wukan1986 committed Dec 22, 2024
    Copy the full SHA
    71d4bc0 View commit details

Commits on Dec 24, 2024

  1. Copy the full SHA
    be42342 View commit details
  2. 打印模块路径

    wukan1986 committed Dec 24, 2024
    Copy the full SHA
    07b2621 View commit details
  3. Copy the full SHA
    269877c View commit details

Commits on Dec 30, 2024

  1. Copy the full SHA
    41b8a37 View commit details
  2. 分钟处理示例

    wukan1986 committed Dec 30, 2024
    Copy the full SHA
    2df86c9 View commit details

Commits on Jan 9, 2025

  1. 分钟数据示例

    wukan1986 committed Jan 9, 2025
    Copy the full SHA
    cdcff0f View commit details

Commits on Jan 11, 2025

  1. Copy the full SHA
    43ccc91 View commit details

Commits on Jan 12, 2025

  1. 添加日志

    wukan1986 committed Jan 12, 2025
    Copy the full SHA
    53c2166 View commit details

Commits on Feb 17, 2025

  1. 修复模板中隐含不足

    wukan1986 committed Feb 17, 2025
    Copy the full SHA
    44ceb9b View commit details

Commits on Feb 18, 2025

  1. Copy the full SHA
    3e57ba0 View commit details

Commits on Mar 20, 2025

  1. 修复cse失败的bug

    wukan1986 committed Mar 20, 2025
    Copy the full SHA
    38e24d4 View commit details
61 changes: 40 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
# expr_codegen 符号表达式代码生成器

表达式转代码工具
# expr_codegen 表达式转译器

## 项目背景

@@ -29,14 +27,9 @@ https://exprcodegen.streamlit.app

```python
import sys
from io import StringIO

# from polars_ta.prefix.talib import * # noqa
from polars_ta.prefix.cdl import * # noqa
from polars_ta.prefix.ta import * # noqa
from polars_ta.prefix.tdx import * # noqa
from polars_ta.prefix.wq import * # noqa

from expr_codegen.tool import codegen_exec
from expr_codegen import codegen_exec


def _code_block_1():
@@ -64,10 +57,15 @@ def _code_block_2():
CPV = cs_zscore(_corr) + cs_zscore(_beta)


code = StringIO()

df = None # 替换成真实的polars数据
df = codegen_exec(df, _code_block_1, _code_block_2, output_file=sys.stdout) # 打印代码
df = codegen_exec(df, _code_block_1, _code_block_2, output_file="output.py") # 保存到文件
df = codegen_exec(df, _code_block_1, _code_block_2) # 只执行,不保存代码
df = codegen_exec(df, _code_block_1, _code_block_2, output_file=code) # 保存到字符串
code.seek(0)
code.read() # 读取代码

df = codegen_exec(df.lazy(), _code_block_1, _code_block_2).collect() # Lazy CPU
df = codegen_exec(df.lazy(), _code_block_1, _code_block_2).collect(engine="gpu") # Lazy GPU
@@ -88,7 +86,7 @@ df = codegen_exec(df.lazy(), _code_block_1, _code_block_2).collect(engine="gpu")
│ sympy_define.py # 符号定义,由于太多地方重复使用到,所以统一提取到此处
├─expr_codegen
│ │ expr.py # 表达式处理基本函数
│ │ tool.py # 核心工具代码。一般不需修改
│ │ tool.py # 核心工具代码
│ ├─polars
│ │ │ code.py # 针对polars语法的代码生成功能
│ │ │ template.py.j2 # `Jinja2`模板。用于生成对应py文件,一般不需修改
@@ -135,11 +133,32 @@ df = codegen_exec(df.lazy(), _code_block_1, _code_block_2).collect(engine="gpu")
1. 根据算子前缀分类(`get_current_by_prefix`),限制算子必需以`ts_``cs_``gp_`开头
2. 根据算子全名分类(`get_current_by_name`), 不再限制算子名。比如`cs_rank`可以叫`rank`

## 二次开发
## Null处理

`null`是如何产生的?

1. 停牌导致。在计算前就直接过滤掉了,不会对后续计算产生影响。
2. 不同品种交易时段不同
3. 计算产生。`null`在数列两端不影响后续时序算子结果,但中间出现`null`会影响。例如: `if_else(close<2, None, close)`

https://github.com/pola-rs/polars/issues/12925#issuecomment-2552764629

非常棒的点子,总结下来有两种实现方式:

1. 备份后编辑`demo_express.py`, `import`需要引入的函数
2. 然后`printer.py`有可能需要添加对应函数的打印代码
- 注意:需要留意是否要加括号`()`,不加时可能优先级混乱,可以每次都加括号,也可用提供的`parenthesize`简化处理
1.`null`分成一组,`not_null`分成另一组。要调用两次
2. 仅一组,但复合排序,将`null`排在前面,`not_null`排后面。只调用一次,略快一些

```python
X1 = (ts_returns(CLOSE, 3)).over(CLOSE.is_not_null(), _ASSET_, order_by=_DATE_),
X2 = (ts_returns(CLOSE, 3)).over(_ASSET_, order_by=[CLOSE.is_not_null(), _DATE_]),
X3 = (ts_returns(CLOSE, 3)).over(_ASSET_, order_by=_DATE_),
```

第2种开头的`null`区域,是否影响结果由算子所决定,特别时是多列输入时`null`区域可能有数据

1. `over_null='partition_by'`。分到两个区域
2. `over_null='order_by'`。分到一个区域,`null`排在前面
3. `over_null=None`。不处理,直接调用,速度更快。如果确信不会中段产生`null`建议使用此参数

## `expr_codegen`局限性

@@ -161,12 +180,12 @@ df = codegen_exec(df.lazy(), _code_block_1, _code_block_2).collect(engine="gpu")
9. `gp_`开头的函数都会返回对应的`cs_`函数。如`gp_func(A,B,C)`会替换成`cs_func(B,C)`,其中`A`用在了`groupby([date, A])`
10. 支持`A,B,C=MACD()`元组解包,在底层会替换成

```python
_x_0 = MACD()
A = unpack(_x_0, 0)
B = unpack(_x_0, 1)
C = unpack(_x_0, 2)
```
```python
_x_0 = MACD()
A = unpack(_x_0, 0)
B = unpack(_x_0, 1)
C = unpack(_x_0, 2)
```

## 下划线开头的变量

14 changes: 7 additions & 7 deletions examples/demo_express.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,7 @@
import sys
from io import StringIO

# from polars_ta.prefix.talib import * # noqa
from polars_ta.prefix.cdl import * # noqa
from polars_ta.prefix.ta import * # noqa
from polars_ta.prefix.tdx import * # noqa
from polars_ta.prefix.wq import * # noqa

from expr_codegen.tool import codegen_exec
from expr_codegen import codegen_exec


def _code_block_1():
@@ -34,10 +29,15 @@ def _code_block_2():
CPV = cs_zscore(_corr) + cs_zscore(_beta)


code = StringIO()

df = None # 替换成真实的polars数据
df = codegen_exec(df, _code_block_1, _code_block_2, output_file=sys.stdout) # 打印代码
df = codegen_exec(df, _code_block_1, _code_block_2, output_file="output.py") # 保存到文件
df = codegen_exec(df, _code_block_1, _code_block_2) # 只执行,不保存代码
df = codegen_exec(df, _code_block_1, _code_block_2, output_file=code) # 保存到字符串
code.seek(0)
code.read() # 读取代码

df = codegen_exec(df.lazy(), _code_block_1, _code_block_2).collect() # Lazy CPU
df = codegen_exec(df.lazy(), _code_block_1, _code_block_2).collect(engine="gpu") # Lazy GPU
67 changes: 46 additions & 21 deletions examples/demo_min.py
Original file line number Diff line number Diff line change
@@ -10,62 +10,87 @@
如果分钟数据已经按日期分好了文件,也可以直接多进程并行处理,就没这么麻烦
"""
import sys
from datetime import datetime

import numpy as np
import pandas as pd
import polars as pl
from loguru import logger

from expr_codegen.tool import codegen_exec
from expr_codegen import codegen_exec # noqa

np.random.seed(42)

ASSET_COUNT = 500
DATE_COUNT = 250 * 20
DATE = pd.date_range(datetime(2020, 1, 1), periods=DATE_COUNT, freq='2h').repeat(ASSET_COUNT)
DATE_COUNT = 250 * 24 * 10 * 1
DATE = pd.date_range(datetime(2020, 1, 1), periods=DATE_COUNT, freq='1min').repeat(ASSET_COUNT)
ASSET = [f'A{i:04d}' for i in range(ASSET_COUNT)] * DATE_COUNT

df = pl.DataFrame(
{
'datetime': DATE,
'asset': ASSET,
"OPEN": np.random.rand(DATE_COUNT * ASSET_COUNT),
"HIGH": np.random.rand(DATE_COUNT * ASSET_COUNT),
"LOW": np.random.rand(DATE_COUNT * ASSET_COUNT),
"CLOSE": np.random.rand(DATE_COUNT * ASSET_COUNT),
"VOLUME": np.random.rand(DATE_COUNT * ASSET_COUNT),
"OPEN_INTEREST": np.random.rand(DATE_COUNT * ASSET_COUNT),
"FILTER": np.tri(DATE_COUNT, ASSET_COUNT, k=-2).reshape(-1),
}
).lazy()

df = df.filter(pl.col('FILTER') == 1)

logger.info('时间戳调整开始')
# 交易日,期货夜盘属于下一个交易日,后移4小时夜盘日期就一样了
df = df.with_columns(trading_day=pl.col('datetime').dt.offset_by("4h"))
# 周五晚已经变成了周六,双修要移动到周一
# 周五晚已经变成了周六,双休要移动到周一
df = df.with_columns(trading_day=pl.when(pl.col('trading_day').dt.weekday() > 5)
.then(pl.col("trading_day").dt.offset_by("2d"))
.otherwise(pl.col("trading_day")))
df = df.with_columns(
# 交易日
trading_day=pl.col("trading_day").dt.truncate("1d"),
trading_day=pl.col("trading_day").dt.date(),
# 工作日
action_day=pl.col('datetime').dt.truncate('1d'),
action_day=pl.col('datetime').dt.date(),
)


def _code_block_1():
OPEN_1 = ts_delay(OPEN, 1)
OPEN_RANK = cs_rank(OPEN_1)


df = df.collect()
logger.info('时间戳调整完成')
# ---
# !!! 重要代码,生成复合字段,用来ts_排序
# _asset_date以下划线开头,会自动删除,如要保留,可去了下划线
# 股票用action_day,期货用trading_day
df = df.with_columns(_asset_date=pl.struct("asset", "trading_day"))
print(df.tail(5))
df = codegen_exec(df, _code_block_1, output_file=sys.stdout, # 打印代码
df = codegen_exec(df, """OPEN_RANK = cs_rank(OPEN[1]) # 仅演示""",
# !!!使用时一定要分清分组是用哪个字段
date='datetime', asset='_asset_date')
# 演示中间某天的数据
df = df.filter(pl.col('asset') == 'A0000', pl.col('trading_day') == pl.datetime(2020, 1, 6))

print(df.collect())
# df.write_csv('output.csv')
# ---
logger.info('1分钟转15分钟线开始')
df1 = df.sort('asset', 'datetime').group_by_dynamic('datetime', every="15m", closed='left', label="left", group_by=['asset', 'trading_day']).agg(
open_dt=pl.first("datetime"),
close_dt=pl.last("datetime"),
OPEN=pl.first("OPEN"),
HIGH=pl.max("HIGH"),
LOW=pl.min("LOW"),
CLOSE=pl.last("CLOSE"),
VOLUME=pl.sum("VOLUME"),
OPEN_INTEREST=pl.last("OPEN_INTEREST"),
)
logger.info('1分钟转15分钟线结束')
print(df1)
# ---
logger.info('1分钟转日线开始')
# 也可以使用group_by_dynamic,只是日线隐含了label="left"
df1 = df.sort('asset', 'datetime').group_by('asset', 'trading_day', maintain_order=True).agg(
open_dt=pl.first("datetime"),
close_dt=pl.last("datetime"),
OPEN=pl.first("OPEN"),
HIGH=pl.max("HIGH"),
LOW=pl.min("LOW"),
CLOSE=pl.last("CLOSE"),
VOLUME=pl.sum("VOLUME"),
OPEN_INTEREST=pl.last("OPEN_INTEREST"),
)
logger.info('1分钟转日线结束')
print(df1)
10 changes: 7 additions & 3 deletions examples/demo_tdx.py
Original file line number Diff line number Diff line change
@@ -100,11 +100,15 @@ def _code_block_2():
# =====================================
logger.info('计算开始')
t1 = time.perf_counter()
df = codegen_exec(df.lazy(), _code_block_1, _code_block_2, output_file=sys.stdout)
df = codegen_exec(df, _code_block_1, _code_block_2, output_file='1_out.py', run_file=False, over_null=None)
t2 = time.perf_counter()
print(t2 - t1)
df = codegen_exec(df, _code_block_1, _code_block_2, output_file='1_out.py', run_file=True, over_null=None)
t3 = time.perf_counter()
df = codegen_exec(df, _code_block_1, _code_block_2, output_file='1_out.py', run_file=True, over_null=None)
t4 = time.perf_counter()
print(t2 - t1, t3 - t2, t4 - t3)
logger.info('计算结束')
df = df.filter(
~pl.col('is_st'),
)
print(df.collect())
print(df)
2 changes: 1 addition & 1 deletion expr_codegen/_version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.10.3"
__version__ = "0.10.14"
5 changes: 5 additions & 0 deletions expr_codegen/codes.py
Original file line number Diff line number Diff line change
@@ -2,6 +2,7 @@
import re
from ast import expr

from black import Mode, format_str
from sympy import Add, Mul, Pow, Eq, Not, Xor

from expr_codegen.expr import register_symbols, dict_to_exprs
@@ -125,6 +126,7 @@ class RenameTransformer(ast.NodeTransformer):
def __init__(self, funcs_map, targets_map, args_map=None):

if args_map is None:
# 保留字
args_map = {'True': "_TRUE_", 'False': "_FALSE_", 'None': "_NONE_"}
self.funcs_old = set()
self.args_old = set()
@@ -315,6 +317,9 @@ def source_replace(source: str) -> str:
# break
# 或、与
source = source.replace('||', '|').replace('&&', '&')
# IndentationError: unexpected indent
# 嵌套函数前有空格,会报错
source = format_str(source, mode=Mode(line_length=600, magic_trailing_comma=True))
return source


22 changes: 20 additions & 2 deletions expr_codegen/model.py
Original file line number Diff line number Diff line change
@@ -7,6 +7,8 @@
from expr_codegen.dag import zero_indegree, hierarchy_pos, remove_paths_by_zero_outdegree
from expr_codegen.expr import CL, get_symbols, get_children, get_key, is_simple_expr

_RESERVED_WORD_ = {'_NONE_', '_TRUE_', '_FALSE_'}


class ListDictList:
"""嵌套列表
@@ -109,8 +111,7 @@ def drop_symbols(self):
l2 = [set()]
s = set()
for i in reversed(l1):
# 这三变量需要排除
s = s | i - {'_NONE_', '_TRUE_', '_FALSE_'}
s = s | i # - {'_NONE_', '_TRUE_', '_FALSE_'}
l2.append(s)
l2 = list(reversed(l2))

@@ -136,13 +137,27 @@ def chain_create(nested_list):
last_min = float('inf')
# 最小不重复的一行记录
last_row = None
last_rows = set()
for row in product(*neighbor_inter):
# 判断两两是否重复,重复为1,反之为0
result = sum([x == y for x, y in zip(row[:-1], row[1:])])
if last_min > result:
last_min = result
last_row = row
if result == 0:
last_rows.add(last_row)
last_min = float('inf')
continue
last_rows.add(last_row)
last_rows = list(last_rows)

# last_rows中有多个满足条件的,优先保证最后一组ts在最前,ts后可提前filter减少计算量
last_row = last_rows[0]
for row in last_rows:
if row[-1] is None:
continue
if row[-1][0] == 'ts':
last_row = row
break

# 如何移动才是难点 如果两个连续 ts/ts,那么如何移动
@@ -396,6 +411,9 @@ def dag_end(G):
key = G.nodes[node]['key']
expr = G.nodes[node]['expr']
symbols = G.nodes[node]['symbols']
# 这几个特殊的不算成字段名
symbols = list(set(symbols) - _RESERVED_WORD_)

exprs_ldl.append(key, (node, expr, symbols))

exprs_ldl._list = exprs_ldl.values()[1:]
3 changes: 2 additions & 1 deletion expr_codegen/pandas/code.py
Original file line number Diff line number Diff line change
@@ -38,7 +38,8 @@ def codegen(exprs_ldl: ListDictList, exprs_src, syms_dst,
filename='template.py.j2',
date='date', asset='asset',
alias: Dict[str, str] = {},
extra_codes: Sequence[str] = ()):
extra_codes: Sequence[str] = (),
**kwargs):
"""基于模板的代码生成"""
# 打印Pandas风格代码
p = PandasStrPrinter()
7 changes: 4 additions & 3 deletions expr_codegen/pandas/template.py.j2
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# this code is auto generated by the expr_codegen
# https://github.com/wukan1986/expr_codegen
# 此段代码由 expr_codegen 自动生成,欢迎提交 issue 或 pull request
from typing import Tuple

import numpy as np # noqa
import pandas as pd # noqa
@@ -63,6 +64,6 @@ def main(df: pd.DataFrame) -> pd.DataFrame:

return df

if __name__ in ("__main__", "builtins"):
# TODO: 数据加载或外部传入
df_output = main(df_input)
# if __name__ in ("__main__", "builtins"):
# # TODO: 数据加载或外部传入
# df_output = main(df_input)
3 changes: 2 additions & 1 deletion expr_codegen/polars_group/code.py
Original file line number Diff line number Diff line change
@@ -39,7 +39,8 @@ def codegen(exprs_ldl: ListDictList, exprs_src, syms_dst,
filename='template.py.j2',
date='date', asset='asset',
alias: Dict[str, str] = {},
extra_codes: Sequence[str] = ()):
extra_codes: Sequence[str] = (),
**kwargs):
"""基于模板的代码生成"""
# 打印Polars风格代码
p = PolarsStrPrinter()
Loading