Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vLLM results are better than trt with the same request #1870

Closed
2 of 4 tasks
activezhao opened this issue Jun 30, 2024 · 7 comments
Closed
2 of 4 tasks

vLLM results are better than trt with the same request #1870

activezhao opened this issue Jun 30, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@activezhao
Copy link

activezhao commented Jun 30, 2024

System Info

CPU x86_64

GPU NVIDIA L40

TensorRT branch: v0.10.0

CUDA: NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.4

Who can help?

@kaiyux

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I have a model based on deepseek_coder_6.7b, and add some special tokens, such as <filename>, <reponame> and so on for better performance.

I have some requests, and they are executed on trt, vLLM and transformers.generate respectively.

The resluts of vLLM and transformers.generate are very good, but the result of trt is a badcase, which is pretty werid.

Here are the commands of trt:

python /data/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir /data/deepseek-6.7b/ \
                            --output_dir /data/trt-v10-deepseek-6.7b-tp2-bs8 \
                            --dtype float16 \
                            --tp_size 2 \
                            --workers 2

trtllm-build --checkpoint_dir /data/trt-v10-deepseek-6.7b-tp2-bs8 \
            --output_dir /data/trt-v10-engines-deepseek-6.7b-bs8/2-gpu/  \
            --gemm_plugin float16 \
            --paged_kv_cache enable \
            --max_input_len 8192 \
            --max_output_len 1024 \
            --gpt_attention_plugin float16 \
            --max_batch_size 8 

Here is the one of the requests:

curl -X POST localhost:8820/v2/models/ensemble/generate_stream -d '{"text_input": "\u003creponame\u003eprogramming-language-demo\n\u003cneighbor\u003e\u003cfilename\u003eprime-number.go\u003ccodeblock\u003e// }\n// func isPrime(n int) bool {\n// \tif n \u003c 2 {\n// \t\treturn false\n// \t} else {\n// \t\tfor i := 2; i \u003c= n/2; i++ {\n// \t\t\tif n%i == 0 {\n// \t\t\t\treturn false\n// \t\t\t}\n// \t\t}\n// \t}\n// \treturn true\n// }\n// Functions from import file go/prime-number.go can be referenced:\n// func exitWithError()\n// func main()\n// func isPrime(n int) bool\n// Compare this snippet from go/prime-number.go:\n// package main\n// \n// import (\n// \t\"fmt\"\n// \t\"os\"\n// \t\"strconv\"\n// )\n// \n// func isPrime(n int) bool {\n// \tif n \u003c 2 {\n// \t\treturn false\n// \t} else {\n// \t\tfor i := 2; i \u003c= n/2; i++ {\n// \t\t\tif n%i == 0 {\n// \t\t\t\treturn false\n// \t\t\t}\n// \t\t}\n// \t}\n// \treturn true\n// }\n// \n// func exitWithError() {\n// \tfmt.Println(\"Usage: please input a non-negative integer\")\n// \tos.Exit(1)\n// }\n// \n// func main() {\n// \tif len(os.Args) != 2 {\n// \t\texitWithError()\n// \t}\n// \n// \tn, err := strconv.Atoi(os.Args[1])\n// \tif err != nil || n \u003c 0 {\n// \t\texitWithError()\n// \t}\n// \n// \tif isPrime(n) {\n// \t\tfmt.Println(\"Prime\")\n// \t} else {\n// \t    fmt.Println(\"Composite\")\n// \t}\n// }\u003cneighbor\u003e\u003cfilename\u003eprime-number.go\u003ccodeblock\u003e// Functions from import file go/prime-number.go can be referenced:\n// func exitWithError() {\n// \tfmt.Println(\"Usage: please input a non-negative integer\")\n// \tos.Exit(1)\n// }\n// func main() {\n// \tif len(os.Args) != 2 {\n// \t\texitWithError()\n// \t}\n// \n// \tn, err := strconv.Atoi(os.Args[1])\n// \tif err != nil || n \u003c 0 {\n// \t\texitWithError()\n// \t}\n// \n// \tif isPrime(n) {\n// \t\tfmt.Println(\"Prime\")\n// \t} else {\n// \t    fmt.Println(\"Composite\")\n// \t}\n// }\n// func isPrime(n int) bool {\n// \tif n \u003c 2 {\n// \t\treturn false\n// \t} else {\n// \t\tfor i := 2; i \u003c= n/2; i++ {\n// \t\t\tif n%i == 0 {\n// \t\t\t\treturn false\n// \t\t\t}\n// \t\t}\n// \t}\n// \treturn true\n// }\n// Functions from import file go/prime-number.go can be referenced:\n// func exitWithError()\n// func main()\n// func isPrime(n int) bool\n// Compare this snippet from go/prime-number.go:\n// package main\n// \n// import (\n// \t\"fmt\"\n// \t\"os\"\n// \t\"strconv\"\n// )\n// \n// func isPrime(n int) bool {\n// \tif n \u003c 2 {\n// \t\treturn false\n// \t} else {\n// \t\tfor i := 2; i \u003c= n/2; i++ {\n// \t\t\tif n%i == 0 {\n// \t\t\t\treturn false\n// \t\t\t}\n// \t\t}\n// \t}\n// \treturn true\n// }\n// \n// func exitWithError() {\u003cneighbor\u003e\u003cfilename\u003elongest-word.go\u003ccodeblock\u003e// Variables from import file go/longest-word.go can be referenced:\n// errorMessage = \"Usage: please provide a string\"\n// Functions from import file go/longest-word.go can be referenced:\n// func longestWordLength(str string) int {\n// \twords := strings.FieldsFunc(str, isLimitedWhitespace)\n// \treturn longestStringLength(words)\n// }\n// func isLimitedWhitespace(r rune) bool {\n// \treturn strings.ContainsRune(\" \\t\\n\\r\", r)\n// }\n// func longestStringLength(strs []string) (longest int) {\n// \tfor _, str := range strs {\n// \t\tif len(str) \u003e longest {\n// \t\t\tlongest = len(str)\n// \t\t}\n// \t}\n// \treturn\n// }\n// Functions from import file go/longest-word.go can be referenced:\n// func longestWordLength(str string) int\n// func isLimitedWhitespace(r rune) bool\n// func longestStringLength(strs []string) (longest int)\u003cneighbor\u003e\u003cfilename\u003efactorial.go\u003ccodeblock\u003e// Functions from import file go/factorial.go can be referenced:\n// func exitWithError(msg string) {\n// \tfmt.Println(msg)\n// \tos.Exit(1)\n// }\n// func factorial(n uint64) uint64 {\n// \tif n \u003c= 0 {\n// \t\treturn 1\n// \t}\n// \treturn n * factorial(n-1)\n// }\n// Functions from import file go/factorial.go can be referenced:\n// func exitWithError(msg string)\n// func factorial(n uint64) uint64\u003cfilename\u003elongest-common-subsequence.go\n\u003ccodecontent\u003epackage main\nimport (\n\t\"encoding/json\"\n\t\"fmt\"\n\t\"os\"\n\t\"regexp\"\n\t\"strconv\"\n\t\"strings\"\n)\n//exitWithError\n, ", "max_tokens": 50, "bad_words": "", "stop_words": "", "stream": false, "temperature": 0.2, "top_p": 0.95, "return_log_probs": true, "generation_logits": true}'

Expected behavior

The expected result is:

func exitWithError(msg string) {
	fmt.Println(msg)
	os.Exit(1)
}

In fact, vLLM and transformers.generate are all the results as above.

actual behavior

The trt result is:

data: {"context_logits":0.0,"cum_log_probs":-1.76106858253479,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[-0.0000066757424974639438,-0.10143566876649857,-0.1650305688381195,-0.00022062112111598253,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-0.0000067949526965094268,-0.0000010728841743912199,-9.536747711536009e-7,-9.536747711536009e-7,-0.00007355483830906451,-9.536747711536009e-7,-0.0000020265599687263604,-0.0000010728841743912199,-9.536747711536009e-7,-0.000012636264727916569,-9.536747711536009e-7,-9.536747711536009e-7,-0.0000010728841743912199,-0.0001179049359052442,-9.536747711536009e-7,-0.0005595461116172373,-0.0000011920935776288389,-9.536747711536009e-7,-0.000048638572479831058,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-0.17364458739757539,-9.536747711536009e-7,-0.0004099851648788899,-9.536747711536009e-7,-0.000002861027041944908,-0.0005539401317946613,-0.0008925008587539196,-9.536747711536009e-7,-9.536747711536009e-7,-0.000003933914285880746,-0.0258316770195961,-9.536747711536009e-7,-0.022926615551114084,-9.536747711536009e-7,-9.536747711536009e-7,-0.000002145769485650817,-1.0269726514816285,-0.24228566884994508],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\n//isLimitedWhitespace\n, \n//longestStringLength\n, \n//longestWordLength\n, \n//longestCommonSubsequence\n, \n//main\n, \n//parseArgs"}

And the text_output part is:

//isLimitedWhitespace
//longestStringLength
//longestWordLength
//longestCommonSubsequence
//main
//parseArgs

However, If I only use the last part from the request, the result is also normal.

Here is the request:

curl -X POST localhost:8820/v2/models/ensemble/generate_stream -d '{"text_input": "\u003ccodecontent\u003epackage main\nimport (\n\t\"encoding/json\"\n\t\"fmt\"\n\t\"os\"\n\t\"regexp\"\n\t\"strconv\"\n\t\"strings\"\n)\n//exitWithError\n, ", "max_tokens": 50, "bad_words": "", "stop_words": "", "stream": false, "temperature": 0.2, "top_p": 0.95, "return_log_probs": true, "generation_logits": true}'

And here is the result:

data: {"context_logits":0.0,"cum_log_probs":-2.383721351623535,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[-0.000052334245992824438,-0.3010028600692749,-0.0016516967443749309,-9.536747711536009e-7,-0.0000010728841743912199,-9.536747711536009e-7,-0.0057563441805541519,-0.0000027418175250204514,-0.000046373490476980808,-0.0000019073504518019037,-9.536747711536009e-7,-0.008396431803703308,-9.536747711536009e-7,-9.536747711536009e-7,-0.21918922662734986,-0.0002970540663227439,-0.06785676628351212,-9.536747711536009e-7,-0.00040557264583185315,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-0.0000020265599687263604,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-0.6207338571548462,-0.15082910656929017,-0.4851605296134949,-0.39718568325042727,-0.0005019970703870058,-0.0011182717280462385,-0.0000017881409348774469,-9.536747711536009e-7,-0.0000010728841743912199,-9.536747711536009e-7,-9.536747711536009e-7,-0.0000015497220147153712,-0.00020500138634815812,-0.12325640767812729,-0.000039816695789340886,-9.536747711536009e-7,-0.0000013113030945532956,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\nfunc exitWithError(err error) {\n\tfmt.Println(err)\n\tos.Exit(1)\n}\n\n//getEnv\n, \nfunc getEnv(key, fallback string) string {"}

And the text_output part is:

func exitWithError(err error) {
    fmt.Println(err)
    os.Exit(1)
}

//getEnv
func getEnv(key, fallback string) string {

additional notes

This is so weird.

I have analyzed for a long time, but I still don’t know what is causing it.

Please help me.

Thank you.

@activezhao activezhao added the bug Something isn't working label Jun 30, 2024
@DreamGenX
Copy link
Contributor

This might be related: #1788

@activezhao
Copy link
Author

activezhao commented Jul 1, 2024

This might be related: #1788

@DreamGenX Thanks for your suggestion.

But, it seems that my problem is not rope's problem, the value of rotary_base is right.

This is the original config.json

{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 32013,
  "eos_token_id": 32022,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 16384,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "factor": 4.0,
    "type": "linear"
  },
  "rope_theta": 100000,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.39.0.dev0",
  "use_cache": true,
  "vocab_size": 32043
}

And this is the engines' config.json

{
    "version": "0.10.0",
    "pretrained_config": {
        "architecture": "LlamaForCausalLM",
        "dtype": "float16",
        "logits_dtype": "float32",
        "vocab_size": 32043,
        "max_position_embeddings": 16384,
        "hidden_size": 4096,
        "num_hidden_layers": 32,
        "num_attention_heads": 32,
        "num_key_value_heads": 32,
        "head_size": 128,
        "qk_layernorm": false,
        "hidden_act": "silu",
        "intermediate_size": 11008,
        "norm_epsilon": 1e-06,
        "position_embedding_type": "rope_gpt_neox",
        "use_parallel_embedding": false,
        "embedding_sharding_dim": 0,
        "share_embedding_table": false,
        "mapping": {
            "world_size": 2,
            "tp_size": 2,
            "pp_size": 1,
            "gpus_per_node": 8
        },
        "quantization": {
            "quant_algo": null,
            "kv_cache_quant_algo": null,
            "group_size": 128,
            "smoothquant_val": null,
            "has_zero_point": false,
            "pre_quant_scale": false,
            "exclude_modules": [
                "lm_head"
            ]
        },
        "kv_dtype": "float16",
        "rotary_scaling": {
            "factor": 4.0,
            "type": "linear"
        },
        "residual_mlp": false,
        "moe_normalization_mode": null,
        "rotary_base": 100000,
        "moe_num_experts": 0,
        "moe_top_k": 0,
        "moe_tp_mode": 2,
        "attn_bias": false,
        "disable_weight_only_quant_plugin": false,
        "mlp_bias": false
    },
    "build_config": {
        "max_input_len": 8192,
        "max_output_len": 1024,
        "opt_batch_size": null,
        "max_batch_size": 8,
        "max_beam_width": 1,
        "max_num_tokens": 65536,
        "opt_num_tokens": 8,
        "max_prompt_embedding_table_size": 0,
        "gather_context_logits": false,
        "gather_generation_logits": false,
        "strongly_typed": false,
        "builder_opt": null,
        "profiling_verbosity": "layer_names_only",
        "enable_debug_output": false,
        "max_draft_len": 0,
        "speculative_decoding_mode": 1,
        "use_refit": false,
        "input_timing_cache": null,
        "output_timing_cache": "model.cache",
        "lora_config": {
            "lora_dir": [],
            "lora_ckpt_source": "hf",
            "max_lora_rank": 64,
            "lora_target_modules": [],
            "trtllm_modules_to_hf_modules": {}
        },
        "auto_parallel_config": {
            "world_size": 1,
            "gpus_per_node": 8,
            "cluster_key": "L40",
            "cluster_info": null,
            "sharding_cost_model": "alpha_beta",
            "comm_cost_model": "alpha_beta",
            "enable_pipeline_parallelism": false,
            "enable_shard_unbalanced_shape": false,
            "enable_shard_dynamic_shape": false,
            "enable_reduce_scatter": true,
            "builder_flags": null,
            "debug_mode": false,
            "infer_shape": true,
            "validation_mode": false,
            "same_buffer_io": {
                "past_key_value_(\\d+)": "present_key_value_\\1"
            },
            "same_spec_io": {},
            "sharded_io_allowlist": [
                "past_key_value_\\d+",
                "present_key_value_\\d*"
            ],
            "fast_reduce": true,
            "fill_weights": false,
            "parallel_config_cache": null,
            "profile_cache": null,
            "dump_path": null,
            "debug_outputs": []
        },
        "weight_sparsity": false,
        "weight_streaming": false,
        "use_strip_plan": false,
        "max_encoder_input_len": 1024,
        "use_fused_mlp": true,
        "plugin_config": {
            "bert_attention_plugin": "float16",
            "gpt_attention_plugin": "float16",
            "gemm_plugin": "float16",
            "smooth_quant_gemm_plugin": null,
            "identity_plugin": null,
            "layernorm_quantization_plugin": null,
            "rmsnorm_quantization_plugin": null,
            "nccl_plugin": "float16",
            "lookup_plugin": null,
            "lora_plugin": null,
            "weight_only_groupwise_quant_matmul_plugin": null,
            "weight_only_quant_matmul_plugin": null,
            "quantize_per_token_plugin": false,
            "quantize_tensor_plugin": false,
            "moe_plugin": "float16",
            "mamba_conv1d_plugin": "float16",
            "context_fmha": true,
            "context_fmha_fp32_acc": false,
            "paged_kv_cache": true,
            "remove_input_padding": true,
            "use_custom_all_reduce": true,
            "multi_block_mode": false,
            "enable_xqa": true,
            "attention_qk_half_accumulation": false,
            "tokens_per_block": 64,
            "use_paged_context_fmha": false,
            "use_fp8_context_fmha": false,
            "use_context_fmha_for_generation": false,
            "multiple_profiles": false,
            "paged_state": true,
            "streamingllm": false
        }
    }
}

@activezhao
Copy link
Author

@kaiyux @byshiue Can you help me look into this issue?

Thanks.

@DreamGenX
Copy link
Contributor

@activezhao In my case rotary_base was also not the root cause (was correctly set to 500000 for llama3). I am still not sure where the issue is.

@activezhao
Copy link
Author

@activezhao In my case rotary_base was also not the root cause (was correctly set to 500000 for llama3). I am still not sure where the issue is.

@DreamGenX Yes, I agree with you.

I print the input_ids information and it looks normal, so I really don’t know why the results are abnormal.

Just so weird.

@netanel-haber
Copy link
Collaborator

@activezhao - can you elaborate why you closed this issue as completed please?

@activezhao
Copy link
Author

activezhao commented Jul 7, 2024

@activezhao - can you elaborate why you closed this issue as completed please?

Hi @netanel-haber I have solved this problem.

I analyzed the code of bls and finally found that the inference quality dropped significantly in some scenarios, because the temperature parameter was not given.

And I have submitted a PR for this fixing here.

triton-inference-server/tensorrtllm_backend#523

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants