vLLM results are better than trt with the same request #1870

activezhao · 2024-06-30T08:17:06Z

System Info

CPU x86_64

GPU NVIDIA L40

TensorRT branch: v0.10.0

CUDA: NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.4

Who can help?

@kaiyux

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I have a model based on deepseek_coder_6.7b, and add some special tokens, such as <filename>, <reponame> and so on for better performance.

I have some requests, and they are executed on trt, vLLM and transformers.generate respectively.

The resluts of vLLM and transformers.generate are very good, but the result of trt is a badcase, which is pretty werid.

Here are the commands of trt:

python /data/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir /data/deepseek-6.7b/ \
                            --output_dir /data/trt-v10-deepseek-6.7b-tp2-bs8 \
                            --dtype float16 \
                            --tp_size 2 \
                            --workers 2

trtllm-build --checkpoint_dir /data/trt-v10-deepseek-6.7b-tp2-bs8 \
            --output_dir /data/trt-v10-engines-deepseek-6.7b-bs8/2-gpu/  \
            --gemm_plugin float16 \
            --paged_kv_cache enable \
            --max_input_len 8192 \
            --max_output_len 1024 \
            --gpt_attention_plugin float16 \
            --max_batch_size 8

Here is the one of the requests:

curl -X POST localhost:8820/v2/models/ensemble/generate_stream -d '{"text_input": "\u003creponame\u003eprogramming-language-demo\n\u003cneighbor\u003e\u003cfilename\u003eprime-number.go\u003ccodeblock\u003e// }\n// func isPrime(n int) bool {\n// \tif n \u003c 2 {\n// \t\treturn false\n// \t} else {\n// \t\tfor i := 2; i \u003c= n/2; i++ {\n// \t\t\tif n%i == 0 {\n// \t\t\t\treturn false\n// \t\t\t}\n// \t\t}\n// \t}\n// \treturn true\n// }\n// Functions from import file go/prime-number.go can be referenced:\n// func exitWithError()\n// func main()\n// func isPrime(n int) bool\n// Compare this snippet from go/prime-number.go:\n// package main\n// \n// import (\n// \t\"fmt\"\n// \t\"os\"\n// \t\"strconv\"\n// )\n// \n// func isPrime(n int) bool {\n// \tif n \u003c 2 {\n// \t\treturn false\n// \t} else {\n// \t\tfor i := 2; i \u003c= n/2; i++ {\n// \t\t\tif n%i == 0 {\n// \t\t\t\treturn false\n// \t\t\t}\n// \t\t}\n// \t}\n// \treturn true\n// }\n// \n// func exitWithError() {\n// \tfmt.Println(\"Usage: please input a non-negative integer\")\n// \tos.Exit(1)\n// }\n// \n// func main() {\n// \tif len(os.Args) != 2 {\n// \t\texitWithError()\n// \t}\n// \n// \tn, err := strconv.Atoi(os.Args[1])\n// \tif err != nil || n \u003c 0 {\n// \t\texitWithError()\n// \t}\n// \n// \tif isPrime(n) {\n// \t\tfmt.Println(\"Prime\")\n// \t} else {\n// \t    fmt.Println(\"Composite\")\n// \t}\n// }\u003cneighbor\u003e\u003cfilename\u003eprime-number.go\u003ccodeblock\u003e// Functions from import file go/prime-number.go can be referenced:\n// func exitWithError() {\n// \tfmt.Println(\"Usage: please input a non-negative integer\")\n// \tos.Exit(1)\n// }\n// func main() {\n// \tif len(os.Args) != 2 {\n// \t\texitWithError()\n// \t}\n// \n// \tn, err := strconv.Atoi(os.Args[1])\n// \tif err != nil || n \u003c 0 {\n// \t\texitWithError()\n// \t}\n// \n// \tif isPrime(n) {\n// \t\tfmt.Println(\"Prime\")\n// \t} else {\n// \t    fmt.Println(\"Composite\")\n// \t}\n// }\n// func isPrime(n int) bool {\n// \tif n \u003c 2 {\n// \t\treturn false\n// \t} else {\n// \t\tfor i := 2; i \u003c= n/2; i++ {\n// \t\t\tif n%i == 0 {\n// \t\t\t\treturn false\n// \t\t\t}\n// \t\t}\n// \t}\n// \treturn true\n// }\n// Functions from import file go/prime-number.go can be referenced:\n// func exitWithError()\n// func main()\n// func isPrime(n int) bool\n// Compare this snippet from go/prime-number.go:\n// package main\n// \n// import (\n// \t\"fmt\"\n// \t\"os\"\n// \t\"strconv\"\n// )\n// \n// func isPrime(n int) bool {\n// \tif n \u003c 2 {\n// \t\treturn false\n// \t} else {\n// \t\tfor i := 2; i \u003c= n/2; i++ {\n// \t\t\tif n%i == 0 {\n// \t\t\t\treturn false\n// \t\t\t}\n// \t\t}\n// \t}\n// \treturn true\n// }\n// \n// func exitWithError() {\u003cneighbor\u003e\u003cfilename\u003elongest-word.go\u003ccodeblock\u003e// Variables from import file go/longest-word.go can be referenced:\n// errorMessage = \"Usage: please provide a string\"\n// Functions from import file go/longest-word.go can be referenced:\n// func longestWordLength(str string) int {\n// \twords := strings.FieldsFunc(str, isLimitedWhitespace)\n// \treturn longestStringLength(words)\n// }\n// func isLimitedWhitespace(r rune) bool {\n// \treturn strings.ContainsRune(\" \\t\\n\\r\", r)\n// }\n// func longestStringLength(strs []string) (longest int) {\n// \tfor _, str := range strs {\n// \t\tif len(str) \u003e longest {\n// \t\t\tlongest = len(str)\n// \t\t}\n// \t}\n// \treturn\n// }\n// Functions from import file go/longest-word.go can be referenced:\n// func longestWordLength(str string) int\n// func isLimitedWhitespace(r rune) bool\n// func longestStringLength(strs []string) (longest int)\u003cneighbor\u003e\u003cfilename\u003efactorial.go\u003ccodeblock\u003e// Functions from import file go/factorial.go can be referenced:\n// func exitWithError(msg string) {\n// \tfmt.Println(msg)\n// \tos.Exit(1)\n// }\n// func factorial(n uint64) uint64 {\n// \tif n \u003c= 0 {\n// \t\treturn 1\n// \t}\n// \treturn n * factorial(n-1)\n// }\n// Functions from import file go/factorial.go can be referenced:\n// func exitWithError(msg string)\n// func factorial(n uint64) uint64\u003cfilename\u003elongest-common-subsequence.go\n\u003ccodecontent\u003epackage main\nimport (\n\t\"encoding/json\"\n\t\"fmt\"\n\t\"os\"\n\t\"regexp\"\n\t\"strconv\"\n\t\"strings\"\n)\n//exitWithError\n, ", "max_tokens": 50, "bad_words": "", "stop_words": "", "stream": false, "temperature": 0.2, "top_p": 0.95, "return_log_probs": true, "generation_logits": true}'

Expected behavior

The expected result is:

func exitWithError(msg string) {
	fmt.Println(msg)
	os.Exit(1)
}

In fact, vLLM and transformers.generate are all the results as above.

actual behavior

The trt result is:

data: {"context_logits":0.0,"cum_log_probs":-1.76106858253479,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[-0.0000066757424974639438,-0.10143566876649857,-0.1650305688381195,-0.00022062112111598253,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-0.0000067949526965094268,-0.0000010728841743912199,-9.536747711536009e-7,-9.536747711536009e-7,-0.00007355483830906451,-9.536747711536009e-7,-0.0000020265599687263604,-0.0000010728841743912199,-9.536747711536009e-7,-0.000012636264727916569,-9.536747711536009e-7,-9.536747711536009e-7,-0.0000010728841743912199,-0.0001179049359052442,-9.536747711536009e-7,-0.0005595461116172373,-0.0000011920935776288389,-9.536747711536009e-7,-0.000048638572479831058,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-0.17364458739757539,-9.536747711536009e-7,-0.0004099851648788899,-9.536747711536009e-7,-0.000002861027041944908,-0.0005539401317946613,-0.0008925008587539196,-9.536747711536009e-7,-9.536747711536009e-7,-0.000003933914285880746,-0.0258316770195961,-9.536747711536009e-7,-0.022926615551114084,-9.536747711536009e-7,-9.536747711536009e-7,-0.000002145769485650817,-1.0269726514816285,-0.24228566884994508],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\n//isLimitedWhitespace\n, \n//longestStringLength\n, \n//longestWordLength\n, \n//longestCommonSubsequence\n, \n//main\n, \n//parseArgs"}

And the text_output part is:

//isLimitedWhitespace
//longestStringLength
//longestWordLength
//longestCommonSubsequence
//main
//parseArgs

However, If I only use the last part from the request, the result is also normal.

Here is the request:

curl -X POST localhost:8820/v2/models/ensemble/generate_stream -d '{"text_input": "\u003ccodecontent\u003epackage main\nimport (\n\t\"encoding/json\"\n\t\"fmt\"\n\t\"os\"\n\t\"regexp\"\n\t\"strconv\"\n\t\"strings\"\n)\n//exitWithError\n, ", "max_tokens": 50, "bad_words": "", "stop_words": "", "stream": false, "temperature": 0.2, "top_p": 0.95, "return_log_probs": true, "generation_logits": true}'

And here is the result:

data: {"context_logits":0.0,"cum_log_probs":-2.383721351623535,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[-0.000052334245992824438,-0.3010028600692749,-0.0016516967443749309,-9.536747711536009e-7,-0.0000010728841743912199,-9.536747711536009e-7,-0.0057563441805541519,-0.0000027418175250204514,-0.000046373490476980808,-0.0000019073504518019037,-9.536747711536009e-7,-0.008396431803703308,-9.536747711536009e-7,-9.536747711536009e-7,-0.21918922662734986,-0.0002970540663227439,-0.06785676628351212,-9.536747711536009e-7,-0.00040557264583185315,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-0.0000020265599687263604,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-0.6207338571548462,-0.15082910656929017,-0.4851605296134949,-0.39718568325042727,-0.0005019970703870058,-0.0011182717280462385,-0.0000017881409348774469,-9.536747711536009e-7,-0.0000010728841743912199,-9.536747711536009e-7,-9.536747711536009e-7,-0.0000015497220147153712,-0.00020500138634815812,-0.12325640767812729,-0.000039816695789340886,-9.536747711536009e-7,-0.0000013113030945532956,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\nfunc exitWithError(err error) {\n\tfmt.Println(err)\n\tos.Exit(1)\n}\n\n//getEnv\n, \nfunc getEnv(key, fallback string) string {"}

And the text_output part is:

func exitWithError(err error) {
    fmt.Println(err)
    os.Exit(1)
}

//getEnv
func getEnv(key, fallback string) string {

additional notes

This is so weird.

I have analyzed for a long time, but I still don’t know what is causing it.

Please help me.

Thank you.

The text was updated successfully, but these errors were encountered:

DreamGenX · 2024-06-30T20:05:52Z

This might be related: #1788

activezhao · 2024-07-01T01:56:03Z

This might be related: #1788

@DreamGenX Thanks for your suggestion.

But, it seems that my problem is not rope's problem, the value of rotary_base is right.

This is the original config.json

{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 32013,
  "eos_token_id": 32022,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 16384,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "factor": 4.0,
    "type": "linear"
  },
  "rope_theta": 100000,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.39.0.dev0",
  "use_cache": true,
  "vocab_size": 32043
}

And this is the engines' config.json

{
    "version": "0.10.0",
    "pretrained_config": {
        "architecture": "LlamaForCausalLM",
        "dtype": "float16",
        "logits_dtype": "float32",
        "vocab_size": 32043,
        "max_position_embeddings": 16384,
        "hidden_size": 4096,
        "num_hidden_layers": 32,
        "num_attention_heads": 32,
        "num_key_value_heads": 32,
        "head_size": 128,
        "qk_layernorm": false,
        "hidden_act": "silu",
        "intermediate_size": 11008,
        "norm_epsilon": 1e-06,
        "position_embedding_type": "rope_gpt_neox",
        "use_parallel_embedding": false,
        "embedding_sharding_dim": 0,
        "share_embedding_table": false,
        "mapping": {
            "world_size": 2,
            "tp_size": 2,
            "pp_size": 1,
            "gpus_per_node": 8
        },
        "quantization": {
            "quant_algo": null,
            "kv_cache_quant_algo": null,
            "group_size": 128,
            "smoothquant_val": null,
            "has_zero_point": false,
            "pre_quant_scale": false,
            "exclude_modules": [
                "lm_head"
            ]
        },
        "kv_dtype": "float16",
        "rotary_scaling": {
            "factor": 4.0,
            "type": "linear"
        },
        "residual_mlp": false,
        "moe_normalization_mode": null,
        "rotary_base": 100000,
        "moe_num_experts": 0,
        "moe_top_k": 0,
        "moe_tp_mode": 2,
        "attn_bias": false,
        "disable_weight_only_quant_plugin": false,
        "mlp_bias": false
    },
    "build_config": {
        "max_input_len": 8192,
        "max_output_len": 1024,
        "opt_batch_size": null,
        "max_batch_size": 8,
        "max_beam_width": 1,
        "max_num_tokens": 65536,
        "opt_num_tokens": 8,
        "max_prompt_embedding_table_size": 0,
        "gather_context_logits": false,
        "gather_generation_logits": false,
        "strongly_typed": false,
        "builder_opt": null,
        "profiling_verbosity": "layer_names_only",
        "enable_debug_output": false,
        "max_draft_len": 0,
        "speculative_decoding_mode": 1,
        "use_refit": false,
        "input_timing_cache": null,
        "output_timing_cache": "model.cache",
        "lora_config": {
            "lora_dir": [],
            "lora_ckpt_source": "hf",
            "max_lora_rank": 64,
            "lora_target_modules": [],
            "trtllm_modules_to_hf_modules": {}
        },
        "auto_parallel_config": {
            "world_size": 1,
            "gpus_per_node": 8,
            "cluster_key": "L40",
            "cluster_info": null,
            "sharding_cost_model": "alpha_beta",
            "comm_cost_model": "alpha_beta",
            "enable_pipeline_parallelism": false,
            "enable_shard_unbalanced_shape": false,
            "enable_shard_dynamic_shape": false,
            "enable_reduce_scatter": true,
            "builder_flags": null,
            "debug_mode": false,
            "infer_shape": true,
            "validation_mode": false,
            "same_buffer_io": {
                "past_key_value_(\\d+)": "present_key_value_\\1"
            },
            "same_spec_io": {},
            "sharded_io_allowlist": [
                "past_key_value_\\d+",
                "present_key_value_\\d*"
            ],
            "fast_reduce": true,
            "fill_weights": false,
            "parallel_config_cache": null,
            "profile_cache": null,
            "dump_path": null,
            "debug_outputs": []
        },
        "weight_sparsity": false,
        "weight_streaming": false,
        "use_strip_plan": false,
        "max_encoder_input_len": 1024,
        "use_fused_mlp": true,
        "plugin_config": {
            "bert_attention_plugin": "float16",
            "gpt_attention_plugin": "float16",
            "gemm_plugin": "float16",
            "smooth_quant_gemm_plugin": null,
            "identity_plugin": null,
            "layernorm_quantization_plugin": null,
            "rmsnorm_quantization_plugin": null,
            "nccl_plugin": "float16",
            "lookup_plugin": null,
            "lora_plugin": null,
            "weight_only_groupwise_quant_matmul_plugin": null,
            "weight_only_quant_matmul_plugin": null,
            "quantize_per_token_plugin": false,
            "quantize_tensor_plugin": false,
            "moe_plugin": "float16",
            "mamba_conv1d_plugin": "float16",
            "context_fmha": true,
            "context_fmha_fp32_acc": false,
            "paged_kv_cache": true,
            "remove_input_padding": true,
            "use_custom_all_reduce": true,
            "multi_block_mode": false,
            "enable_xqa": true,
            "attention_qk_half_accumulation": false,
            "tokens_per_block": 64,
            "use_paged_context_fmha": false,
            "use_fp8_context_fmha": false,
            "use_context_fmha_for_generation": false,
            "multiple_profiles": false,
            "paged_state": true,
            "streamingllm": false
        }
    }
}

activezhao · 2024-07-01T08:47:12Z

@kaiyux @byshiue Can you help me look into this issue?

Thanks.

DreamGenX · 2024-07-01T08:54:59Z

@activezhao In my case rotary_base was also not the root cause (was correctly set to 500000 for llama3). I am still not sure where the issue is.

activezhao · 2024-07-01T09:54:56Z

@activezhao In my case rotary_base was also not the root cause (was correctly set to 500000 for llama3). I am still not sure where the issue is.

@DreamGenX Yes, I agree with you.

I print the input_ids information and it looks normal, so I really don’t know why the results are abnormal.

Just so weird.

netanel-haber · 2024-07-07T10:57:54Z

@activezhao - can you elaborate why you closed this issue as completed please?

activezhao · 2024-07-07T15:27:13Z

@activezhao - can you elaborate why you closed this issue as completed please?

Hi @netanel-haber I have solved this problem.

I analyzed the code of bls and finally found that the inference quality dropped significantly in some scenarios, because the temperature parameter was not given.

And I have submitted a PR for this fixing here.

triton-inference-server/tensorrtllm_backend#523

activezhao added the bug Something isn't working label Jun 30, 2024

activezhao closed this as completed Jul 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM results are better than trt with the same request #1870

vLLM results are better than trt with the same request #1870

activezhao commented Jun 30, 2024 •

edited

Loading

DreamGenX commented Jun 30, 2024

activezhao commented Jul 1, 2024 •

edited

Loading

activezhao commented Jul 1, 2024

DreamGenX commented Jul 1, 2024

activezhao commented Jul 1, 2024

netanel-haber commented Jul 7, 2024

activezhao commented Jul 7, 2024 •

edited

Loading

vLLM results are better than trt with the same request #1870

vLLM results are better than trt with the same request #1870

Comments

activezhao commented Jun 30, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

DreamGenX commented Jun 30, 2024

activezhao commented Jul 1, 2024 • edited Loading

activezhao commented Jul 1, 2024

DreamGenX commented Jul 1, 2024

activezhao commented Jul 1, 2024

netanel-haber commented Jul 7, 2024

activezhao commented Jul 7, 2024 • edited Loading

activezhao commented Jun 30, 2024 •

edited

Loading

activezhao commented Jul 1, 2024 •

edited

Loading

activezhao commented Jul 7, 2024 •

edited

Loading