Skip to content

Latest commit

 

History

History
468 lines (457 loc) · 50.3 KB

CHANGELOG.md

File metadata and controls

468 lines (457 loc) · 50.3 KB

Change log

Generated on 2023-10-16

Release 23.10

Features

#9220 [FEA] Add GPU support for converting binary data to a hex string in REPL
#9171 [FEA] Add GPU version of ToPrettyString
#5314 [FEA] Support window.rowsBetween(Window.unboundedPreceding, -1)
#9057 [FEA] Add unbounded to unbounded fixers for min and max
#8121 [FEA] Add Spark 3.5.0 shim layer
#9224 [FEA] Allow } and }} to be transpiled to static strings
#8596 [FEA] Support spark.sql.legacy.parquet.datetimeRebaseModeInWrite=LEGACY
#8767 [AUDIT][SPARK-43302][SQL] Make Python UDAF an AggregateFunction
#9055 [FEA] Support Spark 3.3.3 official release
#8672 [FEA] Make GPU readers easier to debug on failure (any failure including OOM)
#8965 [FEA] Enable Bloom filter join acceleration by default
#8625 [FEA] Support outputTimestampType being INT96

Performance

#7803 [FEA] Accelerate Bloom filtered joins

Bugs Fixed

#9060 [BUG] OOM error in split and retry with multifile coalesce reader with parquet data
#8916 [BUG] Databricks - move init scripts off DBFS
#9416 [BUG] CDH build failed due to missing dependencies
#9357 [BUG] json_test failed on "NameError: name 'TimestampNTZType' is not defined"
#9271 [BUG] ThreadPool size is deduced incorrectly in MultiFileReaderThreadPool on YARN clusters
#9309 [BUG] bround and round do not return the correct result for some decimal values.
#9153 [BUG] netty OOM with MULTITHREADED shuffle
#9311 [BUG] test_hash_groupby_collect_list fails
#9180 [FEA][AUDIT][SPARK-44641] Incorrect result in certain scenarios when SPJ is not triggered
#9290 [BUG] delta_lake_test FAILED on "column mapping mode id is not supported for this Delta version"
#9255 [BUG] Unable to read DeltaTable with columnMapping.mode = name
#9261 [BUG] Leaks and Double Frees in Unit Tests
#9246 [BUG] test_predefined_character_classes failed with seed 4
#9208 [BUG] SplitAndRetryOOM query14_part1 at 100TB with spark.executor.cores=64
#9106 [BUG] Configuring GDS breaks new host spillable buffers and batches
#9131 [BUG] ConcurrentModificationException in ScalableTaskCompletion
#9263 [BUG] Unit test logging is not captured when running against Spark 3.5.0
#9168 [BUG] Calling RmmSpark.getAndResetNumRetryThrow from tests is not working
#8776 [BUG] FileCacheIntegrationSuite intermittent failure
#9223 [BUG] Failed to create memory map on query14_part1 at 100TB with spark.executor.cores=64
#9116 [BUG] spark350 shim build failed in mvn-verify github checks and nightly due to dependencies not released
#8984 [BUG] Check that keys are not null when creating a map
#9233 [BUG] test_parquet_testing_error_files - Failed: DID NOT RAISE <class 'Exception'> in databricks runtime 12.2
#9142 [BUG] AWS EMR 6.12 NDS SF3k query9 Failure on g4dn.4xlarge
#9214 [BUG] mvn resolve dependencies failed missing rapids-4-spark-sql-plugin-api_2.12 of 311 shim
#9204 [BUG] SplitAndRetryOOM query78 at 100TB with spark.executor.cores=64
#9213 [BUG] Missing revision info in databricks shims failed nightly build
#9206 [BUG] test_datetime_roundtrip_with_legacy_rebase failed in databricks runtimes
#9165 [BUG] Data gen for key groups produces type-mismatch columns
#9129 [BUG] Writing Parquet map(map) column can not set the outer key as non-null.
#9194 [BUG] missing sql-plugin-api databricks artifacts in the nightly CI
#9167 [BUG] Ensure no udf-compiler internal nodes escape
#9092 [BUG] NDS query 64 falls back to CPU only for a shuffle
#9071 [BUG] test_numeric_running_sum_window_no_part_unbounded failed in MT tests
#9154 [BUG] Spark 3.5.0 nightly build failures (test_parquet_testing_error_files)
#9149 [BUG] compile failed in databricks runtimes due to new added TestReport
#9041 [BUG] Fix regression in Python UDAF support when running against Spark 3.5.0
#9064 [BUG][Spark 3.5.0] Re-enable test_hive_empty_simple_udf when 3.5.0-rc2 is available
#9065 [BUG][Spark 3.5.0] Reinstate cast map/array to string tests when 3.5.0-rc2 is available
#9119 [BUG] Predicate pushdown doesn't work for parquet files written by GPU
#9103 [BUG] test_select_complex_field fails in MT tests
#9086 [BUG] GpuBroadcastNestedLoopJoinExec can assert in doUnconditionalJoin
#8939 [BUG] q95 odd task failure in query95 at 30TB
#9082 [BUG] Race condition while spilling and aliasing a RapidsBuffer (regression)
#9069 [BUG] ParquetFormatScanSuite does not pass locally
#8980 [BUG] invalid escape sequences in pytests
#7807 [BUG] Round robin partitioning sort check falls back to CPU for cases that can be supported
#8482 [BUG] Potential leak on SplitAndRetry when iterator not fully drained
#8942 [BUG] NDS query 14 parts 1 and 2 both fail at SF100K
#8778 [BUG] GPU Parquet output for TIMESTAMP_MICROS is misinteterpreted by fastparquet as nanos

PRs

#9445 Only run test_csv_infer_schema_timestamp_ntz tests with PySpark >= 3.4.1
#9420 Update private and jni dep version to released 23.10.0
#9415 [BUG] fix docker modified check in premerge [skip ci]
#9392 Only run test_json_ts_formats_round_trip_ntz tests with PySpark >= 3.4.1
#9401 Remove using mamba before they fix the incompatibility issue [skip ci]
#9381 Change the executor core calculation to take into account the cluster manager
#9351 Put back in full decimal support for format_number
#9374 GpuCoalesceBatches should throw SplitAndRetyOOM on GPU OOM error
#9238 Simplified handling of GPU core dumps
#9362 [DOC] Removing User Guide pages that will be source of truth on docs.nvidia…
#9365 Update DataWriteCommandExec docs to reflect ORC support for nested types
#9277 [Doc]Remove CUDA related requirement from download page.[Skip CI]
#9352 Refine rules for skipping test_csv_infer_schema_timestamp_ntz_* tests
#9334 Add NaNs to Data Generators In Floating-Point Testing
#9344 Update MULTITHREADED shuffle maxBytesInFlight default to 128MB
#9330 Add Hao to blossom-ci whitelist
#9328 Building different Cuda versions section profile does not take effect [skip ci]
#9329 Add kuhushukla to blossom ci yml
#9281 Support format_number
#9335 Temporarily skip failing tests test_csv_infer_schema_timestamp_ntz*
#9318 Update authorized user in blossom-ci whitelist [skip ci]
#9221 Add GPU version of ToPrettyString
#9321 [DOC] Fix some incorrect config links in doc [skip ci]
#9314 Fix RMM crash in FileCacheIntegrationSuite with ARENA memory allocator
#9287 Allow checkpoint and restore on non-deterministic expressions in GpuFilter and GpuProject
#9146 Improve some CSV integration tests
#9159 Update tests and documentation for spark.sql.timestampType when reading CSV/JSON
#9313 Sort results of collect_list test before comparing since it is not guaranteed
#9286 [FEA][AUDIT][SPARK-44641] Incorrect result in certain scenarios when SPJ is not triggered
#9229 Support negative preceding/following for ROW-based window functions
#9297 Append new authorized user to blossom-ci whitelist [skip ci]
#9294 Fix test_delta_read_column_mapping test failures on Spark 3.2.x and 3.3.x
#9285 Add CastOptions to make GpuCast extendible to handle more options
#9279 Fix file format checks to be exact and handle Delta Lake column mapping
#9283 Refactor ExternalSource to move some APIs to converted GPU format or scan
#9264 Fix leak in test and double free in corner case
#9280 Fix some issues found with different seeds in integration tests
#9257 Have host spill use the new HostAlloc API
#9253 Enforce Scala method syntax over deprecated procedure syntax
#9273 Add arm64 profile to build arm artifacts
#9270 Remove GDS spilling
#9267 Roll our own BufferedIterator so we can close cleanly
#9266 Specify correct dependency versions for 350 build
#9262 Add Delta Lake support for Spark 3.4.1 and Delta Lake tests on Spark 3.4.x
#9256 Test Parquet double column stat without NaN
#9254 [Doc]update the emr getting started doc for emr-6130 release[skip ci]
#9228 Add in unbounded to unbounded optimization for min/max
#9252 Add Spark 3.5.0 to list of supported Spark versions [skip ci]
#9251 Enable a couple of retry asserts in internal row to cudf row iterator suite
#9239 Handle escaping the dangling right ] and right } in the regexp transpiler
#9090 Add test cases for Parquet statistics
#9240 Fix flaky ORC filecache test
#9053 [DOC] update the turning guide document issues [skip ci]
#9211 Allow skipping host spill for a direct device->disk spill
#9234 Enable Spark 350 builds
#9237 Check for null keys when creating map
#9235 xfail fixed_length_byte_array.parquet test due to rapidsai/cudf#14104
#9231 Use conda libmamba solver to resolve intermittent libarchive issue [skip ci]
#8404 Add in support for FIXED_LEN_BYTE_ARRAY as binary
#9225 Add in a HostAlloc API for high priority and add in spilling
#9207 Support SplitAndRetry for GpuRangeExec
#9217 Fix leak in aggregate when there are retries
#9200 Fix a few minor things with scale test
#9222 Deploy classified aggregator for Databricks [skip ci]
#9209 Fix tests for datetime rebase in Databricks
#9181 [DOC] address document issues [skip ci]
#9132 Support spark.sql.parquet.datetimeRebaseModeInWrite=LEGACY
#9196 Fix host memory leak for R2C
#9192 Throw overflow exception when interval seconds are outside of range [0, 59]
#9150 add error section in report and the rest queries
#9189 Expose host store spill
#9147 Make map column non-nullable when it's a key in another map.
#9193 Support Retry for GpuLocalLimitExec and GpuGlobalLimitExec
#9183 Add test to verify UDT fallback for parquet
#9195 Deploy sql-plugin-api artifact in DBR CI pipelines [skip ci]
#9170 Add in new HostAlloc API
#9182 Consolidate Spark vendor shim dependency management
#9190 Prevent returning internal compiler expressions when compiling UDFs
#9164 Support Retry for GpuTopN and GpuSortEachBatchIterator
#9134 Fix shuffle fallback due to AQE on AWS EMR
#9188 Fix flaky tests in FileCacheIntegrationSuite
#9148 Add minimum Maven module eventually containing all non-shimmable source code
#9169 Add retry-without-split in InternalRowToColumnarBatchIterator
#9172 Remove doSetSpillable in favor of setSpillable
#9152 Add test cases for testing Parquet compression types
#9157 XFAIL parquet lz4_raw tests for Spark 3.5.0 or later
#9128 Test parquet predicate pushdown for basic types and fields having dots in names
#9158 Add json4s dependencies for Databricks integration_tests build
#9102 Add retry support to GpuOutOfCoreSortIterator.mergeSortEnoughToOutput
#9089 Add application to run Scale Test
#9143 [DOC] update spark.rapids.sql.concurrentGpuTasks default value in tuning guide [skip ci]
#9141 Removed resultDecimalType in GpuIntegralDecimalDivide
#9099 Spark 3.5.0 follow-on work (rc2 support + Python UDAF)
#9140 Bump Jython to 2.7.3
#9136 Moving row column conversion code from cudf to jni
#9133 Add 350 tag to InSubqueryShims
#9124 Import scala.collection intead of collection
#9122 Fall back to CPU if spark.sql.execution.arrow.useLargeVarTypes is true
#9115 [DOC] updates documentation related to java compatibility [skip ci]
#9098 Add SpillableHostColumnarBatch
#9091 GPU support for DynamicPruningExpression and InSubqueryExec
#9117 Temply disable spark 350 shim build in nightly [skip ci]
#9113 Instantiate execution plan capture callback via shim loader
#8969 Initial support for Spark 3.5.0-rc1
#9100 Support broadcast nested loop existence joins with no condition
#8925 Add GpuConv operator for the conv 10<->16 expression
#9109 [DOC] adding java 11 to download docs [skip ci]
#9085 Retry with smaller split on CudfColumnSizeOverflowException
#8961 Save Databricks init scripts in the workspace
#9088 Add retry and SplitAndRetry support to AcceleratedColumnarToRowIterator
#9095 Support released spark 3.3.3
#9084 Fix race when a rapids buffer is aliased while it is spilled
#9093 Update ParquetFormatScanSuite to not call CUDF directly
#9068 Test ORC predicate pushdown (PPD) with timestamps decimals booleans
#9054 Initial entry point to data generation for scale test
#9070 Spillable host buffer
#9066 Add retry support to RowToColumnarIterator
#9073 Stop using invalid escape sequences
#9018 Add test for selecting a single complex field array and its parent struct array
#9067 Add array support for round robin partition; Refactor pluginSupportedOrderableSig
#9072 Revert "Implement SumUnboundedToUnboundedFixer (#8934)"
#9056 Add in configs for host memory limits
#9061 Fix import order
#8934 Implement SumUnboundedToUnboundedFixer
#9051 Use number of threads on executor instead of driver to set core count
#9040 Fix issues from 23.08 merge in join_test
#9045 Fix auto merge conflict 9043 [skip ci]
#9009 Add in a layer of indirection for task completion callbacks
#9013 Create a two-shim jar by default on Databricks
#8995 Add test case for ORC statistics test
#8970 Add ability to debug dump input data only on errors
#9003 Fix auto merge conflict 9002 [skip ci]
#8989 Mark lazy spillables as allowSpillable in during gatherer construction
#8988 Move big data generator to a separate module
#8987 Fix host memory buffer leaks in SerializationSuite
#8968 Enable GPU acceleration of Bloom filter join expressions by default
#8947 Add ArrowUtilsShims in preparation for Spark 3.5.0
#8946 [Spark 3.5.0] Shim access to StructType.fromAttributes
#8824 Drop the in-range check at INT96 output path
#8924 Deprecate and delegate GpuCV.debug to cudf TableDebug
#8915 Move LegacyBehaviorPolicy references to shim layer
#8918 Output unified diff when GPU output deviates
#8857 Remove the pageable pool
#8854 Fix auto merge conflict 8853 [skip ci]
#8805 Bump up dep versions to 23.10.0-SNAPSHOT
#8796 Init version 23.10.0-SNAPSHOT

Release 23.08

Features

#5509 [FEA] Support order-by on Array
#7876 [FEA] Add initial support for Databricks 12.2 ML LTS
#8547 [FEA] Add support for Delta Lake 2.4 with Spark 3.4
#8633 [FEA] Add support for xxHash64 function
#4929 [FEA] Support min/max aggregation/reduction for arrays of structs and arrays of strings
#8668 [FEA] Support min and max for arrays
#4887 [FEA] Hash partitioning on ArrayType
#6680 [FEA] Support hashaggregate for Array[Any]
#8085 [FEA] Add support for MillisToTimestamp
#7801 [FEA] Window Expression orderBy column is not supported in a window range function, found DoubleType
#8556 [FEA] [Delta Lake] Add support for new metrics in MERGE
#308 [FEA] Spark 3.1 adding support for TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions
#8122 [FEA] Add spark 3.4.1 snapshot shim
#8525 [FEA] Add support for org.apache.spark.sql.functions.flatten
#8202 [FEA] List supported Spark builds when the Shim is not found

Performance

#8231 [FEA] Add filecache support to ORC scans
#8141 [FEA] Explore how to best deal with large numbers of aggregations in the short term

Bugs Fixed

#9034 [BUG] java.lang.ClassCastException: com.nvidia.spark.rapids.RuleNotFoundExprMeta cannot be cast to com.nvidia.spark.rapids.GeneratorExprMeta
#9032 [BUG] Multiple NDS queries fail with Spark-3.4.1 with bloom filter exception
#8962 [BUG] Nightly build failed: ExecutionPlanCaptureCallback$.class is not bitwise-identical across shims
#9021 [BUG] test_map_scalars_supported_key_types failed in dataproc 2.1
#9020 [BUG] auto-disable snapshot shims test in github action for pre-release branch
#9010 [BUG] Customer failure 23.08: Cannot compute hash of a table with a LIST of STRUCT columns.
#8922 [BUG] integration map_test:test_map_scalars_supported_key_types failures
#8982 [BUG] Nightly prerelease failures - OrcSuite
#8978 [BUG] compiling error due to OrcSuite&OrcStatisticShim in databricks runtimes
#8610 [BUG] query 95 @ SF30K fails with OOM exception
#8955 [BUG] Bloom filter join tests can fail with multiple join columns
#45 [BUG] very large shuffles can fail
#8779 [BUG] Put shared Databricks test script together for ease of maintenance
#8930 [BUG] checkoutSCM plugin is unstable for pre-merge CI, it is often unable to clone submodules
#8923 [BUG] Mortgage test failing with 'JavaPackage' error on AWS Databricks
#8303 [BUG] GpuExpression columnarEval can return scalars from subqueries that may be unhandled
#8318 [BUG][Databricks 12.2] GpuRowBasedHiveGenericUDF ClassCastException
#8822 [BUG] Early terminate CI if submodule init failed
#8847 [BUG] github actions CI messed up w/ JDK versions intermittently
#8716 [BUG] test_hash_groupby_collect_set_on_nested_type and test_hash_reduction_collect_set_on_nested_type failed
#8827 [BUG] databricks cudf_udf night build failing with pool size exceeded errors
#8630 [BUG] Parquet with RLE encoded booleans loads corrupted data
#8735 [BUG] test_orc_column_name_with_dots fails in nightly EGX tests
#6980 [BUG] Partitioned writes release GPU semaphore with unspillable GPU memory
#8784 [BUG] hash_aggregate_test.py::test_min_max_in_groupby_and_reduction failed on "TypeError: object of type 'NoneType' has no len()"
#8756 [BUG] [Databricks 12.2] RapidsDeltaWrite queries that reference internal metadata fail to run
#8636 [BUG] AWS Databricks 12.2 integration tests failed due to Iceberg check
#8754 [BUG] databricks build broke after adding bigDataGen
#8726 [BUG] Test "parquet_write_test.py::test_hive_timestamp_value[INJECT_OOM]" failed on Databricks
#8690 [BUG buildall script does not support JDK11 profile
#8702 [BUG] test_min_max_for_single_level_struct failed
#8727 [BUG] test_column_add_after_partition failed in databricks 10.4 runtime
#8669 [BUG] SpillableColumnarBatch doesn't always take ownership
#8655 [BUG] There are some potential device memory leaks in AbstractGpuCoalesceIterator
#8685 [BUG] install build fails with Maven 3.9.3
#8156 [BUG] Install phase for modules with Spark build classifier fails for install plugin versions 3.0.0+
#1130 [BUG] TIMESTAMP_MILLIS not handled in isDateTimeRebaseNeeded
#7676 [BUG] SparkShimsImpl class initialization in SparkShimsSuite for 340 too eager
#8278 [BUG] NDS query 16 hangs at SF30K
#8665 [BUG] EGX nightly tests fail to detect Spark version on startup
#8647 [BUG] array_test.py::test_array_min_max[Float][INJECT_OOM] failed mismatched CPU and GPU output in nightly
#8640 [BUG] Optimize Databricks pre-merge scripts, move it out into a new CI file
#8308 [BUG] Device Memory leak seen in integration_tests when AssertEmptyNulls are enabled
#8602 [BUG] AutoCloseable Broadcast results are getting closed by Spark
#8603 [BUG] SerializeConcatHostBuffersDeserializeBatch.writeObject fails with ArrayIndexOutOfBoundsException on rows-only table
#8615 [BUG] RapidsShuffleThreadedWriterSuite temp shuffle file test failure
#6872 [BUG] awk: cmd. line:1: warning: regexp escape sequence `\ ' is not a known regexp operator
#8588 [BUG] Spark 3.3.x integration tests failed due to missing jars
#7775 [BUG] scala version hardcoded irrespective of Spark dependency
#8548 [BUG] cache_test:test_batch_no_cols test FAILED on spark-3.3.0+
#8579 [BUG] build failed on Databricks clusters "GpuDeleteCommand.scala:104: type mismatch"
#8187 [BUG] Integration test test_window_running_no_part can produce non-empty nulls (cudf scan)
#8493 [BUG] branch-23.08 fails to build on Databricks 12.2

PRs

#9407 [Doc]Update docs for 23.08.2 version[skip ci]
#9382 Bump up project version to 23.08.2
#8476 Use retry with split in GpuCachedDoublePassWindowIterator
#9048 Update 23.08 changelog 23/08/15 [skip ci]
#9044 [DOC] update release version from v2308.0 to 2308.1 [skip ci]
#9036 Fix meta class cast exception when generator not supported
#9042 Bump up project version to 23.08.1-SNAPSHOT
#9035 Handle null values when merging Bloom filters
#9029 Update 23.08 changelog to latest [skip ci]
#9023 Allow WindowLocalExec to run on CPU for a map test.
#9024 Do not trigger snapshot spark version test in pre-release maven-verify checks [skip ci]
#8975 Init 23.08 changelog [skip ci]
#9016 Fix issue where murmur3 tried to work on array of structs
#9014 Updating link to download jar [skip ci]
#9006 Revert test changes to fix binary dedup error
#9001 [Doc]update the emr getting started doc for emr-6120 release[skip ci]
#8949 Update JNI and private version to released 23.08.0
#8977 Create an anonymous subclass of AdaptiveSparkPlanHelper in ExecutionPlanCaptureCallback.scala
#8972 [Doc]Add best practice doc[skip ci]
#8948 [Doc]update download docs for 2308 version[skip ci]
#8971 Fix test_map_scalars_supported_key_types
#8990 Remove doc references to 312db [skip ci]
#8960 [Doc] address profiling tool formatted issue [skip ci]
#8983 Revert OrcSuite to fix deployment build
#8979 Fix Databricks build error for new added ORC test cases
#8920 Add test case to test orc dictionary encoding with lots of rows for nested types
#8940 Add test case for ORC statistics test
#8909 Match Spark's NaN handling in collect_set
#8892 Experimental support for BloomFilterAggregate expression in a reduction context
#8957 Fix building dockerfile.cuda hanging at tzdata installation [skip ci]
#8944 Fix issues around bloom filter with multple columns
#8744 Add test for selecting a single complex field array and its parent struct array
#8936 Device synchronize prior to freeing a set of RapidsBuffer
#8935 Don't go over shuffle limits on CPU
#8927 Skipping test_map_scalars_supported_key_types because of distributed …
#8931 Clone submodule using git command instead of checkoutSCM plugin
#8917 Databricks shim version for integration test
#8775 Support BloomFilterMightContain expression
#8833 Binary and ternary handling of scalar audit and some fixes
#7233 [FEA] Support order by on single-level array
#8893 Fix regression in Hive Generic UDF support on Databricks 12.2
#8828 Put shared part together for Databricks test scripts
#8872 Terminate CI if fail to clone submodule
#8787 Add in support for ExponentialDistribution
#8868 Add a test case for testing ORC version V_0_11 and V_0_12
#8795 Add ORC writing test cases for not implicitly lowercase columns
#8871 Adjust parallelism in spark-tests script to reduce memory footprint [skip ci]
#8869 Specify expected JAVA_HOME and bin for mvn-verify-check [skip ci]
#8785 Add test cases for ORC writing according to options orc.compress and compression
#8810 Fall back to CPU for deletion vectors writes on Databricks
#8830 Update documentation to add Databricks 12.2 as a supported platform [skip ci]
#8799 Add tests to cover some odd corner cases with nulls and empty arrays
#8783 Fix collect_set_on_nested_type tests failed
#8855 Fix bug: Check GPU file instead of CPU file [skip ci]
#8852 Update test scripts and dockerfiles to match cudf conda pkg change [skip ci]
#8848 Try mitigate mismatched JDK versions in mvn-verify checks [skip ci]
#8825 Add a case to test ORC writing/reading with lots of nulls
#8802 Treat unbounded windows as truly non-finite.
#8798 Add ORC writing test cases for dictionary compression
#8829 Enable rle_boolean_encoding.parquet test
#8667 Make state spillable in partitioned writer
#8801 Fix shuffling an empty Struct() column with UCX
#8748 Add driver log warning when GPU is limiting scheduling resource
#8786 Add support for row-based execution in RapidsDeltaWrite
#8791 Auto merge to branch-23.10 from branch-23.08[skip ci]
#8790 Update ubuntu dockerfiles default to 20.04 and deprecating centos one [skip ci]
#8777 Install python packages with shared scripts on Databricks
#8772 Test concurrent writer update file metrics
#8646 Add testing of Parquet files from apache/parquet-testing
#8684 Add 'submodule update --init' when build spark-rapids
#8769 Remove iceberg scripts from Databricks test scripts
#8773 Add a test case for reading/write null to ORC
#8749 Add test cases for read/write User Defined Type (UDT) to ORC
#8768 Add support for xxhash64
#8751 Ensure columnarEval always returns a GpuColumnVector
#8765 Add in support for maps to big data gen
#8758 Normal and Multi Distributions for BigDataGen
#8755 Add in dependency for databricks on integration tests
#8737 Fix parquet_write_test.py::test_hive_timestamp_value failure for Databricks
#8745 Conventional jar layout is not required for JDK9+
#8706 Add a tool to support generating large amounts of data
#8747 xfail hash_groupby_collect_set and hash_reduction_collect_set on nested type cases
#8689 Support nested arrays for min/max aggregations in groupby and reduction
#8699 Regression test for array of struct with a single field name "element" in Parquet
#8733 Avoid generating numeric null partition values on Databricks 10.4
#8728 Use specific mamba version and install libarchive explictly [skip ci]
#8594 String generation from complex regex in integration tests
#8700 Add regression test to ensure Parquet doesn't interpret timestamp values differently from Hive 0.14.0+
#8711 Factor out modules shared among shim profiles
#8697 Spillable columnar batch takes ownership and improve code coverage
#8705 Add schema evolution integration tests for partitioned data
#8673 Fix some potential memory leaks
#8707 Update config docs for new filecache configs [skip ci]
#8695 Always create the main artifact along with a shim-classifier artifact
#8704 Add tests for column names with dots
#8703 Comment out min/max agg test for nested structs to unblock CI
#8698 Cache last ORC stripe footer to avoid redundant remote reads
#8687 Handle TIMESTAMP_MILLIS for rebase check
#8688 Enable the 340 shim test
#8656 Return result from filecache message instead of null
#8659 Filter out nulls for build batches when needed in hash joins
#8682 [DOC] Update CUDA requirements in documentation and Dockerfiles[skip ci]
#8637 Support Float order-by columns for RANGE window functions
#8681 changed container name to adapt to blossom-lib refactor [skip ci]
#8573 Add support for Delta Lake 2.4.0
#8671 Fix use-after-freed bug in GpuFloatArrayMin
#8650 Support TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS
#8495 Speed up PCBS CPU read path by not recalculating as much
#8389 Add filecache support for ORC
#8658 Check if need to run Databricks pre-merge
#8649 Add Spark 3.4.1 shim
#8624 Rename numBytesAdded/Removed metrics and add deletion vector metrics in Databricks 12.2 shims
#8645 Fix "PytestUnknownMarkWarning: Unknown pytest.mark.inject_oom" warning
#8608 Matrix stages to dynamically build Databricks shims
#8517 Revert "Disable asserts for non-empty nulls (#8183)"
#8628 Enable Delta Write fallback tests on Databricks 12.2
#8632 Fix GCP examples and getting started guide [skip ci]
#8638 Support nested structs for min/max aggregations in groupby and reduction
#8639 Add iceberg test for nightly DB12.2 IT pipeline[skip ci]
#8618 Heuristic to speed up partial aggregates that get larger
#8605 [Doc] Fix demo link in index.md [skip ci]
#8619 Enable output batches metric for GpuShuffleCoalesceExec by default
#8617 Fixes broadcast spill serialization/deserialization
#8531 filecache: Modify FileCacheLocalityManager.init to pass in Spark context
#8613 Try print JVM core dump files if any test failures in CI
#8616 Wait for futures in multi-threaded writers even on exception
#8578 Add in metric to see how much computation time is lost due to retry
#8590 Drop ".dev0" suffix from Spark SNASHOT distro builds
#8604 Upgrade scalatest version to 3.2.16
#8555 Support flatten SQL function
#8599 Fix broken links in advanced_configs.md
#8589 Revert to the JVM-based Spark version extraction in pytests
#8582 Fix databricks shims build errors caused by DB updates
#8564 Fold verify-all-modules-with-headSparkVersion into verify-all-modules [skip ci]
#8553 Handle empty batch in ParquetCachedBatchSerializer
#8575 Corrected typos in CONTRIBUTING.md [skip ci]
#8574 Remove maxTaskFailures=4 for pre-3.1.1 Spark
#8503 Remove hard-coded version numbers for dependencies when building on
#8544 Fix auto merge conflict 8543 [skip ci]
#8521 List supported Spark versions when no shim found
#8520 Add support for first, last, nth, and collect_list aggregations for BinaryType
#8509 Remove legacy spark version check
#8494 Fix 23.08 build on Databricks 12.2
#8487 Move MockTaskContext to tests project
#8426 Pre-merge CI to support Databricks 12.2
#8282 Databricks 12.2 Support
#8407 Bump up dep version to 23.08.0-SNAPSHOT
#8359 Init version 23.08.0-SNAPSHOT

Older Releases

Changelog of older releases can be found at docs/archives