Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Extend escape analysis to account for arrays with non-gcref elements #104906

Open
wants to merge 94 commits into
base: main
Choose a base branch
from

Conversation

hez2010
Copy link
Contributor

@hez2010 hez2010 commented Jul 15, 2024

Positive case:

var chs = new char[42];
chs[1] = 'a';
Console.WriteLine((int)chs[1] + chs.Length);

Codegen:

; Assembly listing for method ArrayAllocator.Program:Main() (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;* V00 loc0         [V00    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact <short[]>
;  V01 OutArgs      [V01    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  struct (104) zero-ref    do-not-enreg[SF] "stack allocated array temp"
;* V03 tmp2         [V03    ] (  0,  0   )    long  ->  zero-ref    single-def "V02.[000..008)"
;* V04 tmp3         [V04    ] (  0,  0   )     int  ->  zero-ref    single-def "V02.[008..012)"
;* V05 tmp4         [V05    ] (  0,  0   )   short  ->  zero-ref    "V02.[018..020)"
;
; Lcl frame size = 40

G_M25548_IG01:  ;; offset=0x0000
       sub      rsp, 40
                                                ;; size=4 bbWeight=1 PerfScore 0.25
G_M25548_IG02:  ;; offset=0x0004
       mov      ecx, 84
       call     [System.Console:WriteLine(int)]
       nop
                                                ;; size=12 bbWeight=1 PerfScore 3.50
G_M25548_IG03:  ;; offset=0x0010
       add      rsp, 40
       ret
                                                ;; size=5 bbWeight=1 PerfScore 1.25

; Total bytes of code 21, prolog size 4, PerfScore 5.00, instruction count 6, allocated bytes for code 21 (MethodHash=5b0b9c33) for method ArrayAllocator.Program:Main() (FullOpts)

Negative case:

var chs = new char[42];
chs[1] = 'a';
Console.WriteLine((int)chs[42] + chs.Length);

Codegen:

; Assembly listing for method ArrayAllocator.Program:Main() (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;* V00 loc0         [V00    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact <short[]>
;  V01 OutArgs      [V01    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  struct (104) zero-ref    do-not-enreg[SF] "stack allocated array temp"
;  V03 tmp2         [V03,T00] (  1,  0   )   byref  ->  rbx         must-init "dummy temp of must thrown exception"
;* V04 tmp3         [V04    ] (  0,  0   )    long  ->  zero-ref    single-def "V02.[000..008)"
;* V05 tmp4         [V05    ] (  0,  0   )     int  ->  zero-ref    single-def "V02.[008..012)"
;* V06 tmp5         [V06    ] (  0,  0   )   short  ->  zero-ref    single-def "V02.[018..020)"
;
; Lcl frame size = 32

G_M25548_IG01:  ;; offset=0x0000
       push     rbx
       sub      rsp, 32
       xor      ebx, ebx
                                                ;; size=7 bbWeight=0 PerfScore 0.00
G_M25548_IG02:  ;; offset=0x0007
       call     CORINFO_HELP_RNGCHKFAIL
       movsx    rcx, word  ptr [rbx]
       call     [System.Console:WriteLine(int)]
       int3
                                                ;; size=16 bbWeight=0 PerfScore 0.00

; Total bytes of code 23, prolog size 5, PerfScore 0.00, instruction count 7, allocated bytes for code 23 (MethodHash=5b0b9c33) for method ArrayAllocator.Program:Main() (FullOpts)
; ============================================================

Benchmark on Mandelbrot:

Method Job Mean Error StdDev Code Size Allocated
MandelBrot NoStackAllocationArray 199.7 us 1.30 us 1.22 us 1,996 B 2.49 KB
MandelBrot StackAllocationArray 195.8 us 1.16 us 1.08 us 2,414 B 1.14 KB

Diff: https://www.diffchecker.com/bNP4qHdF/

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 15, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jul 15, 2024
Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For arrays (and also perhaps boxes and ref classes) we ought to have some kind of size limit... possibly similar to the one we use for stackallocs.

We need to be careful we don't allocate a lot of stack for an object that might not be heavily used, as we'll pay per-call prolog zeroing costs.

@AndyAyersMS
Copy link
Member

I am tempted to try and unify stack allocated array checks with our existing bounds check logic by having the stack case invoke a "helper method" at the allocation site (which will lower to nothing or perhaps stores to the vtable and length field) so that the heap and stack cases are handled similarly.

I'm going to prototype this, since I think it's better than relying on the overall size or on being able to propagate the element type/length through memory. Aside from sharing the inference logic with heap arrays, it will also generalize better to variable-length or MD arrays...

This seems simple enough, the allocation is already a helper call so we can mostly just leave it as is; however we likely also don't want to propagate the local address, so that dataflow on the array leads back to the helper, which means we lose the connection of the newarr temp and the local. We can add that as some kind of extra payload on the helper perhaps.

I also suspect what is in this PR doesn't always work right for R2R/NAOT, it looks like we embed the compile time handle into the array's method table slot, but then it's unlikely we ever look at this slot in cases where the array doesn't escape, so not sure if there's a test case that can be made to fail here.

@AndyAyersMS
Copy link
Member

If we defer "lowering" of these helpers until lower, we may miss out on forming address modes from the local directly... we'll have something like this during optimization

  t = newarr(len, type) [stack:V]
  ...
    = t[i]

and only later expose the assignment as to a local address:

    V[0] = method table
    V[8] = len 
    t = &V;
    ...
       = t[i]

so may end up needing an extra register perhaps.

Leave the newarr helper call in place, and don't rewrite the uses to be
uses of the local. Remove special handling in local morph and morph.

Lower the helper to the proper stores later on.

Update a few utilities to understand array base addresses may not be TYP_REF.
@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jan 10, 2025

I think this approach is looking pretty good.

Draft PR here: #111284

No changes now in morph or local morph or assertion prop. May not have the R2R embedding right yet either, but it may not actually matter until we try and support arrays with GC types.

We aren't as capable as optimizing the results as I'd like, but improving on that raises deeper questions of our aliasing model, and so likely should be deferred.

For example, liveness will mark all STOREINDs as modifying the GC heap, but now we have some that won't. Detecting that requires annotating the store with something like GTF_IND_TGT_NOT_HEAP, but I don't want to rely on that in liveness since it also can refer to pinned heap or unmanaged non-stack memory, so we probably want a different attribute like GTF_IND_TGT_STACK, and a separation of memory SSA tracking into heap, stack (byref), and unmanaged?

For now I have plumbed through some logic in VN to recognize these new local arrays can be treated via liberal VNs, this enables constant prop through known index elements as long as there is no ambiguous heap store in between. But without VN we don't have enough breadcrumbs to model these more aggressively.

I still need to look more closely at diffs between this and the above. SPMI doesn't show many, but has a few hundred missed contexts per collection, I suspect behavior may diverge there.

@AndyAyersMS
Copy link
Member

Merged in the changes from #111284.

@AndyAyersMS
Copy link
Member

@dotnet/jit-contrib PTAL
cc @jakobbotsch @hez2010

SPMI will be fairly accurate here... kicked off another collection which should have even fewer misses. Code size increases but mostly from clr test where small fixed-sized arrays are unusually frequent.

  • We currently limit array size to 528 bytes (so enough for a 512 byte array or a 128 int array). This can be altered by config.
  • I have disabled this opt for R2R because it inhibits some cross-module inlines. If/when we can prove that the vtable is never looked at (or perhaps rely on a runtime helper call to initialize it), we can revisit.
  • There is no "hotness" check so we will stack allocate arrays that are created in cold blocks. We should reconsider this since these arrays will frequently require prolog zeroing.
  • There is some special handling in VN to always rely on liberal VNs, but it is still simplistic, eg for a local array a the array load below will not be recognized as a constant.
 a[0] = 3;
 f();
     = a[0];
  • Arrays that are only stored to are not optimized away
  • Other opts mostly work as they do now, as these arrays don't really look any different than heap arrays to the jit.

@AndyAyersMS
Copy link
Member

Seeing AVs in osx crossgen2, oddly in both base and diff jits:

[17:42:02] Invoking: C:\h\w\AAC9097D\p\superpmi.exe -a -v ewi -f C:\h\w\AAC9097D\t\tmpwd8c5ge0\libraries.crossgen2.osx.arm64.checked.mch_fail.mcl -details C:\h\w\AAC9097D\t\tmpwd8c5ge0\libraries.crossgen2.osx.arm64.checked.mch_details.csv -target arm64 -jitoption force JitEnableNoWayAssert=1 -jitoption force JitNoForceFallback=1 -jitoption force JitAlignLoops=0 -jit2option force JitEnableNoWayAssert=1 -jit2option force JitNoForceFallback=1 -jit2option force JitAlignLoops=0 -p -failureLimit 100 C:\h\w\AAC9097D\p\base\checked\clrjit_universal_arm64_x64.dll C:\h\w\AAC9097D\p\diff\checked\clrjit_universal_arm64_x64.dll C:\h\w\AAC9097D\w\9D38092F\e\artifacts\spmi\mch\cc0e7adf-e397-40b6-9d14-a7149815c991.osx.arm64\libraries.crossgen2.osx.arm64.checked.mch

[17:45:49] ERROR: Unexpected exception c0000005 was thrown.

[17:45:49] ERROR: Method 86217 of size 23 failed to load and compile correctly by JIT1 (C:\h\w\AAC9097D\p\base\checked\clrjit_universal_arm64_x64.dll).

[17:45:49] ERROR: Unexpected exception c0000005 was thrown.

[17:45:49] ERROR: Method 86217 of size 23 failed to load and compile correctly by JIT2 (C:\h\w\AAC9097D\p\diff\checked\clrjit_universal_arm64_x64.dll).

[17:45:49] ERROR: Unexpected exception c0000005 was thrown.

Will try and look at this locally

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jan 17, 2025

Failing here: InitClass map is null

CorInfoInitClassResult MethodContext::repInitClass(CORINFO_FIELD_HANDLE   field,
                                                   CORINFO_METHOD_HANDLE  method,
                                                   CORINFO_CONTEXT_HANDLE context)
{
    Agnostic_InitClass key;
    ZeroMemory(&key, sizeof(key)); // Zero key including any struct padding
    key.field   = CastHandle(field);
    key.method  = CastHandle(method);
    key.context = CastHandle(context);

    DWORD value = InitClass->Get(key);

Looks like this map should always be present since morph always calls InitClass. But at any rate we can stop SPMI from AVing.

Tolerating this in #111555.

@hez2010
Copy link
Contributor Author

hez2010 commented Jan 18, 2025

Merged the main branch to include #111555.

@hez2010
Copy link
Contributor Author

hez2010 commented Jan 18, 2025

Try a late dead store removal
@MihuBot

@hez2010
Copy link
Contributor Author

hez2010 commented Jan 18, 2025

@AndyAyersMS Just experimented a bit with late dead stores removal by unexposing locals in the liveness and repeating to convergence: commit 3450580

Before adding late dead stores removal: diffs
After adding late dead stores removal: diffs

Relative diffs:

linux arm64: -26.588 bytes
linux x64: -27,113 bytes
osx arm64: -14,360 bytes
windows arm64: -25,528 bytes
windows x64: -26,628 bytes

Diffs look interesting but TP impacts are too high. Now I'm reverting the experiment commit.

@hez2010 hez2010 force-pushed the value-array-stack-alloc branch from 3450580 to 1915450 Compare January 18, 2025 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants