Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dev3: mi_bin(...) function returns non-existing bin (233) #1016

Open
Noxybot opened this issue Feb 18, 2025 · 10 comments
Open

dev3: mi_bin(...) function returns non-existing bin (233) #1016

Noxybot opened this issue Feb 18, 2025 · 10 comments

Comments

@Noxybot
Copy link

Noxybot commented Feb 18, 2025

Reproed on x86 ChromeOS:

static mi_decl_noinline size_t mi_bin(size_t size) { // size == 248 <--------
  size_t wsize = _mi_wsize_from_size(size); // wsize == 31  <--------
#if defined(MI_ALIGN4W)
  if mi_likely(wsize <= 4) {
    return (wsize <= 1 ? 1 : (wsize+1)&~1); // round to double word sizes
  }
#elif defined(MI_ALIGN2W)
  if mi_likely(wsize <= 8) { // false <--------
    return (wsize <= 1 ? 1 : (wsize+1)&~1); // round to double word sizes
  }
#else
  if mi_likely(wsize <= 8) {
    return (wsize == 0 ? 1 : wsize);
  }
#endif
  else if mi_unlikely(wsize > MI_LARGE_MAX_OBJ_WSIZE) { // false <--------
    return MI_BIN_HUGE;
  }
  else {
    #if defined(MI_ALIGN4W)
    if (wsize <= 16) { wsize = (wsize+3)&~3; } // round to 4x word sizes
    #endif
    wsize--; // wsize == 30 <--------
    // find the highest bit
    const size_t b = (MI_SIZE_BITS - 1 - mi_clz(wsize));  // note: wsize != 0 // b == 59  <--------
    // and use the top 3 bits to determine the bin (~12.5% worst internal fragmentation).
    // - adjust with 3 because we use do not round the first 8 sizes
    //   which each get an exact bin
    const size_t bin = ((b << 2) + ((wsize >> (b - 2)) & 0x03)) - 3; // bin == 233  <-------- ERROR? Non existing bin
    mi_assert_internal(bin > 0 && bin < MI_BIN_HUGE);
    return bin;
  }
}
@Noxybot Noxybot changed the title dev3 mi_bin(...) function return non-existing bin (233) dev3 mi_bin(...) function returns non-existing bin (233) Feb 18, 2025
@Noxybot
Copy link
Author

Noxybot commented Feb 18, 2025

The issue that mi_clz works differently on dev3 comparing to dev2. On dev2 we have this:

static inline size_t mi_clz(size_t x) {
  if (x==0) return MI_SIZE_BITS;
  #if (SIZE_MAX == ULONG_MAX)
    return __builtin_clzl(x);
  #else
    return __builtin_clzll(x);
  #endif
}

So this line:

 const size_t b = (MI_SIZE_BITS - 1 - mi_clz(wsize));

results in b = 4, when wsize = 30 on dev3, however, b = 59 when wsize = 30

@Noxybot Noxybot changed the title dev3 mi_bin(...) function returns non-existing bin (233) dev3: mi_bin(...) function returns non-existing bin (233) Feb 18, 2025
@daanx
Copy link
Collaborator

daanx commented Feb 18, 2025

Yikes -- that is not good. It seems the clz is returning the bit position instead of the leading zeros (MI_SIZE_BITS - 1 - clz == bsr). I cannot reproduce this on my systems -- can you somehow check which preprocessor case is used in bits.h:mi_clz when compiling to chromeOS? Can you try to use -DMI_ARCH_OPT=ON (or OFF) with cmake and see if it still happens?

(maybe it is an old processor that does not support lzcnt and it gets executed as bsr reversing the bit position? -DMI_ARCH_OPT=OFF should prevent this)

@Noxybot
Copy link
Author

Noxybot commented Feb 19, 2025

Just pulled the latest dev3 and gave it a shot. Here are the results: (running on ChromeOS + Intel Celeron N4120 CPU @1.10Hz)

  1. The code branch taken is this one :
  #if defined(__GNUC__) && MI_ARCH_X64 && defined(__BMI1__) // on x64 lzcnt is defined for 0
    size_t r;
    __asm ("lzcnt\t%1, %0" : "=r"(r) : "r"(x) : "cc");
    return r;

when static inline size_t mi_clz(size_t x) { is called with x=15, the function returns 3, which does not look like "leading zeros count"...

  1. Setting -DMI_ARCH_OPT=OFF seem to not help.

UPD:
Applied CMake option incorrectly. Seems like one of those options -march=haswell;-mavx2 is the culprit. If I remove those options(exactly what -DMI_ARCH_OPT=OFF does) the code now takes this branch:

  #elif mi_has_builtinz(clz)
    return (x!=0 ? (size_t)mi_builtinz(clz)(x) : MI_SIZE_BITS);

and works fine.

@Noxybot
Copy link
Author

Noxybot commented Feb 19, 2025

OK, so setting it to -march=x86-64 also works fine. With -march=haswell;-mavx2 non-working mi_clz was just the first issue. Later, almost any code in mimalloc was hitting SIGILL. Apparently, when compiled with those options mimalloc is just not functional on Intel Celeron N4120 CPU @1.10Hz.
IDK what are the the perf implications here, but we can't just enable optimization for haswell / avx2 for ALL x64 CPUs...

@daanx
Copy link
Collaborator

daanx commented Feb 19, 2025

Thanks!

  #if defined(__GNUC__) && MI_ARCH_X64 && defined(__BMI1__) // on x64 lzcnt is defined for 0
    size_t r;
    __asm ("lzcnt\t%1, %0" : "=r"(r) : "r"(x) : "cc");
    return r;

This means __BMI1__ is defined, which should be mean it is compiled with -march=haswell ? But if MI_OPT_ARCH=OFF this should not be the case? Ah wait, I think you used MI_ARCH_OPT but the option is called MI_OPT_ARCH -- can you try that again and see if it works in that case? (the MI_OPT_ARCH is especially meant for this case in a way where one doesn't want to use any "modern" instructions)

I am still surprised though that that processor (Intel Celeron N4120) cannot handle lzcnt... that seems still very strange as BMI1 is defined since 2013 or so. Is this running in emulation?

@Noxybot
Copy link
Author

Noxybot commented Feb 19, 2025

Yeah, -D MI_OPT_ARCH=OFF solves the issue. It's not an emulator, it's a physical Chromebook device running ChromeOS. Here is the flags queried from /proc/cpuinfo:

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 cx16 pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand hypervisor lahf_lm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdseed smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves arat vnmi umip rdpid md_clear arch_capabilities
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml tsc_scaling

@daanx
Copy link
Collaborator

daanx commented Feb 21, 2025

I changed the code such that the clz/ctz code should work now even with MI_OPT_ARCH=ON (the default). Can you perhaps check this ?

I am quite surprised to see that Intel is still releasing processors without bmi1 (and avx2). Even though the Celeron is technically a (post) Haswell architecture (~2013), I think the current -march=haswell implies avx2 and bmi1 for the compiler so it might be that we still need to set MI_OPT_ARCH=OFF when compiling for chromeOS.

@Noxybot
Copy link
Author

Noxybot commented Feb 27, 2025

Yeah, it works, but mimalloc itself is not functional with MI_OPT_ARCH=ON on this CPU (like any random assignment triggers "SIGILL"... We might want to probably disable MI_OPT_ARCH for x86 Androids (but we can do this on our end, I do not think it needs to be committed)
Or make the optimization less aggressive

@daanx
Copy link
Collaborator

daanx commented Feb 27, 2025

Right, that is expected as the compiler would still emit other BMI instructions. I will need to consider to perhaps change the default MI_OPT_ARCH to OFF ... I was always under the impression that any Intel CPU after haswell (in 2013) would have BMI1 extensions so I was surprised to discover these more recent Celerons still don't have lzcnt etc.

@daanx
Copy link
Collaborator

daanx commented Mar 6, 2025

I pushed an update to make MI_OPT_ARCH=OFF by default on x64 to avoid this trouble -- the code for ctz is still quite good without it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants