Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose raw descriptor string slice of ClassName, FieldDescriptor and MethodDescriptor for bindgens #52

Open
wuwbobo2021 opened this issue Feb 5, 2025 · 14 comments

Comments

@wuwbobo2021
Copy link

I'm porting the java-spaghetti-gen to use this library, because the previously used jreflection crate is currently unmaintained.

It looks strange to me that ClassFile<'_>::this_class is Cow<'a, str>, but the object type class name contained in the FieldDescriptor is ClassName in which the original string slice cannot be read.

These desired functions are available in jreflection crate:

https://docs.rs/jreflection/latest/jreflection/field/struct.Field.html#method.descriptor_str
https://docs.rs/jreflection/latest/jreflection/method/struct.Method.html#method.descriptor_str
https://docs.rs/jreflection/latest/jreflection/field/enum.BasicType.html (Note: Id<'a> is a wrapper of &'a str)

Crate noak exposes these original string slices too, but it doesn't parse these field/method descriptors.

PS: It seems like ClassName's parse function splits the class binary name by /, but it doesn't care about $.

@staktrace
Copy link
Owner

the object type class name contained in the FieldDescriptor is ClassName in which the original string slice cannot be read.

The unqualified segments can be glued back together to make the string, right? I'd take a PR that exposes that as a function on ClassName.

PS: It seems like ClassName's parse function splits the class binary name by /, but it doesn't care about $.

Sorry, is there a bug here? If so please provide a test case to demonstrate it, thanks!

@wuwbobo2021
Copy link
Author

wuwbobo2021 commented Feb 5, 2025

The unqualified segments can be glued back together to make the string, right?

I've already tried to do it, but it's not an ideal solution: it constructs a new String which involves allocation. I'm changing it to use std::fmt::Formatter (write immediately).

Sorry, is there a bug here?

Here it is (based on your test_owned_cow):

    #[test]
    fn test_owned_cow() {
        let chars = Cow::from("(Ljava/lang/Object$Obj;)V".to_string());

        let descriptor = parse_method_descriptor(&chars, 0).unwrap();
        let mut parameters = descriptor.parameters.into_iter();
        let return_type = descriptor.return_type;

        assert_eq!(
            parameters.next().unwrap(),
            FieldDescriptor {
                dimensions: 0,
                field_type: FieldType::Object(ClassName {
                    segments: vec![
                        UnqualifiedSegment {
                            name: Cow::Borrowed("java")
                        },
                        UnqualifiedSegment {
                            name: Cow::Borrowed("lang")
                        },
                        UnqualifiedSegment {
                            name: Cow::Borrowed("Object$Obj")
                        },
                    ],
                }),
            },
        );
        assert!(parameters.next().is_none());
        assert_eq!(return_type, ReturnDescriptor::Void);
    }

@wuwbobo2021
Copy link
Author

wuwbobo2021 commented Feb 6, 2025

Sorry, I think it's not easy to make the change: parsed fields available in ClassFile do use the borrowed class data, but the original class data isn't available as a whole.

I've built a self-referencing wrapper for the parsed ClassFile (possibly off-topic):

pub use unsafe_class::Class;
mod unsafe_class { 
    use std::pin::Pin;
    use std::marker::PhantomPinned;

    pub struct Class {
        raw_bytes: Pin<Box<(Vec<u8>, PhantomPinned)>>,
        inner: cafebabe::ClassFile<'static>,
    }

    impl Class {
        pub fn read(raw_bytes: Vec<u8>) -> Result<Self, cafebabe::ParseError> {
            let pinned = Box::pin((raw_bytes, PhantomPinned));
            // SAFETY: `get<'a>(&'a self)` restricts the lifetime
            let fake_static = unsafe {
                std::slice::from_raw_parts(pinned.0.as_ptr(), pinned.0.len())
            };
            let inner = cafebabe::parse_class(fake_static)?;
            Ok(Self { raw_bytes: pinned, inner })
        }

        pub fn get<'a>(&'a self) -> &'a cafebabe::ClassFile<'a> {
            // SAFETY: casts `self.inner` into `cafebabe::ClassFile<'a>` forcefully.
            // Why is `ClassFile` invariant over `'a`?
            unsafe { &*(&raw const (self.inner)).cast() }
        }
    }
}

Do you think it's actually safe?

@staktrace
Copy link
Owner

Sorry, I think it's not easy to make the change: parsed fields available in ClassFile do use the borrowed class data, but the original class data isn't available as a whole.

Yes, unfortunately there's the pesky case of Java UTF-8 being different from regular UTF-8, so we can't unconditionally use the original class data when representing things as rust strings.

I've built a self-referencing wrapper for the parsed ClassFile (possibly off-topic):

neat!

Do you think it's actually safe?

I don't know, your grasp of advanced rust is better than mine, I haven't kept up with the language much recently.

PS: It seems like ClassName's parse function splits the class binary name by /, but it doesn't care about $.

I looked again the JVM spec and per https://docs.oracle.com/javase/specs/jvms/se21/html/jvms-4.html#jvms-4.2.2 unqualified names are allowed to contain the $ character so unless there's an actual bug here I don't think there's anything to change. The test case you provided passes as-written, but even if you change the expectation there it's not clear to me why you would expect it to behave differently.

@wuwbobo2021
Copy link
Author

I'm about to finish porting java-spaghetti to cafebabe with some workarounds, but I can't make previous unit tests like method_mangling_style_mangle_test working, because functions like parse_method_descriptor aren't exposed by cafebabe. I'll switch to the noak crate if cafebabe cannot be changed (I made this choice because you are the author of some popular crates, it's a wrong idea of mine...). Anyway, thank you for the reply.

Yes, unfortunately there's the pesky case of Java UTF-8 being different from regular UTF-8, so we can't unconditionally use the original class data when representing things as rust strings.

How about holding a &'a [u8] reference of the original data in these descriptors to make getting Cow<'a, str> possible? On the other hand, how could you get the parsed ClassName via this_class: Cow<'a, str> field of ClassFile?

Note about nested classes:

"The binary name of a member class or interface consists of the binary name of its immediately enclosing class or interface, followed by $, followed by the simple name of the member." https://docs.oracle.com/javase/specs/jls/se17/html/jls-13.html#jls-13.1.

Have a A.java source file: public class A { public class B {} }, execute javac A.java, you can see the name A$B in both A.class and A$B.class. However, A$B is probably one unqualified name, I'm not sure...

jreflection manages to parse nested class names from the string: https://docs.rs/jreflection/latest/src/jreflection/class.rs.html#177-181.

@wuwbobo2021
Copy link
Author

Well, a workaround in method_mangling_style.rs is possible, I may construct MethodDescriptor manually for the test case, because its fields are all public. I may keep using cafebabe with less effort.

@staktrace
Copy link
Owner

staktrace commented Feb 7, 2025

How about holding a &'a [u8] reference of the original data in these descriptors to make getting Cow<'a, str> possible?

While technically this is possible, I think it would be a pretty invasive change and require a massive overhaul of how data is plumbed through the parser. If I'm wrong and there's an easy way to do this I'll consider it.

On the other hand, how could you get the parsed ClassName via this_class: Cow<'a, str> field of ClassFile?

Not sure what you mean by this, can you elaborate?

Note about nested classes:

"The binary name of a member class or interface consists of the binary name of its immediately enclosing class or interface, followed by $, followed by the simple name of the member." https://docs.oracle.com/javase/specs/jls/se17/html/jls-13.html#jls-13.1.

This is in the JLS which is specific to Java. cafebabe is a class file parser which means it has to work with all JVM languages not just Java. I'd like to avoid adding Java-specific things into cafebabe directly.

@wuwbobo2021
Copy link
Author

It is an issue about API designing. Why should the field type (or method argument/return type) class binary name available in FieldType be ClassName, while the name of the class itself is exposed as Cow<'a, str>? (the API doesn't provide methods for convertions in between ClassName and Cow<'a, str>)

@staktrace
Copy link
Owner

Ah I see what you mean. That's a fair point. When I wrote it I was mostly following the spec and in the one case it's a descriptor string and in the other it's not, so I used different types. But I can see your point that from the consumer's point of view it's inconsistent.

@wuwbobo2021
Copy link
Author

Current workarounds in Dirbaio/java-spaghetti#5:

When I wrote it I was mostly following the spec and in the one case it's a descriptor string and in the other it's not, so I used different types.

Description of this_class in https://docs.oracle.com/javase/specs/jvms/se21/html/jvms-4.html#jvms-4.1: The value of the this_class item must be a valid index into the constant_pool table. The constant_pool entry at that index must be a CONSTANT_Class_info structure (§4.4.1) representing the class or interface defined by this class file.

https://docs.oracle.com/javase/specs/jvms/se21/html/jvms-4.html#jvms-4.4.1: The value of the name_index item must be a valid index into the constant_pool table. The constant_pool entry at that index must be a CONSTANT_Utf8_info structure (§4.4.7) representing a valid binary class or interface name encoded in internal form (§4.2.1).

Then I took a glance at https://docs.oracle.com/javase/specs/jvms/se21/html/jvms-4.html#jvms-4.5, https://docs.oracle.com/javase/specs/jvms/se21/html/jvms-4.html#jvms-4.6, still I can't understand your words "in the one case it's a descriptor string and in the other it's not".

@staktrace
Copy link
Owner

staktrace commented Feb 15, 2025

Ok after reading over the spec again I agree with you, the way I have it is inconsistent and can be improved. What I'm thinking is that

  1. The ClassName structure should have a Cow for the whole string, and instead of exposing the Vec of unqualified names as a member, could have an impl method that computes and returns the Vec
  2. The this_class (and super_class etc.) members of the ClassFile structure should also be ClassName structures that wrap the Cow values currently present.

Does that seem reasonable?

@wuwbobo2021
Copy link
Author

wuwbobo2021 commented Feb 15, 2025

Maybe. However, I had worried about the lifetime of ClassName previously: if it should be a wrapper of Cow by itself, the lifetime of unqualified names got from that ClassName might be limited by ClassName, but not ClassFile. If ClassName remembers the original &'a [u8], then it's possible to have an impl method that returns a Cow<'a, str>, and another method that returns a sequence (or iterator) of unqualified names having 'a lifetime parameter (instead of being limited by the lifetime of ClassName).

@wuwbobo2021
Copy link
Author

wuwbobo2021 commented Feb 15, 2025 via email

@staktrace
Copy link
Owner

I think in this case the unqualified names would be returned as Strings with copying. I don't mind doing that since it would be an explicit function call from the user instead of happening as part of parsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants