Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is TorchText implemented? #1381

Open
przemyslawbak opened this issue Oct 12, 2024 · 3 comments
Open

Is TorchText implemented? #1381

przemyslawbak opened this issue Oct 12, 2024 · 3 comments

Comments

@przemyslawbak
Copy link

For TorchSharp text classification example there is TorchText used to load data set.

I am not sure what I am doing wrong, but I can not find any usings to import this library.

For TorchSharp MNIST example I did manage to find and install proper NuGet to use torchvision.

Is TorchText implemented for .NET?

If not, alternatively, how can I load data from CSV file? I do not know what data type should be used for var reader in the example? Im confused.

@yueyinqiu
Copy link
Contributor

yueyinqiu commented Oct 12, 2024

I think we don't have torchtext support currently, and I've found the class in Examples.Utils.

@NiklasGustafsson
Copy link
Contributor

NiklasGustafsson commented Oct 15, 2024

We do not have that implemented.

Maybe @luisquintanilla can comment on some of the text-based preprocessing primitives we've added to ML.NET -- there's a few new tokenizers there, which should be usable with TorchSharp.

@GeorgeS2019
Copy link

GeorgeS2019 commented Oct 27, 2024

@LittleLittleCloud

Could you share your view which of the recent progress in ML.NET, regarding deep NLP, could be relevant for advancing TorchText project using TorchSharp?

References


TorchText from Pytorch

PyTorch TorchText

torchtext.nn
torchtext.data.functional
torchtext.data.metrics
torchtext.data.utils
torchtext.datasets
torchtext.vocab
torchtext.utils
torchtext.transforms
torchtext.functional
torchtext.models

Tutorials


Tokenizers/Traansform from PyTorch

https://pytorch.org/text/stable/transforms.html

Tokenizers

  • SentencePieceTokenizer
  • GPT2BPETokenizer
  • CLIPTokenizer
  • RegexTokenizer
  • BERTTokenizer
  • CharBPETokenizer

Transform

  • VocabTransform
  • PadTransform
  • StrToIntTransform

Utils

ToTensor
LabelToIndex
Truncate
AddToken
Sequential


Microsoft.ML.Tokenizers

Microsoft.ML.Tokenizers

  • Microsoft.ML.Tokenizers
  • Microsoft.ML.Tokenizers.Data.Cl100kBase
  • Microsoft.ML.Tokenizers.Data.Gpt2
  • Microsoft.ML.Tokenizers.Data.O200kBase
  • Microsoft.ML.Tokenizers.Data.P50kBase
  • Microsoft.ML.Tokenizers.Data.R50kBase

# Microsoft.ML.Tokenizers

Models

  • BPETokenizer.cs
  • BertTokenizer.cs
  • CodeGenTokenizer.cs
  • EnglishRobertaTokenizer.cs
  • LlamaTokenizer.cs
  • Phi2Tokenizer.cs
  • SentencePieceTokenizer.cs
  • TiktokenTokenizer.cs
  • WordPieceTokenizer.cs

  • Merge.cs
  • ModelSourceGenerationContext.cs
  • Pair.cs
  • Symbol.cs
  • Word.cs
  • Cache.cs

Normalizers

  • BertNormalizer.cs
  • LowerCaseNormalizer.cs
  • Normalizer.cs
  • SentencePieceNormalizer.cs
  • UpperCaseNormalizer.cs

PreTokenizers

  • PreTokenizer.cs
  • RegexPreTokenizer.cs
  • RobertaPreTokenizer.cs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants