Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically inject GPU resources and node selector for NIM deployment #134

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

shivamerla
Copy link
Collaborator

No description provided.

Comment on lines +85 to +90
// Reusable error handler to log and return errors
handleError := func(err error, reason string, resource string) (ctrl.Result, error) {
logger.Error(err, fmt.Sprintf("Failed to reconcile %s", resource), "resource", resource)
return ctrl.Result{}, err
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not in line with Commit message or PR description

Comment on lines +153 to +196
// Setup PVC for model store
modelPVC, err := r.setupPVC(ctx, nimService)
if err != nil {
return handleError(err, "Failed to setup PVC", "PVC")
}

// Setup deployment parameters
deploymentParams := nimService.GetDeploymentParams()
if err := r.setupDeploymentParams(ctx, nimService, modelPVC, deploymentParams); err != nil {
return handleError(err, "Failed to setup deployment parameters", "DeploymentParams")
}

// Sync deployment
err = r.renderAndSyncResource(ctx, nimService, &renderer, &appsv1.Deployment{}, func() (client.Object, error) {
return renderer.Deployment(deploymentParams)
}, "deployment", conditions.ReasonDeploymentFailed)
if err != nil {
return ctrl.Result{}, err
}

// Wait for deployment readiness
msg, ready, err := r.isDeploymentReady(ctx, &namespacedName)
if err != nil {
return handleError(err, "Failed to check deployment readiness", "Deployment")
}

// Update status
if !ready {
if err := r.updater.SetConditionsNotReady(ctx, nimService, conditions.NotReady, msg); err != nil {
return handleError(err, "Failed to update NotReady status", "Status")
}
} else {
if err := r.updater.SetConditionsReady(ctx, nimService, conditions.Ready, msg); err != nil {
return handleError(err, "Failed to update Ready status", "Status")
}
}

return ctrl.Result{}, nil
}

func (r *NIMServiceReconciler) setupPVC(ctx context.Context, nimService *appsv1alpha1.NIMService) (*appsv1alpha1.PersistentVolumeClaim, error) {
logger := log.FromContext(ctx)

// Initialize variable for the PVC
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not in line with Commit message or PR description

Copy link
Collaborator Author

@shivamerla shivamerla Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactoring was necessary as we had to customize deployment params for multiple reasons now. (PVC, node selector, resources etc). Earlier all were done in reconcileNIMService call as only PVC was customized.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not saying I am against it, just please have it as a separate commit or a separate PR all together

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, let me refactor this into TP change and one for node selector itself.

@@ -0,0 +1,81 @@
/*
Copyright 2024.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Copyright 2024.
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

}

if !crdExists {
return false, nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return a new error saying that NFD CRD does not exist instead of nil?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intent of this function is to check if a specific NodeFeatureRule CR exists, so if the CRD itself is missing we can assume CR is not present as well. Hence error is not thrown in this case.

// TODO: Make the resource name configurable
const gpuResourceName = corev1.ResourceName("nvidia.com/gpu")

deploymentParams.Resources.Requests[corev1.ResourceName(gpuResourceName)] = gpuQuantity
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the customer set a gpu resource request / limit in the nim service spec? (I believe yes) and this logic may overwrite the value, is it correct? Also, do we have unit test coverage for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if a profile is selected, then we populate the number of GPUs and schedule them to matching GPU type. For generic profile selection, they can specify resources to be consumed.


func (r *NIMServiceReconciler) getDeviceIDByProfile(ctx context.Context, profile *appsv1alpha1.NIMProfile) (string, error) {
deviceID := ""
if device, exists := profile.Config["gpu_device"]; exists {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the model point of view, the same (model, profile) should be able to run on multiple types of gpus. But here the logic seem to indicate that it is always the 1:1 mapping. Does nim enforce the latter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slu2011 models are optimized for specific GPU SKUs, they cannot be run on other types.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about different mig profiles from the same physical gpu? like 1g.10gb vs 2g.20gb. Does nim provide different model even for different mig profiles, or it would be the same?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slu2011 currently NIMs need a full GPU, they don't support MIG.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants