HuggingFace Inference Endpoints

10/10/2024

Inference Endpoints

Inference Endpoints offers a secure production solution to easily deploy any Transformers, Sentence-Transformers and Diffusers models from the Hub on dedicated and autoscaling infrastructure managed by Hugging Face.
A Hugging Face Endpoint is built from a Hugging Face Model Repository.

Create custom Inference Handler

Hugging Face Endpoints supports all of the Transformers and Sentence-Transformers tasks and can support custom tasks. The customization can be done through a handler.py file in your model repository on the Hugging Face Hub.


    class EndPointHandlerBaseConfig:
        
        ( ... )


    class EndpointHandler(EndPointHandlerBaseConfig):
        
        def __init__(self, path=""):

            self.controlnet = ControlNetModel.from_pretrained(
                self.controlnet_model,
                torch_dtype=torch.float16,
                use_safetensor=True,
                variant="fp16",
                device=DEVICE,
            )
            
            self.pipeline = StableDiffusionXLControlNetPipeline.from_pretrained(
                self.diffusion_model,
                controlnet=self.controlnet,
                torch_dtype=torch.float16,
                variant="fp16",
                use_safetensors=True,
                device=DEVICE,
            )

            self.pipeline.enable_model_cpu_offload()
            self.pipeline.load_lora_weights(self.lora_model)
            self.pipeline.fuse_lora(lora_scale=self.lora_scale)
            
            self.inpaint_pipeline = StableDiffusionXLInpaintPipeline.from_pretrained(
                self.inpainting_model,
                torch_dtype=torch.float16,
                variant="fp16",
                use_safetensors=True,
                device=DEVICE,
            )

            self.inpaint_pipeline.enable_model_cpu_offload()
            self.inpaint_pipeline.load_lora_weights(self.lora_model)
            self.inpaint_pipeline.fuse_lora(lora_scale=self.lora_scale)

        def __call__(self, data):

            try:

                ( Processing... )

                return {"res": img_base64, "error": None}

            except:
                return {"res": None, "error": traceback.format_exc()}

Inference Endpoints (dedicated)

Creating Endpoint

Automatic Scale-to-Zero: You can configure your Endpoint to scale to zero GPUs/CPUs after a certain amount of time. Scaled-to-zero Endpoints are not billed anymore. (restarting the Endpoint requires the model to be re-loaded into memory)
Endpoint Security Level: he standard security level is Protected, which requires an authorized HF token for accessing the Endpoint. Public Endpoints are accessible by anyone without token authentification.

Deploying Endpoint

After successful endpoint creation, the inference endpoint is initialized as Fig 2.
The Endpoint has Initializing, Running, and Scaled to Zero status.