TGI Inference Service
Text Generation Inference (TGI) is a toolkit from HuggingFace for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and T5.
This service serves models via a TGIClientInferenceWorkflow
object, which encapsulates the backend, preprocessing, and postprocessing logic.
Infernet Configuration
The service can be configured as part of the overall Infernet configuration
in config.json
.
{
"log_path": "infernet_node.log",
//...... contents abbreviated
"containers": [
{
"id": "tgi_client_inference_service",
"image": "ritualnetwork/tgi_client_inference_service:latest",
"external": true,
"port": "3000",
"allowed_delegate_addresses": [],
"allowed_addresses": [],
"allowed_ips": [],
"command": "--bind=0.0.0.0:3000 --workers=2",
"env": {
"TGI_INF_TOKEN": "YOUR_TOKEN_HERE",
"TGI_INF_WORKFLOW_POSITIONAL_ARGS": "[\"http://FILL_HOSTNAME_HERE\", 30]",
"TGI_INF_WORKFLOW_KW_ARGS": "{\"retry_params\": {\"tries\": 3, \"delay\": 1, \"backoff\": 2, \"max_delay\": 10, \"jitter\": [0.5, 1.5]}, \"max_new_tokens\": 30, \"temperature\": 0.01}"
}
}
]
}
Environment Variables
TGI_INF_TOKEN
- Description: The HuggingFace token for authenticated API requests. Not required, but will increase API limits and enable access to private models.
- Default: None
TGI_INF_WORKFLOW_POSITIONAL_ARGS
- Description: Arguments passed to the TGI workflow applied positionally.
server_url
- Description: The TGI service URL.
connection_timeout
- Description: The connection timeout.
headers (optional)
- Description: Additional headers to pass to the TGI service.
cookies (optional)
- Description: The cookies to pass to the TGI service.
TGI_INF_WORKFLOW_KW_ARGS
- Description: Any argument passed here will passed in as a keyword argument to the TGI workflow. Used to set the TGI inference parameters.
Refer to the TGI documentation for a full list of available parameters.
retry_params
- Description: The retry parameters for the inference workflow. (optional)
tries
- Description: The number of retries for the inference workflow.
- Default:
3
delay
- Description: The delay (in seconds) between retries.
- Default:
3
max_delay
- Description: The maximum delay (in seconds) between retries.
- Default:
null
backoff
- Description: The backoff (in seconds) between retries.
- Default:
2
jitter
- Description: The jitter (in seconds) to add to requests.
- Default:
[0.5, 1.5]
Usage
Offchain requests to the service can be initiated with python
or cli
by utilizing the infernet_client package, as well as with
HTTP requests against the Infernet Node directly (using a client like cURL
).
The schema format of a infernet_client
job request looks like the following:
class JobRequest(TypedDict):
"""Job request.
Attributes:
containers: The list of container names.
data: The data to pass to the containers.
"""
containers: list[str]
data: dict[str, Any]
requires_proof: NotRequired[bool]
The schema format of a infernet_client
job result looks like the following:
class JobResult(TypedDict):
"""Job result.
Attributes:
id: The job ID.
status: The job status.
result: The job result.
intermediate: Job result from intermediate containers.
"""
id: str
status: JobStatus
result: Optional[ContainerOutput]
intermediate: NotRequired[list[ContainerOutput]]
class ContainerOutput(TypedDict):
"""Container output.
Attributes:
container: The container name.
output: The output of the container.
"""
container: str
output: Any
Offchain (web2) Request
Please note: The examples below assume that you have an Infernet Node running locally on port 4000
.
# Note that the sync flag is optional and will wait for the job to complete.
# If you do not pass the sync flag, the job will be submitted and you will receive a job id, which you can use to get the result later.
infernet-client job -c tgi_client_inference_service -i input.json --sync
input.json
looks like this:
Onchain (web3) Subscription
You will need to import the infernet-sdk
in your requesting contract. In this example
we showcase the Callback
pattern, which is an example of a one-off subscription. Please refer to
the infernet-sdk
documentation for
further details.
Input requests should be passed in as an encoded byte string. Here is an example of how to generate this for a TGI Client Inference request:
from eth_abi.abi import encode
input_bytes = encode(
["string"],
[
"Is the sky blue during a clear day?"
],
)
Assuming your contract inherits from the CallbackConsumer
provided by infernet-sdk
,
you can use the following functions to request and recieve compute:
pragma solidity ^0.8.0;
import {CallbackConsumer} from "infernet-sdk/consumer/Callback.sol";
contract MyOnchainSubscription is CallbackConsumer {
constructor(address registry) CallbackConsumer(registry) {}
// Function to chat with LLM
function chatWithLLM(bytes memory inputs) public {
string memory containerId = "my-container";
uint16 redundancy = 1;
address paymentToken = address(0);
uint256 paymentAmount = 0;
address wallet = address(0);
address verifier = address(0);
_requestCompute(
containerId,
inputs,
redundancy,
paymentToken,
paymentAmount,
wallet,
verifier
);
console.log("Requested compute");
}
// Function to receive the compute result
function receiveCompute(
bytes32 taskId,
bytes memory output,
bytes memory proof
) public {
console.log("Received output!");
console.logBytes(output);
// Handle the received output and proof
}
}
You can call the chatWithLLM function with the encoded byte string from Python like so:
from web3 import Web3
# Assuming you have a contract instance
contract = w3.eth.contract(address=contract_address, abi=contract_abi)
# Call the function, `input_bytes` here is the same as the one generated above
tx_hash = contract.functions.chatWithLLM(input_bytes).transact()
Delegated Subscription Request
Please Note: the examples below assume that you have an Infernet Node running locally
on port 4000
.
from infernet_client.node import NodeClient
from infernet_client.chain_utils import Subscription, RPC
sub = Subscription(
owner="0x...",
active_at=int(time()),
period=0,
frequency=1,
redundancy=1,
containers=["tgi_client_inference_service"],
lazy=False,
verifier=ZERO_ADDRESS,
payment_amount=0,
payment_token=ZERO_ADDRESS,
wallet=ZERO_ADDRESS,
)
client = NodeClient("http://127.0.0.1:4000")
nonce = random.randint(0, 2**32 - 1)
await client.request_delegated_subscription(
sub=sub,
rpc=RPC("http://127.0.0.1:8545")
coordinator_address=global_config.coordinator_address,
expiry=int(time() + 10),
nonce=nonce,
private_key="0x...",
data={
"text": "Is the sky blue during a clear day?"
},
)
infernet-client sub --rpc_url http://some-rpc-url.com --address 0x.. --expiry 1713376164 --key key-file.txt \
--params params.json --input input.json
# Success: Subscription created.
where params.json
looks like this:
{
"owner": "0x00Bd138aBD7....................", // Subscription Owner
"active_at": 0, // Instantly active
"period": 3, // 3 seconds between intervals
"frequency": 2, // Process 2 times
"redundancy": 2, // 2 nodes respond each time
"containers": ["tgi_client_inference_service"], // comma-separated list of containers
"lazy": false,
"verifier": "0x0000000000000000000000000000000000000000",
"payment_amount": 0,
"payment_token": "0x0000000000000000000000000000000000000000",
"wallet": "0x0000000000000000000000000000000000000000",
}
and where input.json
looks like this: