Node Affinity + Resource Quota로 워크로드 관리하기

마늘김 2025. 5. 6. 03:38

Kubernetes 클러스터를 운영하다 보면, "특정 워크로드를 특정 노드에 배치하고 싶다"거나 "팀별로 자원을 효율적으로 나누고 싶다"는 요구사항을 마주치게 됩니다. 예를 들어, Team A의 애플리케이션은 고성능 GPU 노드에, Team B는 일반 컴퓨팅 노드에 배치하고 싶을 때, 어떻게 해야 할까요? 이때 Node Affinity가 그 해결책이 될 수 있습니다. Node Affinity는 Kubernetes 스케줄러가 파드를 원하는 노드에 배치하도록 유도하는 강력한 도구로, 유연한 자원 관리와 워크로드 격리를 가능하게 합니다.

이번 포스팅에서는 Node Affinity의 기본 개념에 대해서 설명하고, 예제를 통한 실습으로 Node Affinity가 어떻게 작동하는지에 대해서도 직접 확인해 보도록 하겠습니다.

Kubernetes에서 Pod를 특정 Node에 할당하는 방법들

Kubernetes 공식 문서 중 Concepts > Scheduling, Preemption and Eviction > Assigning Pod to Nodes에서 Kubernetes가 특정 Pod를 스케줄링하는 위치(Node)를 선택할 수 있는 방법을 아래와 같이 제시하고 있습니다.

nodeSelector field matching against node labels
Affinity and anti-affinity
nodeName field
Pod topology spread constraints

이번 포스팅에서는 Affinity를 활용하여 Pod를 특정 Node에 할당하는 방법을 알아보고, 이를 통해 팀별로 자원을 효율적으로 나누는 방법에 대한 단서도 찾아보도록 하겠습니다.

Affinity

Affinity(and anti-affinity)는 Pod의 스케줄링을 사용자가 원하는 방식으로 하게 하면서도, 유연하게 작동할 수 있게 하는 기능입니다. 좀 더 세밀하게는 Node affinity와 Inter-pod affinity로 나뉘는데, 각 타입에 대해 간단히 알아보도록 하겠습니다.

1. Node affinity

Node affinity는 nodeSelector와 개념적으로 흡사하지만, 조금 더 유연할 설정이 가능합니다. Node affinity에는 아래 두가지 타입이 있습니다.

requiredDuringSchedulingIgnoredDuringExecution: The scheduler can't schedule the Pod unless the rule is met. This functions like nodeSelector, but with a more expressive syntax.
preferredDuringSchedulingIgnoredDuringExecution: The scheduler tries to find a node that meets the rule. If a matching node is not available, the scheduler still schedules the Pod.

위 설명에서 볼 수 있듯 requiredDuringSchedulingIgnoredDuringExecution타입은 규칙이 충족되지 않으면 Pod를 스케줄링 하지 않습니다. 이에 반해 preferredDuringSchedulingIgnoredDuringExecution는 약간 더 유연한 설정으로 규칙을 충족하는 Node에 여유가 있으면 우선적으로 Pod를 스케줄링하고 그렇지 않은 경우에는 다른 노드에 스케줄링을 합니다. Pod spec의 .spec.affinity.nodeAffinity 필드를 통해 Node affinity를 정의하고 사용할 수 있습니다.

2. Inter-pod affinity

Inter-pod affinity는 각 노드에 실행 중인 다른 Pod의 레이블을 기준으로 Pod가 스케줄링될 노드를 제한하는 기능입니다. 그 규칙은 "X가 규칙 Y를 충족하는 하나 이상의 Pod를 실행중인 경우 이 파드는 X에서 실행해야 한다(Pod anti-affinity의 경우 "실행하면 안 된다")"와 같은 형식입니다. 이때 X는 Node나 서버 랙, CSP 또는 CSP의 리전 등이 될 수 있으며, Y는 Kubernetes가 충족할 규칙으로 Label selector로 표현됩니다. Inter-pod affinity도 Node affinity와 같이 두 가지 타입이 존재합니다.

requiredDuringSchedulingIgnoredDuringExecution
preferredDuringSchedulingIgnoredDuringExecution

Kubernetes 공식 문서에서는 Inter-pod affinity(또는 anti-affinity) 사용 시에는 다음과 같은 주의 사항에 대해 이야기하고 있습니다.

Note:
Inter-pod affinity and anti-affinity require substantial amounts of processing which can slow down scheduling in large clusters significantly. We do not recommend using them in clusters larger than several hundred nodes.
Inter-pod affinity 및 anti-affinit는 상당한 처리량을 필요로 하며, 이는 대규모 클러스터에서 스케줄링 속도를 상당히 저하시킬 수 있습니다. 수백 개 이상의 노드가 있는 대규모 클러스터에서는 이 기능을 사용하지 않는 것이 좋습니다.

Node Affinity를 활용한 팀별 워크로드 관리

다음과 같은 시나리오를 생각해 봅시다. Kubernetes 클러스터는 총 10대의 Worker 노드를 가지고 있으며 각 노드는 CPU 2 Core, RAM 8GB의 스펙을 가지고 있습니다. 이 Kubernetes 클러스터에 총 3개의 팀(A팀, B팀, C팀)이 워크로드를 구동시킵니다. 클러스터 관리자는 각 팀의 워크로드가 서로에게 영향을 주지 않도록 격리되어 실행되기를 바랍니다. 그래서 A팀, B팀, C팀에 각각 4:3:3 비율로 노드를 나누어서 운영하려고 합니다.

1. 팀 별 노드 분배 및 Node affinity 적용

관리자는 A팀에는 1번부터 4번까지, B팀에는 5번부터 7번까지, C팀에는 8번부터 10번까지의 노드를 분배하기로 하였습니다. 이를 위해 각 노드에 key값은 team으로 value값은 소속 team의 이름인 team-a, team-b, team-c로 Label을 지정합니다.

kubectl label nodes worker-01 worker-02 worker-03 worker-04 team=team-a
kubectl label nodes worker-05 worker-06 worker-07 team=team-b
kubectl label nodes worker-08 worker-09 worker-10 team=team-c

Node affinity의 적용 여부를 판단하기 위해 아래의 예제 Deployment yaml을 작성하여 테스트를 수행합니다.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: workload-a-required
  namespace: team-a
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: team
                operator: In
                values:
                - team-a
      containers:
      - name: nginx
        image: nginx
        resources:         #requests=limits 설정으로 Pod의 QoS를 Guaranteed로 설정
          requests:
            cpu: "500m"
            memory: "2Gi"
          limits:
            cpu: "500m"
            memory: "2Gi"

kubectl get pods -n team-a -o wide --sort-by=.spec.nodeName                                

NAME                                 READY   STATUS    RESTARTS   AGE   IP          NODE             NOMINATED NODE   READINESS GATES
workload-a-required-7b9fd6bd-p4rp6   1/1     Running   0          32s   10.0.5.80   worker-02        <none>           <none>

위와 같이 1개의 Pod가 A 팀이 사용가능한 노드 중 하나인 worker-02를 할당받은 것을 알 수 있습니다. Pod가 늘어났을 때도 A 팀에 할당된 노드에만 배포되는지 확인하기 위해 Pod를 10개로 scale out 해 보겠습니다.

#Replicas를 10개로 scale out
kubectl scale deployments.apps -n team-a workload-a-required --replicas=10
deployment.apps/workload-a-required scaled

#Scale out된 pod가 배포된 노드 확인
kubectl get pods -n team-a -o wide --sort-by=.spec.nodeName

NAME                                 READY   STATUS    RESTARTS   AGE     IP            NODE             NOMINATED NODE   READINESS GATES
workload-a-required-7b9fd6bd-rk7qz   1/1     Running   0          84s     10.0.2.253    worker-01        <none>           <none>
workload-a-required-7b9fd6bd-vp2tq   1/1     Running   0          84s     10.0.2.140    worker-01        <none>           <none>
workload-a-required-7b9fd6bd-4srtd   1/1     Running   0          84s     10.0.5.134    worker-02        <none>           <none>
workload-a-required-7b9fd6bd-9vr6h   1/1     Running   0          84s     10.0.5.22     worker-02        <none>           <none>
workload-a-required-7b9fd6bd-p4rp6   1/1     Running   0          7m30s   10.0.5.80     worker-02        <none>           <none>
workload-a-required-7b9fd6bd-8kzpk   1/1     Running   0          84s     10.0.0.165    worker-03        <none>           <none>
workload-a-required-7b9fd6bd-r68sx   1/1     Running   0          84s     10.0.0.250    worker-03        <none>           <none>
workload-a-required-7b9fd6bd-4pc9g   1/1     Running   0          84s     10.0.12.184   worker-04        <none>           <none>
workload-a-required-7b9fd6bd-9x8g5   1/1     Running   0          84s     10.0.12.60    worker-04        <none>           <none>
workload-a-required-7b9fd6bd-clg6l   1/1     Running   0          84s     10.0.12.20    worker-04        <none>           <none>

예상했던 데로 1번부터 4번까지의 Worker 노드에 Pod들이 잘 분포된 것을 확인할 수 있습니다.

2. 문제 발생! Pod 스케줄링 실패!

위와 같은 방식으로 각 팀들이 한동안은 문제없이 워크로드를 실행할 수 있었습니다. 그런데 A 팀으로부터 문제 상황이 보고되기 시작했습니다. 워크로드를 실행할 Pod의 스케줄링이 실패한다는 것이었습니다. 실제로 확인해 보니 Pod들의 상태가 Pending으로 워크로드를 실행하지 못하고 있었습니다.

#Pending 상태의 Pod들이 발견
kubectl get pods -n team-a -o wide --sort-by=.spec.nodeName                                  

NAME                                 READY   STATUS    RESTARTS   AGE    IP            NODE             NOMINATED NODE   READINESS GATES
workload-a-required-7b9fd6bd-t82b5   0/1     Pending   0          103s   <none>        <none>           <none>           <none>
workload-a-required-7b9fd6bd-ft6pj   0/1     Pending   0          103s   <none>        <none>           <none>           <none>
workload-a-required-7b9fd6bd-xdkws   0/1     Pending   0          103s   <none>        <none>           <none>           <none>
workload-a-required-7b9fd6bd-xk24m   0/1     Pending   0          103s   <none>        <none>           <none>           <none>
workload-a-required-7b9fd6bd-9vmx8   0/1     Pending   0          103s   <none>        <none>           <none>           <none>
workload-a-required-7b9fd6bd-z22vd   0/1     Pending   0          103s   <none>        <none>           <none>           <none>
workload-a-required-7b9fd6bd-5tp6w   0/1     Pending   0          103s   <none>        <none>           <none>           <none>
workload-a-required-7b9fd6bd-zn2zj   0/1     Pending   0          103s   <none>        <none>           <none>           <none>
workload-a-required-7b9fd6bd-rk7qz   1/1     Running   0          10m    10.0.2.253    worker-01        <none>           <none>
workload-a-required-7b9fd6bd-vp2tq   1/1     Running   0          10m    10.0.2.140    worker-01        <none>           <none>
workload-a-required-7b9fd6bd-p4gjs   1/1     Running   0          103s   10.0.2.45     worker-01        <none>           <none>
workload-a-required-7b9fd6bd-9vr6h   1/1     Running   0          10m    10.0.5.22     worker-02        <none>           <none>
workload-a-required-7b9fd6bd-p4rp6   1/1     Running   0          17m    10.0.5.80     worker-02        <none>           <none>
workload-a-required-7b9fd6bd-4srtd   1/1     Running   0          10m    10.0.5.134    worker-02        <none>           <none>
workload-a-required-7b9fd6bd-r68sx   1/1     Running   0          10m    10.0.0.250    worker-03        <none>           <none>
workload-a-required-7b9fd6bd-pww95   1/1     Running   0          103s   10.0.0.252    worker-03        <none>           <none>
workload-a-required-7b9fd6bd-8kzpk   1/1     Running   0          10m    10.0.0.165    worker-03        <none>           <none>
workload-a-required-7b9fd6bd-clg6l   1/1     Running   0          10m    10.0.12.20    worker-04        <none>           <none>
workload-a-required-7b9fd6bd-9x8g5   1/1     Running   0          10m    10.0.12.60    worker-04        <none>           <none>
workload-a-required-7b9fd6bd-4pc9g   1/1     Running   0          10m    10.0.12.184   worker-04        <none>           <none

Pending 상태에 있는 Pod의 상세 정보를 조회해 보니 events에 아래와 같은 메시지를 확인할 수 있었습니다.

#Pending 상태 Pod의 상세 정보를 조회
kubectl describe pod -n team-a workload-a-required-7b9fd6bd-t82b5

...

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  28s   default-scheduler  0/13 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 4 Insufficient cpu, 4 Insufficient memory, 6 node(s) didn't match Pod's node affinity/selector. preemption: 0/13 nodes are available: 4 No preemption victims found for incoming pod, 9 Preemption is not helpful for scheduling.

메시지의 내용을 분석해 보면 다음과 같습니다.

0/13 노드가 사용 가능함 - 사용가능한 노드가 없음을 의미
3개의 노드에는 taint가 설정되어 있음 - Control Plane 노드
4개 노드는 CPU와 메모리가 부족 - Node affinity를 만족하는 노드(team: team-a)에는 컴퓨팅 자원 부족하여 Pod 스케줄링 불가
나머지 6개 노드(Worker 05 - 10)는 스케줄링 조건이 일치하지 않음 - Node affinity 조건 불일치

핵심은 A 팀에 할당된 노드들의 자원이 모두 소모되어 더 이상 다른 Pod를 스케줄링할 수 없었던 것입니다. 이를 해결하기 위한 가장 쉬운 방법은 A 팀에 할당된 노드의 수를 늘리는 것이었습니다. 그러나 이는 필연적으로 비용의 증가를 야기하는 방법으로 운영팀은 조금 더 비용 효율적인 방법을 찾아야 했습니다.

만약 다른 팀의 여유자원을 사용할 수 있다면?
그렇다면 클러스터 비용을 늘리지 않고도 문제를 해결할 수 있지 않을까?

3. Preferred node affinity 도입

앞서 살펴본 바와 같이 Node affinity에는 두 가지 타입이 있습니다. 그중 preferredDuringSchedulingIgnoredDuringExecution은 우선적으로 규칙과 일치하는 노드에 Pod를 스케줄링한 후, 더 이상 규칙과 일치하는 노드가 없을 경우에는 다른 노드에도 스케줄링을 할 수 있게 합니다. 기존 nodeSelector나 Reqired node affinity에 비해서 더욱 유연한 방식으로 클러스터 내 여유 자원을 활용하여 클러스터 운영 효율을 높여주는 설정입니다. 예제 YAML 파일부터 살펴보도록 하겠습니다.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: workload-a-preferred
  namespace: team-a
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:        #Preferred node affinity를 사용하는 field
          - weight: 80                                            #가중치는 1 - 100 사이로 설정
            preference:
              matchExpressions:
              - key: team
                operator: In
                values:
                - team-a
      containers:
      - name: nginx
        image: nginx
        resources:
          requests:
            cpu: "500m"
            memory: "2Gi"
          limits:
            cpu: "500m"
            memory: "2Gi"

위 예제의 Deployment를 배포하고 Pod를 15개로 scale out 한 후 노드에 어떻게 스케줄링되었는지 확인해 보겠습니다.

#예제 Deployment 배포
kubectl apply -f team-a-deployment-preferred.yaml

deployment.apps/workload-a-preferred created

#Pod를 15개로 Scale out
kubectl scale deployments.apps -n team-a workload-a-preferred --replicas=15

deployment.apps/workload-a-preferred scaled

#Pod의 스케줄링 상태 확인
kubectl get pods -n team-a -o wide --sort-by=.spec.nodeName                                  

NAME                                   READY   STATUS    RESTARTS   AGE     IP            NODE             NOMINATED NODE   READINESS GATES
workload-a-preferred-c79d9dc5d-jqszv   1/1     Running   0          2m28s   10.0.2.202    worker-01        <none>           <none>
workload-a-preferred-c79d9dc5d-8qb88   1/1     Running   0          2m29s   10.0.2.65     worker-01        <none>           <none>
workload-a-preferred-c79d9dc5d-chf95   1/1     Running   0          2m28s   10.0.2.37     worker-01        <none>           <none>
workload-a-preferred-c79d9dc5d-vv4xm   1/1     Running   0          2m28s   10.0.5.249    worker-02        <none>           <none>
workload-a-preferred-c79d9dc5d-2r6pj   1/1     Running   0          2m28s   10.0.5.17     worker-02        <none>           <none>
workload-a-preferred-c79d9dc5d-fmtgr   1/1     Running   0          2m52s   10.0.5.239    worker-02        <none>           <none>
workload-a-preferred-c79d9dc5d-rjn8c   1/1     Running   0          2m29s   10.0.0.215    worker-03        <none>           <none>
workload-a-preferred-c79d9dc5d-65cbx   1/1     Running   0          2m28s   10.0.0.105    worker-03        <none>           <none>
workload-a-preferred-c79d9dc5d-qxvgh   1/1     Running   0          2m28s   10.0.0.157    worker-03        <none>           <none>
workload-a-preferred-c79d9dc5d-6hl94   1/1     Running   0          2m28s   10.0.12.146   worker-04        <none>           <none>
workload-a-preferred-c79d9dc5d-49hcc   1/1     Running   0          2m28s   10.0.12.117   worker-04        <none>           <none>
workload-a-preferred-c79d9dc5d-mddvp   1/1     Running   0          2m29s   10.0.12.63    worker-04        <none>           <none>
workload-a-preferred-c79d9dc5d-5sq8q   1/1     Running   0          2m28s   10.0.6.165    worker-05        <none>           <none>
workload-a-preferred-c79d9dc5d-fh4qt   1/1     Running   0          2m28s   10.0.8.240    worker-06        <none>           <none>
workload-a-preferred-c79d9dc5d-lm5sb   1/1     Running   0          2m28s   10.0.11.3     worker-10        <none>           <none>

위에서 볼 수 있듯 우선적으로 Worker 노드 01 - 04에 Pod가 스케줄링되었습니다. 선호하는(preferred) 노드의 자원이 부족하여 더 이상 스케줄링을 할 수 없는 경우에는 규칙에 맞지 않는 다른 노드(비 선호 노드)에 Pod를 스케줄링합니다. 실제로 worker-05, worker-06, worker-10에 나머지 3개의 노드가 배포된 것을 확인할 수 있었습니다.

운영팀이 원했던 방식대로 워크로드의 스케줄링이 이루어지게 되었습니다. 이렇게 문제가 해결된 것처럼 보이지만 사실 이는 또 다른 문제를 야기할 수 있습니다. 바로 '특정 워크로드가 자원을 과도하게 선점할 가능성'입니다.

4. Resource Quota를 통한 자원 사용량 한계 설정

Preferred node affinity를 통해 유연하고 효율적인 클러스터 운영이 가능해졌지만, 이는 Pod의 스케줄링이 최초에 지정된 노드에서 시작할 뿐, Scale out이 많이 일어나면 지정된 노드를 벗어나 클러스터 전체로 퍼져 나가는 것을 막지는 못하게 되었습니다. 결국 팀 별로 노드를 격리한 것의 의미가 없어지게 되었으며, 너무 많은 자원을 사용하는 워크로드가 클러스터의 자원을 과도하게 선점하여 다른 팀의 워크로드에 영향을 끼칠 우려마저 생기게 되었습니다. 결국 원점으로 돌아오게 되었고, 해결책은 없는 것일까요?

그렇지 않습니다! Kubernetes에서는 이러한 상황을 해결할 수 있도록 Resoruce Quota라는 정책을 지원합니다. Resource Quota란 네임스페이스별 총 리소스 사용을 제한하는 제약 조건을 제공하는 정책 오브젝트입니다. 유형별로 네임스페이스에서 만들 수 있는 오브젝트의 수와 총 사용 가능한 컴퓨트 리소스 양을 제한할 수 있습니다. 이를 통해 과도한 리소스 선점을 막으면서도 적당히 유연한 리소스 격리 환경을 제공할 수 있습니다.

리소스 쿼터는 다음과 같이 작동합니다.

각 팀은 서로 다른 네임스페이스에서 작업하도록 RBAC 설정
클러스터 관리자는 각 네임스페이스에 대하여 Resource Quota를 생성
사용자는 네임스페이스에서 리소스를 생성하며, 쿼터 시스템은 사용량을 추적하여 리소스 쿼터에 정의된 리소스 제한을 초과하지 않도록 감시
리소스를 생성하거나 업데이트할 때 제약 조건을 위반하면 위반된 제약 조건을 설명하는 메시지와 함께 HTTP 상태 코드 403 FORBIDDEN을 반환

클러스터 운영자는 Resource Quota를 적절히 활용하여 팀 별로 사용할 노드를 비교적 느슨하게 제한하면서도 특정 팀이 과도하게 많은 자원을 사용하지 못하게 하려고 합니다. 따라서 기존의 4:3:3 비율에서 1개 노드 정도의 오버커밋을 허용한 5:4:4 비율의 Resource Quota를 지정하고 테스트하여 원하는 데로 잘 작동하는지 확인해 보겠습니다.

---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-a-quota
  namespace: team-a
spec:                          #5개 노드의 Spec 합을 지정
  hard:
    limits.cpu: "10"
    limits.memory: "40Gi"
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-b-quota
  namespace: team-b
spec:                          #4개 노드의 Spec 합을 지정
  hard:
    limits.cpu: "8"
    limits.memory: "32Gi"
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-c-quota
  namespace: team-c
spec:                          #4개 노드의 Spec 합을 지정
  hard:
    limits.cpu: "8"
    limits.memory: "32Gi"

#A 팀의 Resource Quota 정보 확인
kubectl describe quota -n team-a

Name:          team-a-quota
Namespace:     team-a
Resource       Used   Hard
--------       ----   ----
limits.cpu     7500m  10          #CPU 10개 중 7.5개 사용 중
limits.memory  30Gi   40Gi        #RAM 40G 중 30G 사용 중

team-a 네임스페이스의 Resource Quota를 조회해 뵈면 앞선 과정에서 배포된 workload-a-preferred Deployment의 15개의 Pod가 선점하고 있는 리소스 양을 확인할 수 있습니다. 아직 5개 정도의 Pod가 더 배포되어도 문제가 없어 보입니다. 해당 Deployment를 25개로 scale out 하여 의도적으로 Resource Quota를 초과해 보고 어떻게 작동하는지 확인해 보도록 하겠습니다.

#Pod를 25개로 Scale out
kubectl scale deployments.apps -n team-a workload-a-preferred --replicas=25

deployment.apps/workload-a-preferred scaled

#Deployment 상세 내용 확인
kubectl describe deployments.apps -n team-a

Name:                   workload-a-preferred
Namespace:              team-a
CreationTimestamp:      Tue, 06 May 2025 01:43:42 +0900
Labels:                 app=nginx
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               app=nginx
Replicas:               25 desired | 20 updated | 20 total | 20 available | 5 unavailable    #5개는 스케줄 실패
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:  app=nginx
  Containers:
   nginx:
    Image:      nginx
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:     500m
      memory:  2Gi
    Requests:
      cpu:         500m
      memory:      2Gi
    Environment:   <none>
    Mounts:        <none>
  Volumes:         <none>
  Node-Selectors:  <none>
  Tolerations:     <none>
Conditions:
  Type             Status  Reason
  ----             ------  ------
  Progressing      True    NewReplicaSetAvailable
  ReplicaFailure   True    FailedCreate
  Available        True    MinimumReplicasAvailable
OldReplicaSets:    <none>
NewReplicaSet:     workload-a-preferred-c79d9dc5d (20/25 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  3m3s  deployment-controller  Scaled up replica set workload-a-preferred-c79d9dc5d from 15 to 25

#team-a 네임스페이스의 Resource Quota 상태
kubectl describe quota -n team-a

Name:          team-a-quota
Namespace:     team-a
Resource       Used  Hard
--------       ----  ----
limits.cpu     10    10
limits.memory  40Gi  40Gi       #CPU, RAM 모두 사용 된 것 확인

#team-a 네임스페이스 Pod 확인
kubectl get pods -n team-a -o wide --sort-by=.spec.nodeName                                  

NAME                                   READY   STATUS    RESTARTS   AGE    IP            NODE             NOMINATED NODE   READINESS GATES
workload-a-preferred-c79d9dc5d-jqszv   1/1     Running   0          64m    10.0.2.202    worker-01        <none>           <none>
workload-a-preferred-c79d9dc5d-8qb88   1/1     Running   0          64m    10.0.2.65     worker-01        <none>           <none>
workload-a-preferred-c79d9dc5d-chf95   1/1     Running   0          64m    10.0.2.37     worker-01        <none>           <none>
workload-a-preferred-c79d9dc5d-2r6pj   1/1     Running   0          64m    10.0.5.17     worker-02        <none>           <none>
workload-a-preferred-c79d9dc5d-vv4xm   1/1     Running   0          64m    10.0.5.249    worker-02        <none>           <none>
workload-a-preferred-c79d9dc5d-fmtgr   1/1     Running   0          64m    10.0.5.239    worker-02        <none>           <none>
workload-a-preferred-c79d9dc5d-65cbx   1/1     Running   0          64m    10.0.0.105    worker-03        <none>           <none>
workload-a-preferred-c79d9dc5d-qxvgh   1/1     Running   0          64m    10.0.0.157    worker-03        <none>           <none>
workload-a-preferred-c79d9dc5d-rjn8c   1/1     Running   0          64m    10.0.0.215    worker-03        <none>           <none>
workload-a-preferred-c79d9dc5d-49hcc   1/1     Running   0          64m    10.0.12.117   worker-04        <none>           <none>
workload-a-preferred-c79d9dc5d-mddvp   1/1     Running   0          64m    10.0.12.63    worker-04        <none>           <none>
workload-a-preferred-c79d9dc5d-6hl94   1/1     Running   0          64m    10.0.12.146   worker-04        <none>           <none>
workload-a-preferred-c79d9dc5d-5sq8q   1/1     Running   0          64m    10.0.6.165    worker-05        <none>           <none>
workload-a-preferred-c79d9dc5d-vhkjx   1/1     Running   0          6m1s   10.0.6.246    worker-05        <none>           <none>
workload-a-preferred-c79d9dc5d-fh4qt   1/1     Running   0          64m    10.0.8.240    worker-06        <none>           <none>
workload-a-preferred-c79d9dc5d-dgqjl   1/1     Running   0          6m1s   10.0.9.18     worker-07        <none>           <none>
workload-a-preferred-c79d9dc5d-mf27l   1/1     Running   0          6m1s   10.0.10.171   worker-08        <none>           <none>
workload-a-preferred-c79d9dc5d-bj8wt   1/1     Running   0          6m1s   10.0.7.152    worker-09        <none>           <none>
workload-a-preferred-c79d9dc5d-lm5sb   1/1     Running   0          64m    10.0.11.3     worker-10        <none>           <none>
workload-a-preferred-c79d9dc5d-2jgkw   1/1     Running   0          6m1s   10.0.11.53    worker-10        <none>           <none>

총 20개의 Pod만이 스케줄링되고 나머지 5개는 스케줄링되지 않은 것을 확인할 수 있습니다. 또한 우선적으로 Pod가 스케줄링되어야 할 노드에 먼저 스케줄링이 일어난 후 다른 노드들에 스케줄링이 일어난 것을 확인할 수 있습니다.

즉, Preferred node affinity와 Resourece Quota를 활용하여 느슨한 워크로드의 노드별 격리와 과도한 자원 선점 제한으로 유연하면서도 효율적인 클러스터 운영을 달성할 수 있게 되었습니다. 물론 향후에 각 팀에 대한 자원 사용량을 모니터링하여 각 팀의 자원 분배를 적절히 조절할 필요성은 있습니다. 또한 예상치 못한 동작이 발생하지는 않는지도 관찰해야 합니다.

추가 사항

1. 왜 Preferred node affinity와 Resoruce Quota를 사용했을 때에는 자원 제한을 넘는 경우에 Pod의 Pending 메시지가 출력되지 않는가?

앞선 Requried node affinity에서는 더 이상 스케줄링 될 노드가 없는 경우에는 Pod가 Pending 상태로 대기 중인 것을 확인할 수 있었습니다. 그런데 위 실습 결과를 보면 25개의 Pod 중 20개만이 스케줄링되어 있고 나머지 5개 Pod는 스케줄링이 되지 않은 것을 확인할 수 있습니다. 왜 여기에서는 Pending 상태가 보이지 않는 것일까요?

이는 Resource Qouta의 작동 방식에 기인합니다. 앞서 살펴본 대로 A 팀의 Resource Qouta는 10 core CPU와 40GB RAM으로 이를 넘어서는 Pod 생성 요청에 대해서는 Kubernetes API 서버에서 거부를 하게 됩니다. 즉 Pod 자체가 생성되지 않아 그 상태인 Pending이 나타날 수가 없는 것입니다. 따라서 자세히 살펴보면 API 서버의 거부에 의한 실패 메시지가 존재합니다. 우선 Deployment를 먼저 보도록 하죠.

#Deployment 상세 내용 확인
kubectl describe deployments.apps -n team-a

Name:                   workload-a-preferred
Namespace:              team-a
CreationTimestamp:      Tue, 06 May 2025 01:43:42 +0900
Labels:                 app=nginx
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               app=nginx
Replicas:               25 desired | 20 updated | 20 total | 20 available | 5 unavailable    #5개는 스케줄 실패
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:  app=nginx
  Containers:
   nginx:
    Image:      nginx
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:     500m
      memory:  2Gi
    Requests:
      cpu:         500m
      memory:      2Gi
    Environment:   <none>
    Mounts:        <none>
  Volumes:         <none>
  Node-Selectors:  <none>
  Tolerations:     <none>
Conditions:
  Type             Status  Reason
  ----             ------  ------
  Progressing      True    NewReplicaSetAvailable
  ReplicaFailure   True    FailedCreate               #Replica의 실패를 확인할 수 있음
  Available        True    MinimumReplicasAvailable
OldReplicaSets:    <none>
NewReplicaSet:     workload-a-preferred-c79d9dc5d (20/25 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  3m3s  deployment-controller  Scaled up replica set workload-a-preferred-c79d9dc5d from 15 to 25

Deployment의 상세 정보 중 Conditions 파트에 ReplicaFailure를 보면 생성 실패가 True임을 보여 줍니다. 즉 Replica에 어떤 문제가 발생했다는 뜻입니다. 그럼 Replica의 상세 정보 중 Conditions 파트와 Events 파트를 살펴봅시다.

#Replica 정보 확인
kubectl describe replicasets.apps -n team-a 

...
Conditions:
  Type             Status  Reason
  ----             ------  ------
  ReplicaFailure   True    FailedCreate
Events:
  Type     Reason            Age                 From                   Message
  ----     ------            ----                ----                   -------
  Normal   SuccessfulCreate  14m                 replicaset-controller  Created pod: workload-a-preferred-c79d9dc5d-mf27l
  Normal   SuccessfulCreate  14m                 replicaset-controller  Created pod: workload-a-preferred-c79d9dc5d-dgqjl
  Normal   SuccessfulCreate  14m                 replicaset-controller  Created pod: workload-a-preferred-c79d9dc5d-bj8wt
  Warning  FailedCreate      14m                 replicaset-controller  Error creating: pods "workload-a-preferred-c79d9dc5d-5wj25" is forbidden: exceeded quota: team-a-quota, requested: limits.cpu=500m,limits.memory=2Gi, used: limits.cpu=10,limits.memory=40Gi, limited: limits.cpu=10,limits.memory=40Gi
  Warning  FailedCreate      14m                 replicaset-controller  Error creating: pods "workload-a-preferred-c79d9dc5d-hxbcd" is forbidden: exceeded quota: team-a-quota, requested: limits.cpu=500m,limits.memory=2Gi, used: limits.cpu=10,limits.memory=40Gi, limited: limits.cpu=10,limits.memory=40Gi
  Normal   SuccessfulCreate  14m                 replicaset-controller  Created pod: workload-a-preferred-c79d9dc5d-vhkjx
  Normal   SuccessfulCreate  14m                 replicaset-controller  Created pod: workload-a-preferred-c79d9dc5d-2jgkw
  Warning  FailedCreate      14m                 replicaset-controller  Error creating: pods "workload-a-preferred-c79d9dc5d-ck4sk" is forbidden: exceeded quota: team-a-quota, requested: limits.cpu=500m,limits.memory=2Gi, used: limits.cpu=10,limits.memory=40Gi, limited: limits.cpu=10,limits.memory=40Gi
  Warning  FailedCreate      14m                 replicaset-controller  Error creating: pods "workload-a-preferred-c79d9dc5d-jw6ff" is forbidden: exceeded quota: team-a-quota, requested: limits.cpu=500m,limits.memory=2Gi, used: limits.cpu=10,limits.memory=40Gi, limited: limits.cpu=10,limits.memory=40Gi
  Warning  FailedCreate      14m                 replicaset-controller  Error creating: pods "workload-a-preferred-c79d9dc5d-w47z6" is forbidden: exceeded quota: team-a-quota, requested: limits.cpu=500m,limits.memory=2Gi, used: limits.cpu=10,limits.memory=40Gi, limited: limits.cpu=10,limits.memory=40Gi
  Warning  FailedCreate      14m                 replicaset-controller  Error creating: pods "workload-a-preferred-c79d9dc5d-7c9dq" is forbidden: exceeded quota: team-a-quota, requested: limits.cpu=500m,limits.memory=2Gi, used: limits.cpu=10,limits.memory=40Gi, limited: limits.cpu=10,limits.memory=40Gi
  Warning  FailedCreate      14m                 replicaset-controller  Error creating: pods "workload-a-preferred-c79d9dc5d-dtqqh" is forbidden: exceeded quota: team-a-quota, requested: limits.cpu=500m,limits.memory=2Gi, used: limits.cpu=10,limits.memory=40Gi, limited: limits.cpu=10,limits.memory=40Gi
  Warning  FailedCreate      14m                 replicaset-controller  Error creating: pods "workload-a-preferred-c79d9dc5d-vq4bw" is forbidden: exceeded quota: team-a-quota, requested: limits.cpu=500m,limits.memory=2Gi, used: limits.cpu=10,limits.memory=40Gi, limited: limits.cpu=10,limits.memory=40Gi
  Warning  FailedCreate      14m                 replicaset-controller  Error creating: pods "workload-a-preferred-c79d9dc5d-4dn6k" is forbidden: exceeded quota: team-a-quota, requested: limits.cpu=500m,limits.memory=2Gi, used: limits.cpu=10,limits.memory=40Gi, limited: limits.cpu=10,limits.memory=40Gi
  Warning  FailedCreate      13m (x11 over 14m)  replicaset-controller  (combined from similar events): Error creating: pods "workload-a-preferred-c79d9dc5d-z6kjf" is forbidden: exceeded quota: team-a-quota, requested: limits.cpu=500m,limits.memory=2Gi, used: limits.cpu=10,limits.memory=40Gi, limited: limits.cpu=10,limits.memory=40Gi

Replica의 상세 정보를 통해 Pod의 스케줄링이 실패하였고 그 원인은 네임스페이스의 자원의 여유가 없어서 임을 알 수 있습니다.

2. Scale in과 남아 있는 Pod의 위치

Deployment가 Scale in 되어 5개의 Pod로 줄어든 상태에서는 Pod가 어떻게 될까요?

#Pod를 5개로 Scale in
kubectl scale deployments.apps -n team-a workload-a-preferred --replicas=5

deployment.apps/workload-a-preferred scaled

#남은 Pod들은 어떤 노드에 존재하는지 확인
kubectl get pods -n team-a -o wide --sort-by=.spec.nodeName                                  

NAME                                   READY   STATUS    RESTARTS   AGE   IP            NODE             NOMINATED NODE   READINESS GATES
workload-a-preferred-c79d9dc5d-5sq8q   1/1     Running   0          93m   10.0.6.165    worker-05        <none>           <none>
workload-a-preferred-c79d9dc5d-fh4qt   1/1     Running   0          93m   10.0.8.240    worker-06        <none>           <none>
workload-a-preferred-c79d9dc5d-dgqjl   1/1     Running   0          35m   10.0.9.18     worker-07        <none>           <none>
workload-a-preferred-c79d9dc5d-mf27l   1/1     Running   0          35m   10.0.10.171   worker-08        <none>           <none>
workload-a-preferred-c79d9dc5d-bj8wt   1/1     Running   0          35m   10.0.7.152    worker-09        <none>           <none>

결과를 살펴보면 Scale in의 경우에는 삭제되는 Pod에 대해서는 Node affinity가 영향을 주지 않는다는 것을 알 수 있습니다. 위 결과 외에도 여러 번 다시 시도를 해 보았으나 Node affinity의 대상이 되는 노드에 Pod가 남아 있는 경우도 있었지만, 대체로는 무작위로 Pod가 삭제되는 것으로 파악이 되었습니다.

이런 현상이 일어나는 이유는 Node affinity는 Pod가 새로 생성되는 시점에만 관여하기 때문입니다. Scale in은 기존 Pod를 종료하는 과정으로 무작위 Pod를 대상으로 일어나며 이때 Node affinity는 이 과정에 영향을 미칠 수가 없습니다. 이런 현상은 잠재적으로 다음과 같은 문제점을 야기할 수 있습니다.

Scale in 된 워크로드의 Pod들이 비 선호 노드에 남아 해당 노드의 자원을 선점
해당 노드를 선호하는 워크로드(예: 팀 B, 팀 C의 워크로드)들이 자원 부족으로 비 선호 노드에 스케줄링
이러한 현상이 지속되면 팀 별 노드 분배의 의미가 희미해질 수 있음

따라서, 지속적으로 워크로드를 모니터링하고 재 배치하는 작업이 필요할 수 있습니다. 그러나 대규모 클러스터에서 많은 팀이 워크로드를 사용하는 경우에는 이를 수동으로 하기에는 사실상 불가능합니다. 이러한 문제점을 해결하기 위해서 Kubernetes Descheduler의 도입을 고려해 볼 수도 있겠습니다.

GitHub - kubernetes-sigs/descheduler: Descheduler for Kubernetes

Descheduler for Kubernetes. Contribute to kubernetes-sigs/descheduler development by creating an account on GitHub.

github.com

지금까지 Node affinity와 Resource Quota를 활용한 Multi tenancy 워크로드 관리에 대해서 알아보았습니다. Kubernetes의 운영은 정적이지 않습니다. 비즈니스 요구사항에 맞추어서, 혹은 클러스터를 사용하는 사용자의 요구사항에 맞추어서 유연하게 변화해야 합니다. 이번 포스팅이 Kubernetes 클러스터의 유연한 운영에 도움이 되기를 바랍니다.