2026년 3월 25일 약 18분 분량 개인 프로젝트/위키엔진

카테고리 검색 필터링 + Facet 집계: Lucene FILTER 절 설계

Lucene Search Engine Faceted Search BM25 Performance Spring Boot Wiki

분산 안정성 검증: stress 테스트 + 한계점 분석에서 분산 아키텍처(2 App + MySQL Replication + Redis 3샤드 + Kafka CDC)의 한계를 stress 테스트로 확인했습니다.

지표	100 VU	200 VU (stress)
평균 응답시간	42.8ms	897ms
P95	190ms	1,911ms
에러율	0.00%	0.09%
병목	App CPU ~50%	App CPU 80-100% (Lucene BM25 + Nori)

100 VU에서 P95 ~200ms로 SLA(300ms)를 충족했고, MySQL/Redis/Kafka 모두 여유가 있었습니다. 인프라 병목이 해소되었으므로, 이 글부터는 검색 기능 자체의 고도화에 집중합니다.

1. 정상 상태: 현재 검색 아키텍처

검색 흐름

1
사용자 검색 요청: GET /api/v1.0/posts/search?q=프로그래밍&page=0&size=20
2
  → PostService.search(keyword, pageable)
3
  → TieredCacheService L1(Caffeine) → L2(Redis) → Origin:
4
    → LuceneSearchService.search(keyword, pageable)
5
    → buildQuery(keyword):
6
        MUST:   BM25(title^3, content^1) via MultiFieldQueryParser + Nori
7
        SHOULD: FeatureField.saturation(viewCount, w=3.0, pivot=1000)
8
        SHOULD: FeatureField.saturation(likeCount, w=2.0, pivot=100)
9
        SHOULD: RecencyDecay(halfLife=30일)
10
    → TopDocs → Lucene doc에서 ID 추출 → DB findAllById → Slice<Post>
11
  → PostSearchResponse(id, title, snippet, viewCount, likeCount, createdAt)

Lucene 인덱스 필드 구성 (코드 실측)

LuceneIndexService.toDocument() (LuceneIndexService.java:161-179):

1
Document doc = new Document();
2
doc.add(new KeywordField("id", post.getId().toString(), Field.Store.YES));
3
doc.add(new TextField("title", post.getTitle(), Field.Store.YES));
4
doc.add(new TextField("content", post.getContent(), Field.Store.NO));
5

6
if (post.getCategoryId() != null) {
7
    doc.add(new LongField("categoryId", post.getCategoryId(), Field.Store.YES));
8
}
9

10
doc.add(new LongField("viewCount", post.getViewCount(), Field.Store.YES));
11
doc.add(new LongField("createdAt", post.getCreatedAt().toEpochMilli(), Field.Store.YES));
12
doc.add(new FeatureField("features", "viewCount", Math.max(post.getViewCount(), 1)));
13
doc.add(new FeatureField("features", "likeCount", Math.max(post.getLikeCount(), 1)));

핵심 사실: categoryId가 이미 LongField로 Lucene 인덱스에 포함되어 있습니다. 카테고리 “필터링”은 필드를 추가하는 문제가 아니라, 검색 쿼리(buildQuery)에 필터 절을 추가하는 문제입니다.

카테고리 데이터 현황

1
posts 테이블:
2
  category_id BIGINT (nullable) — FK to categories 테이블
3

4
categories 테이블:
5
  id BIGINT PK
6
  name VARCHAR (NOT NULL, UNIQUE)
7
  parent_id BIGINT (nullable) — 계층 구조 지원

Post 엔티티: private Long categoryId; (nullable, 미분류 게시글 허용)
Category 엔티티: name + parentId (계층 구조)
위키피디아 임포트 시 카테고리가 함께 생성됨

기존 카테고리 관련 기능

PostController에 이미 목록 조회 시 카테고리 필터링이 존재합니다 (PostController.java:41-51):

1
@GetMapping
2
public Slice<PostListResponse> getPosts(
3
    @RequestParam(required = false) Long categoryId,
4
    Pageable pageable
5
) {
6
    if (categoryId != null) {
7
        return postService.getPostsByCategory(categoryId, pageable);
8
    }
9
    return postService.getLatestPosts(pageable);
10
}

하지만 이건 SQL 기반 필터링 (postRepository.findByCategoryIdOrderByCreatedAtDesc)입니다. 검색 API (GET /posts/search)에는 카테고리 필터링이 없습니다.

2. 문제 상황: 검색에서 카테고리 필터링이 불가능하다

문제 1: 검색 API에 카테고리 필터 파라미터가 없다

PostController.java:128-134의 검색 엔드포인트:

1
@GetMapping("/search")
2
public Slice<PostSearchResponse> search(
3
    @RequestParam String q,
4
    Pageable pageable
5
) {
6
    return postService.search(q, pageable);
7
}

categoryId 파라미터가 없다. 사용자가 “프로그래밍”을 검색하면 1,425만 건 전체에서 결과를 반환하며, 특정 카테고리로 좁히는 방법이 없다.

검색 결과를 보고 “이 중에서 Java 관련만 보고 싶다”는 요구를 충족할 수 없다. 목록 조회(GET /posts)에서는 카테고리 필터가 되지만, 검색에서는 안 됩니다. 기능의 비대칭입니다.

“프로그래밍” 검색 시 1,233건(검색 품질 평가 실측) 중 어떤 카테고리에 몇 건이 있는지 집계가 안 된다. 사용자는 맹목적으로 결과를 스크롤해야 합니다.

실제 검색엔진의 Faceted Navigation:

1
Google: "프로그래밍" → 탭(전체/이미지/뉴스/동영상) + 도구(기간 필터)
2
네이버: "프로그래밍" → 탭(통합/블로그/카페/지식iN) + 카테고리 필터
3
Stack Overflow: 태그 기반 필터링 (java, python, etc.) + 태그별 질문 수 표시
4
Amazon: 상품 검색 → 좌측 카테고리 트리 + 각 카테고리별 건수

커뮤니티 검색에서 Faceted Navigation은 기본 기능이다.

3. 문제 분석: 구조적 원인

카테고리 필터링이 안 되는 이유

categoryId가 Lucene 인덱스에 이미 있지만, buildQuery() (LuceneSearchService.java:176-197)에 카테고리 필터 절이 없다:

1
// 현재 buildQuery() — 카테고리 필터 없음
2
return new BooleanQuery.Builder()
3
    .add(textQuery, BooleanClause.Occur.MUST)        // 텍스트 매칭
4
    .add(viewBoost, BooleanClause.Occur.SHOULD)       // 인기도
5
    .add(likeBoost, BooleanClause.Occur.SHOULD)       // 좋아요
6
    .add(recencyBoost, BooleanClause.Occur.SHOULD)    // 최신성
7
    .build();

필드는 있는데 쿼리에서 안 쓰고 있다. LongField.newExactQuery("categoryId", categoryId)를 FILTER 절로 추가하면 필터링이 된다.

Facet 집계는 전체 매칭 문서의 카테고리별 건수를 세는 것이다. 일반 검색 쿼리는 Top-K만 반환하므로, 별도 Collector가 필요합니다.

Lucene은 lucene-facet 모듈에서 Facet API를 제공하지만, 현재 build.gradle에 lucene-facet 의존성이 없다:

1
// build.gradle — 현재
2
implementation 'org.apache.lucene:lucene-core:10.3.2'
3
implementation 'org.apache.lucene:lucene-analysis-nori:10.3.2'
4
implementation 'org.apache.lucene:lucene-queryparser:10.3.2'
5
implementation 'org.apache.lucene:lucene-queries:10.3.2'
6
// lucene-facet 없음!

Facet을 구현하려면 두 가지 경로가 있다:

방식	필요한 것	장단점
Lucene Facet API (`lucene-facet`)	`SortedSetDocValuesFacetField` + `FacetsConfig` + `SortedSetDocValuesFacetCounts`	네이티브 Facet, 정확한 집계. 인덱스에 SortedSetDocValuesField 추가 + 재색인 필요
수동 집계 (현재 LongField 활용)	`LongField("categoryId")` 이미 존재 → 검색 결과 postIds로 DB GROUP BY	재색인 불필요, 추가 의존성 불필요. 하지만 전체 매칭 문서가 아닌 현재 페이지 결과만 집계 가능 (정확한 Facet 아님)

4. 대안 검토: 왜 이 방법을 선택했는가

카테고리 필터링 방식

방안	장점	단점	판단
Lucene LongField.newExactQuery + FILTER	이미 인덱스에 있음, 재색인 불필요, pagination 정확	-	선택
DB Post-filter (Lucene 결과 → DB WHERE category_id=?)	Lucene 변경 없음	pagination 깨짐 (100건 중 50건 필터 → 페이지 절반만 표시)	탈락
Elasticsearch	네이티브 필터링 + Aggregation	별도 클러스터 필요, Free Tier 불가 (최소 6G RAM)	탈락

DB Post-filter를 탈락시킨 구체적 이유: Lucene이 20건을 반환한 뒤 DB에서 카테고리 필터로 10건이 걸러지면, 해당 페이지에 10건만 표시됩니다. 다음 페이지도 같은 문제가 반복됩니다. Lucene에서 FILTER 절로 처리하면 처음부터 해당 카테고리 결과만 정확히 20건 반환합니다.

방안	장점	단점	판단
Lucene SortedSetDocValuesFacetCounts	정확한 전체 매칭 문서 집계, 네이티브	`lucene-facet` 의존성 추가 + `SortedSetDocValuesFacetField` 추가 + 전체 재색인 필요	재색인 시 함께 적용
DB GROUP BY (검색 결과 ID로)	재색인 불필요, 즉시 구현 가능	Top-K 결과만 집계 (전체 매칭 문서 집계 아님), DB 왕복 추가	먼저 적용
Taxonomy Index	계층 Facet 지원	별도 인덱스 관리 비용이 큼	탈락

단계적 접근:

현재: 카테고리 필터링 (LongField FILTER, 재색인 불필요) + DB 기반 간이 Facet
쿼리 확장 구현 재색인 시: SortedSetDocValuesFacetField 추가 + Lucene 네이티브 Facet으로 전환

이렇게 하면 즉시 기능을 제공하면서, 재색인 인프라 구축과 동시에 정확한 Facet으로 업그레이드할 수 있다.

5. 구현

5-1. 카테고리 필터링: LuceneSearchService 수정

categoryId가 이미 LongField로 인덱싱되어 있으므로, search() 메서드에 categoryId 파라미터를 추가하고 FILTER 절을 추가합니다.

1
// LuceneSearchService — 변경
2
public Slice<Post> search(String keyword, Long categoryId, Pageable pageable) throws IOException {
3
    IndexSearcher searcher = searcherManager.acquire();
4
    try {
5
        Query query = buildQuery(keyword, categoryId);  // categoryId 전달
6
        // ... 기존 로직 동일
7
    }
8
}
9

10
private Query buildQuery(String keyword, Long categoryId) throws ParseException {
11
    // 기존 BM25 + 인기도 + 최신성 쿼리
12
    BooleanQuery.Builder builder = new BooleanQuery.Builder()
13
        .add(textQuery, BooleanClause.Occur.MUST)
14
        .add(viewBoost, BooleanClause.Occur.SHOULD)
15
        .add(likeBoost, BooleanClause.Occur.SHOULD)
16
        .add(recencyBoost, BooleanClause.Occur.SHOULD);
17

18
    // 카테고리 필터 추가
19
    if (categoryId != null) {
20
        builder.add(LongField.newExactQuery("categoryId", categoryId),
21
                     BooleanClause.Occur.FILTER);
22
    }
23

24
    return builder.build();
25
}

왜 Occur.FILTER인가:

MUST는 스코어에 영향을 준다. 카테고리 필터는 “이 카테고리에 속하는가?”만 판단하면 되고, 관련도 스코어와 무관하다.
FILTER는 MUST와 동일하게 필수 조건이지만 스코어에 기여하지 않는다. Lucene 내부적으로 FILTER 절은 bitset 캐싱 대상이 되어, 동일 카테고리 반복 검색 시 성능 이점이 있다.
출처: Lucene BooleanClause.Occur Javadoc

5-2. API 변경: PostController + PostService

1
// PostController — 검색 엔드포인트에 categoryId 추가
2
@GetMapping("/search")
3
public Slice<PostSearchResponse> search(
4
    @RequestParam String q,
5
    @RequestParam(required = false) Long categoryId,
6
    Pageable pageable
7
) {
8
    return postService.search(q, categoryId, pageable);
9
}

1
// PostService — categoryId를 LuceneSearchService에 전달
2
public Slice<PostSearchResponse> search(String keyword, Long categoryId, Pageable pageable) {
3
    // 캐시 키에 categoryId 포함
4
    String cacheKey = keyword + ":" + categoryId + ":" + pageable.getPageNumber()
5
                    + ":" + pageable.getPageSize();
6
    // L1 → L2 → origin 기존 로직 동일, origin에서 categoryId 전달
7
    Slice<Post> posts = luceneSearchService.search(keyword, categoryId, pageable);
8
    // ...
9
}

캐시 키 변경 주의: categoryId가 캐시 키에 포함되어야 합니다. 같은 키워드라도 카테고리별로 다른 결과를 반환하므로, 기존 캐시 키(keyword:page:size)에 categoryId를 추가해야 캐시 오염이 방지됩니다.

Lucene 네이티브 Facet 대신, 검색 후 DB 집계로 카테고리 분포를 제공한다.

1
// PostService — 카테고리 Facet (DB 기반 간이 구현)
2
public List<CategoryFacet> getCategoryFacets(String keyword, int topN) throws IOException {
3
    // 1. Lucene에서 검색 결과 전체 ID 추출 (상위 1000건 제한)
4
    List<Long> postIds = luceneSearchService.searchIds(keyword, 1000);
5

6
    // 2. DB에서 카테고리별 건수 집계
7
    return postRepository.countByCategoryIdIn(postIds).stream()
8
        .sorted(Comparator.comparing(CategoryFacet::count).reversed())
9
        .limit(topN)
10
        .toList();
11
}

1
-- PostRepository — 카테고리별 건수 집계
2
@Query("SELECT new com.wiki.engine.post.dto.CategoryFacet(c.id, c.name, COUNT(p)) " +
3
       "FROM Post p JOIN Category c ON p.categoryId = c.id " +
4
       "WHERE p.id IN :postIds " +
5
       "GROUP BY c.id, c.name " +
6
       "ORDER BY COUNT(p) DESC")
7
List<CategoryFacet> countByCategoryIdIn(@Param("postIds") List<Long> postIds);

한계 인지: 이 방식은 상위 1,000건에 대한 집계이므로, 전체 매칭 문서에 대한 정확한 Facet은 아닙니다. 하지만 검색 엔진에서 사용자가 관심 있는 건 상위 결과의 분포이지, 10만 번째 결과의 카테고리가 아닙니다. 상위 1,000건의 카테고리 분포는 전체와 유사한 경향을 보이므로 UX 관점에서 충분합니다. 쿼리 확장 구현의 재색인 시 Lucene Facet API로 전환합니다.

5-4. 응답 DTO 확장

1
// PostSearchResponse — categoryId 추가
2
public record PostSearchResponse(
3
    Long id,
4
    String title,
5
    String snippet,
6
    Long viewCount,
7
    Long likeCount,
8
    Instant createdAt,
9
    Long categoryId       // 추가
10
) {}
11

12
// CategoryFacet — Facet 집계 결과
13
public record CategoryFacet(
14
    Long id,
15
    String name,
16
    Long count
17
) {}
18

19
// SearchWithFacetsResponse — 검색 + Facet 통합 응답
20
public record SearchWithFacetsResponse(
21
    Slice<PostSearchResponse> results,
22
    List<CategoryFacet> facets
23
) {}

5-5. 전체 API 설계

1
기존:  GET /api/v1.0/posts/search?q=프로그래밍&page=0&size=20
2
변경:  GET /api/v1.0/posts/search?q=프로그래밍&category=42&page=0&size=20
3

4
응답:
5
{
6
  "data": {
7
    "results": {
8
      "content": [
9
        { "id": 123, "title": "...", "snippet": "...", "categoryId": 42, ... }
10
      ],
11
      "hasNext": true
12
    },
13
    "facets": [
14
      { "id": 42, "name": "프로그래밍 언어", "count": 342 },
15
      { "id": 15, "name": "소프트웨어 공학", "count": 189 },
16
      { "id": 7, "name": "운영체제", "count": 127 }
17
    ]
18
  }
19
}

Facet은 category 파라미터 없이 검색할 때만 반환합니다. 이미 카테고리가 선택된 상태에서 Facet을 보여주는 건 의미 없다 (드릴다운된 상태에서는 해당 카테고리만 나옴).

6. 검증: Before/After

Before: 카테고리 필터 없는 검색

After: categoryId=7 필터 적용

“프로그래밍” 검색에 categoryId=7(컴퓨터 과학) 필터 적용 → LongField.newExactQuery("categoryId", 7) + Occur.FILTER로 해당 카테고리 결과만 반환. 기전공학, 스파게티 코드, 재구성 가능 컴퓨팅 등 컴퓨터 과학 카테고리 게시글만 노출됨.

재색인 시 함께 반영된 사항

lucene-facet 의존성 추가 (org.apache.lucene:lucene-facet:10.3.2)
SortedSetDocValuesFacetField("category", categoryName) 인덱스 추가
FacetsConfig 설정 + config.build(doc) 적용
SortedSetDocValuesFacetCounts 기반 정확한 Facet으로 전환
lucene-highlighter 추가 + snippetSource StoredField + UnifiedHighlighter 전환
전체 재색인 완료 (12,156,589건, 42GB, ~2시간)

쿼리 확장 + Query Understanding: 검색 품질 고도화에서 동의어 확장(“AI” → “인공지능”), 오타 교정(DirectSpellChecker), Nori 사용자 사전, UnifiedHighlighter 기반 snippet 개선, 그리고 전체 재색인 인프라를 구축합니다.

출처

In Distributed Stability — Stress Test + Limit Analysis we validated the limits of the distributed architecture (2 App + MySQL Replication + Redis 3-shard + Kafka CDC) under stress.

Metric	100 VU	200 VU (stress)
Avg response	42.8ms	897ms
P95	190ms	1,911ms
Error rate	0.00%	0.09%
Bottleneck	App CPU ~50%	App CPU 80-100% (Lucene BM25 + Nori)

At 100 VU, P95 ~200ms met the 300ms SLA, and MySQL/Redis/Kafka all had headroom. With infra bottlenecks resolved, this post starts focusing on enhancing the search functionality itself.

1. Steady State — Current Search Architecture

Search flow

1
User search: GET /api/v1.0/posts/search?q=프로그래밍&page=0&size=20
2
  → PostService.search(keyword, pageable)
3
  → TieredCacheService L1(Caffeine) → L2(Redis) → Origin:
4
    → LuceneSearchService.search(keyword, pageable)
5
    → buildQuery(keyword):
6
        MUST:   BM25(title^3, content^1) via MultiFieldQueryParser + Nori
7
        SHOULD: FeatureField.saturation(viewCount, w=3.0, pivot=1000)
8
        SHOULD: FeatureField.saturation(likeCount, w=2.0, pivot=100)
9
        SHOULD: RecencyDecay(halfLife=30 days)
10
    → TopDocs → extract IDs from Lucene docs → DB findAllById → Slice<Post>
11
  → PostSearchResponse(id, title, snippet, viewCount, likeCount, createdAt)

Lucene index field layout (verified in code)

LuceneIndexService.toDocument() (LuceneIndexService.java:161-179):

1
Document doc = new Document();
2
doc.add(new KeywordField("id", post.getId().toString(), Field.Store.YES));
3
doc.add(new TextField("title", post.getTitle(), Field.Store.YES));
4
doc.add(new TextField("content", post.getContent(), Field.Store.NO));
5

6
if (post.getCategoryId() != null) {
7
    doc.add(new LongField("categoryId", post.getCategoryId(), Field.Store.YES));
8
}
9

10
doc.add(new LongField("viewCount", post.getViewCount(), Field.Store.YES));
11
doc.add(new LongField("createdAt", post.getCreatedAt().toEpochMilli(), Field.Store.YES));
12
doc.add(new FeatureField("features", "viewCount", Math.max(post.getViewCount(), 1)));
13
doc.add(new FeatureField("features", "likeCount", Math.max(post.getLikeCount(), 1)));

Key fact: categoryId is already in the Lucene index as a LongField. Adding category “filtering” is not about adding a field — it is about adding a filter clause to the search query (buildQuery).

Category data state

1
posts table:
2
  category_id BIGINT (nullable) — FK to categories table
3

4
categories table:
5
  id BIGINT PK
6
  name VARCHAR (NOT NULL, UNIQUE)
7
  parent_id BIGINT (nullable) — supports hierarchy

Post entity: private Long categoryId; (nullable — uncategorized posts allowed)
Category entity: name + parentId (hierarchical)
Categories are created together with the Wikipedia import

PostController already supports category filtering on the listing endpoint (PostController.java:41-51):

1
@GetMapping
2
public Slice<PostListResponse> getPosts(
3
    @RequestParam(required = false) Long categoryId,
4
    Pageable pageable
5
) {
6
    if (categoryId != null) {
7
        return postService.getPostsByCategory(categoryId, pageable);
8
    }
9
    return postService.getLatestPosts(pageable);
10
}

But that is SQL-based filtering (postRepository.findByCategoryIdOrderByCreatedAtDesc). The search API (GET /posts/search) has no category filtering.

2. Problem — No Category Filtering in Search

PostController.java:128-134 search endpoint:

1
@GetMapping("/search")
2
public Slice<PostSearchResponse> search(
3
    @RequestParam String q,
4
    Pageable pageable
5
) {
6
    return postService.search(q, pageable);
7
}

No categoryId parameter. Searching “프로그래밍” returns results from all 14.25M docs with no way to narrow to a category.

The user cannot satisfy “show me only Java-related ones from these results.” Listing (GET /posts) supports category filtering, but search does not — an asymmetry.

Among the 1,233 hits (measured in search-quality) for “프로그래밍”, we cannot aggregate how many fall in each category. The user has to scroll blindly.

Faceted Navigation in real search engines:

1
Google: "프로그래밍" → tabs (All/Images/News/Video) + tools (date filter)
2
Naver: "프로그래밍" → tabs (Integrated/Blog/Cafe/Knowledge iN) + category filters
3
Stack Overflow: tag-based filtering (java, python, etc.) + count per tag
4
Amazon: product search → left category tree + count per category

In community search, Faceted Navigation is a baseline feature.

3. Analysis — Structural Cause

Why category filtering does not work

categoryId exists in the Lucene index, but buildQuery() (LuceneSearchService.java:176-197) has no category filter clause:

1
// current buildQuery() — no category filter
2
return new BooleanQuery.Builder()
3
    .add(textQuery, BooleanClause.Occur.MUST)        // text matching
4
    .add(viewBoost, BooleanClause.Occur.SHOULD)       // popularity
5
    .add(likeBoost, BooleanClause.Occur.SHOULD)       // likes
6
    .add(recencyBoost, BooleanClause.Occur.SHOULD)    // recency
7
    .build();

The field exists but is not used in the query. Adding LongField.newExactQuery("categoryId", categoryId) as a FILTER clause makes filtering work.

Facet aggregation means counting matching docs per category across the entire matching set. A normal search query returns only top-K, so a separate Collector is needed.

Lucene provides a Facet API in the lucene-facet module, but the current build.gradle has no lucene-facet dependency:

1
// build.gradle — current
2
implementation 'org.apache.lucene:lucene-core:10.3.2'
3
implementation 'org.apache.lucene:lucene-analysis-nori:10.3.2'
4
implementation 'org.apache.lucene:lucene-queryparser:10.3.2'
5
implementation 'org.apache.lucene:lucene-queries:10.3.2'
6
// no lucene-facet!

Two paths to implement Facets:

Approach	Required	Pros / Cons
Lucene Facet API (`lucene-facet`)	`SortedSetDocValuesFacetField` + `FacetsConfig` + `SortedSetDocValuesFacetCounts`	native Facet, exact aggregation. Requires adding `SortedSetDocValuesField` to the index + full reindex
Manual aggregation (use existing LongField)	`LongField("categoryId")` already exists → DB GROUP BY on result postIds	no reindex, no extra dependencies. But aggregates only the current page, not the full match set (not exact Facet)

4. Alternatives — Why I Picked This

Category filtering approach

Option	Pro	Con	Verdict
Lucene LongField.newExactQuery + FILTER	already in index, no reindex, pagination correct	-	chosen
DB Post-filter (Lucene results → DB WHERE category_id=?)	no Lucene change	breaks pagination (100 → 50 after filter → page shows half)	rejected
Elasticsearch	native filter + Aggregation	needs separate cluster, impossible on Free Tier (≥6GB RAM)	rejected

Why DB Post-filter is rejected, concretely: Lucene returns 20, then DB filtering by category drops it to 10 — only 10 shown on that page. Same problem repeats on the next page. Doing it in Lucene with FILTER returns exactly 20 results from that category from the start.

Option	Pro	Con	Verdict
Lucene SortedSetDocValuesFacetCounts	exact full-match aggregation, native	adds `lucene-facet` dependency + `SortedSetDocValuesFacetField` + full reindex required	applied during reindex
DB GROUP BY (over result IDs)	no reindex, instant to implement	aggregates only top-K, not the whole match set, extra DB round-trip	applied first
Taxonomy Index	supports hierarchical Facets	high cost of managing a separate index	rejected

Phased approach:

Now: category filtering (LongField FILTER, no reindex) + DB-based approximate Facet
At the query expansion reindex: add SortedSetDocValuesFacetField + switch to native Lucene Facet

This way we deliver the feature instantly while upgrading to exact Facets together with the reindex infrastructure build-out.

5. Implementation

5-1. Category filtering — modifying LuceneSearchService

Since categoryId is already indexed as LongField, we add a categoryId parameter to search() and a FILTER clause.

1
// LuceneSearchService — change
2
public Slice<Post> search(String keyword, Long categoryId, Pageable pageable) throws IOException {
3
    IndexSearcher searcher = searcherManager.acquire();
4
    try {
5
        Query query = buildQuery(keyword, categoryId);  // pass categoryId
6
        // ... existing logic identical
7
    }
8
}
9

10
private Query buildQuery(String keyword, Long categoryId) throws ParseException {
11
    // existing BM25 + popularity + recency
12
    BooleanQuery.Builder builder = new BooleanQuery.Builder()
13
        .add(textQuery, BooleanClause.Occur.MUST)
14
        .add(viewBoost, BooleanClause.Occur.SHOULD)
15
        .add(likeBoost, BooleanClause.Occur.SHOULD)
16
        .add(recencyBoost, BooleanClause.Occur.SHOULD);
17

18
    // add category filter
19
    if (categoryId != null) {
20
        builder.add(LongField.newExactQuery("categoryId", categoryId),
21
                     BooleanClause.Occur.FILTER);
22
    }
23

24
    return builder.build();
25
}

Why Occur.FILTER:

MUST affects scoring. Category filter is a “is it in this category?” decision — relevance score is irrelevant.
FILTER is required like MUST but does not contribute to scoring. Internally Lucene treats FILTER clauses as bitset-cacheable, giving a perf benefit on repeated queries with the same category.
Source: Lucene BooleanClause.Occur Javadoc

5-2. API change — PostController + PostService

1
// PostController — add categoryId to the search endpoint
2
@GetMapping("/search")
3
public Slice<PostSearchResponse> search(
4
    @RequestParam String q,
5
    @RequestParam(required = false) Long categoryId,
6
    Pageable pageable
7
) {
8
    return postService.search(q, categoryId, pageable);
9
}

1
// PostService — pass categoryId down to LuceneSearchService
2
public Slice<PostSearchResponse> search(String keyword, Long categoryId, Pageable pageable) {
3
    // include categoryId in the cache key
4
    String cacheKey = keyword + ":" + categoryId + ":" + pageable.getPageNumber()
5
                    + ":" + pageable.getPageSize();
6
    // L1 → L2 → origin same logic; pass categoryId at origin
7
    Slice<Post> posts = luceneSearchService.search(keyword, categoryId, pageable);
8
    // ...
9
}

Cache-key caveat: categoryId must be in the cache key. The same keyword returns different results per category, so without adding categoryId to the existing key (keyword:page:size), cache pollution would occur.

Instead of native Lucene Facets, provide category distribution via DB aggregation after search.

1
// PostService — category Facet (DB-based approximate impl)
2
public List<CategoryFacet> getCategoryFacets(String keyword, int topN) throws IOException {
3
    // 1. extract all result IDs from Lucene (cap at 1000)
4
    List<Long> postIds = luceneSearchService.searchIds(keyword, 1000);
5

6
    // 2. aggregate counts per category in DB
7
    return postRepository.countByCategoryIdIn(postIds).stream()
8
        .sorted(Comparator.comparing(CategoryFacet::count).reversed())
9
        .limit(topN)
10
        .toList();
11
}

1
-- PostRepository — counts per category
2
@Query("SELECT new com.wiki.engine.post.dto.CategoryFacet(c.id, c.name, COUNT(p)) " +
3
       "FROM Post p JOIN Category c ON p.categoryId = c.id " +
4
       "WHERE p.id IN :postIds " +
5
       "GROUP BY c.id, c.name " +
6
       "ORDER BY COUNT(p) DESC")
7
List<CategoryFacet> countByCategoryIdIn(@Param("postIds") List<Long> postIds);

Acknowledged limit: this aggregates only the top-1,000 results, so it is not an exact Facet over the full match set. But in search engines what users actually care about is the distribution of top results, not the category of the 100,000th. The top-1,000 distribution mirrors the overall trend, so it is good enough UX-wise. Will switch to the Lucene Facet API during the query expansion reindex.

5-4. Response DTO extension

1
// PostSearchResponse — add categoryId
2
public record PostSearchResponse(
3
    Long id,
4
    String title,
5
    String snippet,
6
    Long viewCount,
7
    Long likeCount,
8
    Instant createdAt,
9
    Long categoryId       // added
10
) {}
11

12
// CategoryFacet — facet aggregation result
13
public record CategoryFacet(
14
    Long id,
15
    String name,
16
    Long count
17
) {}
18

19
// SearchWithFacetsResponse — combined search + facet response
20
public record SearchWithFacetsResponse(
21
    Slice<PostSearchResponse> results,
22
    List<CategoryFacet> facets
23
) {}

5-5. Full API design

1
Before:  GET /api/v1.0/posts/search?q=프로그래밍&page=0&size=20
2
After:   GET /api/v1.0/posts/search?q=프로그래밍&category=42&page=0&size=20
3

4
Response:
5
{
6
  "data": {
7
    "results": {
8
      "content": [
9
        { "id": 123, "title": "...", "snippet": "...", "categoryId": 42, ... }
10
      ],
11
      "hasNext": true
12
    },
13
    "facets": [
14
      { "id": 42, "name": "Programming Languages", "count": 342 },
15
      { "id": 15, "name": "Software Engineering", "count": 189 },
16
      { "id": 7, "name": "OS", "count": 127 }
17
    ]
18
  }
19
}

Facets are returned only when no category is selected. Showing Facets when a category is already selected is meaningless (the drilled-down result is just that category).

6. Verification — Before/After

After: with categoryId=7 filter

Searching “프로그래밍” with categoryId=7 (Computer Science): LongField.newExactQuery("categoryId", 7) + Occur.FILTER returns only that category’s results. Posts on mechatronics, spaghetti code, reconfigurable computing, etc. — only Computer Science posts.

Items applied together at the reindex

Added lucene-facet dependency (org.apache.lucene:lucene-facet:10.3.2)
Added SortedSetDocValuesFacetField("category", categoryName) to the index
FacetsConfig setup + config.build(doc) applied
Switched to exact Facets via SortedSetDocValuesFacetCounts
Added lucene-highlighter + snippetSource StoredField + UnifiedHighlighter
Full reindex completed (12,156,589 docs, 42GB, ~2 hours)

In Query Expansion + Query Understanding — Search Quality Enhancement we add synonym expansion (“AI” → “인공지능”), typo correction (DirectSpellChecker), the Nori user dictionary, snippet improvements via UnifiedHighlighter, and the full reindex infrastructure.

Sources

작성자 @범수

오늘의 노력이 내일의 전문성을 만든다고 믿습니다.

댓글 수정/삭제는 GitHub Discussions에서 가능합니다.

이전 글