Groundx doc pipeline #191

namanvirk18 · 2025-09-05T17:45:53Z

Summary by CodeRabbit

New Features
- Inline PDF and image previews in the upload flow, with file metadata display.
- Embedded document preview and expanded analysis results (summary, sample content, keywords, extracted text).
UI/UX
- Streamlined header and redesigned tab navigation (analysis/chat) with segmented control.
- Richer analysis sections: JSON Output, Narrative Summary, File Summary, Suggested Text, Extracted Text, Keywords.
- Two-column processing steps with automatic switch to analysis on completion.
- Persistent chat history and focused chat view.
- Updated branding and styling for tabs, previews, buttons, and layout.
Behavior Changes
- Progress bar replaced with a spinner during processing.

coderabbitai · 2025-09-05T17:45:59Z

Walkthrough

Introduces PDF and image preview in the upload flow, restructures UI with an active_tab state and segmented tabs, expands analysis displays and chat handling, and replaces granular progress tracking with a spinner-based polling loop. Adds display_pdf(file) and updates processing/preview/analysis sequences and styling.

Changes

Cohort / File(s)	Summary
UI overhaul and previews `groundX-doc-pipeline/app.py`	Added display_pdf(file) using base64 and iframe; integrated PDF/image previews and file metadata in upload flow; introduced active_tab session state and segmented tab UI; reworked processing steps with auto-tab switch; expanded analysis sections (document preview, summaries, extracted text, keywords, JSON); strengthened chat flow with context and history; updated branding and CSS.
Processing status polling simplification `groundX-doc-pipeline/groundx_utils.py`	Replaced detailed progress parsing with st.spinner-based loop polling gx.documents.get_processing_status_by_id(...).ingest until terminal states or timeout; preserved timeout; now raises RuntimeError if not complete; no public signatures changed.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant UI as App UI (app.py)
  participant GX as Ground X API
  participant Utils as groundx_utils.poll_until_complete

  U->>UI: Upload file
  alt application/pdf
    UI->>UI: display_pdf(file) via base64 iframe
  else image/*
    UI->>UI: Render image preview
  else docx/other
    UI->>UI: Show "preview after processing" notice
  end
  U->>UI: Click "Process"
  UI->>GX: Start processing (create document)
  UI->>Utils: poll_until_complete(process_id)
  activate Utils
  Utils->>GX: get_processing_status_by_id(...).ingest (poll)
  GX-->>Utils: status (processing|complete|error|cancelled)
  loop until terminal or timeout
    Utils->>GX: poll status
    GX-->>Utils: status
  end
  Utils-->>UI: completion or raise error
  deactivate Utils
  alt complete
    UI->>GX: Fetch X-Ray data
    UI->>UI: Switch active_tab -> analysis
    UI->>UI: Render Analysis tabs (JSON, Summary, File Summary, Extracted Text, Keywords)
    UI->>UI: Embedded document preview and sample content
  else error/cancelled/timeout
    UI->>UI: Show error message
  end

  U->>UI: Open Chat tab
  UI->>UI: prepare_chat_context(xray, prompt)
  UI->>UI: generate_chat_response(prompt, context)
  UI-->>U: Stream/Show response (chat history maintained)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I nibbled bytes and flipped a tab,
A spinner twirled—no progress drab.
I framed a PDF with base64 flair,
Preview here, analysis there.
Chat squeaks wise with context tight—
Hop-hop! Your docs are clear in sight.
🐇📄✨

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

groundX-doc-pipeline/groundx_utils.py (1)
70-83: Add HTTP client timeouts when fetching X-Ray JSON

External GETs lack timeouts; this can hang the app indefinitely.
-        response = requests.get(document.xray_url)
+        response = requests.get(document.xray_url, timeout=15)
-                response = requests.get(doc.xray_url)
+                response = requests.get(doc.xray_url, timeout=15)
Also applies to: 92-105

🧹 Nitpick comments (7)

groundX-doc-pipeline/groundx_utils.py (3)
52-66: Use monotonic clock and configurable poll interval; keep UX spinner

Loop works, but timeouts should use time.monotonic and a poll_interval param to tune cadence. Also tolerate brief API hiccups without breaking the spinner.
-def poll_until_complete(gx: GroundX, process_id: str, timeout: int = 600) -> None:
+def poll_until_complete(gx: GroundX, process_id: str, timeout: int = 600, poll_interval: float = 3.0) -> None:
     """Monitor document processing status until completion"""
-    start_time = time.time()
+    start_time = time.monotonic()
     
     # Use a spinner container for better UX
     with st.spinner("Processing document..."):
         while True:
-            status = gx.documents.get_processing_status_by_id(process_id=process_id).ingest
+            try:
+                status = gx.documents.get_processing_status_by_id(process_id=process_id).ingest
+            except Exception as e:
+                # brief backoff on transient errors
+                if time.monotonic() - start_time > timeout:
+                    raise TimeoutError("Ground X ingest timed out.") from e
+                time.sleep(min(1.0, poll_interval))
+                continue
             
             if status.status in {"complete", "error", "cancelled"}:
                 break
-            if time.time() - start_time > timeout:
+            if time.monotonic() - start_time > timeout:
                 raise TimeoutError("Ground X ingest timed out.")
-            time.sleep(3)
+            time.sleep(poll_interval)
25-35: Fix return type hint for ensure_bucket (actual id is int)

Function returns bucket.bucket_id which appears to be an int. Align the annotation (or use int | str) to avoid downstream confusion.
-@st.cache_resource(show_spinner=False)
-def ensure_bucket(_gx: GroundX, name: str = "gx_demo") -> str:
+@st.cache_resource(show_spinner=False)
+def ensure_bucket(_gx: GroundX, name: str = "gx_demo") -> int:
36-51: Align bucket_id typing across helpers

ingest_document accepts Union[str,int] at runtime; reflect this in type hints for clarity.
-def ingest_document(gx: GroundX, bucket_id: str, path: Path, mime: str) -> str:
+from typing import Union
+
+def ingest_document(gx: GroundX, bucket_id: Union[str, int], path: Path, mime: str) -> str:
groundX-doc-pipeline/app.py (4)
820-838: Remove unused in_chat_mode flag

in_chat_mode is set but never read; dead state.
-            # Ensure we stay in chat mode
-            st.session_state.in_chat_mode = True
47-246: Reduce CSS duplication and risky negative margins

Large repeated blocks with aggressive overrides/negative margins make the layout brittle and harder to maintain. Consolidate shared button/column styles into a single CSS block and avoid overlapping z-index hacks unless needed.

503-538: Minor: collapse upload status steps into a single status container

UX copy looks good. Consider wrapping step messages in a single st.status or st.container to avoid jitter.

482-486: Clean up temp files after processing

NamedTemporaryFile(delete=False) leaves files behind. After successful processing, unlink the file unless you intentionally keep it for re-processing.
-    st.session_state.uploaded_file_path = tmp_file.name
+    st.session_state.uploaded_file_path = tmp_file.name
+    # TODO: after processing completes, consider: Path(tmp_file.name).unlink(missing_ok=True)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e597969 and 1fc4085.

📒 Files selected for processing (2)

groundX-doc-pipeline/app.py (9 hunks)
groundX-doc-pipeline/groundx_utils.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

groundX-doc-pipeline/groundx_utils.py (1)

groundX-doc-pipeline/evaluation_geval.py (2)

_poll_until_complete (129-139)

process_invoice (117-127)

groundX-doc-pipeline/app.py (2)

groundX-doc-pipeline/groundx_utils.py (1)

process_document (107-128)

firecrawl-agent/app.py (1)

display_pdf (60-71)

🪛 Ruff (0.12.2)

groundX-doc-pipeline/groundx_utils.py

64-64: Avoid specifying long messages outside the exception class

(TRY003)

groundX-doc-pipeline/app.py

262-262: SyntaxError: Expected a statement

262-262: SyntaxError: Simple statements must be separated by newlines or semicolons

265-265: SyntaxError: Expected a statement

265-266: SyntaxError: Expected a statement

266-266: SyntaxError: Expected a statement

266-266: SyntaxError: Simple statements must be separated by newlines or semicolons

267-267: SyntaxError: Unexpected indentation

271-271: SyntaxError: Expected a statement

271-271: SyntaxError: Simple statements must be separated by newlines or semicolons

274-274: SyntaxError: Unexpected indentation

277-277: SyntaxError: Expected a statement

277-278: SyntaxError: Expected a statement

278-278: SyntaxError: Unexpected indentation

280-280: SyntaxError: Expected a statement

280-280: SyntaxError: Simple statements must be separated by newlines or semicolons

405-405: SyntaxError: Expected a statement

405-405: SyntaxError: Simple statements must be separated by newlines or semicolons

407-407: SyntaxError: Unexpected indentation

434-434: SyntaxError: Expected a statement

434-435: SyntaxError: Expected a statement

435-435: SyntaxError: Unexpected indentation

448-448: SyntaxError: Expected a statement

448-448: SyntaxError: Simple statements must be separated by newlines or semicolons

552-552: SyntaxError: Expected a statement

552-552: SyntaxError: Simple statements must be separated by newlines or semicolons

554-554: SyntaxError: Unexpected indentation

598-598: SyntaxError: Expected a statement

598-599: SyntaxError: Expected a statement

600-600: SyntaxError: Unexpected indentation

601-601: SyntaxError: Expected a statement

601-601: SyntaxError: Simple statements must be separated by newlines or semicolons

603-603: SyntaxError: Unexpected indentation

742-742: SyntaxError: Expected a statement

742-742: SyntaxError: Simple statements must be separated by newlines or semicolons

744-744: SyntaxError: Expected a statement

744-745: SyntaxError: Expected a statement

745-745: SyntaxError: Expected a statement

745-745: SyntaxError: Simple statements must be separated by newlines or semicolons

746-746: SyntaxError: Unexpected indentation

762-762: SyntaxError: unindent does not match any outer indentation level

770-770: SyntaxError: unindent does not match any outer indentation level

772-772: SyntaxError: Expected a statement

772-772: SyntaxError: Simple statements must be separated by newlines or semicolons

774-774: SyntaxError: Expected a statement

774-775: SyntaxError: Expected a statement

775-775: SyntaxError: Expected a statement

775-775: SyntaxError: Simple statements must be separated by newlines or semicolons

776-776: SyntaxError: Unexpected indentation

792-792: SyntaxError: unindent does not match any outer indentation level

794-794: SyntaxError: Expected a statement

794-794: SyntaxError: Simple statements must be separated by newlines or semicolons

796-796: SyntaxError: Expected a statement

796-797: SyntaxError: Expected a statement

797-797: SyntaxError: Expected a statement

797-797: SyntaxError: Simple statements must be separated by newlines or semicolons

798-798: SyntaxError: Unexpected indentation

812-812: SyntaxError: unindent does not match any outer indentation level

820-820: SyntaxError: unindent does not match any outer indentation level

820-820: SyntaxError: Invalid annotated assignment target

820-821: SyntaxError: Expected an expression

821-821: SyntaxError: Unexpected indentation

835-835: SyntaxError: Expected a statement

835-835: SyntaxError: Simple statements must be separated by newlines or semicolons

837-837: SyntaxError: Unexpected indentation

840-840: SyntaxError: Expected a statement

840-841: SyntaxError: Expected a statement

841-841: SyntaxError: Expected a statement

841-841: SyntaxError: Simple statements must be separated by newlines or semicolons

857-857: SyntaxError: Expected a statement

coderabbitai · 2025-09-05T17:53:36Z

groundX-doc-pipeline/app.py

+def display_pdf(file):
+    """Display PDF preview using embedded iframe"""
+    st.markdown("### PDF Preview")
+    base64_pdf = base64.b64encode(file.read()).decode("utf-8")
+<<<<<<< Updated upstream
+
+    # Embedding PDF in HTML
+=======
+>>>>>>> Stashed changes
+    pdf_display = f"""<iframe src="data:application/pdf;base64,{base64_pdf}" width="100%" height="400" type="application/pdf"
+                        style="border: 1px solid #ddd; border-radius: 8px;"
+                    >
+                    </iframe>"""
+<<<<<<< Updated upstream
+
+    # Displaying File
+    st.markdown(pdf_display, unsafe_allow_html=True)
+
 # Chat Interface Functions
+=======
+    st.markdown(pdf_display, unsafe_allow_html=True)
+
+>>>>>>> Stashed changes
 def prepare_chat_context(xray_data, prompt):


⚠️ Potential issue

Resolve merge conflict in display_pdf and reset file pointer

Unresolved conflict markers will crash the app. Also reset the file object after reading for downstream use.

-def display_pdf(file): - """Display PDF preview using embedded iframe""" - st.markdown("### PDF Preview") - base64_pdf = base64.b64encode(file.read()).decode("utf-8") -<<<<<<< Updated upstream - - # Embedding PDF in HTML -======= ->>>>>>> Stashed changes - pdf_display = f"""<iframe src="data:application/pdf;base64,{base64_pdf}" width="100%" height="400" type="application/pdf" - style="border: 1px solid #ddd; border-radius: 8px;" - > - </iframe>""" -<<<<<<< Updated upstream - - # Displaying File - st.markdown(pdf_display, unsafe_allow_html=True) - -======= - st.markdown(pdf_display, unsafe_allow_html=True) - ->>>>>>> Stashed changes +def display_pdf(file): + """Display PDF preview using embedded iframe""" + st.markdown("### PDF Preview") + data = file.getvalue() if hasattr(file, "getvalue") else file.read() + base64_pdf = base64.b64encode(data).decode("utf-8") + pdf_display = f'''<iframe src="data:application/pdf;base64,{base64_pdf}" width="100%" height="400" type="application/pdf" + style="border: 1px solid #ddd; border-radius: 8px;"></iframe>''' + st.markdown(pdf_display, unsafe_allow_html=True) + if hasattr(file, "seek"): + try: + file.seek(0) + except Exception: + pass

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def display_pdf(file):

"""Display PDF preview using embedded iframe"""

st.markdown("### PDF Preview")

base64_pdf = base64.b64encode(file.read()).decode("utf-8")

<<<<<<< Updated upstream

# Embedding PDF in HTML

=======

>>>>>>> Stashed changes

pdf_display = f"""<iframe src="data:application/pdf;base64,{base64_pdf}" width="100%" height="400" type="application/pdf"

style="border: 1px solid #ddd; border-radius: 8px;"

>

</iframe>"""

<<<<<<< Updated upstream

# Displaying File

st.markdown(pdf_display, unsafe_allow_html=True)

# Chat Interface Functions

=======

st.markdown(pdf_display, unsafe_allow_html=True)

>>>>>>> Stashed changes

def prepare_chat_context(xray_data, prompt):

def display_pdf(file):

"""Display PDF preview using embedded iframe"""

st.markdown("### PDF Preview")

# Read bytes, accommodating both fastAPI-style InMemoryUploadedFile and standard file-like objects

data = file.getvalue() if hasattr(file, "getvalue") else file.read()

base64_pdf = base64.b64encode(data).decode("utf-8")

pdf_display = f'''<iframe src="data:application/pdf;base64,{base64_pdf}" width="100%" height="400" type="application/pdf"

style="border: 1px solid #ddd; border-radius: 8px;"></iframe>'''

st.markdown(pdf_display, unsafe_allow_html=True)

# Reset pointer so the file can be re-read later

if hasattr(file, "seek"):

try:

file.seek(0)

except Exception:

pass

def prepare_chat_context(xray_data, prompt):

🧰 Tools

🪛 Ruff (0.12.2)

262-262: SyntaxError: Expected a statement

262-262: SyntaxError: Expected a statement

262-262: SyntaxError: Expected a statement

262-262: SyntaxError: Expected a statement

262-262: SyntaxError: Simple statements must be separated by newlines or semicolons

265-265: SyntaxError: Expected a statement

265-265: SyntaxError: Expected a statement

265-265: SyntaxError: Expected a statement

265-265: SyntaxError: Expected a statement

265-266: SyntaxError: Expected a statement

266-266: SyntaxError: Expected a statement

266-266: SyntaxError: Expected a statement

266-266: SyntaxError: Expected a statement

266-266: SyntaxError: Expected a statement

266-266: SyntaxError: Simple statements must be separated by newlines or semicolons

267-267: SyntaxError: Unexpected indentation

271-271: SyntaxError: Expected a statement

271-271: SyntaxError: Expected a statement

271-271: SyntaxError: Expected a statement

271-271: SyntaxError: Expected a statement

271-271: SyntaxError: Simple statements must be separated by newlines or semicolons

274-274: SyntaxError: Unexpected indentation

277-277: SyntaxError: Expected a statement

277-277: SyntaxError: Expected a statement

277-277: SyntaxError: Expected a statement

277-277: SyntaxError: Expected a statement

277-278: SyntaxError: Expected a statement

278-278: SyntaxError: Unexpected indentation

280-280: SyntaxError: Expected a statement

280-280: SyntaxError: Expected a statement

280-280: SyntaxError: Expected a statement

280-280: SyntaxError: Expected a statement

280-280: SyntaxError: Simple statements must be separated by newlines or semicolons

🤖 Prompt for AI Agents

In groundX-doc-pipeline/app.py around lines 258 to 281, remove the unresolved Git conflict markers and duplicate lines in display_pdf, keep a single coherent implementation that builds the base64 PDF iframe and calls st.markdown once; after reading the file to create base64_pdf call file.seek(0) to reset the file pointer so downstream code can reuse the file, and ensure there are no leftover "<<<<<<<", "=======" or ">>>>>>>" strings.

coderabbitai · 2025-09-05T17:53:36Z

groundX-doc-pipeline/app.py

+for key in ["xray_data", "uploaded_file_path", "uploaded_file_name", "uploaded_file_type", "processing_complete", "used_existing_file", "auto_loaded_file", "active_tab"]:
    if key not in st.session_state:
        st.session_state[key] = None if key == "xray_data" else False



🛠️ Refactor suggestion

Initialize session_state keys with correct types

Current defaults set strings/paths to False, but later code checks for None (e.g., Line 852). This causes wrong branch selection and UX issues.

-for key in ["xray_data", "uploaded_file_path", "uploaded_file_name", "uploaded_file_type", "processing_complete", "used_existing_file", "auto_loaded_file", "active_tab"]: - if key not in st.session_state: - st.session_state[key] = None if key == "xray_data" else False +defaults = { + "xray_data": None, + "uploaded_file_path": None, + "uploaded_file_name": None, + "uploaded_file_type": None, + "processing_complete": False, + "used_existing_file": False, + "auto_loaded_file": False, + "active_tab": None, +} +for k, v in defaults.items(): + if k not in st.session_state: + st.session_state[k] = v

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for key in ["xray_data", "uploaded_file_path", "uploaded_file_name", "uploaded_file_type", "processing_complete", "used_existing_file", "auto_loaded_file", "active_tab"]:

if key not in st.session_state:

st.session_state[key] = None if key == "xray_data" else False

defaults = {

"xray_data": None,

"uploaded_file_path": None,

"uploaded_file_name": None,

"uploaded_file_type": None,

"processing_complete": False,

"used_existing_file": False,

"auto_loaded_file": False,

"active_tab": None,

}

for k, v in defaults.items():

if k not in st.session_state:

st.session_state[k] = v

🤖 Prompt for AI Agents

In groundX-doc-pipeline/app.py around lines 354 to 357, the session_state initialization sets string/path keys to False which later breaks None checks; change the defaults so xray_data, uploaded_file_path, uploaded_file_name, uploaded_file_type, and active_tab are initialized to None, while processing_complete, used_existing_file, and auto_loaded_file remain initialized to False so boolean flags keep correct types and downstream None checks behave as expected.

coderabbitai · 2025-09-05T17:53:36Z

groundX-doc-pipeline/app.py

+<<<<<<< Updated upstream
+        # Document Preview Section
+        st.markdown("---")
+        st.markdown("### 📄 Document Preview")
+
+        # Show preview based on file type
+        if uploaded.type == "application/pdf":
+            # For PDF files, show the actual PDF preview using iframe
+            display_pdf(uploaded)
+
+        elif uploaded.type.startswith("image/"):
+            # For image files, show the actual image
+            st.image(uploaded, caption=f"Preview: {uploaded.name}", use_column_width=True)
+
+        elif uploaded.type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
+            # For DOCX files
+            st.info("📝 **Word Document** - Preview will be available after processing")
+            st.markdown(f"**Content**: Text extraction in progress...")
+
+        else:
+            # For other file types
+            st.info(f"📄 **{uploaded.type}** - Preview will be available after processing")
+
+        # Show file metadata
+        st.markdown("**File Details:**")
+        st.markdown(f"- **Name**: {uploaded.name}")
+        st.markdown(f"- **Size**: {uploaded.size / 1024:.1f} KB")
+        st.markdown(f"- **Type**: {uploaded.type}")
+        st.markdown(f"- **Status**: Ready for processing")
+=======
+        st.markdown("---")
+        st.markdown("### 📄 Document Preview")
+
+        if uploaded.type == "application/pdf":
+            display_pdf(uploaded)
+        elif uploaded.type.startswith("image/"):
+            st.image(uploaded, caption=f"Preview: {uploaded.name}", use_column_width=True)
+        elif uploaded.type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
+            st.info("📝 **Word Document** - Preview will be available after processing")
+            st.markdown(f"**Content**: Text extraction in progress...")
+        else:
+            st.info(f"📄 **{uploaded.type}** - Preview will be available after processing")
+
+>>>>>>> Stashed changes


⚠️ Potential issue

Resolve merge conflict in Document Preview block (keep metadata + preview)

Conflict markers remain and will break execution. Recommend keeping the richer version with file metadata.

-<<<<<<< Updated upstream - # Document Preview Section - st.markdown("---") - st.markdown("### 📄 Document Preview") - - # Show preview based on file type - if uploaded.type == "application/pdf": - # For PDF files, show the actual PDF preview using iframe - display_pdf(uploaded) - - elif uploaded.type.startswith("image/"): - # For image files, show the actual image - st.image(uploaded, caption=f"Preview: {uploaded.name}", use_column_width=True) - - elif uploaded.type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document": - # For DOCX files - st.info("📝 **Word Document** - Preview will be available after processing") - st.markdown(f"**Content**: Text extraction in progress...") - - else: - # For other file types - st.info(f"📄 **{uploaded.type}** - Preview will be available after processing") - - # Show file metadata - st.markdown("**File Details:**") - st.markdown(f"- **Name**: {uploaded.name}") - st.markdown(f"- **Size**: {uploaded.size / 1024:.1f} KB") - st.markdown(f"- **Type**: {uploaded.type}") - st.markdown(f"- **Status**: Ready for processing") -======= - st.markdown("---") - st.markdown("### 📄 Document Preview") - - if uploaded.type == "application/pdf": - display_pdf(uploaded) - elif uploaded.type.startswith("image/"): - st.image(uploaded, caption=f"Preview: {uploaded.name}", use_column_width=True) - elif uploaded.type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document": - st.info("📝 **Word Document** - Preview will be available after processing") - st.markdown(f"**Content**: Text extraction in progress...") - else: - st.info(f"📄 **{uploaded.type}** - Preview will be available after processing") - ->>>>>>> Stashed changes + st.markdown("---") + st.markdown("### 📄 Document Preview") + if uploaded.type == "application/pdf": + display_pdf(uploaded) + elif uploaded.type.startswith("image/"): + st.image(uploaded, caption=f"Preview: {uploaded.name}", use_column_width=True) + elif uploaded.type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document": + st.info("📝 Word Document — preview will be available after processing") + st.markdown("Content: text extraction in progress…") + else: + st.info(f"📄 {uploaded.type} — preview will be available after processing") + st.markdown("**File Details:**") + st.markdown(f"- Name: {uploaded.name}") + st.markdown(f"- Size: {uploaded.size / 1024:.1f} KB") + st.markdown(f"- Type: {uploaded.type}") + st.markdown("- Status: Ready for processing")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

<<<<<<< Updated upstream

# Document Preview Section

st.markdown("---")

st.markdown("### 📄 Document Preview")

# Show preview based on file type

if uploaded.type == "application/pdf":

# For PDF files, show the actual PDF preview using iframe

display_pdf(uploaded)

elif uploaded.type.startswith("image/"):

# For image files, show the actual image

st.image(uploaded, caption=f"Preview: {uploaded.name}", use_column_width=True)

elif uploaded.type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":

# For DOCX files

st.info("📝 **Word Document** - Preview will be available after processing")

st.markdown(f"**Content**: Text extraction in progress...")

else:

# For other file types

st.info(f"📄 **{uploaded.type}** - Preview will be available after processing")

# Show file metadata

st.markdown("**File Details:**")

st.markdown(f"- **Name**: {uploaded.name}")

st.markdown(f"- **Size**: {uploaded.size / 1024:.1f} KB")

st.markdown(f"- **Type**: {uploaded.type}")

st.markdown(f"- **Status**: Ready for processing")

=======

st.markdown("---")

st.markdown("### 📄 Document Preview")

if uploaded.type == "application/pdf":

display_pdf(uploaded)

elif uploaded.type.startswith("image/"):

st.image(uploaded, caption=f"Preview: {uploaded.name}", use_column_width=True)

elif uploaded.type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":

st.info("📝 **Word Document** - Preview will be available after processing")

st.markdown(f"**Content**: Text extraction in progress...")

else:

st.info(f"📄 **{uploaded.type}** - Preview will be available after processing")

>>>>>>> Stashed changes

st.markdown("---")

st.markdown("### 📄 Document Preview")

if uploaded.type == "application/pdf":

display_pdf(uploaded)

elif uploaded.type.startswith("image/"):

st.image(uploaded, caption=f"Preview: {uploaded.name}", use_column_width=True)

elif uploaded.type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":

st.info("📝 Word Document — preview will be available after processing")

st.markdown("Content: text extraction in progress…")

else:

st.info(f"📄 {uploaded.type} — preview will be available after processing")

st.markdown("**File Details:**")

st.markdown(f"- Name: {uploaded.name}")

st.markdown(f"- Size: {uploaded.size / 1024:.1f} KB")

st.markdown(f"- Type: {uploaded.type}")

st.markdown("- Status: Ready for processing")

🧰 Tools

🪛 Ruff (0.12.2)

405-405: SyntaxError: Expected a statement

405-405: SyntaxError: Expected a statement

405-405: SyntaxError: Expected a statement

405-405: SyntaxError: Expected a statement

405-405: SyntaxError: Simple statements must be separated by newlines or semicolons

407-407: SyntaxError: Unexpected indentation

434-434: SyntaxError: Expected a statement

434-434: SyntaxError: Expected a statement

434-434: SyntaxError: Expected a statement

434-434: SyntaxError: Expected a statement

434-435: SyntaxError: Expected a statement

435-435: SyntaxError: Unexpected indentation

448-448: SyntaxError: Expected a statement

448-448: SyntaxError: Expected a statement

448-448: SyntaxError: Expected a statement

448-448: SyntaxError: Expected a statement

448-448: SyntaxError: Simple statements must be separated by newlines or semicolons

🤖 Prompt for AI Agents

In groundX-doc-pipeline/app.py around lines 405 to 448, remove the leftover Git conflict markers and merge the two variants so the richer block is kept: keep the "Document Preview" heading, the conditional preview rendering for PDF, images and DOCX, the fallback info message, and also retain the file metadata lines (Name, Size, Type, Status). Replace the conflict markers (<<<<<<<, =======, >>>>>>>) with a single coherent block that includes both the preview logic and the metadata st.markdown lines.

coderabbitai · 2025-09-05T17:53:36Z

groundX-doc-pipeline/app.py

+<<<<<<< Updated upstream
+    # Document Preview Section (after processing)
+    with st.expander("📄 Document Preview", expanded=False):
+        st.markdown("### 📋 Document Summary")
+        file_summary = xray.get("fileSummary")
+        if file_summary:
+            st.markdown(file_summary)
+        else:
+            st.info("No summary available")
+
+        st.markdown("### 📝 Sample Content")
+        # Show first few chunks of extracted text
+        if "documentPages" in xray and xray["documentPages"]:
+            sample_texts = []
+            for page in xray["documentPages"][:2]:  # First 2 pages
+                if "chunks" in page:
+                    for chunk in page["chunks"][:2]:  # First 2 chunks per page
+                        if "text" in chunk and chunk["text"]:
+                            text = chunk["text"]
+                            if len(text) > 200:
+                                text = text[:200] + "..."
+                            sample_texts.append(text)
+
+            if sample_texts:
+                for i, text in enumerate(sample_texts, 1):
+                    st.markdown(f"**Sample {i}:**")
+                    st.markdown(text)
+                    st.markdown("---")
+            else:
+                st.info("No text content available for preview")
+
+        st.markdown("### 🏷️ Key Topics")
+        if xray.get("fileKeywords"):
+            keywords_list = xray["fileKeywords"].split(",")
+            # Show first 10 keywords
+            display_keywords = keywords_list[:10]
+            keyword_tags = " ".join([f"`{kw.strip()}`" for kw in display_keywords])
+            st.markdown(keyword_tags)
+        else:
+            st.info("No keywords available")
+
    # Primary interface tabs for analysis and interaction
    main_tabs = st.tabs([
        "📊 X-Ray Analysis",
        "💬 Chat"
    ])
+=======
+    # Create a left-aligned container for the tab buttons
+    col1, col2 = st.columns([1, 4])
+>>>>>>> Stashed changes


⚠️ Potential issue

Resolve conflict: choose segmented buttons over Streamlit tabs

Per PR summary, primary navigation moved to segmented buttons. Remove the old st.tabs() block and keep the segmented-control layout.

-<<<<<<< Updated upstream - # Document Preview Section (after processing) - with st.expander("📄 Document Preview", expanded=False): - ... - # Primary interface tabs for analysis and interaction - main_tabs = st.tabs([ - "📊 X-Ray Analysis", - "💬 Chat" - ]) -======= - # Create a left-aligned container for the tab buttons - col1, col2 = st.columns([1, 4]) ->>>>>>> Stashed changes + # Segmented buttons for primary nav (Analysis vs Chat) + col1, col2 = st.columns([1, 4])

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.12.2)

552-552: SyntaxError: Expected a statement

552-552: SyntaxError: Expected a statement

552-552: SyntaxError: Expected a statement

552-552: SyntaxError: Expected a statement

552-552: SyntaxError: Simple statements must be separated by newlines or semicolons

554-554: SyntaxError: Unexpected indentation

598-598: SyntaxError: Expected a statement

598-598: SyntaxError: Expected a statement

598-598: SyntaxError: Expected a statement

598-598: SyntaxError: Expected a statement

598-599: SyntaxError: Expected a statement

600-600: SyntaxError: Unexpected indentation

601-601: SyntaxError: Expected a statement

601-601: SyntaxError: Expected a statement

601-601: SyntaxError: Expected a statement

601-601: SyntaxError: Expected a statement

601-601: SyntaxError: Simple statements must be separated by newlines or semicolons

🤖 Prompt for AI Agents

In groundX-doc-pipeline/app.py around lines 552 to 601, resolve the merge conflict by removing the old Streamlit tabs block and any conflict markers; keep the segmented-control layout (the col1, col2 = st.columns([1, 4]) section) as the primary navigation. Delete the entire st.tabs(...) block and the surrounding conflict markers (<<<<<<<, =======, >>>>>>>) so only the segmented buttons layout remains and follow-up code uses that layout for navigation.

coderabbitai · 2025-09-05T17:53:36Z

groundX-doc-pipeline/app.py

+        with tabs[0]:
+            st.subheader("🔍 Raw JSON Data")
+            st.json(xray)

-    with tabs[1]:
-        st.subheader("📝 Narrative Summary")
-        # Extract and display narrative content from document chunks
-        narratives = []
-        if "documentPages" in xray:
-            for page in xray["documentPages"]:
-                if "chunks" in page:
-                    for chunk in page["chunks"]:
-                        if "narrative" in chunk and chunk["narrative"]:
-                            narratives.extend(chunk["narrative"])
-
-        if narratives:
-            for i, narrative in enumerate(narratives, 1):
-                st.markdown(f"**Narrative {i}:**")
-                st.markdown(narrative)
-                st.divider()
-        else:
-            st.info("No narrative text found in the X-Ray data")
+        with tabs[1]:
+            st.subheader("📝 Narrative Summary")
+<<<<<<< Updated upstream
+            # Extract and display narrative content from document chunks
+=======
+>>>>>>> Stashed changes
+            narratives = []
+            if "documentPages" in xray:
+                for page in xray["documentPages"]:
+                    if "chunks" in page:
+                        for chunk in page["chunks"]:
+                            if "narrative" in chunk and chunk["narrative"]:
+                                narratives.extend(chunk["narrative"])
+
+            if narratives:
+                for i, narrative in enumerate(narratives, 1):
+                    st.markdown(f"**Narrative {i}:**")
+                    st.markdown(narrative)
+                    st.divider()
+            else:
+                st.info("No narrative text found in the X-Ray data")



⚠️ Potential issue

Remove remaining conflict markers in analysis sub-tabs

Only comments differ between versions; keep the code as-is and drop the markers.

-<<<<<<< Updated upstream - # Extract and display narrative content from document chunks -======= ->>>>>>> Stashed changes

-<<<<<<< Updated upstream - # Extract and display suggested text content from document chunks -======= ->>>>>>> Stashed changes

-<<<<<<< Updated upstream - # Extract and display raw text content from document chunks -======= ->>>>>>> Stashed changes

Also applies to: 770-791, 792-811

🧰 Tools

🪛 Ruff (0.12.2)

742-742: SyntaxError: Expected a statement

742-742: SyntaxError: Expected a statement

742-742: SyntaxError: Expected a statement

742-742: SyntaxError: Expected a statement

742-742: SyntaxError: Simple statements must be separated by newlines or semicolons

744-744: SyntaxError: Expected a statement

744-744: SyntaxError: Expected a statement

744-744: SyntaxError: Expected a statement

744-744: SyntaxError: Expected a statement

744-745: SyntaxError: Expected a statement

745-745: SyntaxError: Expected a statement

745-745: SyntaxError: Expected a statement

745-745: SyntaxError: Expected a statement

745-745: SyntaxError: Expected a statement

745-745: SyntaxError: Simple statements must be separated by newlines or semicolons

746-746: SyntaxError: Unexpected indentation

🤖 Prompt for AI Agents

In groundX-doc-pipeline/app.py around lines 736 to 761 (and also at ranges 770-791 and 792-811), there are leftover Git conflict markers (<<<<<<<, =======, >>>>>>>) in the analysis sub-tabs; remove those markers and retain the existing code as-is (keep the narrative extraction/display logic unchanged), ensuring no extra whitespace or commented markers remain and the file compiles/runs cleanly.

namanvirk18 added 2 commits September 5, 2025 21:12

add doc preview in app.py

0ca97fc

Update app.py

1fc4085

coderabbitai bot reviewed Sep 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Groundx doc pipeline #191

Groundx doc pipeline #191

Uh oh!

namanvirk18 commented Sep 5, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 5, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 5, 2025

Uh oh!

coderabbitai bot Sep 5, 2025

Uh oh!

coderabbitai bot Sep 5, 2025

Uh oh!

coderabbitai bot Sep 5, 2025

Uh oh!

coderabbitai bot Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-for key in ["xray_data", "uploaded_file_path", "uploaded_file_name", "uploaded_file_type", "processing_complete", "used_existing_file", "auto_loaded_file", "active_tab"]:
-    if key not in st.session_state:
-        st.session_state[key] = None if key == "xray_data" else False
+defaults = {
+    "xray_data": None,
+    "uploaded_file_path": None,
+    "uploaded_file_name": None,
+    "uploaded_file_type": None,
+    "processing_complete": False,
+    "used_existing_file": False,
+    "auto_loaded_file": False,
+    "active_tab": None,
+}
+for k, v in defaults.items():
+    if k not in st.session_state:
+        st.session_state[k] = v

Groundx doc pipeline #191

Are you sure you want to change the base?

Groundx doc pipeline #191

Uh oh!

Conversation

namanvirk18 commented Sep 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

namanvirk18 commented Sep 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 5, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)